8 Groupby
The dataset we’re going to use has information on members of the US congress - first and last names - birth date - gender - type (“rep” for House of Representatives or “sen” for Senate) - U.S. state - political party.
| last_name | first_name | birthday | gender | type | state | party | |
|---|---|---|---|---|---|---|---|
| 0 | Bassett | Richard | 1745-04-02 | M | sen | DE | Anti-Administration |
| 1 | Bland | Theodorick | 1742-03-21 | M | rep | VA | NaN |
| 2 | Burke | Aedanus | 1743-06-16 | M | rep | SC | NaN |
| 3 | Carroll | Daniel | 1730-07-22 | M | rep | MD | NaN |
| 4 | Clymer | George | 1739-03-16 | M | rep | PA | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 12043 | Loeffler | Kelly | 1970-11-27 | F | sen | GA | Republican |
| 12044 | Wright | Ron | 1953-04-08 | M | rep | TX | Republican |
| 12045 | Fudge | Marcia | 1952-10-29 | F | rep | OH | Democrat |
| 12046 | Haaland | Debra | 1960-12-02 | F | rep | NM | Democrat |
| 12047 | Hastings | Alcee | 1936-09-05 | M | rep | FL | Democrat |
12048 rows × 7 columns
last_name object
first_name category
birthday datetime64[ns]
gender category
type category
state category
party category
dtype: object
What is the count of Congressional members, on a state-by-state basis, over the entire history of the dataset?
You call .groupby() and pass the name of the column you want to group on, which is “state”. Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation.
/var/folders/wn/2bz1970d2w5182zy7h96yfcc0000gn/T/ipykernel_39943/3228017122.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
df.groupby("state")["last_name"].count()
state
AK 16
AL 209
AR 117
AS 2
AZ 49
CA 367
CO 92
CT 240
DC 2
DE 97
DK 9
FL 161
GA 317
GU 4
HI 24
IA 205
ID 59
IL 488
IN 343
KS 143
KY 373
LA 199
MA 427
MD 305
ME 175
MI 296
MN 161
MO 334
MS 155
MT 53
NC 356
ND 44
NE 127
NH 181
NJ 359
NM 57
NV 56
NY 1467
OH 675
OK 93
OL 2
OR 90
PA 1053
PI 13
PR 19
RI 107
SC 251
SD 51
TN 301
TX 263
UT 55
VA 433
VI 4
VT 115
WA 96
WI 198
WV 120
WY 40
Name: last_name, dtype: int64
You can pass a lot more than just a single column name to .groupby() as the first argument. You can also specify any of the following: - A list of multiple column names - A dictionary - A Pandas Series - A NumPy array or Pandas Index
8.1 multi index
Here’s an example of grouping jointly on two columns, which finds the count of Congressional members broken out by state and then by gender:
/var/folders/wn/2bz1970d2w5182zy7h96yfcc0000gn/T/ipykernel_39943/503199787.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
df.groupby(["state", "gender"])["last_name"].count()
state gender
AK F 0
M 16
AL F 4
M 205
AR F 5
...
WI M 198
WV F 1
M 119
WY F 1
M 39
Name: last_name, Length: 116, dtype: int64
Note: When we use groupby on more than one column, we create a MultiIndex
/var/folders/wn/2bz1970d2w5182zy7h96yfcc0000gn/T/ipykernel_39943/197250217.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
n_by_state_gender = df.groupby(["state", "gender"])["last_name"].count()
pandas.core.series.Series
MultiIndex([('AK', 'F'),
('AK', 'M'),
('AL', 'F'),
('AL', 'M'),
('AR', 'F'),
('AR', 'M'),
('AS', 'F'),
('AS', 'M'),
('AZ', 'F'),
('AZ', 'M'),
...
('VT', 'F'),
('VT', 'M'),
('WA', 'F'),
('WA', 'M'),
('WI', 'F'),
('WI', 'M'),
('WV', 'F'),
('WV', 'M'),
('WY', 'F'),
('WY', 'M')],
names=['state', 'gender'], length=116)
If we want to keep the data in columns, we can use the as_index parameter
/var/folders/wn/2bz1970d2w5182zy7h96yfcc0000gn/T/ipykernel_39943/1045572165.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
df.groupby(["state", "gender"], as_index=False)["last_name"].count()
| state | gender | last_name | |
|---|---|---|---|
| 0 | AK | F | 0 |
| 1 | AK | M | 16 |
| 2 | AL | F | 4 |
| 3 | AL | M | 205 |
| 4 | AR | F | 5 |
| ... | ... | ... | ... |
| 111 | WI | M | 198 |
| 112 | WV | F | 1 |
| 113 | WV | M | 119 |
| 114 | WY | F | 1 |
| 115 | WY | M | 39 |
116 rows × 3 columns
You might have noticed that groupby automatically sorts the data. We can also change this, if we want.
/var/folders/wn/2bz1970d2w5182zy7h96yfcc0000gn/T/ipykernel_39943/966111394.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
df.groupby("state", sort=False)["last_name"].count()
state
DE 97
VA 433
SC 251
MD 305
PA 1053
MA 427
NJ 359
GA 317
NY 1467
NC 356
CT 240
VT 115
KY 373
RI 107
NH 181
TN 301
OH 675
MS 155
OL 2
IN 343
LA 199
IL 488
MO 334
AL 209
AR 117
ME 175
FL 161
MI 296
IA 205
WI 198
TX 263
CA 367
OR 90
MN 161
NM 57
NE 127
WA 96
KS 143
UT 55
NV 56
CO 92
WV 120
DK 9
AZ 49
ID 59
MT 53
WY 40
DC 2
ND 44
SD 51
OK 93
HI 24
PR 19
AK 16
PI 13
VI 4
GU 4
AS 2
Name: last_name, dtype: int64
8.2 What is actually happening here?
/var/folders/wn/2bz1970d2w5182zy7h96yfcc0000gn/T/ipykernel_39943/3630249550.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
by_state = df.groupby("state")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12f9b0890>
for state, frame in by_state:
print(f"First 2 entries for {state!r}")
print("------------------------")
print(frame.head(2), end="\n\n")First 2 entries for 'AK'
------------------------
last_name first_name birthday gender type state party
6617 Waskey Frank 1875-04-20 M rep AK Democrat
6645 Cale Thomas 1848-09-17 M rep AK Independent
First 2 entries for 'AL'
------------------------
last_name first_name birthday gender type state party
911 Crowell John 1780-09-18 M rep AL Republican
990 Walker John 1783-08-12 M sen AL Republican
First 2 entries for 'AR'
------------------------
last_name first_name birthday gender type state party
1000 Bates James 1788-08-25 M rep AR NaN
1278 Conway Henry 1793-03-18 M rep AR NaN
First 2 entries for 'AS'
------------------------
last_name first_name birthday gender type state party
10795 Sunia Fofó 1937-03-13 M rep AS Democrat
11752 Faleomavaega Eni 1943-08-15 M rep AS Democrat
First 2 entries for 'AZ'
------------------------
last_name first_name birthday gender type state party
3672 Poston Charles 1825-04-20 M rep AZ Republican
3723 Goodwin John 1824-10-18 M rep AZ Republican
First 2 entries for 'CA'
------------------------
last_name first_name birthday gender type state party
2648 Gilbert Edward NaT M rep CA Democrat
2740 Wright George 1816-06-04 M rep CA Independent
First 2 entries for 'CO'
------------------------
last_name first_name birthday gender type state \
3612 Bennet Hiram 1826-09-02 M rep CO
3931 Bradford Allen 1815-07-23 M rep CO
party
3612 Conservative Republican
3931 Republican
First 2 entries for 'CT'
------------------------
last_name first_name birthday gender type state party
14 Huntington Benjamin 1736-04-19 M rep CT NaN
16 Johnson William 1727-10-07 M sen CT Pro-Administration
First 2 entries for 'DC'
------------------------
last_name first_name birthday gender type state party
4230 Chipman Norton 1834-03-07 M rep DC Republican
10819 Fauntroy Walter 1933-02-06 M rep DC Democrat
First 2 entries for 'DE'
------------------------
last_name first_name birthday gender type state party
0 Bassett Richard 1745-04-02 M sen DE Anti-Administration
40 Read George 1733-09-18 M sen DE NaN
First 2 entries for 'DK'
------------------------
last_name first_name birthday gender type state party
3640 Jayne William 1826-10-08 M rep DK NaN
3685 Todd John 1814-04-04 M rep DK Democrat
First 2 entries for 'FL'
------------------------
last_name first_name birthday gender type state party
1030 Hernández Joseph 1788-05-26 M rep FL NaN
1089 Call Richard 1792-10-24 M rep FL NaN
First 2 entries for 'GA'
------------------------
last_name first_name birthday gender type state party
8 Few William 1748-06-08 M sen GA Anti-Administration
20 Mathews George 1739-08-30 M rep GA NaN
First 2 entries for 'GU'
------------------------
last_name first_name birthday gender type state party
10702 Won Pat Antonio 1908-12-10 M rep GU Democrat
10872 Blaz Ben 1928-02-14 M rep GU Republican
First 2 entries for 'HI'
------------------------
last_name first_name birthday gender type state party
6393 Wilcox Robert 1855-02-15 M rep HI NaN
7522 Baldwin Henry 1871-01-12 M rep HI Republican
First 2 entries for 'IA'
------------------------
last_name first_name birthday gender type state party
1930 Chapman William 1808-08-11 M rep IA Democrat
2390 Hastings Serranus 1813-11-22 M rep IA Democrat
First 2 entries for 'ID'
------------------------
last_name first_name birthday gender type state party
3691 Wallace William 1811-07-19 M rep ID Republican
3844 Holbrook Edward 1836-05-06 M rep ID Democrat
First 2 entries for 'IL'
------------------------
last_name first_name birthday gender type state party
595 Bond Shadrack 1773-11-24 M rep IL NaN
772 Stephenson Benjamin NaT M rep IL NaN
First 2 entries for 'IN'
------------------------
last_name first_name birthday gender type state party
431 Parke Benjamin 1777-09-22 M rep IN NaN
984 Taylor Waller NaT M sen IN Republican
First 2 entries for 'KS'
------------------------
last_name first_name birthday gender type state party
3161 Whitfield John 1818-03-11 M rep KS Democrat
3420 Parrott Marcus 1828-10-27 M rep KS Republican
First 2 entries for 'KY'
------------------------
last_name first_name birthday gender type state party
55 Edwards John NaT M sen KY Anti-Administration
113 Greenup Christopher NaT M rep KY Republican
First 2 entries for 'LA'
------------------------
last_name first_name birthday gender type state party
584 Destréhan Jean NaT M sen LA Unknown
585 Magruder Allan NaT M sen LA Republican
First 2 entries for 'MA'
------------------------
last_name first_name birthday gender type state party
6 Dalton Tristram 1738-05-28 M sen MA Pro-Administration
12 Grout Jonathan 1737-07-23 M rep MA NaN
First 2 entries for 'MD'
------------------------
last_name first_name birthday gender type state party
3 Carroll Daniel 1730-07-22 M rep MD NaN
5 Contee Benjamin NaT M rep MD NaN
First 2 entries for 'ME'
------------------------
last_name first_name birthday gender type state party
1018 Dane Joseph 1778-10-25 M rep ME Federalist
1028 Harris Mark 1779-01-27 M rep ME Republican
First 2 entries for 'MI'
------------------------
last_name first_name birthday gender type state party
1061 Sibley Solomon 1769-10-07 M rep MI NaN
1132 Richard Gabriel 1767-10-15 M rep MI NaN
First 2 entries for 'MN'
------------------------
last_name first_name birthday gender type state party
2864 Sibley Henry 1811-02-20 M rep MN NaN
3239 Kingsbury William 1828-06-04 M rep MN Democrat
First 2 entries for 'MO'
------------------------
last_name first_name birthday gender type state party
627 Hempstead Edward 1780-06-03 M rep MO NaN
712 Easton Rufus 1774-05-04 M rep MO NaN
First 2 entries for 'MS'
------------------------
last_name first_name birthday gender type state party
256 Greene Thomas 1758-02-26 M rep MS NaN
260 Hunter Narsworthy NaT M rep MS NaN
First 2 entries for 'MT'
------------------------
last_name first_name birthday gender type state party
3754 McLean Samuel 1826-08-07 M rep MT Democrat
3939 Cavanaugh James 1823-07-04 M rep MT Democrat
First 2 entries for 'NC'
------------------------
last_name first_name birthday gender type state party
13 Hawkins Benjamin 1754-08-15 M sen NC Pro-Administration
17 Johnston Samuel 1733-12-15 M sen NC Pro-Administration
First 2 entries for 'ND'
------------------------
last_name first_name birthday gender type state party
5382 Casey Lyman 1837-05-06 M sen ND Republican
5477 Pierce Gilbert 1839-01-11 M sen ND Republican
First 2 entries for 'NE'
------------------------
last_name first_name birthday gender type state party
2952 Giddings Napoleon 1816-01-02 M rep NE Democrat
3066 Chapman Bird 1821-08-24 M rep NE Democrat
First 2 entries for 'NH'
------------------------
last_name first_name birthday gender type state party
99 Wingate Paine 1739-05-14 M rep NH NaN
120 Langdon John 1741-06-26 M sen NH Republican
First 2 entries for 'NJ'
------------------------
last_name first_name birthday gender type state party
7 Elmer Jonathan 1745-11-29 M sen NJ Pro-Administration
23 Paterson William 1745-12-24 M sen NJ Pro-Administration
First 2 entries for 'NM'
------------------------
last_name first_name birthday gender type state party
2883 Weightman Richard 1816-12-28 M rep NM Democrat
3418 Otero Miguel 1829-06-21 M rep NM NaN
First 2 entries for 'NV'
------------------------
last_name first_name birthday gender type state party
3508 Cradlebaugh John 1819-02-22 M rep NV NaN
3662 Mott Gordon 1812-10-21 M rep NV Republican
First 2 entries for 'NY'
------------------------
last_name first_name birthday gender type state party
9 Floyd William 1734-12-17 M rep NY NaN
26 Van Rensselaer Jeremiah 1738-08-27 M rep NY NaN
First 2 entries for 'OH'
------------------------
last_name first_name birthday gender type state party
226 McMillan William 1764-03-02 M rep OH NaN
254 Fearing Paul 1762-02-28 M rep OH Federalist
First 2 entries for 'OK'
------------------------
last_name first_name birthday gender type state party
5599 Harvey David 1845-03-20 M rep OK Republican
6058 Callahan James 1852-12-19 M rep OK Free Silver
First 2 entries for 'OL'
------------------------
last_name first_name birthday gender type state party
404 Clark Daniel NaT M rep OL NaN
503 Poydras Julien 1740-04-03 M rep OL NaN
First 2 entries for 'OR'
------------------------
last_name first_name birthday gender type state party
2726 Thurston Samuel 1816-04-15 M rep OR Democrat
3396 Lane Joseph 1801-12-14 M sen OR Democrat
First 2 entries for 'PA'
------------------------
last_name first_name birthday gender type state party
4 Clymer George 1739-03-16 M rep PA NaN
19 Maclay William 1737-07-20 M sen PA Anti-Administration
First 2 entries for 'PI'
------------------------
last_name first_name birthday gender type state party
6833 Ocampo Pablo 1853-01-25 M rep PI NaN
6937 Legarda Y Tuason Benito 1853-09-27 M rep PI NaN
First 2 entries for 'PR'
------------------------
last_name first_name birthday gender type state party
6424 Degetau Federico 1862-12-05 M rep PR Republican
6809 Larrinaga Tulio 1847-01-15 M rep PR Unionist
First 2 entries for 'RI'
------------------------
last_name first_name birthday gender type state party
61 Bradford William 1729-11-04 M sen RI Federalist
105 Bourne Benjamin 1755-09-09 M rep RI Federalist
First 2 entries for 'SC'
------------------------
last_name first_name birthday gender type state party
2 Burke Aedanus 1743-06-16 M rep SC NaN
15 Izard Ralph NaT M sen SC Pro-Administration
First 2 entries for 'SD'
------------------------
last_name first_name birthday gender type state party
5421 Gifford Oscar 1842-10-20 M rep SD Republican
5460 Moody Gideon 1832-10-16 M sen SD Republican
First 2 entries for 'TN'
------------------------
last_name first_name birthday gender type state party
141 White James 1749-06-16 M rep TN NaN
142 Blount William 1749-03-26 M sen TN Republican
First 2 entries for 'TX'
------------------------
last_name first_name birthday gender type state party
2567 Pilsbury Timothy 1789-04-12 M rep TX Democrat
2669 Kaufman David 1813-12-18 M rep TX Democrat
First 2 entries for 'UT'
------------------------
last_name first_name birthday gender type state party
3482 Bernhisel John 1799-07-23 M rep UT Whig
3645 Kinney John 1816-04-02 M rep UT Democrat
First 2 entries for 'VA'
------------------------
last_name first_name birthday gender type state party
1 Bland Theodorick 1742-03-21 M rep VA NaN
11 Grayson William NaT M sen VA Anti-Administration
First 2 entries for 'VI'
------------------------
last_name first_name birthday gender type state party
10494 Evans Melvin 1917-08-07 M rep VI Republican
11086 de Lugo Ron 1930-08-02 M rep VI Democrat
First 2 entries for 'VT'
------------------------
last_name first_name birthday gender type state party
41 Robinson Moses 1741-03-22 M sen VT Anti-Administration
86 Niles Nathaniel 1741-04-03 M rep VT NaN
First 2 entries for 'WA'
------------------------
last_name first_name birthday gender type state party
2977 Lancaster Columbia 1803-08-26 M rep WA Democrat
3050 Anderson James 1822-02-16 M rep WA Democrat
First 2 entries for 'WI'
------------------------
last_name first_name birthday gender type state party
2409 Martin Morgan 1805-03-31 M rep WI Democrat
2502 Darling Mason 1801-05-18 M rep WI Democrat
First 2 entries for 'WV'
------------------------
last_name first_name birthday gender type state \
3613 Blair Jacob 1821-04-11 M rep WV
3688 Van Winkle Peter 1808-09-07 M sen WV
party
3613 Unconditional Unionist
3688 Republican
First 2 entries for 'WY'
------------------------
last_name first_name birthday gender type state party
4007 Nuckolls Stephen 1825-08-16 M rep WY Democrat
4136 Jones William 1842-02-20 M rep WY Republican
When you use groupby, you automatically create a dictionary with the different group labels.
8.3 get group
You can also use .get_group() as a way to drill down to the sub-table from a single group:
| last_name | first_name | birthday | gender | type | state | party | |
|---|---|---|---|---|---|---|---|
| 4 | Clymer | George | 1739-03-16 | M | rep | PA | NaN |
| 19 | Maclay | William | 1737-07-20 | M | sen | PA | Anti-Administration |
| 21 | Morris | Robert | 1734-01-20 | M | sen | PA | Pro-Administration |
| 27 | Wynkoop | Henry | 1737-03-02 | M | rep | PA | NaN |
| 38 | Jacobs | Israel | 1726-06-09 | M | rep | PA | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 11887 | Brady | Robert | 1945-04-07 | M | rep | PA | Democrat |
| 11926 | Shuster | Bill | 1961-01-10 | M | rep | PA | Republican |
| 11938 | Rothfus | Keith | 1962-04-25 | M | rep | PA | Republican |
| 11952 | Costello | Ryan | 1976-09-07 | M | rep | PA | Republican |
| 11965 | Marino | Tom | 1952-08-15 | M | rep | PA | Republican |
1053 rows × 7 columns
This is virtually equivalent to using .loc[]. You could get the same output with something like df.loc[df["state"] == "PA"].