8  Groupby

import pandas as pd

The dataset we’re going to use has information on members of the US congress - first and last names - birth date - gender - type (“rep” for House of Representatives or “sen” for Senate) - U.S. state - political party.

dtypes = {
    "first_name": "category",
    "gender": "category",
    "type": "category",
    "state": "category",
    "party": "category",
}
df = pd.read_csv(
    "./congress.csv",
    dtype=dtypes,
    usecols=list(dtypes) + ["birthday", "last_name"],
    parse_dates=["birthday"]
)
df.shape
(12048, 7)
df
last_name first_name birthday gender type state party
0 Bassett Richard 1745-04-02 M sen DE Anti-Administration
1 Bland Theodorick 1742-03-21 M rep VA NaN
2 Burke Aedanus 1743-06-16 M rep SC NaN
3 Carroll Daniel 1730-07-22 M rep MD NaN
4 Clymer George 1739-03-16 M rep PA NaN
... ... ... ... ... ... ... ...
12043 Loeffler Kelly 1970-11-27 F sen GA Republican
12044 Wright Ron 1953-04-08 M rep TX Republican
12045 Fudge Marcia 1952-10-29 F rep OH Democrat
12046 Haaland Debra 1960-12-02 F rep NM Democrat
12047 Hastings Alcee 1936-09-05 M rep FL Democrat

12048 rows × 7 columns

df.dtypes
last_name             object
first_name          category
birthday      datetime64[ns]
gender              category
type                category
state               category
party               category
dtype: object

What is the count of Congressional members, on a state-by-state basis, over the entire history of the dataset?

You call .groupby() and pass the name of the column you want to group on, which is “state”. Then, you use ["last_name"] to specify the columns on which you want to perform the actual aggregation.

df.groupby("state")["last_name"].count()
/var/folders/wn/2bz1970d2w5182zy7h96yfcc0000gn/T/ipykernel_39943/3228017122.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby("state")["last_name"].count()
state
AK      16
AL     209
AR     117
AS       2
AZ      49
CA     367
CO      92
CT     240
DC       2
DE      97
DK       9
FL     161
GA     317
GU       4
HI      24
IA     205
ID      59
IL     488
IN     343
KS     143
KY     373
LA     199
MA     427
MD     305
ME     175
MI     296
MN     161
MO     334
MS     155
MT      53
NC     356
ND      44
NE     127
NH     181
NJ     359
NM      57
NV      56
NY    1467
OH     675
OK      93
OL       2
OR      90
PA    1053
PI      13
PR      19
RI     107
SC     251
SD      51
TN     301
TX     263
UT      55
VA     433
VI       4
VT     115
WA      96
WI     198
WV     120
WY      40
Name: last_name, dtype: int64

You can pass a lot more than just a single column name to .groupby() as the first argument. You can also specify any of the following: - A list of multiple column names - A dictionary - A Pandas Series - A NumPy array or Pandas Index

8.1 multi index

Here’s an example of grouping jointly on two columns, which finds the count of Congressional members broken out by state and then by gender:

df.groupby(["state", "gender"])["last_name"].count()
/var/folders/wn/2bz1970d2w5182zy7h96yfcc0000gn/T/ipykernel_39943/503199787.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(["state", "gender"])["last_name"].count()
state  gender
AK     F           0
       M          16
AL     F           4
       M         205
AR     F           5
                ... 
WI     M         198
WV     F           1
       M         119
WY     F           1
       M          39
Name: last_name, Length: 116, dtype: int64

Note: When we use groupby on more than one column, we create a MultiIndex

n_by_state_gender = df.groupby(["state", "gender"])["last_name"].count()
type(n_by_state_gender)
/var/folders/wn/2bz1970d2w5182zy7h96yfcc0000gn/T/ipykernel_39943/197250217.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  n_by_state_gender = df.groupby(["state", "gender"])["last_name"].count()
pandas.core.series.Series
n_by_state_gender.index
MultiIndex([('AK', 'F'),
            ('AK', 'M'),
            ('AL', 'F'),
            ('AL', 'M'),
            ('AR', 'F'),
            ('AR', 'M'),
            ('AS', 'F'),
            ('AS', 'M'),
            ('AZ', 'F'),
            ('AZ', 'M'),
            ...
            ('VT', 'F'),
            ('VT', 'M'),
            ('WA', 'F'),
            ('WA', 'M'),
            ('WI', 'F'),
            ('WI', 'M'),
            ('WV', 'F'),
            ('WV', 'M'),
            ('WY', 'F'),
            ('WY', 'M')],
           names=['state', 'gender'], length=116)

If we want to keep the data in columns, we can use the as_index parameter

df.groupby(["state", "gender"], as_index=False)["last_name"].count()
/var/folders/wn/2bz1970d2w5182zy7h96yfcc0000gn/T/ipykernel_39943/1045572165.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby(["state", "gender"], as_index=False)["last_name"].count()
state gender last_name
0 AK F 0
1 AK M 16
2 AL F 4
3 AL M 205
4 AR F 5
... ... ... ...
111 WI M 198
112 WV F 1
113 WV M 119
114 WY F 1
115 WY M 39

116 rows × 3 columns

You might have noticed that groupby automatically sorts the data. We can also change this, if we want.

df.groupby("state", sort=False)["last_name"].count()
/var/folders/wn/2bz1970d2w5182zy7h96yfcc0000gn/T/ipykernel_39943/966111394.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  df.groupby("state", sort=False)["last_name"].count()
state
DE      97
VA     433
SC     251
MD     305
PA    1053
MA     427
NJ     359
GA     317
NY    1467
NC     356
CT     240
VT     115
KY     373
RI     107
NH     181
TN     301
OH     675
MS     155
OL       2
IN     343
LA     199
IL     488
MO     334
AL     209
AR     117
ME     175
FL     161
MI     296
IA     205
WI     198
TX     263
CA     367
OR      90
MN     161
NM      57
NE     127
WA      96
KS     143
UT      55
NV      56
CO      92
WV     120
DK       9
AZ      49
ID      59
MT      53
WY      40
DC       2
ND      44
SD      51
OK      93
HI      24
PR      19
AK      16
PI      13
VI       4
GU       4
AS       2
Name: last_name, dtype: int64

8.2 What is actually happening here?

by_state = df.groupby("state")
by_state
/var/folders/wn/2bz1970d2w5182zy7h96yfcc0000gn/T/ipykernel_39943/3630249550.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  by_state = df.groupby("state")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x12f9b0890>
for state, frame in by_state:
    print(f"First 2 entries for {state!r}")
    print("------------------------")
    print(frame.head(2), end="\n\n")
First 2 entries for 'AK'
------------------------
     last_name first_name   birthday gender type state        party
6617    Waskey      Frank 1875-04-20      M  rep    AK     Democrat
6645      Cale     Thomas 1848-09-17      M  rep    AK  Independent

First 2 entries for 'AL'
------------------------
    last_name first_name   birthday gender type state       party
911   Crowell       John 1780-09-18      M  rep    AL  Republican
990    Walker       John 1783-08-12      M  sen    AL  Republican

First 2 entries for 'AR'
------------------------
     last_name first_name   birthday gender type state party
1000     Bates      James 1788-08-25      M  rep    AR   NaN
1278    Conway      Henry 1793-03-18      M  rep    AR   NaN

First 2 entries for 'AS'
------------------------
          last_name first_name   birthday gender type state     party
10795         Sunia       Fofó 1937-03-13      M  rep    AS  Democrat
11752  Faleomavaega        Eni 1943-08-15      M  rep    AS  Democrat

First 2 entries for 'AZ'
------------------------
     last_name first_name   birthday gender type state       party
3672    Poston    Charles 1825-04-20      M  rep    AZ  Republican
3723   Goodwin       John 1824-10-18      M  rep    AZ  Republican

First 2 entries for 'CA'
------------------------
     last_name first_name   birthday gender type state        party
2648   Gilbert     Edward        NaT      M  rep    CA     Democrat
2740    Wright     George 1816-06-04      M  rep    CA  Independent

First 2 entries for 'CO'
------------------------
     last_name first_name   birthday gender type state  \
3612    Bennet      Hiram 1826-09-02      M  rep    CO   
3931  Bradford      Allen 1815-07-23      M  rep    CO   

                        party  
3612  Conservative Republican  
3931               Republican  

First 2 entries for 'CT'
------------------------
     last_name first_name   birthday gender type state               party
14  Huntington   Benjamin 1736-04-19      M  rep    CT                 NaN
16     Johnson    William 1727-10-07      M  sen    CT  Pro-Administration

First 2 entries for 'DC'
------------------------
      last_name first_name   birthday gender type state       party
4230    Chipman     Norton 1834-03-07      M  rep    DC  Republican
10819  Fauntroy     Walter 1933-02-06      M  rep    DC    Democrat

First 2 entries for 'DE'
------------------------
   last_name first_name   birthday gender type state                party
0    Bassett    Richard 1745-04-02      M  sen    DE  Anti-Administration
40      Read     George 1733-09-18      M  sen    DE                  NaN

First 2 entries for 'DK'
------------------------
     last_name first_name   birthday gender type state     party
3640     Jayne    William 1826-10-08      M  rep    DK       NaN
3685      Todd       John 1814-04-04      M  rep    DK  Democrat

First 2 entries for 'FL'
------------------------
      last_name first_name   birthday gender type state party
1030  Hernández     Joseph 1788-05-26      M  rep    FL   NaN
1089       Call    Richard 1792-10-24      M  rep    FL   NaN

First 2 entries for 'GA'
------------------------
   last_name first_name   birthday gender type state                party
8        Few    William 1748-06-08      M  sen    GA  Anti-Administration
20   Mathews     George 1739-08-30      M  rep    GA                  NaN

First 2 entries for 'GU'
------------------------
      last_name first_name   birthday gender type state       party
10702   Won Pat    Antonio 1908-12-10      M  rep    GU    Democrat
10872      Blaz        Ben 1928-02-14      M  rep    GU  Republican

First 2 entries for 'HI'
------------------------
     last_name first_name   birthday gender type state       party
6393    Wilcox     Robert 1855-02-15      M  rep    HI         NaN
7522   Baldwin      Henry 1871-01-12      M  rep    HI  Republican

First 2 entries for 'IA'
------------------------
     last_name first_name   birthday gender type state     party
1930   Chapman    William 1808-08-11      M  rep    IA  Democrat
2390  Hastings   Serranus 1813-11-22      M  rep    IA  Democrat

First 2 entries for 'ID'
------------------------
     last_name first_name   birthday gender type state       party
3691   Wallace    William 1811-07-19      M  rep    ID  Republican
3844  Holbrook     Edward 1836-05-06      M  rep    ID    Democrat

First 2 entries for 'IL'
------------------------
      last_name first_name   birthday gender type state party
595        Bond   Shadrack 1773-11-24      M  rep    IL   NaN
772  Stephenson   Benjamin        NaT      M  rep    IL   NaN

First 2 entries for 'IN'
------------------------
    last_name first_name   birthday gender type state       party
431     Parke   Benjamin 1777-09-22      M  rep    IN         NaN
984    Taylor     Waller        NaT      M  sen    IN  Republican

First 2 entries for 'KS'
------------------------
      last_name first_name   birthday gender type state       party
3161  Whitfield       John 1818-03-11      M  rep    KS    Democrat
3420    Parrott     Marcus 1828-10-27      M  rep    KS  Republican

First 2 entries for 'KY'
------------------------
    last_name   first_name birthday gender type state                party
55    Edwards         John      NaT      M  sen    KY  Anti-Administration
113   Greenup  Christopher      NaT      M  rep    KY           Republican

First 2 entries for 'LA'
------------------------
     last_name first_name birthday gender type state       party
584  Destréhan       Jean      NaT      M  sen    LA     Unknown
585   Magruder      Allan      NaT      M  sen    LA  Republican

First 2 entries for 'MA'
------------------------
   last_name first_name   birthday gender type state               party
6     Dalton   Tristram 1738-05-28      M  sen    MA  Pro-Administration
12     Grout   Jonathan 1737-07-23      M  rep    MA                 NaN

First 2 entries for 'MD'
------------------------
  last_name first_name   birthday gender type state party
3   Carroll     Daniel 1730-07-22      M  rep    MD   NaN
5    Contee   Benjamin        NaT      M  rep    MD   NaN

First 2 entries for 'ME'
------------------------
     last_name first_name   birthday gender type state       party
1018      Dane     Joseph 1778-10-25      M  rep    ME  Federalist
1028    Harris       Mark 1779-01-27      M  rep    ME  Republican

First 2 entries for 'MI'
------------------------
     last_name first_name   birthday gender type state party
1061    Sibley    Solomon 1769-10-07      M  rep    MI   NaN
1132   Richard    Gabriel 1767-10-15      M  rep    MI   NaN

First 2 entries for 'MN'
------------------------
      last_name first_name   birthday gender type state     party
2864     Sibley      Henry 1811-02-20      M  rep    MN       NaN
3239  Kingsbury    William 1828-06-04      M  rep    MN  Democrat

First 2 entries for 'MO'
------------------------
     last_name first_name   birthday gender type state party
627  Hempstead     Edward 1780-06-03      M  rep    MO   NaN
712     Easton      Rufus 1774-05-04      M  rep    MO   NaN

First 2 entries for 'MS'
------------------------
    last_name  first_name   birthday gender type state party
256    Greene      Thomas 1758-02-26      M  rep    MS   NaN
260    Hunter  Narsworthy        NaT      M  rep    MS   NaN

First 2 entries for 'MT'
------------------------
      last_name first_name   birthday gender type state     party
3754     McLean     Samuel 1826-08-07      M  rep    MT  Democrat
3939  Cavanaugh      James 1823-07-04      M  rep    MT  Democrat

First 2 entries for 'NC'
------------------------
   last_name first_name   birthday gender type state               party
13   Hawkins   Benjamin 1754-08-15      M  sen    NC  Pro-Administration
17  Johnston     Samuel 1733-12-15      M  sen    NC  Pro-Administration

First 2 entries for 'ND'
------------------------
     last_name first_name   birthday gender type state       party
5382     Casey      Lyman 1837-05-06      M  sen    ND  Republican
5477    Pierce    Gilbert 1839-01-11      M  sen    ND  Republican

First 2 entries for 'NE'
------------------------
     last_name first_name   birthday gender type state     party
2952  Giddings   Napoleon 1816-01-02      M  rep    NE  Democrat
3066   Chapman       Bird 1821-08-24      M  rep    NE  Democrat

First 2 entries for 'NH'
------------------------
    last_name first_name   birthday gender type state       party
99    Wingate      Paine 1739-05-14      M  rep    NH         NaN
120   Langdon       John 1741-06-26      M  sen    NH  Republican

First 2 entries for 'NJ'
------------------------
   last_name first_name   birthday gender type state               party
7      Elmer   Jonathan 1745-11-29      M  sen    NJ  Pro-Administration
23  Paterson    William 1745-12-24      M  sen    NJ  Pro-Administration

First 2 entries for 'NM'
------------------------
      last_name first_name   birthday gender type state     party
2883  Weightman    Richard 1816-12-28      M  rep    NM  Democrat
3418      Otero     Miguel 1829-06-21      M  rep    NM       NaN

First 2 entries for 'NV'
------------------------
        last_name first_name   birthday gender type state       party
3508  Cradlebaugh       John 1819-02-22      M  rep    NV         NaN
3662         Mott     Gordon 1812-10-21      M  rep    NV  Republican

First 2 entries for 'NY'
------------------------
         last_name first_name   birthday gender type state party
9            Floyd    William 1734-12-17      M  rep    NY   NaN
26  Van Rensselaer   Jeremiah 1738-08-27      M  rep    NY   NaN

First 2 entries for 'OH'
------------------------
    last_name first_name   birthday gender type state       party
226  McMillan    William 1764-03-02      M  rep    OH         NaN
254   Fearing       Paul 1762-02-28      M  rep    OH  Federalist

First 2 entries for 'OK'
------------------------
     last_name first_name   birthday gender type state        party
5599    Harvey      David 1845-03-20      M  rep    OK   Republican
6058  Callahan      James 1852-12-19      M  rep    OK  Free Silver

First 2 entries for 'OL'
------------------------
    last_name first_name   birthday gender type state party
404     Clark     Daniel        NaT      M  rep    OL   NaN
503   Poydras     Julien 1740-04-03      M  rep    OL   NaN

First 2 entries for 'OR'
------------------------
     last_name first_name   birthday gender type state     party
2726  Thurston     Samuel 1816-04-15      M  rep    OR  Democrat
3396      Lane     Joseph 1801-12-14      M  sen    OR  Democrat

First 2 entries for 'PA'
------------------------
   last_name first_name   birthday gender type state                party
4     Clymer     George 1739-03-16      M  rep    PA                  NaN
19    Maclay    William 1737-07-20      M  sen    PA  Anti-Administration

First 2 entries for 'PI'
------------------------
             last_name first_name   birthday gender type state party
6833            Ocampo      Pablo 1853-01-25      M  rep    PI   NaN
6937  Legarda Y Tuason     Benito 1853-09-27      M  rep    PI   NaN

First 2 entries for 'PR'
------------------------
      last_name first_name   birthday gender type state       party
6424    Degetau   Federico 1862-12-05      M  rep    PR  Republican
6809  Larrinaga      Tulio 1847-01-15      M  rep    PR    Unionist

First 2 entries for 'RI'
------------------------
    last_name first_name   birthday gender type state       party
61   Bradford    William 1729-11-04      M  sen    RI  Federalist
105    Bourne   Benjamin 1755-09-09      M  rep    RI  Federalist

First 2 entries for 'SC'
------------------------
   last_name first_name   birthday gender type state               party
2      Burke    Aedanus 1743-06-16      M  rep    SC                 NaN
15     Izard      Ralph        NaT      M  sen    SC  Pro-Administration

First 2 entries for 'SD'
------------------------
     last_name first_name   birthday gender type state       party
5421   Gifford      Oscar 1842-10-20      M  rep    SD  Republican
5460     Moody     Gideon 1832-10-16      M  sen    SD  Republican

First 2 entries for 'TN'
------------------------
    last_name first_name   birthday gender type state       party
141     White      James 1749-06-16      M  rep    TN         NaN
142    Blount    William 1749-03-26      M  sen    TN  Republican

First 2 entries for 'TX'
------------------------
     last_name first_name   birthday gender type state     party
2567  Pilsbury    Timothy 1789-04-12      M  rep    TX  Democrat
2669   Kaufman      David 1813-12-18      M  rep    TX  Democrat

First 2 entries for 'UT'
------------------------
      last_name first_name   birthday gender type state     party
3482  Bernhisel       John 1799-07-23      M  rep    UT      Whig
3645     Kinney       John 1816-04-02      M  rep    UT  Democrat

First 2 entries for 'VA'
------------------------
   last_name  first_name   birthday gender type state                party
1      Bland  Theodorick 1742-03-21      M  rep    VA                  NaN
11   Grayson     William        NaT      M  sen    VA  Anti-Administration

First 2 entries for 'VI'
------------------------
      last_name first_name   birthday gender type state       party
10494     Evans     Melvin 1917-08-07      M  rep    VI  Republican
11086   de Lugo        Ron 1930-08-02      M  rep    VI    Democrat

First 2 entries for 'VT'
------------------------
   last_name first_name   birthday gender type state                party
41  Robinson      Moses 1741-03-22      M  sen    VT  Anti-Administration
86     Niles  Nathaniel 1741-04-03      M  rep    VT                  NaN

First 2 entries for 'WA'
------------------------
      last_name first_name   birthday gender type state     party
2977  Lancaster   Columbia 1803-08-26      M  rep    WA  Democrat
3050   Anderson      James 1822-02-16      M  rep    WA  Democrat

First 2 entries for 'WI'
------------------------
     last_name first_name   birthday gender type state     party
2409    Martin     Morgan 1805-03-31      M  rep    WI  Democrat
2502   Darling      Mason 1801-05-18      M  rep    WI  Democrat

First 2 entries for 'WV'
------------------------
       last_name first_name   birthday gender type state  \
3613       Blair      Jacob 1821-04-11      M  rep    WV   
3688  Van Winkle      Peter 1808-09-07      M  sen    WV   

                       party  
3613  Unconditional Unionist  
3688              Republican  

First 2 entries for 'WY'
------------------------
     last_name first_name   birthday gender type state       party
4007  Nuckolls    Stephen 1825-08-16      M  rep    WY    Democrat
4136     Jones    William 1842-02-20      M  rep    WY  Republican

When you use groupby, you automatically create a dictionary with the different group labels.

by_state.groups["PA"]
Index([    4,    19,    21,    27,    38,    57,    69,    76,    84,    88,
       ...
       11838, 11862, 11871, 11873, 11883, 11887, 11926, 11938, 11952, 11965],
      dtype='int64', length=1053)

8.3 get group

You can also use .get_group() as a way to drill down to the sub-table from a single group:

by_state.get_group("PA")
last_name first_name birthday gender type state party
4 Clymer George 1739-03-16 M rep PA NaN
19 Maclay William 1737-07-20 M sen PA Anti-Administration
21 Morris Robert 1734-01-20 M sen PA Pro-Administration
27 Wynkoop Henry 1737-03-02 M rep PA NaN
38 Jacobs Israel 1726-06-09 M rep PA NaN
... ... ... ... ... ... ... ...
11887 Brady Robert 1945-04-07 M rep PA Democrat
11926 Shuster Bill 1961-01-10 M rep PA Republican
11938 Rothfus Keith 1962-04-25 M rep PA Republican
11952 Costello Ryan 1976-09-07 M rep PA Republican
11965 Marino Tom 1952-08-15 M rep PA Republican

1053 rows × 7 columns

This is virtually equivalent to using .loc[]. You could get the same output with something like df.loc[df["state"] == "PA"].