1  Tutorial 1

1.1 Getting to Know Pandas and Jupyter Notebooks

Jupyter Notebooks are an open-source web application that can be used to create and share documents that contain live code, equations, visualizations and narrative text.

Pandas is a software library written for the Python programming language for data manipulation and analysis.

1.1.1 In today’s tirgul we will learn:

  • A bit about Jupyter Notebook
  • How to import data into Pandas and
  • Some basic tools for manipulation.

1.2 Opening a Jupyter Notebook

If you’ve never used a Jupyter Notebook before…

  1. Install Anaconda (https://docs.anaconda.com/anaconda/install/)

  2. Option 1: Open Anaconda Navigator and click on “Jupyter Notebook.” Option 2: Open terminal and type jupyter notebook followed by Enter.**

  3. Navigate to the folder (i.e., directory) that you want to be in.

  4. Click “New–>Python 3 Notebook”

**There are lots of other ways to do this. You can also use Visual Studio Code, or Binder… the choice is yours.

1.3 About Jupyter Notebook

Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text. It’s a great tool for interactive data analysis, scientific computing, and educational purposes.

In Jupyter Notebook, you’ll see two main types of cells: code cells and markdown cells.

1.3.1 Code cell

Code cells are where you can write and execute Python (or other programming language) code. To create a new code cell, click on the + icon in the toolbar or use the Insert menu. To run a code cell, you can click the Run button in the toolbar, or press Shift + Enter on your keyboard. The code in the cell will be executed, and any output will be displayed below the cell.

1.3.2 Markdown cell

Markdown cells are where you can write formatted text using Markdown syntax. Markdown is a lightweight markup language that allows you to add headings, lists, links, images, and more to your text. In fact, this is a markdown cell To create a new markdown cell, click on the + icon and select Markdown from the dropdown menu. To render the markdown as formatted text, you can also run the cell using Shift + Enter.

1.3.3 Execution order

The order of appearance of cells in your notebook doesn’t necessarily determine the order in which they are executed. Instead, Jupyter Notebook keeps track of the order in which cells are run, and uses that order to ensure that all necessary code is executed before it’s used in later cells. You can see the execution order of cells by looking at the number to the left of the cell (e.g., [1]) - this indicates the order in which the cell was executed.

In summary, Jupyter Notebook is a powerful tool for interactive data analysis and scientific computing. Code cells allow you to write and execute Python code, while markdown cells allow you to write formatted text using Markdown syntax. You can run cells using the Run button or Shift + Enter, and Jupyter Notebook keeps track of the order in which cells are executed to ensure that your code runs correctly.

1.4 Short Markdown example

To create headers, you can use the “#” symbol followed by a space, and then type your header text. The number of “#” symbols determines the level of the header, with one “#” indicating the largest header and six “#” indicating the smallest header. For example:

# This is a level 1 header

## This is a level 2 header

### This is a level 3 header

To go down a row and write bullet points, you can use the “*” symbol followed by a space, and then type your bullet text. For example:

  • This is a bullet point
  • This is another bullet point

1.4.1 There are many more things you can do, this cheat sheet can be very helpful.

1.5 Download the required dataset

We are going to be using a dataset based on information from 60k OkCupid users. The dataset was collected by web scraping the OKCupid.com website on 2012/06/30, and includes profiles of people within a 25 mile radius of San Francisco, who were online in the previous year (after 06/30/2011), with at least one profile picture. The data includes age, sex, orientation as well as text data from open ended descriptions. You can download the dataset here: https://github.com/ErezFeuer/Introduction-to-Data-Science-Tutorials/blob/master/tirgul_1/okcupid_profiles.csv.

How to download from files from Github https://www.gitkraken.com/learn/git/github-download#how-to-downlaod-a-file-from-github

1.6 Getting started with Pandas

# import the pandas library
import pandas as pd
# import the dataset
okcupid_df = pd.read_csv('okcupid_profiles.csv')

You just created a pandas DataFrame.

We can look at our data by using the .head() method.

By default, this shows the header (column names) and the first five rows.

Passing an integer, n, to .head() returns that number of rows.

# peak at the data
okcupid_df.head(10)
age status sex orientation body_type diet drinks drugs education ethnicity ... income job last_online location offspring pets religion sign smokes speaks
0 22 single m straight a little extra strictly anything socially never working on college/university asian, white ... -1 transportation 2012-06-28-20-30 south san francisco, california doesn't have kids, but might want them likes dogs and likes cats agnosticism and very serious about it gemini sometimes english
1 35 single m straight average mostly other often sometimes working on space camp white ... 80000 hospitality / travel 2012-06-29-21-41 oakland, california doesn't have kids, but might want them likes dogs and likes cats agnosticism but not too serious about it cancer no english (fluently), spanish (poorly), french (...
2 38 available m straight thin anything socially NaN graduated from masters program NaN ... -1 NaN 2012-06-27-09-10 san francisco, california NaN has cats NaN pisces but it doesn’t matter no english, french, c++
3 23 single m straight thin vegetarian socially NaN working on college/university white ... 20000 student 2012-06-28-14-22 berkeley, california doesn't want kids likes cats NaN pisces no english, german (poorly)
4 29 single m straight athletic NaN socially never graduated from college/university asian, black, other ... -1 artistic / musical / writer 2012-06-27-21-26 san francisco, california NaN likes dogs and likes cats NaN aquarius no english
5 29 single m straight average mostly anything socially NaN graduated from college/university white ... -1 computer / hardware / software 2012-06-29-19-18 san francisco, california doesn't have kids, but might want them likes cats atheism taurus no english (fluently), chinese (okay)
6 32 single f straight fit strictly anything socially never graduated from college/university white, other ... -1 NaN 2012-06-25-20-45 san francisco, california NaN likes dogs and likes cats NaN virgo NaN english
7 31 single f straight average mostly anything socially never graduated from college/university white ... -1 artistic / musical / writer 2012-06-29-12-30 san francisco, california doesn't have kids, but wants them likes dogs and likes cats christianity sagittarius no english, spanish (okay)
8 24 single f straight NaN strictly anything socially NaN graduated from college/university white ... -1 NaN 2012-06-29-23-39 belvedere tiburon, california doesn't have kids likes dogs and likes cats christianity but not too serious about it gemini but it doesn’t matter when drinking english
9 37 single m straight athletic mostly anything not at all never working on two-year college white ... -1 student 2012-06-28-21-08 san mateo, california NaN likes dogs and likes cats atheism and laughing about it cancer but it doesn’t matter no english (fluently)

10 rows × 21 columns

Alternatively, to see the last n rows, use .tail().

okcupid_df.tail(10)
age status sex orientation body_type diet drinks drugs education ethnicity ... income job last_online location offspring pets religion sign smokes speaks
9990 33 single m straight average mostly anything socially never graduated from masters program asian ... -1 banking / financial / real estate 2012-06-30-08-51 san leandro, california doesn't have kids likes dogs NaN leo but it doesn’t matter NaN english, chinese
9991 27 single m gay athletic NaN socially never graduated from college/university asian ... -1 student 2011-10-13-01-58 stanford, california NaN NaN NaN NaN no english (fluently), chinese (fluently)
9992 39 single f straight fit mostly anything socially never graduated from masters program indian ... -1 computer / hardware / software 2012-06-07-14-13 san francisco, california doesn't have kids likes dogs NaN aquarius and it’s fun to think about NaN english (fluently)
9993 38 single f straight athletic NaN socially sometimes graduated from college/university NaN ... -1 computer / hardware / software 2012-06-30-01-51 oakland, california doesn't have kids, but wants them NaN buddhism aries no english
9994 29 single m straight average NaN socially never college/university NaN ... -1 NaN 2012-01-09-21-35 oakland, california NaN NaN NaN NaN sometimes english (fluently), chinese (fluently)
9995 24 single f straight curvy anything socially never graduated from two-year college black ... 20000 other 2012-01-16-23-10 burlingame, california has a kid likes dogs and likes cats christianity and very serious about it libra but it doesn’t matter when drinking english
9996 24 single f straight skinny NaN socially never graduated from college/university asian ... -1 NaN 2012-06-30-02-24 san francisco, california doesn't have kids NaN NaN virgo no english, chinese
9997 19 seeing someone f bisexual athletic NaN rarely never NaN pacific islander, white, other ... -1 NaN 2012-03-08-13-08 martinez, california NaN has dogs and has cats NaN aquarius but it doesn’t matter sometimes english
9998 47 single f straight a little extra NaN rarely never NaN black ... 50000 transportation 2012-06-30-06-12 emeryville, california NaN likes dogs and likes cats other scorpio and it’s fun to think about no english
9999 29 single f straight NaN mostly vegan socially NaN graduated from college/university white ... -1 entertainment / media 2012-06-03-23-30 san francisco, california doesn't have kids, but might want them NaN NaN leo no english, spanish

10 rows × 21 columns

1.6.1 To find the number of rows, you can use the len() function. Alternatively, you can use the shape attribute.

# get the length of the dataframe
len(okcupid_df)
10000
# get the size of the dataframe (rows, columns)
okcupid_df.shape
(10000, 21)

1.6.2 What are all the columns in our dataset?

# get columns
okcupid_df.columns
Index(['age', 'status', 'sex', 'orientation', 'body_type', 'diet', 'drinks',
       'drugs', 'education', 'ethnicity', 'height', 'income', 'job',
       'last_online', 'location', 'offspring', 'pets', 'religion', 'sign',
       'smokes', 'speaks'],
      dtype='object')
# get column data types
okcupid_df.dtypes
age             int64
status         object
sex            object
orientation    object
body_type      object
diet           object
drinks         object
drugs          object
education      object
ethnicity      object
height          int64
income          int64
job            object
last_online    object
location       object
offspring      object
pets           object
religion       object
sign           object
smokes         object
speaks         object
dtype: object

1.6.3 Accessing a single column

# access a single column
okcupid_df[['height']]
height
0 75
1 70
2 68
3 71
4 66
... ...
9995 66
9996 59
9997 61
9998 64
9999 68

10000 rows × 1 columns

# let's try that a different way...
okcupid_df['height']
0       75
1       70
2       68
3       71
4       66
        ..
9995    66
9996    59
9997    61
9998    64
9999    68
Name: height, Length: 10000, dtype: int64
# and another way...
okcupid_df.height
0       75
1       70
2       68
3       71
4       66
        ..
9995    66
9996    59
9997    61
9998    64
9999    68
Name: height, Length: 10000, dtype: int64

It is preferrable to use the bracket notation as a column name might inadvertently have the same name as a DataFrame (or Series) method. In addition, only bracket notation can be used to create a new column. If you try and use attribute access to create a new column, you’ll create a new attribute, not a new column.

You can either use a single bracket or a double bracket. The single bracket will output a pandas Series, while a double bracket will output a pandas DataFrame. A pandas Series is a single vector of data (e.g., a NumPy array) with “an associated array of data labels, called its index.” A DataFrame also has an index.

In our example, the indices are an array of sequential integers, which is the default. You can find them in the left-most position, without a column label. Indices need not be a sequence of integers. They can, for example, be dates or strings. Note that indices do not need to be unique.

1.7 Rename, Index, and Slice

You may have noticed that the height column data is in inches. Let’s rename the column so that the units are included in the name.

# rename the height column
okcupid_df.rename(columns={'height' : 'height_inches'}, inplace=True)
# check to see that the change was implemented
okcupid_df.columns
Index(['age', 'status', 'sex', 'orientation', 'body_type', 'diet', 'drinks',
       'drugs', 'education', 'ethnicity', 'height_inches', 'income', 'job',
       'last_online', 'location', 'offspring', 'pets', 'religion', 'sign',
       'smokes', 'speaks'],
      dtype='object')

Indices, like column names, can be used to select data. Indices can be used to select particular rows. In fact, you can do something like .head() with slicing using the [] operator.

okcupid_df[:5]
age status sex orientation body_type diet drinks drugs education ethnicity ... income job last_online location offspring pets religion sign smokes speaks
0 22 single m straight a little extra strictly anything socially never working on college/university asian, white ... -1 transportation 2012-06-28-20-30 south san francisco, california doesn't have kids, but might want them likes dogs and likes cats agnosticism and very serious about it gemini sometimes english
1 35 single m straight average mostly other often sometimes working on space camp white ... 80000 hospitality / travel 2012-06-29-21-41 oakland, california doesn't have kids, but might want them likes dogs and likes cats agnosticism but not too serious about it cancer no english (fluently), spanish (poorly), french (...
2 38 available m straight thin anything socially NaN graduated from masters program NaN ... -1 NaN 2012-06-27-09-10 san francisco, california NaN has cats NaN pisces but it doesn’t matter no english, french, c++
3 23 single m straight thin vegetarian socially NaN working on college/university white ... 20000 student 2012-06-28-14-22 berkeley, california doesn't want kids likes cats NaN pisces no english, german (poorly)
4 29 single m straight athletic NaN socially never graduated from college/university asian, black, other ... -1 artistic / musical / writer 2012-06-27-21-26 san francisco, california NaN likes dogs and likes cats NaN aquarius no english

5 rows × 21 columns

Before we continue, let’s spend some more time looking at a few useful ways to index data—that is, select rows. .loc primarily works with string labels. It accepts a single label, a list (or array) of labels, or a slice of labels (e.g., ‘a’ : ‘f’).

# another way to get a single column...
okcupid_df.loc[:, 'sex']
0       m
1       m
2       m
3       m
4       m
       ..
9995    f
9996    f
9997    f
9998    f
9999    f
Name: sex, Length: 10000, dtype: object
# we can also index based on rows
okcupid_df[2:4]
age status sex orientation body_type diet drinks drugs education ethnicity ... income job last_online location offspring pets religion sign smokes speaks
2 38 available m straight thin anything socially NaN graduated from masters program NaN ... -1 NaN 2012-06-27-09-10 san francisco, california NaN has cats NaN pisces but it doesn’t matter no english, french, c++
3 23 single m straight thin vegetarian socially NaN working on college/university white ... 20000 student 2012-06-28-14-22 berkeley, california doesn't want kids likes cats NaN pisces no english, german (poorly)

2 rows × 21 columns

The difference is that the former returns a Series because we selected a single lable, while the latter returns a DataFrame because we selected a range of positions.

Another indexing option, .iloc, primarily works with integer positions. To select specific rows, we can do the following.

okcupid_df.iloc[[1, 5, 6, 9]]
age status sex orientation body_type diet drinks drugs education ethnicity ... income job last_online location offspring pets religion sign smokes speaks
1 35 single m straight average mostly other often sometimes working on space camp white ... 80000 hospitality / travel 2012-06-29-21-41 oakland, california doesn't have kids, but might want them likes dogs and likes cats agnosticism but not too serious about it cancer no english (fluently), spanish (poorly), french (...
5 29 single m straight average mostly anything socially NaN graduated from college/university white ... -1 computer / hardware / software 2012-06-29-19-18 san francisco, california doesn't have kids, but might want them likes cats atheism taurus no english (fluently), chinese (okay)
6 32 single f straight fit strictly anything socially never graduated from college/university white, other ... -1 NaN 2012-06-25-20-45 san francisco, california NaN likes dogs and likes cats NaN virgo NaN english
9 37 single m straight athletic mostly anything not at all never working on two-year college white ... -1 student 2012-06-28-21-08 san mateo, california NaN likes dogs and likes cats atheism and laughing about it cancer but it doesn’t matter no english (fluently)

4 rows × 21 columns

We can select a range of rows and specify the step value.

okcupid_df.iloc[25:50:5]
age status sex orientation body_type diet drinks drugs education ethnicity ... income job last_online location offspring pets religion sign smokes speaks
25 28 single m straight fit anything rarely never graduated from college/university asian, white ... -1 medicine / health 2012-06-26-01-27 atherton, california doesn't have kids, but wants them has dogs NaN gemini and it’s fun to think about no english
30 27 single f straight average anything socially NaN working on college/university white ... -1 other 2011-11-10-13-15 san francisco, california NaN NaN agnosticism gemini trying to quit english, spanish (poorly)
35 26 single m straight athletic NaN NaN never NaN white ... -1 banking / financial / real estate 2012-06-27-00-32 san francisco, california NaN NaN NaN NaN NaN english, russian
40 30 single m straight average NaN often never graduated from masters program NaN ... -1 computer / hardware / software 2012-06-29-22-56 menlo park, california doesn't have kids likes cats agnosticism NaN no english (fluently), dutch (fluently), lisp (fl...
45 27 single m straight average mostly anything socially never college/university pacific islander ... -1 education / academia 2012-06-28-18-09 oakland, california doesn't have kids NaN catholicism and somewhat serious about it aquarius NaN english (fluently), tagalog (fluently)

5 rows × 21 columns

1.7.1 We can also create a new column

Here we create a new column with height in cm

okcupid_df['height_cm'] = okcupid_df['height_inches']*2.54
okcupid_df['height_cm']
0       190.50
1       177.80
2       172.72
3       180.34
4       167.64
         ...  
9995    167.64
9996    149.86
9997    154.94
9998    162.56
9999    172.72
Name: height_cm, Length: 10000, dtype: float64

1.7.1.1 Another example

okcupid_df['speaks']
0                                                 english
1       english (fluently), spanish (poorly), french (...
2                                    english, french, c++
3                                english, german (poorly)
4                                                 english
                              ...                        
9995                                              english
9996                                     english, chinese
9997                                              english
9998                                              english
9999                                     english, spanish
Name: speaks, Length: 10000, dtype: object

Now we’ll create a column that has only the first language the person speaks.

okcupid_df['speaks first'] = okcupid_df['speaks'].dropna().apply(lambda x: x.split(',')[0])
okcupid_df['speaks first']
0                  english
1       english (fluently)
2                  english
3                  english
4                  english
               ...        
9995               english
9996               english
9997               english
9998               english
9999               english
Name: speaks first, Length: 10000, dtype: object

1.7.2 Slicing by condition

Lets say we want a df with only the females.

f_okcupid_df = okcupid_df.loc[okcupid_df['sex'] == 'f']
f_okcupid_df
age status sex orientation body_type diet drinks drugs education ethnicity ... last_online location offspring pets religion sign smokes speaks height_cm speaks first
6 32 single f straight fit strictly anything socially never graduated from college/university white, other ... 2012-06-25-20-45 san francisco, california NaN likes dogs and likes cats NaN virgo NaN english 165.10 english
7 31 single f straight average mostly anything socially never graduated from college/university white ... 2012-06-29-12-30 san francisco, california doesn't have kids, but wants them likes dogs and likes cats christianity sagittarius no english, spanish (okay) 165.10 english
8 24 single f straight NaN strictly anything socially NaN graduated from college/university white ... 2012-06-29-23-39 belvedere tiburon, california doesn't have kids likes dogs and likes cats christianity but not too serious about it gemini but it doesn’t matter when drinking english 170.18 english
13 30 single f straight skinny mostly anything socially never graduated from high school white ... 2012-06-13-16-06 san francisco, california NaN has dogs and likes cats christianity but not too serious about it NaN no english 167.64 english
14 29 single f straight thin mostly anything socially never working on college/university hispanic / latin, white ... 2012-06-29-08-55 san leandro, california doesn't have kids, but wants them likes dogs and has cats catholicism taurus no english 157.48 english
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9995 24 single f straight curvy anything socially never graduated from two-year college black ... 2012-01-16-23-10 burlingame, california has a kid likes dogs and likes cats christianity and very serious about it libra but it doesn’t matter when drinking english 167.64 english
9996 24 single f straight skinny NaN socially never graduated from college/university asian ... 2012-06-30-02-24 san francisco, california doesn't have kids NaN NaN virgo no english, chinese 149.86 english
9997 19 seeing someone f bisexual athletic NaN rarely never NaN pacific islander, white, other ... 2012-03-08-13-08 martinez, california NaN has dogs and has cats NaN aquarius but it doesn’t matter sometimes english 154.94 english
9998 47 single f straight a little extra NaN rarely never NaN black ... 2012-06-30-06-12 emeryville, california NaN likes dogs and likes cats other scorpio and it’s fun to think about no english 162.56 english
9999 29 single f straight NaN mostly vegan socially NaN graduated from college/university white ... 2012-06-03-23-30 san francisco, california doesn't have kids, but might want them NaN NaN leo no english, spanish 172.72 english

4056 rows × 23 columns

1.8 Summary statistics

1.8.1 A useful method that generates various summary statistics is .describe().

# get summary statistics
okcupid_df.describe()
age height_inches income height_cm
count 10000.000000 10000.000000 10000.000000 10000.000000
mean 32.073500 68.331200 19307.189900 173.561248
std 9.444025 3.908482 93842.703841 9.927545
min 18.000000 36.000000 -1.000000 91.440000
25% 25.000000 66.000000 -1.000000 167.640000
50% 30.000000 68.000000 -1.000000 172.720000
75% 36.000000 71.000000 -1.000000 180.340000
max 110.000000 95.000000 1000000.000000 241.300000

The summary statistics are based on non-missing values and count reflects that. The values depend on what it’s called on. If the DataFrame includes both numeric and object (e.g., strings) dtypes, it will default to summarizing the numeric data. If .describe() is called on strings, for example, it will return the count, number of unique values, and the most frequent value along with its count.

# describe can be used on a single column
okcupid_df['income'].describe()
count      10000.000000
mean       19307.189900
std        93842.703841
min           -1.000000
25%           -1.000000
50%           -1.000000
75%           -1.000000
max      1000000.000000
Name: income, dtype: float64
# and on non-numerical data  
okcupid_df['sex'].describe()
count     10000
unique        2
top           m
freq       5944
Name: sex, dtype: object

1.8.2 Unique values

We can find out how the unique values in a column

okcupid_df['drinks'].unique()
array(['socially', 'often', 'not at all', 'rarely', nan, 'very often',
       'desperately'], dtype=object)

1.9 Manipulating Data

1.9.1 Let’s seperate the data based on sex

# male users
males=okcupid_df.loc[okcupid_df["sex"]=="m"] 
n_males = len(males)

# female users
females=okcupid_df.loc[okcupid_df["sex"]=="f"] 
n_females = len(females)

n_population = len(okcupid_df)

# how many users are male and how many are female?
print(f"{n_males} males ({n_males/n_population:.1%}), {n_females} females ({n_females/n_population})")
5944 males (59.4%), 4056 females (0.4056)

Note that above I set the male percentage for 1 decimal point with :.1 and for the female I didn’t

1.9.2 Let’s examine the age distribution

# summary statistics on the 'age' column
okcupid_df['age'].describe()
count    10000.000000
mean        32.073500
std          9.444025
min         18.000000
25%         25.000000
50%         30.000000
75%         36.000000
max        110.000000
Name: age, dtype: float64
# how many users are older than 80?
(okcupid_df["age"]>80).sum()
1
# can also be written this way
sum(okcupid_df['age']>80)
1

1.9.3 Who are the age outliers?

okcupid_df[okcupid_df["age"]>80]
age status sex orientation body_type diet drinks drugs education ethnicity ... last_online location offspring pets religion sign smokes speaks height_cm speaks first
2512 110 single f straight NaN NaN NaN NaN NaN NaN ... 2012-06-27-22-16 daly city, california NaN NaN NaN NaN NaN english 170.18 english

1 rows × 23 columns

Let’s assume the 110-year-old lady and the athletic 109-year-old gentleman (who’s working on a masters program) are outliers: we get rid of them so the following plots look better. They didn’t say much else about themselves, anyway.

# eliminate users with age >= 80
okcupid_df=okcupid_df[okcupid_df["age"]<=80] 

Is the age distribution for men different than that for women?

print("Mean and median age for males:   {:.2f}, {:.2f}".format(males["age"].mean(),males["age"].median()))
print("Mean and median age for females: {:.2f}, {:.2f}".format(females["age"].mean(),females["age"].median()))
Mean and median age for males:   31.75, 30.00
Mean and median age for females: 32.55, 30.00

1.9.4 We can also plot the data (don’t worry if you don’t understand the code)

# add plotting libraries to notebook
%matplotlib inline
%config InlineBackend.figure_format='svg'
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
sns.set_style("ticks")
sns.set_context(context="notebook",font_scale=1)
# draw age histograms for male and female users
fig,(ax1,ax2) = plt.subplots(ncols=2,figsize=(10,3),sharey=True,sharex=True)
sns.histplot(males["age"], ax=ax1,
             bins=range(okcupid_df["age"].min(),okcupid_df["age"].max()),
             kde=False,
             color="b")
ax1.set_title("Age distribution for males")
sns.histplot(females["age"], ax=ax2,
             bins=range(okcupid_df["age"].min(),okcupid_df["age"].max()),
             kde=False,
             color="r")
ax2.set_title("Age distribution for females")
ax1.set_ylabel("Number of users in age group")
for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

1.9.5 Now let’s study height distribution and compare with official data from the US Centers of Disease Control and Prevention

fig,(ax,ax2) = plt.subplots(nrows=2,sharex=True,figsize=(6,6),gridspec_kw={'height_ratios':[2,1]})

# Plot histograms of height
bins=range(55,80)
sns.histplot(males["height_inches"].dropna(), ax=ax,
             bins=bins,
             kde=False,
             color="b",
             label="males")
sns.histplot(females["height_inches"].dropna(), ax=ax,
             bins=bins,
             kde=False,
             color="r",
             label="females")
ax.legend(loc="upper left")
ax.set_xlabel("")
ax.set_ylabel("Number of users with given height")
ax.set_title("height distribution of male and female users");

# Make aligned boxplots
sns.boxplot(data=okcupid_df,y="sex",x="height_inches",orient="h",ax=ax2,palette={"m":"b","f":"r"})
plt.setp(ax2.artists, alpha=.5)
ax2.set_xlim([min(bins),max(bins)])
ax2.set_xlabel("Self-reported height [inches]")

sns.despine(ax=ax)
fig.tight_layout()
/var/folders/wn/2bz1970d2w5182zy7h96yfcc0000gn/T/ipykernel_35920/2199488932.py:21: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `y` variable to `hue` and set `legend=False` for the same effect.

  sns.boxplot(data=okcupid_df,y="sex",x="height_inches",orient="h",ax=ax2,palette={"m":"b","f":"r"})

Males are obviously taller than females, and the two distributions make sense.

1.9.6 How do OkCupid user’s heights compare to the general population?

How does this compare with general population data? Are OkCupid users maybe cheating and overreporting their height?

The CDC publishes growth charts, which contain height data for the general US population. The dataset reports statistics (3rd, 5th, 10th, 25th, 50th, 75th, 90th, 95th, 97th percentiles) for stature for different ages from 2 to 20 years. This (and more) data is plotted by the CDC in these beautiful charts (https://www.cdc.gov/growthcharts/data/set2clinical/set2color.pdf).

# Import a CSV file for growth chart data
cdc=pd.read_csv("https://www.cdc.gov/growthcharts/data/zscore/statage.csv")

# print first ten rows of data
cdc.head(10)
Sex Agemos L M S P3 P5 P10 P25 P50 P75 P90 P95 P97
0 1 24.0 0.941524 86.452201 0.040322 79.910844 80.729773 81.991714 84.102892 86.452201 88.805249 90.926191 92.196879 93.022654
1 1 24.5 1.007208 86.861609 0.040396 80.260371 81.088685 82.364010 84.494706 86.861609 89.228048 91.357530 92.631767 93.459230
2 1 25.5 0.837251 87.652473 0.040578 81.005294 81.834452 83.113871 85.258877 87.652473 90.056755 92.229661 93.534066 94.382780
3 1 26.5 0.681493 88.423264 0.040723 81.734157 82.564061 83.847162 86.005173 88.423264 90.862604 93.076082 94.408849 95.277617
4 1 27.5 0.538780 89.175492 0.040833 82.448456 83.278986 84.565344 86.735069 89.175492 91.647114 93.898271 95.257541 96.145118
5 1 28.5 0.407697 89.910409 0.040909 83.149450 83.980453 85.269620 87.449772 89.910409 92.411590 94.697570 96.081488 96.986625
6 1 29.5 0.286762 90.629078 0.040952 83.838194 84.669484 85.960983 88.150284 90.629078 93.157190 95.475221 96.881980 97.803453
7 1 30.5 0.174489 91.332424 0.040965 84.515583 85.346943 86.640272 88.837454 91.332424 93.884956 96.232394 97.660267 98.596905
8 1 31.5 0.069445 92.021272 0.040950 85.182380 86.013566 87.308201 89.512019 92.021272 94.595854 96.970215 98.417579 99.368283
9 1 32.5 -0.029721 92.696379 0.040909 85.839250 86.669993 87.965401 90.174637 92.696379 95.290799 97.689781 99.155138 100.118893
# Adjust the data to fit our format
cdc["Age"]=cdc["Agemos"]/12 # convert age in months to age in fractional years
cdc["Sex"]=cdc["Sex"].replace({1:"male",2:"female"}) # align to our convention

# group the data into percentiles
percentiles=[3,5,10,25,50,75,90,95,97]
percentile_columns=["P"+str(p) for p in percentiles] # names of percentile columns
cdc[percentile_columns]=cdc[percentile_columns]*0.393701 # convert percentile columns from centimeters to inches (ugh)
cdc20=cdc[cdc["Age"]==20].set_index("Sex") # Select the two rows corresponding to 20-year-olds (males and females)

print("Height Percentiles for 20-year-old US population [inches]")
cdc20[percentile_columns]
Height Percentiles for 20-year-old US population [inches]
P3 P5 P10 P25 P50 P75 P90 P95 P97
Sex
male 64.304496 64.975961 66.007627 67.725520 69.625720 71.517283 73.212615 74.224061 74.879691
female 59.492618 60.098742 61.030770 62.584736 64.306433 66.023163 67.564157 68.484561 69.081584

1.9.7 Let’s compare the stats for reported heights of the OkCupid 20-year-olds to the CDC stats for 20-year-olds.

mheights=males.loc[males["age"]==20,"height_inches"] # heights of 20-year-old males
fheights=females.loc[females["age"]==20,"height_inches"] # heights of 20-year-old females

# To smooth the computation of percentiles, jitter height data by adding
# uniformly distributed noise in the range [-0.5,+0.5]
mheightsj=mheights+np.random.uniform(low=-0.5,high=+0.5,size=(len(mheights),))
fheightsj=fheights+np.random.uniform(low=-0.5,high=+0.5,size=(len(fheights),))

# For each of the available percentiles in CDC data, compute the corresponding percentile from our 20-year-old users
stats=[]
for percentile,percentile_column in zip(percentiles,percentile_columns):
    stats.append({"sex":"male",
                  "percentile":percentile,
                  "CDC":cdc20.loc["male",percentile_column],
                  "OkCupid":mheightsj.quantile(percentile/100)})
    stats.append({"sex":"female",
                  "percentile":percentile,
                  "CDC":cdc20.loc["female",percentile_column],
                  "OkCupid":fheightsj.quantile(percentile/100)})
stats=pd.DataFrame(stats).set_index(["sex","percentile"]).sort_index()

# For each percentile, compute the gap between users and CDC
stats["gap"]=stats["OkCupid"]-stats["CDC"]
# plot data
fig,(ax1,ax2)=plt.subplots(ncols=2,sharex=True,figsize=(10,4))
stats.loc["male"][["OkCupid","CDC"]].plot.bar(ax=ax1,color=[[0.5,0.5,1],"k"],alpha=1,width=0.8,rot=0)
stats.loc["female"][["OkCupid","CDC"]].plot.bar(ax=ax2,color=[[1,0.5,0.5],"k"],alpha=1,width=0.8,rot=0)
ax1.set_ylim([64,77])
ax2.set_ylim([58,71])
ax1.set_ylabel("Height [inches]")
ax2.set_ylabel("Height [inches]")
ax1.set_title("Height percentiles in 20y-old male users vs CDC data")
ax2.set_title("Height percentiles in 20y-old female users vs CDC data")
for ax in (ax1,ax2):
    sns.despine(ax=ax)
fig.tight_layout()

The height statistics of our user population matches remarkably well with CDC data. It looks like the OkCupid users, males and females alike, may be slightly over-reporting their height (still, much less than one could expect), but there could be many other reasonable explanations for this gap. For example, the San-Francisco population may be taller than the general US population. This might be a starting point to further investigate the issue.

1.10 More Pandas Resources

  • Lectures on Pandas from UC Berekely https://ds100.org/fa20/lecture/lec05/

  • Getting started with Pandas https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html#getting-started

  • 10 minute guide to Pandas https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

  • The Pandas Cookbook. This provides a nice overview of some of the basic Pandas functions. However, it is slightly out of date. https://nbviewer.jupyter.org/github/jvns/pandas-cookbook/tree/master/cookbook/

  • Learn Pandas. A set of lessons providing an overview of the Pandas library. https://bitbucket.org/hrojas/learn-pandas/src/master/ Python for Data Science Another set of notebook demonstrating Pandas functionality.