Subsets of DataFrames#

Let’s load the DataFrames that we have been working with.

import pandas as pd

small_url = 'https://gist.githubusercontent.com/van-alfen/bca77fd9f4c0fcf74308bc7cfa12a8a7/raw/030a8fea2ddfd1b8fd257d5a601b15a2942c78d7/small_data.csv'
small_df = pd.read_csv(small_url)
small_df.set_index('ticker', inplace=True)

big_url = 'https://gist.githubusercontent.com/van-alfen/08de23cbed187502c0dcc40b4c446da6/raw/ef55891239532995e4fa648bd97f407260b1953f/big_data.csv'
big_df = pd.read_csv(big_url)
big_df.set_index( ['ticker', 'year'], inplace=True)

Selecting a single cell/value#

small_df
price name assets_current liab_current
ticker
AAPL 137 Apple Inc 162.82 105.72
MSFT 229 Microsoft Corp 181.92 72.31
GOOG 1677 Alphabet Inc 152.58 45.22
AMZN 3535 Amazon.com Inc 96.33 87.81

If you want to get a single value from a dataframe, use the .at() method.

the_name = small_df.at['AAPL', 'name']
the_name
'Apple Inc'

We can also use .at() to set a new value in a specific cell. For example, if you want to change the price of MSFT to $200.

small_df.at['MSFT', 'price'] = 200
small_df
price name assets_current liab_current
ticker
AAPL 137 Apple Inc 162.82 105.72
MSFT 200 Microsoft Corp 181.92 72.31
GOOG 1677 Alphabet Inc 152.58 45.22
AMZN 3535 Amazon.com Inc 96.33 87.81

How to do it with a multi index:

big_df.head(10)
assets_current liab_current price
ticker year
AAPL 2013 73286.0 43658.0 476.75
2014 68531.0 63448.0 100.75
AET 2013 9718.9 12602.9 68.59
2014 11764.0 15356.5 88.83
BA 2013 65074.0 51486.0 136.49
2014 67785.0 56717.0 129.98
BUD 2013 18690.0 25627.0 106.46
2014 18541.0 27208.0 112.32
CAT 2013 38335.0 27297.0 90.81
2014 38867.0 27877.0 91.53

Need to use a tuple as the first argument.

my_price = big_df.at[ ('BUD', 2014), 'price']
my_price
112.32

Selecting columns#

We can select a single column by “indexing” that column. This is very similar to how we referenced items in dictionaries.

small_df['name']
ticker
AAPL         Apple Inc
MSFT    Microsoft Corp
GOOG      Alphabet Inc
AMZN    Amazon.com Inc
Name: name, dtype: object

Since we only grabbed one column, it returns a pandas series, which is basically a list. We could also get multiple columns at the same time by indexing a list of columns and storing the result in fewer_cols.

col_list = ['name', 'price']
fewer_cols = small_df[col_list]
fewer_cols
name price
ticker
AAPL Apple Inc 137
MSFT Microsoft Corp 200
GOOG Alphabet Inc 1677
AMZN Amazon.com Inc 3535

Selecting rows#

The loc method#

Starting with the small_df DataFrame.

small_df
price name assets_current liab_current
ticker
AAPL 137 Apple Inc 162.82 105.72
MSFT 200 Microsoft Corp 181.92 72.31
GOOG 1677 Alphabet Inc 152.58 45.22
AMZN 3535 Amazon.com Inc 96.33 87.81

Basic usage#

Suppose you just want the Amazon data. It is very easy to do this with the .loc method. Notice that with this function we use square brackets.

Side Note:

Anytime we select a single row (or a single column), we create something called a pandas Series. It is not a pandas DataFrame. It is more like a list.

amzn_data = small_df.loc['AMZN']
amzn_data
price                       3535
name              Amazon.com Inc
assets_current             96.33
liab_current               87.81
Name: AMZN, dtype: object

Rather that just getting AMZN, let’s grab MSFT and GOOG.

firm_list = ['MSFT', 'GOOG']
two_firms = small_df.loc[firm_list]
two_firms
price name assets_current liab_current
ticker
MSFT 200 Microsoft Corp 181.92 72.31
GOOG 1677 Alphabet Inc 152.58 45.22

Selecting ranges#

We can select a range of index values by using the colon :
For example if we want everything from MSFT to AMZN.

small_df.loc['MSFT':'AMZN']
price name assets_current liab_current
ticker
MSFT 200 Microsoft Corp 181.92 72.31
GOOG 1677 Alphabet Inc 152.58 45.22
AMZN 3535 Amazon.com Inc 96.33 87.81

Columns with .loc#

We can also pass a second argument to .loc to specify the columns we want.

firm_list = ['MSFT', 'GOOG']
col_list = ['assets_current', 'liab_current']

two_firms = small_df.loc[ firm_list, col_list ]
two_firms
assets_current liab_current
ticker
MSFT 181.92 72.31
GOOG 152.58 45.22
col_list = ['assets_current', 'liab_current']

two_firms = small_df.loc[ 'AAPL':'GOOG', col_list ]
two_firms
assets_current liab_current
ticker
AAPL 162.82 105.72
MSFT 181.92 72.31
GOOG 152.58 45.22

We can use .loc to select rows using the index. To select rows using columns that are not the index, see the Filters section below.

.loc with multi index#

For a DataFrame with a multi index, the process is similar. We can still select rows for AAPL since that is the first level of the index:

big_df.head()
assets_current liab_current price
ticker year
AAPL 2013 73286.0 43658.0 476.75
2014 68531.0 63448.0 100.75
AET 2013 9718.9 12602.9 68.59
2014 11764.0 15356.5 88.83
BA 2013 65074.0 51486.0 136.49
big_df.loc['AAPL']
assets_current liab_current price
year
2013 73286.0 43658.0 476.75
2014 68531.0 63448.0 100.75

Like we did earlier, we can store this AAPL subset in a new DataFrame if we want to use it later:

aapl_df = big_df.loc['AAPL']
aapl_df
assets_current liab_current price
year
2013 73286.0 43658.0 476.75
2014 68531.0 63448.0 100.75

We can also use ranges the same way.

big_df.loc[ 'BA':'DAL' ]
assets_current liab_current price
ticker year
BA 2013 65074.000 51486.000 136.49
2014 67785.000 56717.000 129.98
BUD 2013 18690.000 25627.000 106.46
2014 18541.000 27208.000 112.32
CAT 2013 38335.000 27297.000 90.81
2014 38867.000 27877.000 91.53
COKE 2013 239.215 208.816 73.19
2014 287.001 227.420 88.03
CVX 2013 50250.000 33018.000 124.91
2014 42232.000 31926.000 112.18
DAL 2013 9651.000 14152.000 27.47
2014 12465.000 16879.000 49.19

But if we want to use deeper index levels in a multi index — e.g., data for AAPL in 2013 — then we can pass a tuple to .loc.

specific_row = ('AAPL', 2013)
big_df.loc[specific_row]
assets_current    73286.00
liab_current      43658.00
price               476.75
Name: (AAPL, 2013), dtype: float64

We can also pass a tuple of lists for multiple rows:

firm_list = ['BA', 'GM']
tuple_with_list = ( firm_list, 2013 )
big_df.loc[ tuple_with_list, : ]
assets_current liab_current price
ticker year
BA 2013 65074.0 51486.0 136.49
GM 2013 81501.0 62412.0 40.87

Notice the difference between a list of tuples and a tuple of lists:

list_of_tuples = [
    ('BA', 2013),
    ('GM', 2014),
]

big_df.loc[list_of_tuples, :]
assets_current liab_current price
ticker year
BA 2013 65074.0 51486.0 136.49
GM 2014 83670.0 65701.0 34.91
tuple_of_lists = (
    ['BA', 'GM'],
    [2013, 2014],
)

big_df.loc[tuple_of_lists, :]
assets_current liab_current price
ticker year
BA 2013 65074.0 51486.0 136.49
2014 67785.0 56717.0 129.98
GM 2013 81501.0 62412.0 40.87
2014 83670.0 65701.0 34.91

But if we want to reference only the second index — year — we need the IndexSlice.

# we want all tickers, but just the year 2014
just_2014 = pd.IndexSlice[ :, 2014 ]

# pass this "slice" to .loc[], and also specify the columns
big_df.loc[ just_2014, : ]
assets_current liab_current price
ticker year
AAPL 2014 68531.000 63448.000 100.75
AET 2014 11764.000 15356.500 88.83
BA 2014 67785.000 56717.000 129.98
BUD 2014 18541.000 27208.000 112.32
CAT 2014 38867.000 27877.000 91.53
COKE 2014 287.001 227.420 88.03
CVX 2014 42232.000 31926.000 112.18
DAL 2014 12465.000 16879.000 49.19
DIS 2014 15176.000 13292.000 89.03
DOW 2014 24267.000 11593.000 45.61
DTV 2014 8819.000 6959.000 86.70
GM 2014 83670.000 65701.000 34.91
HOG 2014 3948.095 2389.286 65.91
IBM 2014 49422.000 39600.000 160.44
LMT 2014 12329.000 11112.000 192.57
LUV 2014 4404.000 5923.000 42.32
MSFT 2014 114246.000 45625.000 41.70
PFE 2014 57702.000 21631.000 31.15
TAP 2014 1578.900 2325.300 74.52
TWC 2014 2316.000 4497.000 152.06
VZ 2014 29623.000 28064.000 46.78

Important

When using multi indexes with .loc, you should always include the columns argument (even if you want to include all of the columns).

For example, do this:
big_df.loc[ ('AAPL', 2013), : ]
Instead of this:
big_df.loc[ ('AAPL', 2013) ]

Filters#

The .loc method is great for selecting rows using the DataFrame’s index. But in order to select rows based on values in the other columns (using conditional statements), we should use filters.

small_df
price name assets_current liab_current
ticker
AAPL 137 Apple Inc 162.82 105.72
MSFT 200 Microsoft Corp 181.92 72.31
GOOG 1677 Alphabet Inc 152.58 45.22
AMZN 3535 Amazon.com Inc 96.33 87.81

This is what a filter returns on its own. Notice, it is essentially just a list of True/False values.

price_filter = ( small_df['price'] > 500 )
price_filter
ticker
AAPL    False
MSFT    False
GOOG     True
AMZN     True
Name: price, dtype: bool

Rule of Thumb

When using conditionals with pandas DataFrames, it is best practice to always wrap conditional statements in parentheses.

Now let us index our DataFrame using that filter:

small_df[price_filter]
price name assets_current liab_current
ticker
GOOG 1677 Alphabet Inc 152.58 45.22
AMZN 3535 Amazon.com Inc 96.33 87.81

.contains()#

Here is another filter that checks whether a string is contained in a certain column:

fancy_filter = ( small_df['name'].str.contains('Inc') )
small_df[fancy_filter]
price name assets_current liab_current
ticker
AAPL 137 Apple Inc 162.82 105.72
GOOG 1677 Alphabet Inc 152.58 45.22
AMZN 3535 Amazon.com Inc 96.33 87.81

Here is another filter that combines multiple filters.

fancy_filter2 = ( ( small_df['name'].str.contains('Inc') ) & (small_df['price'] > 1000) )
small_df[fancy_filter2]
price name assets_current liab_current
ticker
GOOG 1677 Alphabet Inc 152.58 45.22
AMZN 3535 Amazon.com Inc 96.33 87.81

Note

See here for a discussion on & versus and for logical expressions in pandas.

TLDR; Use & in pandas.

Suppose we have the following DataFrame which contains student data:

Note

None is a way of storing a missing or empty value in python.

student_data = {
  'student': ['981', '428', '762', '304', '698', '170'],
  'quiz1': [90, None, 80, 85, 75, 95],
  'major': ['FIN', 'FIN', 'ACT', None, 'FIN', 'MGMT']
}

student_df = pd.DataFrame(student_data)
student_df.set_index('student', inplace=True)
student_df
quiz1 major
student
981 90.0 FIN
428 NaN FIN
762 80.0 ACT
304 85.0 None
698 75.0 FIN
170 95.0 MGMT

.between()#

If we want to select only the students that got 90-100 on the quiz, we can use the .between() method.

grade_filter = ( student_df['quiz1'].between(90, 100) )
student_df[grade_filter]
quiz1 major
student
981 90.0 FIN
170 95.0 MGMT

.isin()#

If we want only the students that are finance or accounting majors, we can use the .isin() method.

major_list = ['FIN', 'ACT']
major_filter = ( student_df['major'].isin(major_list) )
student_df[major_filter]
quiz1 major
student
981 90.0 FIN
428 NaN FIN
762 80.0 ACT
698 75.0 FIN

.isna()#

If we want only the students for whom we do not have data about their major, we can use the .isna() method:

missing_filter1 = ( student_df['major'].isna() )
student_df[missing_filter1]
quiz1 major
student
304 85.0 None

.notna()#

If we want only the students that are not missing data for their major or their quiz score, we can use the .notna() method:

missing_filter2 = ( student_df['major'].notna() & student_df['quiz1'].notna() )
student_df[missing_filter2]
quiz1 major
student
981 90.0 FIN
762 80.0 ACT
698 75.0 FIN
170 95.0 MGMT

The query method#

Another way to select rows is the .query() method. In many situations it is easier to implement than filters or .loc.

small_df
price name assets_current liab_current
ticker
AAPL 137 Apple Inc 162.82 105.72
MSFT 200 Microsoft Corp 181.92 72.31
GOOG 1677 Alphabet Inc 152.58 45.22
AMZN 3535 Amazon.com Inc 96.33 87.81
small_df.query('name == "Apple Inc"')
price name assets_current liab_current
ticker
AAPL 137 Apple Inc 162.82 105.72
small_df.query('price > 1000')
price name assets_current liab_current
ticker
GOOG 1677 Alphabet Inc 152.58 45.22
AMZN 3535 Amazon.com Inc 96.33 87.81

.query() also works with named indexes.

small_df.query('index == "GOOG"')
price name assets_current liab_current
ticker
GOOG 1677 Alphabet Inc 152.58 45.22

Or similarly, using the name of the index instead of index:

small_df.query('ticker == "GOOG"')
price name assets_current liab_current
ticker
GOOG 1677 Alphabet Inc 152.58 45.22
student_df
quiz1 major
student
981 90.0 FIN
428 NaN FIN
762 80.0 ACT
304 85.0 None
698 75.0 FIN
170 95.0 MGMT

If we want to use functionality like .isin() or .isna() inside .query(), then we need to include one more option — engine='python'.

student_df.query("student.isin( ['428', '698'] )", engine='python')
quiz1 major
student
428 NaN FIN
698 75.0 FIN
student_df.query('major.isna()', engine='python')
quiz1 major
student
304 85.0 None

It is also possible to reference other/external variables in the query statement by using the @ character.

best_major = 'FIN'

student_df.query('major == @best_major')
quiz1 major
student
981 90.0 FIN
428 NaN FIN
698 75.0 FIN

A final example#

Here is an animal example (from the pandas documentation) of a DataFrame with three levels to the multi index. How would we reference the various elements?

d = {
  'num_legs': [4, 4, 2, 2, 2],
    'num_wings': [0, 0, 2, 2, 2],
    'category': ['mammal', 'mammal', 'mammal', 'bird', 'bird'],
    'animal': ['cat', 'dog', 'bat', 'penguin', 'robin'],
    'locomotion': ['walks', 'walks', 'flies', 'walks', 'flies'],
    'pet': [True, True, False, False, False],
}

df = pd.DataFrame(d)
df.set_index( ['category', 'animal', 'locomotion'], inplace=True )
df
num_legs num_wings pet
category animal locomotion
mammal cat walks 4 0 True
dog walks 4 0 True
bat flies 2 2 False
bird penguin walks 2 2 False
robin flies 2 2 False

Using .loc#

If we want just the mammals:

df.loc['mammal']
num_legs num_wings pet
animal locomotion
cat walks 4 0 True
dog walks 4 0 True
bat flies 2 2 False

If we want just the animals that fly:

fly_slice = pd.IndexSlice[ :, :, 'flies' ]
df.loc[ fly_slice, :]
num_legs num_wings pet
category animal locomotion
mammal bat flies 2 2 False
bird robin flies 2 2 False

If we want just the dog and penguin:

animal_list = ['dog', 'penguin']
animal_slice = pd.IndexSlice[ :, animal_list ]
df.loc[ animal_slice, : ]
num_legs num_wings pet
category animal locomotion
mammal dog walks 4 0 True
bird penguin walks 2 2 False

Or, in one line (even though it is harder to read):

df.loc[ pd.IndexSlice[:, ['dog', 'penguin'] ], : ]
num_legs num_wings pet
category animal locomotion
mammal dog walks 4 0 True
bird penguin walks 2 2 False

If we want mammals that fly:

fly_mam_slice = pd.IndexSlice[ 'mammal', :, 'flies' ]
df.loc[ fly_mam_slice, : ]
num_legs num_wings pet
category animal locomotion
mammal bat flies 2 2 False

Using .query#

See below for how to do the same things with the .query method.

df.query('category == "mammal"')
num_legs num_wings pet
category animal locomotion
mammal cat walks 4 0 True
dog walks 4 0 True
bat flies 2 2 False
df.query('locomotion == "flies"')
num_legs num_wings pet
category animal locomotion
mammal bat flies 2 2 False
bird robin flies 2 2 False
df.query('animal.isin(["dog", "penguin"])', engine='python')
num_legs num_wings pet
category animal locomotion
mammal dog walks 4 0 True
bird penguin walks 2 2 False
df.query('category=="mammal" & locomotion=="flies"')
num_legs num_wings pet
category animal locomotion
mammal bat flies 2 2 False