Subsets of DataFrames
Contents
Subsets of DataFrames#
Let’s load the DataFrames that we have been working with.
import pandas as pd
small_url = 'https://gist.githubusercontent.com/van-alfen/bca77fd9f4c0fcf74308bc7cfa12a8a7/raw/030a8fea2ddfd1b8fd257d5a601b15a2942c78d7/small_data.csv'
small_df = pd.read_csv(small_url)
small_df.set_index('ticker', inplace=True)
big_url = 'https://gist.githubusercontent.com/van-alfen/08de23cbed187502c0dcc40b4c446da6/raw/ef55891239532995e4fa648bd97f407260b1953f/big_data.csv'
big_df = pd.read_csv(big_url)
big_df.set_index( ['ticker', 'year'], inplace=True)
Selecting a single cell/value#
small_df
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
AAPL | 137 | Apple Inc | 162.82 | 105.72 |
MSFT | 229 | Microsoft Corp | 181.92 | 72.31 |
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
If you want to get a single value from a dataframe, use the .at()
method.
the_name = small_df.at['AAPL', 'name']
the_name
'Apple Inc'
We can also use .at()
to set a new value in a specific cell. For example, if you want to change the price of MSFT to $200.
small_df.at['MSFT', 'price'] = 200
small_df
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
AAPL | 137 | Apple Inc | 162.82 | 105.72 |
MSFT | 200 | Microsoft Corp | 181.92 | 72.31 |
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
How to do it with a multi index:
big_df.head(10)
assets_current | liab_current | price | ||
---|---|---|---|---|
ticker | year | |||
AAPL | 2013 | 73286.0 | 43658.0 | 476.75 |
2014 | 68531.0 | 63448.0 | 100.75 | |
AET | 2013 | 9718.9 | 12602.9 | 68.59 |
2014 | 11764.0 | 15356.5 | 88.83 | |
BA | 2013 | 65074.0 | 51486.0 | 136.49 |
2014 | 67785.0 | 56717.0 | 129.98 | |
BUD | 2013 | 18690.0 | 25627.0 | 106.46 |
2014 | 18541.0 | 27208.0 | 112.32 | |
CAT | 2013 | 38335.0 | 27297.0 | 90.81 |
2014 | 38867.0 | 27877.0 | 91.53 |
Need to use a tuple as the first argument.
my_price = big_df.at[ ('BUD', 2014), 'price']
my_price
112.32
Selecting columns#
We can select a single column by “indexing” that column. This is very similar to how we referenced items in dictionaries.
small_df['name']
ticker
AAPL Apple Inc
MSFT Microsoft Corp
GOOG Alphabet Inc
AMZN Amazon.com Inc
Name: name, dtype: object
Since we only grabbed one column, it returns a pandas series, which is basically a list. We could also get multiple columns at the same time by indexing a list of columns and storing the result in fewer_cols
.
col_list = ['name', 'price']
fewer_cols = small_df[col_list]
fewer_cols
name | price | |
---|---|---|
ticker | ||
AAPL | Apple Inc | 137 |
MSFT | Microsoft Corp | 200 |
GOOG | Alphabet Inc | 1677 |
AMZN | Amazon.com Inc | 3535 |
Selecting rows#
The loc
method#
Starting with the small_df
DataFrame.
small_df
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
AAPL | 137 | Apple Inc | 162.82 | 105.72 |
MSFT | 200 | Microsoft Corp | 181.92 | 72.31 |
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
Basic usage#
Suppose you just want the Amazon data. It is very easy to do this with the .loc
method. Notice that with this function we use square brackets.
Side Note:
Anytime we select a single row (or a single column), we create something called a pandas Series. It is not a pandas DataFrame. It is more like a list.
amzn_data = small_df.loc['AMZN']
amzn_data
price 3535
name Amazon.com Inc
assets_current 96.33
liab_current 87.81
Name: AMZN, dtype: object
Rather that just getting AMZN, let’s grab MSFT and GOOG.
firm_list = ['MSFT', 'GOOG']
two_firms = small_df.loc[firm_list]
two_firms
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
MSFT | 200 | Microsoft Corp | 181.92 | 72.31 |
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
Selecting ranges#
We can select a range of index values by using the colon :
For example if we want everything from MSFT to AMZN.
small_df.loc['MSFT':'AMZN']
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
MSFT | 200 | Microsoft Corp | 181.92 | 72.31 |
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
Columns with .loc
#
We can also pass a second argument to .loc
to specify the columns we want.
firm_list = ['MSFT', 'GOOG']
col_list = ['assets_current', 'liab_current']
two_firms = small_df.loc[ firm_list, col_list ]
two_firms
assets_current | liab_current | |
---|---|---|
ticker | ||
MSFT | 181.92 | 72.31 |
GOOG | 152.58 | 45.22 |
col_list = ['assets_current', 'liab_current']
two_firms = small_df.loc[ 'AAPL':'GOOG', col_list ]
two_firms
assets_current | liab_current | |
---|---|---|
ticker | ||
AAPL | 162.82 | 105.72 |
MSFT | 181.92 | 72.31 |
GOOG | 152.58 | 45.22 |
We can use .loc
to select rows using the index. To select rows using columns that are not the index, see the Filters section below.
.loc
with multi index#
For a DataFrame with a multi index, the process is similar. We can still select rows for AAPL since that is the first level of the index:
big_df.head()
assets_current | liab_current | price | ||
---|---|---|---|---|
ticker | year | |||
AAPL | 2013 | 73286.0 | 43658.0 | 476.75 |
2014 | 68531.0 | 63448.0 | 100.75 | |
AET | 2013 | 9718.9 | 12602.9 | 68.59 |
2014 | 11764.0 | 15356.5 | 88.83 | |
BA | 2013 | 65074.0 | 51486.0 | 136.49 |
big_df.loc['AAPL']
assets_current | liab_current | price | |
---|---|---|---|
year | |||
2013 | 73286.0 | 43658.0 | 476.75 |
2014 | 68531.0 | 63448.0 | 100.75 |
Like we did earlier, we can store this AAPL subset in a new DataFrame if we want to use it later:
aapl_df = big_df.loc['AAPL']
aapl_df
assets_current | liab_current | price | |
---|---|---|---|
year | |||
2013 | 73286.0 | 43658.0 | 476.75 |
2014 | 68531.0 | 63448.0 | 100.75 |
We can also use ranges the same way.
big_df.loc[ 'BA':'DAL' ]
assets_current | liab_current | price | ||
---|---|---|---|---|
ticker | year | |||
BA | 2013 | 65074.000 | 51486.000 | 136.49 |
2014 | 67785.000 | 56717.000 | 129.98 | |
BUD | 2013 | 18690.000 | 25627.000 | 106.46 |
2014 | 18541.000 | 27208.000 | 112.32 | |
CAT | 2013 | 38335.000 | 27297.000 | 90.81 |
2014 | 38867.000 | 27877.000 | 91.53 | |
COKE | 2013 | 239.215 | 208.816 | 73.19 |
2014 | 287.001 | 227.420 | 88.03 | |
CVX | 2013 | 50250.000 | 33018.000 | 124.91 |
2014 | 42232.000 | 31926.000 | 112.18 | |
DAL | 2013 | 9651.000 | 14152.000 | 27.47 |
2014 | 12465.000 | 16879.000 | 49.19 |
But if we want to use deeper index levels in a multi index — e.g., data for AAPL in 2013 — then we can pass a tuple to .loc
.
specific_row = ('AAPL', 2013)
big_df.loc[specific_row]
assets_current 73286.00
liab_current 43658.00
price 476.75
Name: (AAPL, 2013), dtype: float64
We can also pass a tuple of lists for multiple rows:
firm_list = ['BA', 'GM']
tuple_with_list = ( firm_list, 2013 )
big_df.loc[ tuple_with_list, : ]
assets_current | liab_current | price | ||
---|---|---|---|---|
ticker | year | |||
BA | 2013 | 65074.0 | 51486.0 | 136.49 |
GM | 2013 | 81501.0 | 62412.0 | 40.87 |
Notice the difference between a list of tuples and a tuple of lists:
list_of_tuples = [
('BA', 2013),
('GM', 2014),
]
big_df.loc[list_of_tuples, :]
assets_current | liab_current | price | ||
---|---|---|---|---|
ticker | year | |||
BA | 2013 | 65074.0 | 51486.0 | 136.49 |
GM | 2014 | 83670.0 | 65701.0 | 34.91 |
tuple_of_lists = (
['BA', 'GM'],
[2013, 2014],
)
big_df.loc[tuple_of_lists, :]
assets_current | liab_current | price | ||
---|---|---|---|---|
ticker | year | |||
BA | 2013 | 65074.0 | 51486.0 | 136.49 |
2014 | 67785.0 | 56717.0 | 129.98 | |
GM | 2013 | 81501.0 | 62412.0 | 40.87 |
2014 | 83670.0 | 65701.0 | 34.91 |
But if we want to reference only the second index — year
— we need the IndexSlice.
# we want all tickers, but just the year 2014
just_2014 = pd.IndexSlice[ :, 2014 ]
# pass this "slice" to .loc[], and also specify the columns
big_df.loc[ just_2014, : ]
assets_current | liab_current | price | ||
---|---|---|---|---|
ticker | year | |||
AAPL | 2014 | 68531.000 | 63448.000 | 100.75 |
AET | 2014 | 11764.000 | 15356.500 | 88.83 |
BA | 2014 | 67785.000 | 56717.000 | 129.98 |
BUD | 2014 | 18541.000 | 27208.000 | 112.32 |
CAT | 2014 | 38867.000 | 27877.000 | 91.53 |
COKE | 2014 | 287.001 | 227.420 | 88.03 |
CVX | 2014 | 42232.000 | 31926.000 | 112.18 |
DAL | 2014 | 12465.000 | 16879.000 | 49.19 |
DIS | 2014 | 15176.000 | 13292.000 | 89.03 |
DOW | 2014 | 24267.000 | 11593.000 | 45.61 |
DTV | 2014 | 8819.000 | 6959.000 | 86.70 |
GM | 2014 | 83670.000 | 65701.000 | 34.91 |
HOG | 2014 | 3948.095 | 2389.286 | 65.91 |
IBM | 2014 | 49422.000 | 39600.000 | 160.44 |
LMT | 2014 | 12329.000 | 11112.000 | 192.57 |
LUV | 2014 | 4404.000 | 5923.000 | 42.32 |
MSFT | 2014 | 114246.000 | 45625.000 | 41.70 |
PFE | 2014 | 57702.000 | 21631.000 | 31.15 |
TAP | 2014 | 1578.900 | 2325.300 | 74.52 |
TWC | 2014 | 2316.000 | 4497.000 | 152.06 |
VZ | 2014 | 29623.000 | 28064.000 | 46.78 |
Important
When using multi indexes with .loc
, you should always include the columns argument (even if you want to include all of the columns).
For example, do this:
big_df.loc[ ('AAPL', 2013), : ]
Instead of this:
big_df.loc[ ('AAPL', 2013) ]
Filters#
The .loc
method is great for selecting rows using the DataFrame’s index. But in order to select rows based on values in the other columns (using conditional statements), we should use filters.
small_df
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
AAPL | 137 | Apple Inc | 162.82 | 105.72 |
MSFT | 200 | Microsoft Corp | 181.92 | 72.31 |
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
This is what a filter returns on its own. Notice, it is essentially just a list of True
/False
values.
price_filter = ( small_df['price'] > 500 )
price_filter
ticker
AAPL False
MSFT False
GOOG True
AMZN True
Name: price, dtype: bool
Rule of Thumb
When using conditionals with pandas DataFrames, it is best practice to always wrap conditional statements in parentheses.
Now let us index our DataFrame using that filter:
small_df[price_filter]
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
.contains()
#
Here is another filter that checks whether a string is contained in a certain column:
fancy_filter = ( small_df['name'].str.contains('Inc') )
small_df[fancy_filter]
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
AAPL | 137 | Apple Inc | 162.82 | 105.72 |
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
Here is another filter that combines multiple filters.
fancy_filter2 = ( ( small_df['name'].str.contains('Inc') ) & (small_df['price'] > 1000) )
small_df[fancy_filter2]
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
Note
See here for a discussion on &
versus and
for logical expressions in pandas.
TLDR; Use &
in pandas.
Suppose we have the following DataFrame which contains student data:
Note
None
is a way of storing a missing or empty value in python.
student_data = {
'student': ['981', '428', '762', '304', '698', '170'],
'quiz1': [90, None, 80, 85, 75, 95],
'major': ['FIN', 'FIN', 'ACT', None, 'FIN', 'MGMT']
}
student_df = pd.DataFrame(student_data)
student_df.set_index('student', inplace=True)
student_df
quiz1 | major | |
---|---|---|
student | ||
981 | 90.0 | FIN |
428 | NaN | FIN |
762 | 80.0 | ACT |
304 | 85.0 | None |
698 | 75.0 | FIN |
170 | 95.0 | MGMT |
.between()
#
If we want to select only the students that got 90-100 on the quiz, we can use the .between()
method.
grade_filter = ( student_df['quiz1'].between(90, 100) )
student_df[grade_filter]
quiz1 | major | |
---|---|---|
student | ||
981 | 90.0 | FIN |
170 | 95.0 | MGMT |
.isin()
#
If we want only the students that are finance or accounting majors, we can use the .isin()
method.
major_list = ['FIN', 'ACT']
major_filter = ( student_df['major'].isin(major_list) )
student_df[major_filter]
quiz1 | major | |
---|---|---|
student | ||
981 | 90.0 | FIN |
428 | NaN | FIN |
762 | 80.0 | ACT |
698 | 75.0 | FIN |
.isna()
#
If we want only the students for whom we do not have data about their major, we can use the .isna()
method:
missing_filter1 = ( student_df['major'].isna() )
student_df[missing_filter1]
quiz1 | major | |
---|---|---|
student | ||
304 | 85.0 | None |
.notna()
#
If we want only the students that are not missing data for their major or their quiz score, we can use the .notna()
method:
missing_filter2 = ( student_df['major'].notna() & student_df['quiz1'].notna() )
student_df[missing_filter2]
quiz1 | major | |
---|---|---|
student | ||
981 | 90.0 | FIN |
762 | 80.0 | ACT |
698 | 75.0 | FIN |
170 | 95.0 | MGMT |
The query
method#
Another way to select rows is the .query()
method. In many situations it is easier to implement than filters or .loc
.
small_df
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
AAPL | 137 | Apple Inc | 162.82 | 105.72 |
MSFT | 200 | Microsoft Corp | 181.92 | 72.31 |
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
small_df.query('name == "Apple Inc"')
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
AAPL | 137 | Apple Inc | 162.82 | 105.72 |
small_df.query('price > 1000')
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
.query()
also works with named indexes.
small_df.query('index == "GOOG"')
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
Or similarly, using the name of the index instead of index:
small_df.query('ticker == "GOOG"')
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
student_df
quiz1 | major | |
---|---|---|
student | ||
981 | 90.0 | FIN |
428 | NaN | FIN |
762 | 80.0 | ACT |
304 | 85.0 | None |
698 | 75.0 | FIN |
170 | 95.0 | MGMT |
If we want to use functionality like .isin()
or .isna()
inside .query()
, then we need to include one more option — engine='python'
.
student_df.query("student.isin( ['428', '698'] )", engine='python')
quiz1 | major | |
---|---|---|
student | ||
428 | NaN | FIN |
698 | 75.0 | FIN |
student_df.query('major.isna()', engine='python')
quiz1 | major | |
---|---|---|
student | ||
304 | 85.0 | None |
It is also possible to reference other/external variables in the query statement by using the @
character.
best_major = 'FIN'
student_df.query('major == @best_major')
quiz1 | major | |
---|---|---|
student | ||
981 | 90.0 | FIN |
428 | NaN | FIN |
698 | 75.0 | FIN |
A final example#
Here is an animal example (from the pandas documentation) of a DataFrame with three levels to the multi index. How would we reference the various elements?
d = {
'num_legs': [4, 4, 2, 2, 2],
'num_wings': [0, 0, 2, 2, 2],
'category': ['mammal', 'mammal', 'mammal', 'bird', 'bird'],
'animal': ['cat', 'dog', 'bat', 'penguin', 'robin'],
'locomotion': ['walks', 'walks', 'flies', 'walks', 'flies'],
'pet': [True, True, False, False, False],
}
df = pd.DataFrame(d)
df.set_index( ['category', 'animal', 'locomotion'], inplace=True )
df
num_legs | num_wings | pet | |||
---|---|---|---|---|---|
category | animal | locomotion | |||
mammal | cat | walks | 4 | 0 | True |
dog | walks | 4 | 0 | True | |
bat | flies | 2 | 2 | False | |
bird | penguin | walks | 2 | 2 | False |
robin | flies | 2 | 2 | False |
Using .loc
#
If we want just the mammals:
df.loc['mammal']
num_legs | num_wings | pet | ||
---|---|---|---|---|
animal | locomotion | |||
cat | walks | 4 | 0 | True |
dog | walks | 4 | 0 | True |
bat | flies | 2 | 2 | False |
If we want just the animals that fly:
fly_slice = pd.IndexSlice[ :, :, 'flies' ]
df.loc[ fly_slice, :]
num_legs | num_wings | pet | |||
---|---|---|---|---|---|
category | animal | locomotion | |||
mammal | bat | flies | 2 | 2 | False |
bird | robin | flies | 2 | 2 | False |
If we want just the dog and penguin:
animal_list = ['dog', 'penguin']
animal_slice = pd.IndexSlice[ :, animal_list ]
df.loc[ animal_slice, : ]
num_legs | num_wings | pet | |||
---|---|---|---|---|---|
category | animal | locomotion | |||
mammal | dog | walks | 4 | 0 | True |
bird | penguin | walks | 2 | 2 | False |
Or, in one line (even though it is harder to read):
df.loc[ pd.IndexSlice[:, ['dog', 'penguin'] ], : ]
num_legs | num_wings | pet | |||
---|---|---|---|---|---|
category | animal | locomotion | |||
mammal | dog | walks | 4 | 0 | True |
bird | penguin | walks | 2 | 2 | False |
If we want mammals that fly:
fly_mam_slice = pd.IndexSlice[ 'mammal', :, 'flies' ]
df.loc[ fly_mam_slice, : ]
num_legs | num_wings | pet | |||
---|---|---|---|---|---|
category | animal | locomotion | |||
mammal | bat | flies | 2 | 2 | False |
Using .query
#
See below for how to do the same things with the .query
method.
df.query('category == "mammal"')
num_legs | num_wings | pet | |||
---|---|---|---|---|---|
category | animal | locomotion | |||
mammal | cat | walks | 4 | 0 | True |
dog | walks | 4 | 0 | True | |
bat | flies | 2 | 2 | False |
df.query('locomotion == "flies"')
num_legs | num_wings | pet | |||
---|---|---|---|---|---|
category | animal | locomotion | |||
mammal | bat | flies | 2 | 2 | False |
bird | robin | flies | 2 | 2 | False |
df.query('animal.isin(["dog", "penguin"])', engine='python')
num_legs | num_wings | pet | |||
---|---|---|---|---|---|
category | animal | locomotion | |||
mammal | dog | walks | 4 | 0 | True |
bird | penguin | walks | 2 | 2 | False |
df.query('category=="mammal" & locomotion=="flies"')
num_legs | num_wings | pet | |||
---|---|---|---|---|---|
category | animal | locomotion | |||
mammal | bat | flies | 2 | 2 | False |