Pandas basics
Contents
Pandas basics#
Importing the package#
In order to use pandas
, you may need to first install the package in your Terminal via conda install pandas
.
Once it has been installed, you can use the package by importing it at the beginning of your program:
import pandas as pd
The as pd
is like a shortcut. Whenever we want to use something from the pandas library, we can write pd
instead of typing out pandas
.
Creating a dataframe#
# First let's set up a dictionary using data from Marketwatch
small_data = {
'ticker': ['AAPL', 'MSFT', 'GOOG', 'AMZN'],
'price': [137, 229, 1677, 3535],
'name': ['Apple Inc', 'Microsoft Corp', 'Alphabet Inc', 'Amazon.com Inc'],
'assets_current': [162.82, 181.92, 152.58, 96.33],
'liab_current': [105.72, 72.31, 45.22, 87.81]
}
small_data
{'ticker': ['AAPL', 'MSFT', 'GOOG', 'AMZN'],
'price': [137, 229, 1677, 3535],
'name': ['Apple Inc', 'Microsoft Corp', 'Alphabet Inc', 'Amazon.com Inc'],
'assets_current': [162.82, 181.92, 152.58, 96.33],
'liab_current': [105.72, 72.31, 45.22, 87.81]}
Creating a dataframe (basically a table) is very easy when we have a dictionary that contains lists.
Just load it using the DataFrame
method.
small_df = pd.DataFrame(small_data)
small_df
ticker | price | name | assets_current | liab_current | |
---|---|---|---|---|---|
0 | AAPL | 137 | Apple Inc | 162.82 | 105.72 |
1 | MSFT | 229 | Microsoft Corp | 181.92 | 72.31 |
2 | GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
3 | AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
We can also create a new dataframe using lists instead of a dictionary.
# the lists containing the data
columns_list = ['student', 'id', 'year', 'grade']
student_data = [
['James', 1234, 'Freshman', 'A'],
['John', 2345, 'Sophomore', 'B'],
['Mary', 3456, 'Junior', 'C'],
['Pat', 4567, 'Senior', 'D'],
]
# create a dataframe from the data
student_df = pd.DataFrame(
columns=columns_list,
data=student_data,
)
student_df
student | id | year | grade | |
---|---|---|---|---|
0 | James | 1234 | Freshman | A |
1 | John | 2345 | Sophomore | B |
2 | Mary | 3456 | Junior | C |
3 | Pat | 4567 | Senior | D |
Previewing large data#
This small_df
example is pretty small. But we will frequently use much larger datasets.
A useful way to view the first few observations is with the .head()
method.
big_data = {
'year': [2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014],
'ticker': ['AAPL', 'AAPL', 'AET', 'AET', 'BA', 'BA', 'BUD', 'BUD', 'CAT', 'CAT', 'COKE', 'COKE', 'CVX', 'CVX', 'DAL', 'DAL', 'DIS', 'DIS', 'DOW', 'DOW', 'DTV', 'DTV', 'GM', 'GM', 'HOG', 'HOG', 'IBM', 'IBM', 'LMT', 'LMT', 'LUV', 'LUV', 'MSFT', 'MSFT', 'PFE', 'PFE', 'TAP', 'TAP', 'TWC', 'TWC', 'VZ', 'VZ'],
'assets_current': [73286, 68531, 9718.9, 11764, 65074, 67785, 18690, 18541, 38335, 38867, 239.215, 287.001, 50250, 42232, 9651, 12465, 14109, 15176, 24977, 24267, 5953, 8819, 81501, 83670, 3988.803, 3948.095, 51350, 49422, 13329, 12329, 4456, 4404, 101466, 114246, 56244, 57702, 1537.7, 1578.9, 2144, 2316, 70994, 29623],
'liab_current': [43658, 63448, 12602.9, 15356.5, 51486, 56717, 25627, 27208, 27297, 27877, 208.816, 227.42, 33018, 31926, 14152, 16879, 11704, 13292, 11971, 11593, 6530, 6959, 62412, 65701, 2509.586, 2389.286, 40154, 39600, 11120, 11112, 5676, 5923, 37417, 45625, 23366, 21631, 2142.1, 2325.3, 5226, 4497, 27050, 28064],
'price': [476.75, 100.75, 68.59, 88.83, 136.49, 129.98, 106.46, 112.32, 90.81, 91.53, 73.19, 88.03, 124.91, 112.18, 27.47, 49.19, 64.49, 89.03, 44.4, 45.61, 69.06, 86.7, 40.87, 34.91, 69.24, 65.91, 187.57, 160.44, 148.66, 192.57, 18.84, 42.32, 34.545, 41.7, 30.63, 31.15, 56.15, 74.52, 135.5, 152.06, 49.14, 46.78],
}
big_df = pd.DataFrame(big_data)
big_df.head()
year | ticker | assets_current | liab_current | price | |
---|---|---|---|---|---|
0 | 2013 | AAPL | 73286.0 | 43658.0 | 476.75 |
1 | 2014 | AAPL | 68531.0 | 63448.0 | 100.75 |
2 | 2013 | AET | 9718.9 | 12602.9 | 68.59 |
3 | 2014 | AET | 11764.0 | 15356.5 | 88.83 |
4 | 2013 | BA | 65074.0 | 51486.0 | 136.49 |
You can also see the last few observations with the .tail()
method.
big_df.tail()
year | ticker | assets_current | liab_current | price | |
---|---|---|---|---|---|
37 | 2014 | TAP | 1578.9 | 2325.3 | 74.52 |
38 | 2013 | TWC | 2144.0 | 5226.0 | 135.50 |
39 | 2014 | TWC | 2316.0 | 4497.0 | 152.06 |
40 | 2013 | VZ | 70994.0 | 27050.0 | 49.14 |
41 | 2014 | VZ | 29623.0 | 28064.0 | 46.78 |
Pandas options#
You can also change the default options when displaying a dataframe:
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
big_df
year | ticker | assets_current | liab_current | price | |
---|---|---|---|---|---|
0 | 2013 | AAPL | 73286.000 | 43658.000 | 476.750 |
1 | 2014 | AAPL | 68531.000 | 63448.000 | 100.750 |
2 | 2013 | AET | 9718.900 | 12602.900 | 68.590 |
3 | 2014 | AET | 11764.000 | 15356.500 | 88.830 |
4 | 2013 | BA | 65074.000 | 51486.000 | 136.490 |
5 | 2014 | BA | 67785.000 | 56717.000 | 129.980 |
6 | 2013 | BUD | 18690.000 | 25627.000 | 106.460 |
7 | 2014 | BUD | 18541.000 | 27208.000 | 112.320 |
8 | 2013 | CAT | 38335.000 | 27297.000 | 90.810 |
9 | 2014 | CAT | 38867.000 | 27877.000 | 91.530 |
10 | 2013 | COKE | 239.215 | 208.816 | 73.190 |
11 | 2014 | COKE | 287.001 | 227.420 | 88.030 |
12 | 2013 | CVX | 50250.000 | 33018.000 | 124.910 |
13 | 2014 | CVX | 42232.000 | 31926.000 | 112.180 |
14 | 2013 | DAL | 9651.000 | 14152.000 | 27.470 |
15 | 2014 | DAL | 12465.000 | 16879.000 | 49.190 |
16 | 2013 | DIS | 14109.000 | 11704.000 | 64.490 |
17 | 2014 | DIS | 15176.000 | 13292.000 | 89.030 |
18 | 2013 | DOW | 24977.000 | 11971.000 | 44.400 |
19 | 2014 | DOW | 24267.000 | 11593.000 | 45.610 |
20 | 2013 | DTV | 5953.000 | 6530.000 | 69.060 |
21 | 2014 | DTV | 8819.000 | 6959.000 | 86.700 |
22 | 2013 | GM | 81501.000 | 62412.000 | 40.870 |
23 | 2014 | GM | 83670.000 | 65701.000 | 34.910 |
24 | 2013 | HOG | 3988.803 | 2509.586 | 69.240 |
25 | 2014 | HOG | 3948.095 | 2389.286 | 65.910 |
26 | 2013 | IBM | 51350.000 | 40154.000 | 187.570 |
27 | 2014 | IBM | 49422.000 | 39600.000 | 160.440 |
28 | 2013 | LMT | 13329.000 | 11120.000 | 148.660 |
29 | 2014 | LMT | 12329.000 | 11112.000 | 192.570 |
30 | 2013 | LUV | 4456.000 | 5676.000 | 18.840 |
31 | 2014 | LUV | 4404.000 | 5923.000 | 42.320 |
32 | 2013 | MSFT | 101466.000 | 37417.000 | 34.545 |
33 | 2014 | MSFT | 114246.000 | 45625.000 | 41.700 |
34 | 2013 | PFE | 56244.000 | 23366.000 | 30.630 |
35 | 2014 | PFE | 57702.000 | 21631.000 | 31.150 |
36 | 2013 | TAP | 1537.700 | 2142.100 | 56.150 |
37 | 2014 | TAP | 1578.900 | 2325.300 | 74.520 |
38 | 2013 | TWC | 2144.000 | 5226.000 | 135.500 |
39 | 2014 | TWC | 2316.000 | 4497.000 | 152.060 |
40 | 2013 | VZ | 70994.000 | 27050.000 | 49.140 |
41 | 2014 | VZ | 29623.000 | 28064.000 | 46.780 |
Details about the DataFrame#
We can see the number of rows and columns in a data frame with the .shape
property. This will show us that we have 42 rows and 5 columns.
big_df.shape
(42, 5)
We can also see a list of the column names in our dataframe by using the .columns
property.
big_df.columns
Index(['year', 'ticker', 'assets_current', 'liab_current', 'price'], dtype='object')
The .info()
method will give us a few more details about the dataframe (e.g., data types).
big_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 42 non-null int64
1 ticker 42 non-null object
2 assets_current 42 non-null float64
3 liab_current 42 non-null float64
4 price 42 non-null float64
dtypes: float64(3), int64(1), object(1)
memory usage: 1.8+ KB
Notice:
shape
and columns
are properties of the dataframe. Since they are just properties, we do not use parentheses.
info()
is a method — like a function — and so we include parentheses because we could potentially pass in arguments.
Setting the index#
Notice above how the rows are identified with numbers (i.e., 0, 1, 2, 3). But each row contains data about a specific ticker, so we should probably use that as our index.
small_df.set_index('ticker', inplace=True)
small_df
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
AAPL | 137 | Apple Inc | 162.82 | 105.72 |
MSFT | 229 | Microsoft Corp | 181.92 | 72.31 |
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
Note: the inplace=True
option just means that we are modifying our small_df
table/DataFrame and not creating a new table/DataFrame.
Multi index#
It is a good rule of thumb that the index should uniquely identify each row. For example, in our big_df
, we could set ticker
as the index, but that doesn’t uniquely identify each row since each ticker has a row for 2013 and a row for 2014. In this case, we want to use both ticker
and year
as our index. This is known as a multi index.
# list of columns to use as index
index_list = ['ticker', 'year']
# the rest is the same
big_df.set_index(index_list, inplace=True)
big_df.head(10)
assets_current | liab_current | price | ||
---|---|---|---|---|
ticker | year | |||
AAPL | 2013 | 73286.0 | 43658.0 | 476.75 |
2014 | 68531.0 | 63448.0 | 100.75 | |
AET | 2013 | 9718.9 | 12602.9 | 68.59 |
2014 | 11764.0 | 15356.5 | 88.83 | |
BA | 2013 | 65074.0 | 51486.0 | 136.49 |
2014 | 67785.0 | 56717.0 | 129.98 | |
BUD | 2013 | 18690.0 | 25627.0 | 106.46 |
2014 | 18541.0 | 27208.0 | 112.32 | |
CAT | 2013 | 38335.0 | 27297.0 | 90.81 |
2014 | 38867.0 | 27877.0 | 91.53 |
Sorting DataFrames#
We can sort the data using any variable.
inplace=True
again just means that we are modifying the data “in place” and not creating a new table.
small_df.sort_values(by='assets_current', inplace=True)
small_df
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
AAPL | 137 | Apple Inc | 162.82 | 105.72 |
MSFT | 229 | Microsoft Corp | 181.92 | 72.31 |
Instead of sorting small to large, we can sort large to small by changing the ascending
option.
The default value (if we do not specify it) is True
.
small_df.sort_values(by='price', ascending=False, inplace=True)
small_df
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
MSFT | 229 | Microsoft Corp | 181.92 | 72.31 |
AAPL | 137 | Apple Inc | 162.82 | 105.72 |
Instead of sorting by a column, we can sort by the index using .sort_index()
.
small_df.sort_index(inplace=True)
small_df
price | name | assets_current | liab_current | |
---|---|---|---|---|
ticker | ||||
AAPL | 137 | Apple Inc | 162.82 | 105.72 |
AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
MSFT | 229 | Microsoft Corp | 181.92 | 72.31 |
Transposing tables#
Suppose we have the following table (current_df
) for a particular company:
current_df = pd.DataFrame(
index = ['total debt', 'total assets'],
columns = [2022, 2021, 2020],
data = [
[25, 24, 22],
[70, 77, 80],
],
)
current_df
2022 | 2021 | 2020 | |
---|---|---|---|
total debt | 25 | 24 | 22 |
total assets | 70 | 77 | 80 |
But what if we want the table to instead have the years as the rows and the debt/assets as the columns?
We call that a transposed table, and it is very easy to do in pandas
:
transposed_df = current_df.T
transposed_df
total debt | total assets | |
---|---|---|
2022 | 25 | 70 |
2021 | 24 | 77 |
2020 | 22 | 80 |