Pandas basics#

Importing the package#

In order to use pandas, you may need to first install the package in your Terminal via conda install pandas.

Once it has been installed, you can use the package by importing it at the beginning of your program:

import pandas as pd

The as pd is like a shortcut. Whenever we want to use something from the pandas library, we can write pd instead of typing out pandas.

Creating a dataframe#

# First let's set up a dictionary using data from Marketwatch

small_data = {
    'ticker': ['AAPL', 'MSFT', 'GOOG', 'AMZN'],
    'price': [137, 229, 1677, 3535],
    'name': ['Apple Inc', 'Microsoft Corp', 'Alphabet Inc', 'Amazon.com Inc'],
    'assets_current': [162.82, 181.92, 152.58, 96.33],
    'liab_current': [105.72, 72.31, 45.22, 87.81]
}

small_data
{'ticker': ['AAPL', 'MSFT', 'GOOG', 'AMZN'],
 'price': [137, 229, 1677, 3535],
 'name': ['Apple Inc', 'Microsoft Corp', 'Alphabet Inc', 'Amazon.com Inc'],
 'assets_current': [162.82, 181.92, 152.58, 96.33],
 'liab_current': [105.72, 72.31, 45.22, 87.81]}

Creating a dataframe (basically a table) is very easy when we have a dictionary that contains lists. Just load it using the DataFrame method.

small_df = pd.DataFrame(small_data)
small_df
ticker price name assets_current liab_current
0 AAPL 137 Apple Inc 162.82 105.72
1 MSFT 229 Microsoft Corp 181.92 72.31
2 GOOG 1677 Alphabet Inc 152.58 45.22
3 AMZN 3535 Amazon.com Inc 96.33 87.81

We can also create a new dataframe using lists instead of a dictionary.

# the lists containing the data
columns_list = ['student', 'id', 'year', 'grade']
student_data = [
    ['James', 1234, 'Freshman', 'A'],
    ['John', 2345, 'Sophomore', 'B'],
    ['Mary', 3456, 'Junior', 'C'],
    ['Pat', 4567, 'Senior', 'D'],
]

# create a dataframe from the data
student_df = pd.DataFrame(
    columns=columns_list,
    data=student_data,
)
student_df
student id year grade
0 James 1234 Freshman A
1 John 2345 Sophomore B
2 Mary 3456 Junior C
3 Pat 4567 Senior D

Previewing large data#

This small_df example is pretty small. But we will frequently use much larger datasets.

A useful way to view the first few observations is with the .head() method.

big_data = {
    'year': [2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014],
    'ticker': ['AAPL', 'AAPL', 'AET', 'AET', 'BA', 'BA', 'BUD', 'BUD', 'CAT', 'CAT', 'COKE', 'COKE', 'CVX', 'CVX', 'DAL', 'DAL', 'DIS', 'DIS', 'DOW', 'DOW', 'DTV', 'DTV', 'GM', 'GM', 'HOG', 'HOG', 'IBM', 'IBM', 'LMT', 'LMT', 'LUV', 'LUV', 'MSFT', 'MSFT', 'PFE', 'PFE', 'TAP', 'TAP', 'TWC', 'TWC', 'VZ', 'VZ'],
    'assets_current': [73286, 68531, 9718.9, 11764, 65074, 67785, 18690, 18541, 38335, 38867, 239.215, 287.001, 50250, 42232, 9651, 12465, 14109, 15176, 24977, 24267, 5953, 8819, 81501, 83670, 3988.803, 3948.095, 51350, 49422, 13329, 12329, 4456, 4404, 101466, 114246, 56244, 57702, 1537.7, 1578.9, 2144, 2316, 70994, 29623],
    'liab_current': [43658, 63448, 12602.9, 15356.5, 51486, 56717, 25627, 27208, 27297, 27877, 208.816, 227.42, 33018, 31926, 14152, 16879, 11704, 13292, 11971, 11593, 6530, 6959, 62412, 65701, 2509.586, 2389.286, 40154, 39600, 11120, 11112, 5676, 5923, 37417, 45625, 23366, 21631, 2142.1, 2325.3, 5226, 4497, 27050, 28064],
    'price': [476.75, 100.75, 68.59, 88.83, 136.49, 129.98, 106.46, 112.32, 90.81, 91.53, 73.19, 88.03, 124.91, 112.18, 27.47, 49.19, 64.49, 89.03, 44.4, 45.61, 69.06, 86.7, 40.87, 34.91, 69.24, 65.91, 187.57, 160.44, 148.66, 192.57, 18.84, 42.32, 34.545, 41.7, 30.63, 31.15, 56.15, 74.52, 135.5, 152.06, 49.14, 46.78],
}

big_df = pd.DataFrame(big_data)

big_df.head()
year ticker assets_current liab_current price
0 2013 AAPL 73286.0 43658.0 476.75
1 2014 AAPL 68531.0 63448.0 100.75
2 2013 AET 9718.9 12602.9 68.59
3 2014 AET 11764.0 15356.5 88.83
4 2013 BA 65074.0 51486.0 136.49

You can also see the last few observations with the .tail() method.

big_df.tail()
year ticker assets_current liab_current price
37 2014 TAP 1578.9 2325.3 74.52
38 2013 TWC 2144.0 5226.0 135.50
39 2014 TWC 2316.0 4497.0 152.06
40 2013 VZ 70994.0 27050.0 49.14
41 2014 VZ 29623.0 28064.0 46.78

Pandas options#

You can also change the default options when displaying a dataframe:

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
big_df
year ticker assets_current liab_current price
0 2013 AAPL 73286.000 43658.000 476.750
1 2014 AAPL 68531.000 63448.000 100.750
2 2013 AET 9718.900 12602.900 68.590
3 2014 AET 11764.000 15356.500 88.830
4 2013 BA 65074.000 51486.000 136.490
5 2014 BA 67785.000 56717.000 129.980
6 2013 BUD 18690.000 25627.000 106.460
7 2014 BUD 18541.000 27208.000 112.320
8 2013 CAT 38335.000 27297.000 90.810
9 2014 CAT 38867.000 27877.000 91.530
10 2013 COKE 239.215 208.816 73.190
11 2014 COKE 287.001 227.420 88.030
12 2013 CVX 50250.000 33018.000 124.910
13 2014 CVX 42232.000 31926.000 112.180
14 2013 DAL 9651.000 14152.000 27.470
15 2014 DAL 12465.000 16879.000 49.190
16 2013 DIS 14109.000 11704.000 64.490
17 2014 DIS 15176.000 13292.000 89.030
18 2013 DOW 24977.000 11971.000 44.400
19 2014 DOW 24267.000 11593.000 45.610
20 2013 DTV 5953.000 6530.000 69.060
21 2014 DTV 8819.000 6959.000 86.700
22 2013 GM 81501.000 62412.000 40.870
23 2014 GM 83670.000 65701.000 34.910
24 2013 HOG 3988.803 2509.586 69.240
25 2014 HOG 3948.095 2389.286 65.910
26 2013 IBM 51350.000 40154.000 187.570
27 2014 IBM 49422.000 39600.000 160.440
28 2013 LMT 13329.000 11120.000 148.660
29 2014 LMT 12329.000 11112.000 192.570
30 2013 LUV 4456.000 5676.000 18.840
31 2014 LUV 4404.000 5923.000 42.320
32 2013 MSFT 101466.000 37417.000 34.545
33 2014 MSFT 114246.000 45625.000 41.700
34 2013 PFE 56244.000 23366.000 30.630
35 2014 PFE 57702.000 21631.000 31.150
36 2013 TAP 1537.700 2142.100 56.150
37 2014 TAP 1578.900 2325.300 74.520
38 2013 TWC 2144.000 5226.000 135.500
39 2014 TWC 2316.000 4497.000 152.060
40 2013 VZ 70994.000 27050.000 49.140
41 2014 VZ 29623.000 28064.000 46.780

Details about the DataFrame#

We can see the number of rows and columns in a data frame with the .shape property. This will show us that we have 42 rows and 5 columns.

big_df.shape
(42, 5)

We can also see a list of the column names in our dataframe by using the .columns property.

big_df.columns
Index(['year', 'ticker', 'assets_current', 'liab_current', 'price'], dtype='object')

The .info() method will give us a few more details about the dataframe (e.g., data types).

big_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            42 non-null     int64  
 1   ticker          42 non-null     object 
 2   assets_current  42 non-null     float64
 3   liab_current    42 non-null     float64
 4   price           42 non-null     float64
dtypes: float64(3), int64(1), object(1)
memory usage: 1.8+ KB

Notice:

shape and columns are properties of the dataframe. Since they are just properties, we do not use parentheses.

info() is a method — like a function — and so we include parentheses because we could potentially pass in arguments.

Setting the index#

Notice above how the rows are identified with numbers (i.e., 0, 1, 2, 3). But each row contains data about a specific ticker, so we should probably use that as our index.

small_df.set_index('ticker', inplace=True)
small_df
price name assets_current liab_current
ticker
AAPL 137 Apple Inc 162.82 105.72
MSFT 229 Microsoft Corp 181.92 72.31
GOOG 1677 Alphabet Inc 152.58 45.22
AMZN 3535 Amazon.com Inc 96.33 87.81

Note: the inplace=True option just means that we are modifying our small_df table/DataFrame and not creating a new table/DataFrame.

Multi index#

It is a good rule of thumb that the index should uniquely identify each row. For example, in our big_df, we could set ticker as the index, but that doesn’t uniquely identify each row since each ticker has a row for 2013 and a row for 2014. In this case, we want to use both ticker and year as our index. This is known as a multi index.

# list of columns to use as index
index_list = ['ticker', 'year']

# the rest is the same
big_df.set_index(index_list, inplace=True)
big_df.head(10)
assets_current liab_current price
ticker year
AAPL 2013 73286.0 43658.0 476.75
2014 68531.0 63448.0 100.75
AET 2013 9718.9 12602.9 68.59
2014 11764.0 15356.5 88.83
BA 2013 65074.0 51486.0 136.49
2014 67785.0 56717.0 129.98
BUD 2013 18690.0 25627.0 106.46
2014 18541.0 27208.0 112.32
CAT 2013 38335.0 27297.0 90.81
2014 38867.0 27877.0 91.53

Sorting DataFrames#

We can sort the data using any variable.
inplace=True again just means that we are modifying the data “in place” and not creating a new table.

small_df.sort_values(by='assets_current', inplace=True)
small_df
price name assets_current liab_current
ticker
AMZN 3535 Amazon.com Inc 96.33 87.81
GOOG 1677 Alphabet Inc 152.58 45.22
AAPL 137 Apple Inc 162.82 105.72
MSFT 229 Microsoft Corp 181.92 72.31

Instead of sorting small to large, we can sort large to small by changing the ascending option. The default value (if we do not specify it) is True.

small_df.sort_values(by='price', ascending=False, inplace=True)
small_df
price name assets_current liab_current
ticker
AMZN 3535 Amazon.com Inc 96.33 87.81
GOOG 1677 Alphabet Inc 152.58 45.22
MSFT 229 Microsoft Corp 181.92 72.31
AAPL 137 Apple Inc 162.82 105.72

Instead of sorting by a column, we can sort by the index using .sort_index().

small_df.sort_index(inplace=True)
small_df
price name assets_current liab_current
ticker
AAPL 137 Apple Inc 162.82 105.72
AMZN 3535 Amazon.com Inc 96.33 87.81
GOOG 1677 Alphabet Inc 152.58 45.22
MSFT 229 Microsoft Corp 181.92 72.31

Transposing tables#

Suppose we have the following table (current_df) for a particular company:

current_df = pd.DataFrame(
	index = ['total debt', 'total assets'],
	columns = [2022, 2021, 2020],
	data = [
		[25, 24, 22],
		[70, 77, 80],
	],
)
current_df
2022 2021 2020
total debt 25 24 22
total assets 70 77 80

But what if we want the table to instead have the years as the rows and the debt/assets as the columns?

We call that a transposed table, and it is very easy to do in pandas:

transposed_df = current_df.T
transposed_df
total debt total assets
2022 25 70
2021 24 77
2020 22 80