Pandas basics

Pandas basics#

Importing the package#

We do not have access to the pandas package by default. We need to import it into our program before we can use it.

import pandas as pd

as pd is like a shortcut. Whenever we want to use something from the pandas library, we can write pd instead of typing out pandas.

Creating a dataframe#

# First let's set up a dictionary using data from Marketwatch

small_data = {
    'ticker': ['AAPL', 'MSFT', 'GOOG', 'AMZN'],
    'price': [137, 229, 1677, 3535],
    'name': ['Apple Inc', 'Microsoft Corp', 'Alphabet Inc', 'Amazon.com Inc'],
    'assets_current': [162.82, 181.92, 152.58, 96.33],
    'liab_current': [105.72, 72.31, 45.22, 87.81]
}

small_data

{'ticker': ['AAPL', 'MSFT', 'GOOG', 'AMZN'],
 'price': [137, 229, 1677, 3535],
 'name': ['Apple Inc', 'Microsoft Corp', 'Alphabet Inc', 'Amazon.com Inc'],
 'assets_current': [162.82, 181.92, 152.58, 96.33],
 'liab_current': [105.72, 72.31, 45.22, 87.81]}

Creating a dataframe (basically a table) is very easy when we have a dictionary that contains lists. Just load it using the DataFrame method.

small_df = pd.DataFrame(small_data)
small_df

	ticker	price	name	assets_current	liab_current
0	AAPL	137	Apple Inc	162.82	105.72
1	MSFT	229	Microsoft Corp	181.92	72.31
2	GOOG	1677	Alphabet Inc	152.58	45.22
3	AMZN	3535	Amazon.com Inc	96.33	87.81

We can also create a new dataframe using lists instead of a dictionary.

# the lists containing the data
columns_list = ['student', 'id', 'year', 'grade']
student_data = [
    ['James', 1234, 'Freshman', 'A'],
    ['John', 2345, 'Sophomore', 'B'],
    ['Mary', 3456, 'Junior', 'C'],
    ['Pat', 4567, 'Senior', 'D'],
]

# create a dataframe from the data
student_df = pd.DataFrame(
    columns=columns_list,
    data=student_data,
)
student_df

	student	id	year	grade
0	James	1234	Freshman	A
1	John	2345	Sophomore	B
2	Mary	3456	Junior	C
3	Pat	4567	Senior	D

Previewing large data#

This small_df example is pretty small. But we will frequently use much larger datasets.

A useful way to view the first few observations is with the .head() method.

big_data = {
    'year': [2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014, 2013, 2014],
    'ticker': ['AAPL', 'AAPL', 'AET', 'AET', 'BA', 'BA', 'BUD', 'BUD', 'CAT', 'CAT', 'COKE', 'COKE', 'CVX', 'CVX', 'DAL', 'DAL', 'DIS', 'DIS', 'DOW', 'DOW', 'DTV', 'DTV', 'GM', 'GM', 'HOG', 'HOG', 'IBM', 'IBM', 'LMT', 'LMT', 'LUV', 'LUV', 'MSFT', 'MSFT', 'PFE', 'PFE', 'TAP', 'TAP', 'TWC', 'TWC', 'VZ', 'VZ'],
    'assets_current': [73286, 68531, 9718.9, 11764, 65074, 67785, 18690, 18541, 38335, 38867, 239.215, 287.001, 50250, 42232, 9651, 12465, 14109, 15176, 24977, 24267, 5953, 8819, 81501, 83670, 3988.803, 3948.095, 51350, 49422, 13329, 12329, 4456, 4404, 101466, 114246, 56244, 57702, 1537.7, 1578.9, 2144, 2316, 70994, 29623],
    'liab_current': [43658, 63448, 12602.9, 15356.5, 51486, 56717, 25627, 27208, 27297, 27877, 208.816, 227.42, 33018, 31926, 14152, 16879, 11704, 13292, 11971, 11593, 6530, 6959, 62412, 65701, 2509.586, 2389.286, 40154, 39600, 11120, 11112, 5676, 5923, 37417, 45625, 23366, 21631, 2142.1, 2325.3, 5226, 4497, 27050, 28064],
    'price': [476.75, 100.75, 68.59, 88.83, 136.49, 129.98, 106.46, 112.32, 90.81, 91.53, 73.19, 88.03, 124.91, 112.18, 27.47, 49.19, 64.49, 89.03, 44.4, 45.61, 69.06, 86.7, 40.87, 34.91, 69.24, 65.91, 187.57, 160.44, 148.66, 192.57, 18.84, 42.32, 34.545, 41.7, 30.63, 31.15, 56.15, 74.52, 135.5, 152.06, 49.14, 46.78],
}

big_df = pd.DataFrame(big_data)

big_df.head()

	year	ticker	assets_current	liab_current	price
0	2013	AAPL	73286.0	43658.0	476.75
1	2014	AAPL	68531.0	63448.0	100.75
2	2013	AET	9718.9	12602.9	68.59
3	2014	AET	11764.0	15356.5	88.83
4	2013	BA	65074.0	51486.0	136.49

We can also see the last few observations with the .tail() method.

big_df.tail()

	year	ticker	assets_current	liab_current	price
37	2014	TAP	1578.9	2325.3	74.52
38	2013	TWC	2144.0	5226.0	135.50
39	2014	TWC	2316.0	4497.0	152.06
40	2013	VZ	70994.0	27050.0	49.14
41	2014	VZ	29623.0	28064.0	46.78

Pandas options#

We can also change the default options when displaying a dataframe:

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)

big_df

	year	ticker	assets_current	liab_current	price
0	2013	AAPL	73286.000	43658.000	476.750
1	2014	AAPL	68531.000	63448.000	100.750
2	2013	AET	9718.900	12602.900	68.590
3	2014	AET	11764.000	15356.500	88.830
4	2013	BA	65074.000	51486.000	136.490
5	2014	BA	67785.000	56717.000	129.980
6	2013	BUD	18690.000	25627.000	106.460
7	2014	BUD	18541.000	27208.000	112.320
8	2013	CAT	38335.000	27297.000	90.810
9	2014	CAT	38867.000	27877.000	91.530
10	2013	COKE	239.215	208.816	73.190
11	2014	COKE	287.001	227.420	88.030
12	2013	CVX	50250.000	33018.000	124.910
13	2014	CVX	42232.000	31926.000	112.180
14	2013	DAL	9651.000	14152.000	27.470
15	2014	DAL	12465.000	16879.000	49.190
16	2013	DIS	14109.000	11704.000	64.490
17	2014	DIS	15176.000	13292.000	89.030
18	2013	DOW	24977.000	11971.000	44.400
19	2014	DOW	24267.000	11593.000	45.610
20	2013	DTV	5953.000	6530.000	69.060
21	2014	DTV	8819.000	6959.000	86.700
22	2013	GM	81501.000	62412.000	40.870
23	2014	GM	83670.000	65701.000	34.910
24	2013	HOG	3988.803	2509.586	69.240
25	2014	HOG	3948.095	2389.286	65.910
26	2013	IBM	51350.000	40154.000	187.570
27	2014	IBM	49422.000	39600.000	160.440
28	2013	LMT	13329.000	11120.000	148.660
29	2014	LMT	12329.000	11112.000	192.570
30	2013	LUV	4456.000	5676.000	18.840
31	2014	LUV	4404.000	5923.000	42.320
32	2013	MSFT	101466.000	37417.000	34.545
33	2014	MSFT	114246.000	45625.000	41.700
34	2013	PFE	56244.000	23366.000	30.630
35	2014	PFE	57702.000	21631.000	31.150
36	2013	TAP	1537.700	2142.100	56.150
37	2014	TAP	1578.900	2325.300	74.520
38	2013	TWC	2144.000	5226.000	135.500
39	2014	TWC	2316.000	4497.000	152.060
40	2013	VZ	70994.000	27050.000	49.140
41	2014	VZ	29623.000	28064.000	46.780

Details about the DataFrame#

We can see the number of rows and columns in a data frame with the .shape property. This will show us that we have 42 rows and 5 columns.

big_df.shape

(42, 5)

We can also see a list of the column names in our dataframe by using the .columns property.

big_df.columns

Index(['year', 'ticker', 'assets_current', 'liab_current', 'price'], dtype='object')

The .info() method will give us a few more details about the dataframe (e.g., data types).

big_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42 entries, 0 to 41
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   year            42 non-null     int64  
 1   ticker          42 non-null     object 
 2   assets_current  42 non-null     float64
 3   liab_current    42 non-null     float64
 4   price           42 non-null     float64
dtypes: float64(3), int64(1), object(1)
memory usage: 1.8+ KB

Notice:

shape and columns are properties of the dataframe. Since they are just properties, we do not use parentheses.

info() is a method — like a function — and so we include parentheses because we can potentially pass in arguments.

Setting the index#

Notice above how the rows are identified with numbers (i.e., 0, 1, 2, 3). But each row contains data about a specific ticker, so we should probably use that as our index.

small_df.set_index('ticker', inplace=True)
small_df

	price	name	assets_current	liab_current
ticker
AAPL	137	Apple Inc	162.82	105.72
MSFT	229	Microsoft Corp	181.92	72.31
GOOG	1677	Alphabet Inc	152.58	45.22
AMZN	3535	Amazon.com Inc	96.33	87.81

Note: the inplace=True option just means that we are modifying our small_df table/DataFrame and not creating a new table/DataFrame.

Multi index#

It is a good rule of thumb that the index should uniquely identify each row. For example, in our big_df, we could set ticker as the index, but that doesn’t uniquely identify each row since each ticker has a row for 2013 and a row for 2014. In this case, we want to use both ticker and year as our index. This is known as a multi index.

# list of columns to use as index
index_list = ['ticker', 'year']

# the rest is the same
big_df.set_index(index_list, inplace=True)
big_df.head(10)

		assets_current	liab_current	price
ticker	year
AAPL	2013	73286.0	43658.0	476.75
AAPL	2014	68531.0	63448.0	100.75
AET	2013	9718.9	12602.9	68.59
AET	2014	11764.0	15356.5	88.83
BA	2013	65074.0	51486.0	136.49
BA	2014	67785.0	56717.0	129.98
BUD	2013	18690.0	25627.0	106.46
BUD	2014	18541.0	27208.0	112.32
CAT	2013	38335.0	27297.0	90.81
CAT	2014	38867.0	27877.0	91.53

Sorting DataFrames#

We can sort the data using any variable.
inplace=True again just means that we are modifying the data “in place” and not creating a new table.

small_df.sort_values(by='assets_current', inplace=True)
small_df

	price	name	assets_current	liab_current
ticker
AMZN	3535	Amazon.com Inc	96.33	87.81
GOOG	1677	Alphabet Inc	152.58	45.22
AAPL	137	Apple Inc	162.82	105.72
MSFT	229	Microsoft Corp	181.92	72.31

Instead of sorting small to large, we can sort large to small by changing the ascending option. The default value (if we do not specify it) is True.

small_df.sort_values(by='price', ascending=False, inplace=True)
small_df

	price	name	assets_current	liab_current
ticker
AMZN	3535	Amazon.com Inc	96.33	87.81
GOOG	1677	Alphabet Inc	152.58	45.22
MSFT	229	Microsoft Corp	181.92	72.31
AAPL	137	Apple Inc	162.82	105.72

Instead of sorting by a column, we can sort by the index using .sort_index().

small_df.sort_index(inplace=True)
small_df

	price	name	assets_current	liab_current
ticker
AAPL	137	Apple Inc	162.82	105.72
AMZN	3535	Amazon.com Inc	96.33	87.81
GOOG	1677	Alphabet Inc	152.58	45.22
MSFT	229	Microsoft Corp	181.92	72.31

Manipulating dataframes#

Transpose#

Suppose we have the following table (current_df) for a particular company:

current_df = pd.DataFrame(
	index = ['total debt', 'total assets'],
	columns = [2022, 2021, 2020],
	data = [
		[25, 24, 22],
		[70, 77, 80],
	],
)
current_df

	2022	2021	2020
total debt	25	24	22
total assets	70	77	80

But what if we want the table to instead have the years as the rows and debt & assets as the columns?

We call that a transposed table, and it is very easy to do in pandas:

current_df.T

	total debt	total assets
2022	25	70
2021	24	77
2020	22	80

Stack and unstack#

Also see pivot() and wide_to_long().

Let’s import some data from plotly. In this table each column is a ticker.

import plotly.express as px

df_stock = px.data.stocks()
df_stock.set_index('date', inplace=True)
df_stock

	GOOG	AAPL	AMZN	FB	NFLX	MSFT
date
2018-01-01	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
2018-01-08	1.018172	1.011943	1.061881	0.959968	1.053526	1.015988
2018-01-15	1.032008	1.019771	1.053240	0.970243	1.049860	1.020524
2018-01-22	1.066783	0.980057	1.140676	1.016858	1.307681	1.066561
2018-01-29	1.008773	0.917143	1.163374	1.018357	1.273537	1.040708
...	...	...	...	...	...	...
2019-12-02	1.216280	1.546914	1.425061	1.075997	1.463641	1.720717
2019-12-09	1.222821	1.572286	1.432660	1.038855	1.421496	1.752239
2019-12-16	1.224418	1.596800	1.453455	1.104094	1.604362	1.784896
2019-12-23	1.226504	1.656000	1.521226	1.113728	1.567170	1.802472
2019-12-30	1.213014	1.678000	1.503360	1.098475	1.540883	1.788185

105 rows × 6 columns

Let’s structure the table as a firm-date panel using the .stack() method.

df_panel = df_stock.stack().to_frame()
df_panel

		0
date
2018-01-01	GOOG	1.000000
	AAPL	1.000000
	AMZN	1.000000
	FB	1.000000
	NFLX	1.000000
...	...	...
2019-12-30	AAPL	1.678000
	AMZN	1.503360
	FB	1.098475
	NFLX	1.540883
	MSFT	1.788185

630 rows × 1 columns

Now just clean it up a bit.

df_panel.index.set_names(['date', 'ticker'], inplace=True)
df_panel.rename(columns={0: 'dollar_value'}, inplace=True)
df_panel = df_panel.reorder_levels(['ticker', 'date'])
df_panel.sort_index(inplace=True)
df_panel

		dollar_value
ticker	date
AAPL	2018-01-01	1.000000
	2018-01-08	1.011943
	2018-01-15	1.019771
	2018-01-22	0.980057
	2018-01-29	0.917143
...	...	...
NFLX	2019-12-02	1.463641
	2019-12-09	1.421496
	2019-12-16	1.604362
	2019-12-23	1.567170
	2019-12-30	1.540883

630 rows × 1 columns

Now let’s do the opposite with the .unstack() method.

df_unstack = df_panel.unstack(level=0)
df_unstack

	dollar_value
ticker	AAPL	AMZN	FB	GOOG	MSFT	NFLX
date
2018-01-01	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
2018-01-08	1.011943	1.061881	0.959968	1.018172	1.015988	1.053526
2018-01-15	1.019771	1.053240	0.970243	1.032008	1.020524	1.049860
2018-01-22	0.980057	1.140676	1.016858	1.066783	1.066561	1.307681
2018-01-29	0.917143	1.163374	1.018357	1.008773	1.040708	1.273537
...	...	...	...	...	...	...
2019-12-02	1.546914	1.425061	1.075997	1.216280	1.720717	1.463641
2019-12-09	1.572286	1.432660	1.038855	1.222821	1.752239	1.421496
2019-12-16	1.596800	1.453455	1.104094	1.224418	1.784896	1.604362
2019-12-23	1.656000	1.521226	1.113728	1.226504	1.802472	1.567170
2019-12-30	1.678000	1.503360	1.098475	1.213014	1.788185	1.540883

105 rows × 6 columns

And clean it up.

df_unstack.columns = df_unstack.columns.droplevel(0)
df_unstack.columns.name = None
df_unstack

	AAPL	AMZN	FB	GOOG	MSFT	NFLX
date
2018-01-01	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000
2018-01-08	1.011943	1.061881	0.959968	1.018172	1.015988	1.053526
2018-01-15	1.019771	1.053240	0.970243	1.032008	1.020524	1.049860
2018-01-22	0.980057	1.140676	1.016858	1.066783	1.066561	1.307681
2018-01-29	0.917143	1.163374	1.018357	1.008773	1.040708	1.273537
...	...	...	...	...	...	...
2019-12-02	1.546914	1.425061	1.075997	1.216280	1.720717	1.463641
2019-12-09	1.572286	1.432660	1.038855	1.222821	1.752239	1.421496
2019-12-16	1.596800	1.453455	1.104094	1.224418	1.784896	1.604362
2019-12-23	1.656000	1.521226	1.113728	1.226504	1.802472	1.567170
2019-12-30	1.678000	1.503360	1.098475	1.213014	1.788185	1.540883

105 rows × 6 columns

Pandas basics

Contents

Pandas basics#

Importing the package#

Creating a dataframe#

Previewing large data#

Pandas options#

Details about the DataFrame#

Setting the index#

Multi index#

Sorting DataFrames#

Manipulating dataframes#

Transpose#

Stack and unstack#