Statistics and other calculations

Statistics and other calculations#

import pandas as pd

small_data = {
    'ticker': ['AAPL', 'MSFT', 'GOOG', 'AMZN'],
    'price': [137, 229, 1677, 3535],
    'name': ['Apple Inc', 'Microsoft Corp', 'Alphabet Inc', 'Amazon.com Inc'],
    'assets_current': [162.82, 181.92, 152.58, 96.33],
    'liab_current': [105.72, 72.31, 45.22, 87.81]
}
small_df = pd.DataFrame(small_data)
small_df

	ticker	price	name	assets_current	liab_current
0	AAPL	137	Apple Inc	162.82	105.72
1	MSFT	229	Microsoft Corp	181.92	72.31
2	GOOG	1677	Alphabet Inc	152.58	45.22
3	AMZN	3535	Amazon.com Inc	96.33	87.81

Column statistics#

Calculate the average of the price column and store the result in the avg_value variable.

avg_value = small_df['price'].mean()
avg_value

1394.5

Calculate the median of the price column.

small_df['price'].median()

953.0

Calculate the minimum of the price column.

small_df['price'].min()

Calculate the maximum of the price column.

small_df['price'].max()

Calculate the standard deviation of the price column.

small_df['price'].std()

1591.775005038924

Calculate the product of the price column.

While it does not really make sense to multiply all of the prices together, there are other situations where this functionality will be useful.

small_df['price'].product()

185985261735

Group statistics#

For this section, we will use data from plotly.

import plotly.express as px

tips_df = px.data.tips()
tips_df.tail(10)

	total_bill	tip	sex	smoker	day	time	size
234	15.53	3.00	Male	Yes	Sat	Dinner	2
235	10.07	1.25	Male	No	Sat	Dinner	2
236	12.60	1.00	Male	Yes	Sat	Dinner	2
237	32.83	1.17	Male	Yes	Sat	Dinner	2
238	35.83	4.67	Female	No	Sat	Dinner	3
239	29.03	5.92	Male	No	Sat	Dinner	3
240	27.18	2.00	Female	Yes	Sat	Dinner	2
241	22.67	2.00	Male	Yes	Sat	Dinner	2
242	17.82	1.75	Male	No	Sat	Dinner	2
243	18.78	3.00	Female	No	Thur	Dinner	2

Suppose we want to know the following:

general statistics for tip
what the possible values for time are
the number of observations for each time value
average tip, by time
maximum total_bill, by sex
the number of tips earned each day
the total tip earned each day
average table size and tip, by day and smoker

tips_df['tip'].describe()

count    244.000000
mean       2.998279
std        1.383638
min        1.000000
25%        2.000000
50%        2.900000
75%        3.562500
max       10.000000
Name: tip, dtype: float64

tips_df['time'].unique()

array(['Dinner', 'Lunch'], dtype=object)

tips_df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

tips_df.groupby('time')['tip'].mean()

time
Dinner    3.102670
Lunch     2.728088
Name: tip, dtype: float64

tips_df.groupby('sex')['total_bill'].max()

sex
Female    44.30
Male      50.81
Name: total_bill, dtype: float64

tips_df.groupby('day')['tip'].count()

day
Fri     19
Sat     87
Sun     76
Thur    62
Name: tip, dtype: int64

tips_df.groupby('day')['tip'].sum()

day
Fri      51.96
Sat     260.40
Sun     247.39
Thur    171.83
Name: tip, dtype: float64

groupby_list = [ 'day', 'smoker' ]
variable_list = ['size', 'tip']
tips_df.groupby( groupby_list )[ variable_list ].mean()

		size	tip
day	smoker
Fri	No	2.250000	2.812500
Fri	Yes	2.066667	2.714000
Sat	No	2.555556	3.102889
Sat	Yes	2.476190	2.875476
Sun	No	2.929825	3.167895
Sun	Yes	2.578947	3.516842
Thur	No	2.488889	2.673778
Thur	Yes	2.352941	3.030000

Make a table that has the the average tip, average bill, max table size, and number of tables, for each day.

Use .agg().

summary_df = tips_df.groupby('day').agg(
	avg_tip=('tip', 'mean'),
	avg_bill=('total_bill', 'mean'),
	max_size=('size', 'max'),
	num_tables=('tip', 'count'),
)
summary_df

	avg_tip	avg_bill	max_size	num_tables
day
Fri	2.734737	17.151579	4	19
Sat	2.993103	20.441379	5	87
Sun	3.255132	21.410000	6	76
Thur	2.771452	17.682742	6	62

Other calculations#

Changes and rolling averages#

Let’s calculate returns from prices and then calculate a rolling average.

# dataframe containing months and prices for AMZN stock
cols_list = ['month', 'price']
price_data = [
	['Jan',  250.00],
	['Feb',  240.00],
	['Mar',  255.00],
	['Apr',  250.00],
	['May',  250.00],
	['Jun',  270.00],
	['Jul',  265.00],
]

price_df = pd.DataFrame(price_data, columns=cols_list)
price_df

	month	price
0	Jan	250.0
1	Feb	240.0
2	Mar	255.0
3	Apr	250.0
4	May	250.0
5	Jun	270.0
6	Jul	265.0

Use the pct_change() method to calculate percent change each month.

Warning

This will only work if your table is sorted by ascending date (i.e., the oldest observations are at the top of your table).

# calculate monthly returns
price_df['return'] = price_df['price'].pct_change()
price_df

	month	price	return
0	Jan	250.0	NaN
1	Feb	240.0	-0.040000
2	Mar	255.0	0.062500
3	Apr	250.0	-0.019608
4	May	250.0	0.000000
5	Jun	270.0	0.080000
6	Jul	265.0	-0.018519

Use the .rolling() method to calculate a simple moving average price over the last three months.

price_df['rolling_avg_price'] = price_df.rolling(3)['price'].mean()
price_df

	month	price	return	rolling_avg_price
0	Jan	250.0	NaN	NaN
1	Feb	240.0	-0.040000	NaN
2	Mar	255.0	0.062500	248.333333
3	Apr	250.0	-0.019608	248.333333
4	May	250.0	0.000000	251.666667
5	Jun	270.0	0.080000	256.666667
6	Jul	265.0	-0.018519	261.666667