Statistics and other calculations#

import pandas as pd

small_url = 'https://gist.githubusercontent.com/van-alfen/bca77fd9f4c0fcf74308bc7cfa12a8a7/raw/030a8fea2ddfd1b8fd257d5a601b15a2942c78d7/small_data.csv'
small_df = pd.read_csv(small_url)
small_df
ticker price name assets_current liab_current
0 AAPL 137 Apple Inc 162.82 105.72
1 MSFT 229 Microsoft Corp 181.92 72.31
2 GOOG 1677 Alphabet Inc 152.58 45.22
3 AMZN 3535 Amazon.com Inc 96.33 87.81

Column statistics#

Calculate the average of the price column and store the result in the avg_value variable.

avg_value = small_df['price'].mean()
avg_value
1394.5

Calculate the median of the price column.

small_df['price'].median()
953.0

Calculate the minimum of the price column.

small_df['price'].min()
137

Calculate the maximum of the price column.

small_df['price'].max()
3535

Calculate the standard deviation of the price column.

small_df['price'].std()
1591.775005038924

Calculate the product of the price column.

While it does not really make sense to multiply all of the prices together, there are other situations where this functionality will be useful.

small_df['price'].product()
185985261735

Group statistics#

For this section, we will use data from plotly.

import plotly.express as px

tips_df = px.data.tips()
tips_df.tail(10)
total_bill tip sex smoker day time size
234 15.53 3.00 Male Yes Sat Dinner 2
235 10.07 1.25 Male No Sat Dinner 2
236 12.60 1.00 Male Yes Sat Dinner 2
237 32.83 1.17 Male Yes Sat Dinner 2
238 35.83 4.67 Female No Sat Dinner 3
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

Suppose we want to know the following:

  • general statistics for tip

  • what the possible values for time are

  • average tip, by time

  • maximum total_bill, by sex

  • the number of tips earned each day

  • the total tip earned each day

  • average table size and tip, by day and smoker

tips_df['tip'].describe()
count    244.000000
mean       2.998279
std        1.383638
min        1.000000
25%        2.000000
50%        2.900000
75%        3.562500
max       10.000000
Name: tip, dtype: float64
tips_df['time'].unique()
array(['Dinner', 'Lunch'], dtype=object)
tips_df.groupby('time')['tip'].mean()
time
Dinner    3.102670
Lunch     2.728088
Name: tip, dtype: float64
tips_df.groupby('sex')['total_bill'].max()
sex
Female    44.30
Male      50.81
Name: total_bill, dtype: float64
tips_df.groupby('day')['tip'].count()
day
Fri     19
Sat     87
Sun     76
Thur    62
Name: tip, dtype: int64
tips_df.groupby('day')['tip'].sum()
day
Fri      51.96
Sat     260.40
Sun     247.39
Thur    171.83
Name: tip, dtype: float64
groupby_list = [ 'day', 'smoker' ]
variable_list = ['size', 'tip']
tips_df.groupby( groupby_list )[ variable_list ].mean()
size tip
day smoker
Fri No 2.250000 2.812500
Yes 2.066667 2.714000
Sat No 2.555556 3.102889
Yes 2.476190 2.875476
Sun No 2.929825 3.167895
Yes 2.578947 3.516842
Thur No 2.488889 2.673778
Yes 2.352941 3.030000

Make a table that has the the average tip, average bill, max table size, and number of tables, for each day.

Use .agg().

summary_df = tips_df.groupby('day').agg(
	avg_tip=('tip', 'mean'),
	avg_bill=('total_bill', 'mean'),
	max_size=('size', 'max'),
	num_tables=('tip', 'count'),
)
summary_df
avg_tip avg_bill max_size num_tables
day
Fri 2.734737 17.151579 4 19
Sat 2.993103 20.441379 5 87
Sun 3.255132 21.410000 6 76
Thur 2.771452 17.682742 6 62

Other calculations#

Changes and rolling averages#

Let’s calculate returns from prices and then calculate a rolling average.

# dataframe containing months and prices for AMZN stock
cols_list = ['month', 'price']
price_data = [
	['Jan',  250.00],
	['Feb',  240.00],
	['Mar',  255.00],
	['Apr',  250.00],
	['May',  250.00],
	['Jun',  270.00],
	['Jul',  265.00],
]

price_df = pd.DataFrame(price_data, columns=cols_list)
price_df
month price
0 Jan 250.0
1 Feb 240.0
2 Mar 255.0
3 Apr 250.0
4 May 250.0
5 Jun 270.0
6 Jul 265.0

Use the pct_change() method to calculate percent change each month.

Warning

This will only work if your table is sorted by ascending date (i.e., the oldest observations are at the top of your table).

# calculate monthly returns
price_df['return'] = price_df['price'].pct_change()
price_df
month price return
0 Jan 250.0 NaN
1 Feb 240.0 -0.040000
2 Mar 255.0 0.062500
3 Apr 250.0 -0.019608
4 May 250.0 0.000000
5 Jun 270.0 0.080000
6 Jul 265.0 -0.018519

Use the .rolling() method to calculate a simple moving average price over the last three months.

price_df['rolling_avg_price'] = price_df.rolling(3)['price'].mean()
price_df
month price return rolling_avg_price
0 Jan 250.0 NaN NaN
1 Feb 240.0 -0.040000 NaN
2 Mar 255.0 0.062500 248.333333
3 Apr 250.0 -0.019608 248.333333
4 May 250.0 0.000000 251.666667
5 Jun 270.0 0.080000 256.666667
6 Jul 265.0 -0.018519 261.666667