Statistics and other calculations#
import pandas as pd
small_data = {
    'ticker': ['AAPL', 'MSFT', 'GOOG', 'AMZN'],
    'price': [137, 229, 1677, 3535],
    'name': ['Apple Inc', 'Microsoft Corp', 'Alphabet Inc', 'Amazon.com Inc'],
    'assets_current': [162.82, 181.92, 152.58, 96.33],
    'liab_current': [105.72, 72.31, 45.22, 87.81]
}
small_df = pd.DataFrame(small_data)
small_df
| ticker | price | name | assets_current | liab_current | |
|---|---|---|---|---|---|
| 0 | AAPL | 137 | Apple Inc | 162.82 | 105.72 | 
| 1 | MSFT | 229 | Microsoft Corp | 181.92 | 72.31 | 
| 2 | GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 | 
| 3 | AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 | 
Column statistics#
Calculate the average of the price column and store the result in the avg_value variable.
avg_value = small_df['price'].mean()
avg_value
1394.5
Calculate the median of the price column.
small_df['price'].median()
953.0
Calculate the minimum of the price column.
small_df['price'].min()
137
Calculate the maximum of the price column.
small_df['price'].max()
3535
Calculate the standard deviation of the price column.
small_df['price'].std()
1591.775005038924
Calculate the product of the price column.
While it does not really make sense to multiply all of the prices together, there are other situations where this functionality will be useful.
small_df['price'].product()
185985261735
Group statistics#
For this section, we will use data from plotly.
import plotly.express as px
tips_df = px.data.tips()
tips_df.tail(10)
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 234 | 15.53 | 3.00 | Male | Yes | Sat | Dinner | 2 | 
| 235 | 10.07 | 1.25 | Male | No | Sat | Dinner | 2 | 
| 236 | 12.60 | 1.00 | Male | Yes | Sat | Dinner | 2 | 
| 237 | 32.83 | 1.17 | Male | Yes | Sat | Dinner | 2 | 
| 238 | 35.83 | 4.67 | Female | No | Sat | Dinner | 3 | 
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 | 
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 | 
| 241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 | 
| 242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 | 
| 243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 | 
Suppose we want to know the following:
general statistics for
tipwhat the possible values for
timearethe number of observations for each
timevalueaverage
tip, bytimemaximum
total_bill, bysexthe number of
tips earned eachdaythe total
tipearned eachdayaverage table
sizeandtip, bydayandsmoker
tips_df['tip'].describe()
count    244.000000
mean       2.998279
std        1.383638
min        1.000000
25%        2.000000
50%        2.900000
75%        3.562500
max       10.000000
Name: tip, dtype: float64
tips_df['time'].unique()
array(['Dinner', 'Lunch'], dtype=object)
tips_df['time'].value_counts()
time
Dinner    176
Lunch      68
Name: count, dtype: int64
tips_df.groupby('time')['tip'].mean()
time
Dinner    3.102670
Lunch     2.728088
Name: tip, dtype: float64
tips_df.groupby('sex')['total_bill'].max()
sex
Female    44.30
Male      50.81
Name: total_bill, dtype: float64
tips_df.groupby('day')['tip'].count()
day
Fri     19
Sat     87
Sun     76
Thur    62
Name: tip, dtype: int64
tips_df.groupby('day')['tip'].sum()
day
Fri      51.96
Sat     260.40
Sun     247.39
Thur    171.83
Name: tip, dtype: float64
groupby_list = [ 'day', 'smoker' ]
variable_list = ['size', 'tip']
tips_df.groupby( groupby_list )[ variable_list ].mean()
| size | tip | ||
|---|---|---|---|
| day | smoker | ||
| Fri | No | 2.250000 | 2.812500 | 
| Yes | 2.066667 | 2.714000 | |
| Sat | No | 2.555556 | 3.102889 | 
| Yes | 2.476190 | 2.875476 | |
| Sun | No | 2.929825 | 3.167895 | 
| Yes | 2.578947 | 3.516842 | |
| Thur | No | 2.488889 | 2.673778 | 
| Yes | 2.352941 | 3.030000 | 
Make a table that has the the average tip, average bill, max table size, and number of tables, for each day.
Use .agg().
summary_df = tips_df.groupby('day').agg(
	avg_tip=('tip', 'mean'),
	avg_bill=('total_bill', 'mean'),
	max_size=('size', 'max'),
	num_tables=('tip', 'count'),
)
summary_df
| avg_tip | avg_bill | max_size | num_tables | |
|---|---|---|---|---|
| day | ||||
| Fri | 2.734737 | 17.151579 | 4 | 19 | 
| Sat | 2.993103 | 20.441379 | 5 | 87 | 
| Sun | 3.255132 | 21.410000 | 6 | 76 | 
| Thur | 2.771452 | 17.682742 | 6 | 62 | 
Other calculations#
Changes and rolling averages#
Let’s calculate returns from prices and then calculate a rolling average.
# dataframe containing months and prices for AMZN stock
cols_list = ['month', 'price']
price_data = [
	['Jan',  250.00],
	['Feb',  240.00],
	['Mar',  255.00],
	['Apr',  250.00],
	['May',  250.00],
	['Jun',  270.00],
	['Jul',  265.00],
]
price_df = pd.DataFrame(price_data, columns=cols_list)
price_df
| month | price | |
|---|---|---|
| 0 | Jan | 250.0 | 
| 1 | Feb | 240.0 | 
| 2 | Mar | 255.0 | 
| 3 | Apr | 250.0 | 
| 4 | May | 250.0 | 
| 5 | Jun | 270.0 | 
| 6 | Jul | 265.0 | 
Use the pct_change() method to calculate percent change each month.
Warning
This will only work if your table is sorted by ascending date (i.e., the oldest observations are at the top of your table).
# calculate monthly returns
price_df['return'] = price_df['price'].pct_change()
price_df
| month | price | return | |
|---|---|---|---|
| 0 | Jan | 250.0 | NaN | 
| 1 | Feb | 240.0 | -0.040000 | 
| 2 | Mar | 255.0 | 0.062500 | 
| 3 | Apr | 250.0 | -0.019608 | 
| 4 | May | 250.0 | 0.000000 | 
| 5 | Jun | 270.0 | 0.080000 | 
| 6 | Jul | 265.0 | -0.018519 | 
Use the .rolling() method to calculate a simple moving average price over the last three months.
price_df['rolling_avg_price'] = price_df.rolling(3)['price'].mean()
price_df
| month | price | return | rolling_avg_price | |
|---|---|---|---|---|
| 0 | Jan | 250.0 | NaN | NaN | 
| 1 | Feb | 240.0 | -0.040000 | NaN | 
| 2 | Mar | 255.0 | 0.062500 | 248.333333 | 
| 3 | Apr | 250.0 | -0.019608 | 248.333333 | 
| 4 | May | 250.0 | 0.000000 | 251.666667 | 
| 5 | Jun | 270.0 | 0.080000 | 256.666667 | 
| 6 | Jul | 265.0 | -0.018519 | 261.666667 |