Statistics and other calculations
Contents
Statistics and other calculations#
import pandas as pd
small_url = 'https://gist.githubusercontent.com/van-alfen/bca77fd9f4c0fcf74308bc7cfa12a8a7/raw/030a8fea2ddfd1b8fd257d5a601b15a2942c78d7/small_data.csv'
small_df = pd.read_csv(small_url)
small_df
ticker | price | name | assets_current | liab_current | |
---|---|---|---|---|---|
0 | AAPL | 137 | Apple Inc | 162.82 | 105.72 |
1 | MSFT | 229 | Microsoft Corp | 181.92 | 72.31 |
2 | GOOG | 1677 | Alphabet Inc | 152.58 | 45.22 |
3 | AMZN | 3535 | Amazon.com Inc | 96.33 | 87.81 |
Column statistics#
Calculate the average of the price column and store the result in the avg_value
variable.
avg_value = small_df['price'].mean()
avg_value
1394.5
Calculate the median of the price column.
small_df['price'].median()
953.0
Calculate the minimum of the price column.
small_df['price'].min()
137
Calculate the maximum of the price column.
small_df['price'].max()
3535
Calculate the standard deviation of the price column.
small_df['price'].std()
1591.775005038924
Calculate the product of the price column.
While it does not really make sense to multiply all of the prices together, there are other situations where this functionality will be useful.
small_df['price'].product()
185985261735
Group statistics#
For this section, we will use data from plotly.
import plotly.express as px
tips_df = px.data.tips()
tips_df.tail(10)
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
234 | 15.53 | 3.00 | Male | Yes | Sat | Dinner | 2 |
235 | 10.07 | 1.25 | Male | No | Sat | Dinner | 2 |
236 | 12.60 | 1.00 | Male | Yes | Sat | Dinner | 2 |
237 | 32.83 | 1.17 | Male | Yes | Sat | Dinner | 2 |
238 | 35.83 | 4.67 | Female | No | Sat | Dinner | 3 |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
241 | 22.67 | 2.00 | Male | Yes | Sat | Dinner | 2 |
242 | 17.82 | 1.75 | Male | No | Sat | Dinner | 2 |
243 | 18.78 | 3.00 | Female | No | Thur | Dinner | 2 |
Suppose we want to know the following:
general statistics for
tip
what the possible values for
time
areaverage
tip
, bytime
maximum
total_bill
, bysex
the number of
tip
s earned eachday
the total
tip
earned eachday
average table
size
andtip
, byday
andsmoker
tips_df['tip'].describe()
count 244.000000
mean 2.998279
std 1.383638
min 1.000000
25% 2.000000
50% 2.900000
75% 3.562500
max 10.000000
Name: tip, dtype: float64
tips_df['time'].unique()
array(['Dinner', 'Lunch'], dtype=object)
tips_df.groupby('time')['tip'].mean()
time
Dinner 3.102670
Lunch 2.728088
Name: tip, dtype: float64
tips_df.groupby('sex')['total_bill'].max()
sex
Female 44.30
Male 50.81
Name: total_bill, dtype: float64
tips_df.groupby('day')['tip'].count()
day
Fri 19
Sat 87
Sun 76
Thur 62
Name: tip, dtype: int64
tips_df.groupby('day')['tip'].sum()
day
Fri 51.96
Sat 260.40
Sun 247.39
Thur 171.83
Name: tip, dtype: float64
groupby_list = [ 'day', 'smoker' ]
variable_list = ['size', 'tip']
tips_df.groupby( groupby_list )[ variable_list ].mean()
size | tip | ||
---|---|---|---|
day | smoker | ||
Fri | No | 2.250000 | 2.812500 |
Yes | 2.066667 | 2.714000 | |
Sat | No | 2.555556 | 3.102889 |
Yes | 2.476190 | 2.875476 | |
Sun | No | 2.929825 | 3.167895 |
Yes | 2.578947 | 3.516842 | |
Thur | No | 2.488889 | 2.673778 |
Yes | 2.352941 | 3.030000 |
Make a table that has the the average tip, average bill, max table size, and number of tables, for each day.
Use .agg()
.
summary_df = tips_df.groupby('day').agg(
avg_tip=('tip', 'mean'),
avg_bill=('total_bill', 'mean'),
max_size=('size', 'max'),
num_tables=('tip', 'count'),
)
summary_df
avg_tip | avg_bill | max_size | num_tables | |
---|---|---|---|---|
day | ||||
Fri | 2.734737 | 17.151579 | 4 | 19 |
Sat | 2.993103 | 20.441379 | 5 | 87 |
Sun | 3.255132 | 21.410000 | 6 | 76 |
Thur | 2.771452 | 17.682742 | 6 | 62 |
Other calculations#
Changes and rolling averages#
Let’s calculate returns from prices and then calculate a rolling average.
# dataframe containing months and prices for AMZN stock
cols_list = ['month', 'price']
price_data = [
['Jan', 250.00],
['Feb', 240.00],
['Mar', 255.00],
['Apr', 250.00],
['May', 250.00],
['Jun', 270.00],
['Jul', 265.00],
]
price_df = pd.DataFrame(price_data, columns=cols_list)
price_df
month | price | |
---|---|---|
0 | Jan | 250.0 |
1 | Feb | 240.0 |
2 | Mar | 255.0 |
3 | Apr | 250.0 |
4 | May | 250.0 |
5 | Jun | 270.0 |
6 | Jul | 265.0 |
Use the pct_change()
method to calculate percent change each month.
Warning
This will only work if your table is sorted by ascending date (i.e., the oldest observations are at the top of your table).
# calculate monthly returns
price_df['return'] = price_df['price'].pct_change()
price_df
month | price | return | |
---|---|---|---|
0 | Jan | 250.0 | NaN |
1 | Feb | 240.0 | -0.040000 |
2 | Mar | 255.0 | 0.062500 |
3 | Apr | 250.0 | -0.019608 |
4 | May | 250.0 | 0.000000 |
5 | Jun | 270.0 | 0.080000 |
6 | Jul | 265.0 | -0.018519 |
Use the .rolling()
method to calculate a simple moving average price over the last three months.
price_df['rolling_avg_price'] = price_df.rolling(3)['price'].mean()
price_df
month | price | return | rolling_avg_price | |
---|---|---|---|---|
0 | Jan | 250.0 | NaN | NaN |
1 | Feb | 240.0 | -0.040000 | NaN |
2 | Mar | 255.0 | 0.062500 | 248.333333 |
3 | Apr | 250.0 | -0.019608 | 248.333333 |
4 | May | 250.0 | 0.000000 | 251.666667 |
5 | Jun | 270.0 | 0.080000 | 256.666667 |
6 | Jul | 265.0 | -0.018519 | 261.666667 |