Performing Grouped Aggregations¶
Let us understand how to perform grouped or by key aggregations using Pandas.
Here are the steps we need to follow:
Make sure data is read into Data Frame.
Identify the key on which data should be aggregated. If the data has to be aggregated on derived field which is not available as part of the Data Frame, then first we need to update data frame with the derived field.
Using the key group the values using
groupby
function on data frame. We can only pass column names from Data Frame as part ofgroupby
.Apply required aggregate functions to get aggregated results based up on the key.
We can apply multiple aggregate functions at a time after creating grouped data frame.
Pandas Data Frame exposes a function called as
rename
to provide aliases to the aggregated fields.
%run 06_csv_to_pandas_data_frame.ipynb
Getting number of orders per day
orders
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
0 | 1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED |
1 | 2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT |
2 | 3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE |
3 | 4 | 2013-07-25 00:00:00.0 | 8827 | CLOSED |
4 | 5 | 2013-07-25 00:00:00.0 | 11318 | COMPLETE |
... | ... | ... | ... | ... |
68878 | 68879 | 2014-07-09 00:00:00.0 | 778 | COMPLETE |
68879 | 68880 | 2014-07-13 00:00:00.0 | 1117 | COMPLETE |
68880 | 68881 | 2014-07-19 00:00:00.0 | 2518 | PENDING_PAYMENT |
68881 | 68882 | 2014-07-22 00:00:00.0 | 10000 | ON_HOLD |
68882 | 68883 | 2014-07-23 00:00:00.0 | 5533 | COMPLETE |
68883 rows × 4 columns
orders.groupby(orders['order_date'])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f1f811bb828>
list(orders.groupby(orders['order_date'])['order_id'])[:3]
[('2013-07-25 00:00:00.0',
0 1
1 2
2 3
3 4
4 5
...
57786 57787
57787 57788
57788 57789
67415 67416
68690 68691
Name: order_id, Length: 143, dtype: int64),
('2013-07-26 00:00:00.0',
104 105
105 106
106 107
107 108
108 109
...
67418 67419
67419 67420
67420 67421
67421 67422
68691 68692
Name: order_id, Length: 269, dtype: int64),
('2013-07-27 00:00:00.0',
346 347
347 348
348 349
349 350
350 351
...
67422 67423
67423 67424
67424 67425
67425 67426
68692 68693
Name: order_id, Length: 202, dtype: int64)]
orders.groupby(orders['order_date'])['order_id'].count()
order_date
2013-07-25 00:00:00.0 143
2013-07-26 00:00:00.0 269
2013-07-27 00:00:00.0 202
2013-07-28 00:00:00.0 187
2013-07-29 00:00:00.0 253
...
2014-07-20 00:00:00.0 285
2014-07-21 00:00:00.0 235
2014-07-22 00:00:00.0 138
2014-07-23 00:00:00.0 166
2014-07-24 00:00:00.0 185
Name: order_id, Length: 364, dtype: int64
Getting number of orders per status
orders.groupby('order_status')['order_status'].count()
order_status
CANCELED 1428
CLOSED 7556
COMPLETE 22899
ON_HOLD 3798
PAYMENT_REVIEW 729
PENDING 7610
PENDING_PAYMENT 15030
PROCESSING 8275
SUSPECTED_FRAUD 1558
Name: order_status, dtype: int64
Computing revenue per order
order_items
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
4 | 5 | 4 | 897 | 2 | 49.98 | 24.99 |
... | ... | ... | ... | ... | ... | ... |
172193 | 172194 | 68881 | 403 | 1 | 129.99 | 129.99 |
172194 | 172195 | 68882 | 365 | 1 | 59.99 | 59.99 |
172195 | 172196 | 68882 | 502 | 1 | 50.00 | 50.00 |
172196 | 172197 | 68883 | 208 | 1 | 1999.99 | 1999.99 |
172197 | 172198 | 68883 | 502 | 3 | 150.00 | 50.00 |
172198 rows × 6 columns
list(order_items. \
groupby('order_item_order_id')['order_item_subtotal'])[:5]
[(1,
0 299.98
Name: order_item_subtotal, dtype: float64),
(2,
1 199.99
2 250.00
3 129.99
Name: order_item_subtotal, dtype: float64),
(4,
4 49.98
5 299.95
6 150.00
7 199.92
Name: order_item_subtotal, dtype: float64),
(5,
8 299.98
9 299.95
10 99.96
11 299.98
12 129.99
Name: order_item_subtotal, dtype: float64),
(7,
13 199.99
14 299.98
15 79.95
Name: order_item_subtotal, dtype: float64)]
order_items. \
groupby('order_item_order_id')['order_item_subtotal']. \
sum()
order_item_order_id
1 299.98
2 579.98
4 699.85
5 1129.86
7 579.92
...
68879 1259.97
68880 999.77
68881 129.99
68882 109.99
68883 2149.99
Name: order_item_subtotal, Length: 57431, dtype: float64
order_items. \
groupby('order_item_order_id')['order_item_subtotal']. \
agg(['sum', 'min', 'max', 'count'])
sum | min | max | count | |
---|---|---|---|---|
order_item_order_id | ||||
1 | 299.98 | 299.98 | 299.98 | 1 |
2 | 579.98 | 129.99 | 250.00 | 3 |
4 | 699.85 | 49.98 | 299.95 | 4 |
5 | 1129.86 | 99.96 | 299.98 | 5 |
7 | 579.92 | 79.95 | 299.98 | 3 |
... | ... | ... | ... | ... |
68879 | 1259.97 | 129.99 | 999.99 | 3 |
68880 | 999.77 | 149.94 | 250.00 | 5 |
68881 | 129.99 | 129.99 | 129.99 | 1 |
68882 | 109.99 | 50.00 | 59.99 | 2 |
68883 | 2149.99 | 150.00 | 1999.99 | 2 |
57431 rows × 4 columns
order_items. \
groupby('order_item_order_id')['order_item_subtotal']. \
agg(['sum', 'min', 'max', 'count']). \
rename(columns={'count': 'item_count', 'sum': 'revenue'})
revenue | min | max | item_count | |
---|---|---|---|---|
order_item_order_id | ||||
1 | 299.98 | 299.98 | 299.98 | 1 |
2 | 579.98 | 129.99 | 250.00 | 3 |
4 | 699.85 | 49.98 | 299.95 | 4 |
5 | 1129.86 | 99.96 | 299.98 | 5 |
7 | 579.92 | 79.95 | 299.98 | 3 |
... | ... | ... | ... | ... |
68879 | 1259.97 | 129.99 | 999.99 | 3 |
68880 | 999.77 | 149.94 | 250.00 | 5 |
68881 | 129.99 | 129.99 | 129.99 | 1 |
68882 | 109.99 | 50.00 | 59.99 | 2 |
68883 | 2149.99 | 150.00 | 1999.99 | 2 |
57431 rows × 4 columns
order_items.rename(columns={'order_item_order_id': 'order_id'})
order_item_id | order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
4 | 5 | 4 | 897 | 2 | 49.98 | 24.99 |
... | ... | ... | ... | ... | ... | ... |
172193 | 172194 | 68881 | 403 | 1 | 129.99 | 129.99 |
172194 | 172195 | 68882 | 365 | 1 | 59.99 | 59.99 |
172195 | 172196 | 68882 | 502 | 1 | 50.00 | 50.00 |
172196 | 172197 | 68883 | 208 | 1 | 1999.99 | 1999.99 |
172197 | 172198 | 68883 | 502 | 3 | 150.00 | 50.00 |
172198 rows × 6 columns
Task 1¶
Get order_item_count and order_revenue for each order_id.
order_items
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
4 | 5 | 4 | 897 | 2 | 49.98 | 24.99 |
... | ... | ... | ... | ... | ... | ... |
172193 | 172194 | 68881 | 403 | 1 | 129.99 | 129.99 |
172194 | 172195 | 68882 | 365 | 1 | 59.99 | 59.99 |
172195 | 172196 | 68882 | 502 | 1 | 50.00 | 50.00 |
172196 | 172197 | 68883 | 208 | 1 | 1999.99 | 1999.99 |
172197 | 172198 | 68883 | 502 | 3 | 150.00 | 50.00 |
172198 rows × 6 columns
order_items. \
groupby('order_item_order_id')['order_item_subtotal']. \
agg(['sum', 'count']). \
rename(columns={'sum': 'order_revenue', 'count': 'order_item_count'}). \
reset_index()
order_item_order_id | order_revenue | order_item_count | |
---|---|---|---|
0 | 1 | 299.98 | 1 |
1 | 2 | 579.98 | 3 |
2 | 4 | 699.85 | 4 |
3 | 5 | 1129.86 | 5 |
4 | 7 | 579.92 | 3 |
... | ... | ... | ... |
57426 | 68879 | 1259.97 | 3 |
57427 | 68880 | 999.77 | 5 |
57428 | 68881 | 129.99 | 1 |
57429 | 68882 | 109.99 | 2 |
57430 | 68883 | 2149.99 | 2 |
57431 rows × 3 columns
Task 2¶
Get order count by month using orders data for specific order_status.
orders
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
0 | 1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED |
1 | 2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT |
2 | 3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE |
3 | 4 | 2013-07-25 00:00:00.0 | 8827 | CLOSED |
4 | 5 | 2013-07-25 00:00:00.0 | 11318 | COMPLETE |
... | ... | ... | ... | ... |
68878 | 68879 | 2014-07-09 00:00:00.0 | 778 | COMPLETE |
68879 | 68880 | 2014-07-13 00:00:00.0 | 1117 | COMPLETE |
68880 | 68881 | 2014-07-19 00:00:00.0 | 2518 | PENDING_PAYMENT |
68881 | 68882 | 2014-07-22 00:00:00.0 | 10000 | ON_HOLD |
68882 | 68883 | 2014-07-23 00:00:00.0 | 5533 | COMPLETE |
68883 rows × 4 columns
orders.order_date.str.slice(0, 7)
0 2013-07
1 2013-07
2 2013-07
3 2013-07
4 2013-07
...
68878 2014-07
68879 2014-07
68880 2014-07
68881 2014-07
68882 2014-07
Name: order_date, Length: 68883, dtype: object
orders['order_month'] = orders.order_date.str.slice(0, 7)
orders
order_id | order_date | order_customer_id | order_status | order_month | |
---|---|---|---|---|---|
0 | 1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED | 2013-07 |
1 | 2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT | 2013-07 |
2 | 3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE | 2013-07 |
3 | 4 | 2013-07-25 00:00:00.0 | 8827 | CLOSED | 2013-07 |
4 | 5 | 2013-07-25 00:00:00.0 | 11318 | COMPLETE | 2013-07 |
... | ... | ... | ... | ... | ... |
68878 | 68879 | 2014-07-09 00:00:00.0 | 778 | COMPLETE | 2014-07 |
68879 | 68880 | 2014-07-13 00:00:00.0 | 1117 | COMPLETE | 2014-07 |
68880 | 68881 | 2014-07-19 00:00:00.0 | 2518 | PENDING_PAYMENT | 2014-07 |
68881 | 68882 | 2014-07-22 00:00:00.0 | 10000 | ON_HOLD | 2014-07 |
68882 | 68883 | 2014-07-23 00:00:00.0 | 5533 | COMPLETE | 2014-07 |
68883 rows × 5 columns
orders.query('order_status == "COMPLETE"'). \
groupby('order_month')['order_id']. \
count(). \
sort_index()
order_month
2013-07 515
2014-07 1419
2013-10 1783
2014-06 1797
2014-05 1854
2014-02 1869
2013-08 1880
2013-12 1898
2014-01 1911
2014-04 1932
2013-09 1933
2014-03 1967
2013-11 2141
Name: order_id, dtype: int64
Task 3¶
Get order_revenue and order_quantity for each order_id. Add quantity of all items for each order_id to get order_quantity.
order_metrics = order_items. \
groupby('order_item_order_id')[['order_item_subtotal', 'order_item_quantity']]. \
agg(['sum'])
order_metrics.columns = ['order_revenue', 'order_quantity']
order_metrics
order_revenue | order_quantity | |
---|---|---|
order_item_order_id | ||
1 | 299.98 | 1 |
2 | 579.98 | 7 |
4 | 699.85 | 14 |
5 | 1129.86 | 10 |
7 | 579.92 | 7 |
... | ... | ... |
68879 | 1259.97 | 3 |
68880 | 999.77 | 17 |
68881 | 129.99 | 1 |
68882 | 109.99 | 2 |
68883 | 2149.99 | 4 |
57431 rows × 2 columns