Performing Grouped Aggregations¶
Let us understand how to perform grouped or by key aggregations using Pandas.
Here are the steps we need to follow:
Make sure data is read into Data Frame.
Identify the key on which data should be aggregated. If the data has to be aggregated on derived field which is not available as part of the Data Frame, then first we need to update data frame with the derived field.
Using the key group the values using
groupby
function on data frame. We can only pass column names from Data Frame as part ofgroupby
.Apply required aggregate functions to get aggregated results based up on the key.
We can apply multiple aggregate functions at a time after creating grouped data frame.
Pandas Data Frame exposes a function called as
rename
to provide aliases to the aggregated fields.
Getting number of orders per day
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
0 | 1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED |
1 | 2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT |
2 | 3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE |
3 | 4 | 2013-07-25 00:00:00.0 | 8827 | CLOSED |
4 | 5 | 2013-07-25 00:00:00.0 | 11318 | COMPLETE |
... | ... | ... | ... | ... |
68878 | 68879 | 2014-07-09 00:00:00.0 | 778 | COMPLETE |
68879 | 68880 | 2014-07-13 00:00:00.0 | 1117 | COMPLETE |
68880 | 68881 | 2014-07-19 00:00:00.0 | 2518 | PENDING_PAYMENT |
68881 | 68882 | 2014-07-22 00:00:00.0 | 10000 | ON_HOLD |
68882 | 68883 | 2014-07-23 00:00:00.0 | 5533 | COMPLETE |
68883 rows × 4 columns
[('2013-07-25 00:00:00.0',
0 1
1 2
2 3
3 4
4 5
...
57786 57787
57787 57788
57788 57789
67415 67416
68690 68691
Name: order_id, Length: 143, dtype: int64),
('2013-07-26 00:00:00.0',
104 105
105 106
106 107
107 108
108 109
...
67418 67419
67419 67420
67420 67421
67421 67422
68691 68692
Name: order_id, Length: 269, dtype: int64),
('2013-07-27 00:00:00.0',
346 347
347 348
348 349
349 350
350 351
...
67422 67423
67423 67424
67424 67425
67425 67426
68692 68693
Name: order_id, Length: 202, dtype: int64)]
order_date
2013-07-25 00:00:00.0 143
2013-07-26 00:00:00.0 269
2013-07-27 00:00:00.0 202
2013-07-28 00:00:00.0 187
2013-07-29 00:00:00.0 253
...
2014-07-20 00:00:00.0 285
2014-07-21 00:00:00.0 235
2014-07-22 00:00:00.0 138
2014-07-23 00:00:00.0 166
2014-07-24 00:00:00.0 185
Name: order_id, Length: 364, dtype: int64
Getting number of orders per status
Computing revenue per order
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
4 | 5 | 4 | 897 | 2 | 49.98 | 24.99 |
... | ... | ... | ... | ... | ... | ... |
172193 | 172194 | 68881 | 403 | 1 | 129.99 | 129.99 |
172194 | 172195 | 68882 | 365 | 1 | 59.99 | 59.99 |
172195 | 172196 | 68882 | 502 | 1 | 50.00 | 50.00 |
172196 | 172197 | 68883 | 208 | 1 | 1999.99 | 1999.99 |
172197 | 172198 | 68883 | 502 | 3 | 150.00 | 50.00 |
172198 rows × 6 columns
[(1,
0 299.98
Name: order_item_subtotal, dtype: float64),
(2,
1 199.99
2 250.00
3 129.99
Name: order_item_subtotal, dtype: float64),
(4,
4 49.98
5 299.95
6 150.00
7 199.92
Name: order_item_subtotal, dtype: float64),
(5,
8 299.98
9 299.95
10 99.96
11 299.98
12 129.99
Name: order_item_subtotal, dtype: float64),
(7,
13 199.99
14 299.98
15 79.95
Name: order_item_subtotal, dtype: float64)]
sum | min | max | count | |
---|---|---|---|---|
order_item_order_id | ||||
1 | 299.98 | 299.98 | 299.98 | 1 |
2 | 579.98 | 129.99 | 250.00 | 3 |
4 | 699.85 | 49.98 | 299.95 | 4 |
5 | 1129.86 | 99.96 | 299.98 | 5 |
7 | 579.92 | 79.95 | 299.98 | 3 |
... | ... | ... | ... | ... |
68879 | 1259.97 | 129.99 | 999.99 | 3 |
68880 | 999.77 | 149.94 | 250.00 | 5 |
68881 | 129.99 | 129.99 | 129.99 | 1 |
68882 | 109.99 | 50.00 | 59.99 | 2 |
68883 | 2149.99 | 150.00 | 1999.99 | 2 |
57431 rows × 4 columns
revenue | min | max | item_count | |
---|---|---|---|---|
order_item_order_id | ||||
1 | 299.98 | 299.98 | 299.98 | 1 |
2 | 579.98 | 129.99 | 250.00 | 3 |
4 | 699.85 | 49.98 | 299.95 | 4 |
5 | 1129.86 | 99.96 | 299.98 | 5 |
7 | 579.92 | 79.95 | 299.98 | 3 |
... | ... | ... | ... | ... |
68879 | 1259.97 | 129.99 | 999.99 | 3 |
68880 | 999.77 | 149.94 | 250.00 | 5 |
68881 | 129.99 | 129.99 | 129.99 | 1 |
68882 | 109.99 | 50.00 | 59.99 | 2 |
68883 | 2149.99 | 150.00 | 1999.99 | 2 |
57431 rows × 4 columns
order_item_id | order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
4 | 5 | 4 | 897 | 2 | 49.98 | 24.99 |
... | ... | ... | ... | ... | ... | ... |
172193 | 172194 | 68881 | 403 | 1 | 129.99 | 129.99 |
172194 | 172195 | 68882 | 365 | 1 | 59.99 | 59.99 |
172195 | 172196 | 68882 | 502 | 1 | 50.00 | 50.00 |
172196 | 172197 | 68883 | 208 | 1 | 1999.99 | 1999.99 |
172197 | 172198 | 68883 | 502 | 3 | 150.00 | 50.00 |
172198 rows × 6 columns
Task 1¶
Get order_item_count and order_revenue for each order_id.
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
4 | 5 | 4 | 897 | 2 | 49.98 | 24.99 |
... | ... | ... | ... | ... | ... | ... |
172193 | 172194 | 68881 | 403 | 1 | 129.99 | 129.99 |
172194 | 172195 | 68882 | 365 | 1 | 59.99 | 59.99 |
172195 | 172196 | 68882 | 502 | 1 | 50.00 | 50.00 |
172196 | 172197 | 68883 | 208 | 1 | 1999.99 | 1999.99 |
172197 | 172198 | 68883 | 502 | 3 | 150.00 | 50.00 |
172198 rows × 6 columns
order_item_order_id | order_revenue | order_item_count | |
---|---|---|---|
0 | 1 | 299.98 | 1 |
1 | 2 | 579.98 | 3 |
2 | 4 | 699.85 | 4 |
3 | 5 | 1129.86 | 5 |
4 | 7 | 579.92 | 3 |
... | ... | ... | ... |
57426 | 68879 | 1259.97 | 3 |
57427 | 68880 | 999.77 | 5 |
57428 | 68881 | 129.99 | 1 |
57429 | 68882 | 109.99 | 2 |
57430 | 68883 | 2149.99 | 2 |
57431 rows × 3 columns
Task 2¶
Get order count by month using orders data for specific order_status.
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
0 | 1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED |
1 | 2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT |
2 | 3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE |
3 | 4 | 2013-07-25 00:00:00.0 | 8827 | CLOSED |
4 | 5 | 2013-07-25 00:00:00.0 | 11318 | COMPLETE |
... | ... | ... | ... | ... |
68878 | 68879 | 2014-07-09 00:00:00.0 | 778 | COMPLETE |
68879 | 68880 | 2014-07-13 00:00:00.0 | 1117 | COMPLETE |
68880 | 68881 | 2014-07-19 00:00:00.0 | 2518 | PENDING_PAYMENT |
68881 | 68882 | 2014-07-22 00:00:00.0 | 10000 | ON_HOLD |
68882 | 68883 | 2014-07-23 00:00:00.0 | 5533 | COMPLETE |
68883 rows × 4 columns
order_id | order_date | order_customer_id | order_status | order_month | |
---|---|---|---|---|---|
0 | 1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED | 2013-07 |
1 | 2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT | 2013-07 |
2 | 3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE | 2013-07 |
3 | 4 | 2013-07-25 00:00:00.0 | 8827 | CLOSED | 2013-07 |
4 | 5 | 2013-07-25 00:00:00.0 | 11318 | COMPLETE | 2013-07 |
... | ... | ... | ... | ... | ... |
68878 | 68879 | 2014-07-09 00:00:00.0 | 778 | COMPLETE | 2014-07 |
68879 | 68880 | 2014-07-13 00:00:00.0 | 1117 | COMPLETE | 2014-07 |
68880 | 68881 | 2014-07-19 00:00:00.0 | 2518 | PENDING_PAYMENT | 2014-07 |
68881 | 68882 | 2014-07-22 00:00:00.0 | 10000 | ON_HOLD | 2014-07 |
68882 | 68883 | 2014-07-23 00:00:00.0 | 5533 | COMPLETE | 2014-07 |
68883 rows × 5 columns
Task 3¶
Get order_revenue and order_quantity for each order_id. Add quantity of all items for each order_id to get order_quantity.