Performing Total Aggregations¶
Let us understand how to perform total or global aggregations using Pandas.
%run 06_csv_to_pandas_data_frame.ipynb
Getting number of records in the Data Frame.
orders.shape
(68883, 4)
orders.shape[0]
68883
order_items.shape[0]
172198
Getting number of non np.NaN values in each attribute in a Data Frame
orders.count()
order_id 68883
order_date 68883
order_customer_id 68883
order_status 68883
dtype: int64
type(orders.count())
pandas.core.series.Series
orders.count()['order_id']
68883
orders.order_id.count()
68883
orders['order_id'].count()
68883
Getting basic statistics of numeric fields of a Data Frame
orders.describe()
order_id | order_customer_id | |
---|---|---|
count | 68883.000000 | 68883.000000 |
mean | 34442.000000 | 6216.571099 |
std | 19884.953633 | 3586.205241 |
min | 1.000000 | 1.000000 |
25% | 17221.500000 | 3122.000000 |
50% | 34442.000000 | 6199.000000 |
75% | 51662.500000 | 9326.000000 |
max | 68883.000000 | 12435.000000 |
Get revenue for order id 2 from order_items
order_items[order_items.order_item_order_id == 2]
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
order_items[order_items.order_item_order_id == 2].order_item_subtotal
1 199.99
2 250.00
3 129.99
Name: order_item_subtotal, dtype: float64
order_items[order_items.order_item_order_id == 2].order_item_subtotal.sum()
579.98
order_items[order_items.order_item_order_id == 2]['order_item_subtotal'].sum()
579.98
Task 1¶
Use orders and get total number of records for a given month (201401).
orders
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
0 | 1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED |
1 | 2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT |
2 | 3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE |
3 | 4 | 2013-07-25 00:00:00.0 | 8827 | CLOSED |
4 | 5 | 2013-07-25 00:00:00.0 | 11318 | COMPLETE |
... | ... | ... | ... | ... |
68878 | 68879 | 2014-07-09 00:00:00.0 | 778 | COMPLETE |
68879 | 68880 | 2014-07-13 00:00:00.0 | 1117 | COMPLETE |
68880 | 68881 | 2014-07-19 00:00:00.0 | 2518 | PENDING_PAYMENT |
68881 | 68882 | 2014-07-22 00:00:00.0 | 10000 | ON_HOLD |
68882 | 68883 | 2014-07-23 00:00:00.0 | 5533 | COMPLETE |
68883 rows × 4 columns
orders.order_date
0 2013-07-25 00:00:00.0
1 2013-07-25 00:00:00.0
2 2013-07-25 00:00:00.0
3 2013-07-25 00:00:00.0
4 2013-07-25 00:00:00.0
...
68878 2014-07-09 00:00:00.0
68879 2014-07-13 00:00:00.0
68880 2014-07-19 00:00:00.0
68881 2014-07-22 00:00:00.0
68882 2014-07-23 00:00:00.0
Name: order_date, Length: 68883, dtype: object
orders['order_date'].str.slice(0, 7)
0 2013-07
1 2013-07
2 2013-07
3 2013-07
4 2013-07
...
68878 2014-07
68879 2014-07
68880 2014-07
68881 2014-07
68882 2014-07
Name: order_date, Length: 68883, dtype: object
orders['order_date'].str.slice(0, 7).str.replace('-', '').astype('int64')
0 201307
1 201307
2 201307
3 201307
4 201307
...
68878 201407
68879 201407
68880 201407
68881 201407
68882 201407
Name: order_date, Length: 68883, dtype: int64
orders['order_date'].str.slice(0, 7).str.replace('-', '').astype('int64') == 201401
0 False
1 False
2 False
3 False
4 False
...
68878 False
68879 False
68880 False
68881 False
68882 False
Name: order_date, Length: 68883, dtype: bool
orders[orders['order_date'].str.slice(0, 7).str.replace('-', '').astype('int64') == 201401]
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
25875 | 25876 | 2014-01-01 00:00:00.0 | 3414 | PENDING_PAYMENT |
25876 | 25877 | 2014-01-01 00:00:00.0 | 5549 | PENDING_PAYMENT |
25877 | 25878 | 2014-01-01 00:00:00.0 | 9084 | PENDING |
25878 | 25879 | 2014-01-01 00:00:00.0 | 5118 | PENDING |
25879 | 25880 | 2014-01-01 00:00:00.0 | 10146 | CANCELED |
... | ... | ... | ... | ... |
68789 | 68790 | 2014-01-26 00:00:00.0 | 10302 | CLOSED |
68790 | 68791 | 2014-01-27 00:00:00.0 | 6524 | COMPLETE |
68791 | 68792 | 2014-01-28 00:00:00.0 | 9809 | CANCELED |
68792 | 68793 | 2014-01-30 00:00:00.0 | 5654 | COMPLETE |
68793 | 68794 | 2014-01-31 00:00:00.0 | 6873 | COMPLETE |
5908 rows × 4 columns
orders[orders['order_date'].str.slice(0, 7).str.replace('-', '').astype('int64') == 201401]['order_id'].count()
5908
Task 2¶
Use order_items data set and compute total revenue generated for a given product_id.
order_items
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
4 | 5 | 4 | 897 | 2 | 49.98 | 24.99 |
... | ... | ... | ... | ... | ... | ... |
172193 | 172194 | 68881 | 403 | 1 | 129.99 | 129.99 |
172194 | 172195 | 68882 | 365 | 1 | 59.99 | 59.99 |
172195 | 172196 | 68882 | 502 | 1 | 50.00 | 50.00 |
172196 | 172197 | 68883 | 208 | 1 | 1999.99 | 1999.99 |
172197 | 172198 | 68883 | 502 | 3 | 150.00 | 50.00 |
172198 rows × 6 columns
order_items.query('order_item_product_id == 502')
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
2 | 3 | 2 | 502 | 5 | 250.0 | 50.0 |
6 | 7 | 4 | 502 | 3 | 150.0 | 50.0 |
19 | 20 | 8 | 502 | 1 | 50.0 | 50.0 |
37 | 38 | 12 | 502 | 5 | 250.0 | 50.0 |
41 | 42 | 14 | 502 | 1 | 50.0 | 50.0 |
... | ... | ... | ... | ... | ... | ... |
172151 | 172152 | 68861 | 502 | 4 | 200.0 | 50.0 |
172173 | 172174 | 68871 | 502 | 4 | 200.0 | 50.0 |
172189 | 172190 | 68880 | 502 | 5 | 250.0 | 50.0 |
172195 | 172196 | 68882 | 502 | 1 | 50.0 | 50.0 |
172197 | 172198 | 68883 | 502 | 3 | 150.0 | 50.0 |
21035 rows × 6 columns
order_items.query('order_item_product_id == 502')['order_item_subtotal'].sum()
3147800.0
order_items.query('order_item_product_id == 502').order_item_subtotal.sum()
3147800.0
Task 3¶
Use order_items data set and get total number of items sold as well as total revenue generated for a given product_id.
order_items
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |
3 | 4 | 2 | 403 | 1 | 129.99 | 129.99 |
4 | 5 | 4 | 897 | 2 | 49.98 | 24.99 |
... | ... | ... | ... | ... | ... | ... |
172193 | 172194 | 68881 | 403 | 1 | 129.99 | 129.99 |
172194 | 172195 | 68882 | 365 | 1 | 59.99 | 59.99 |
172195 | 172196 | 68882 | 502 | 1 | 50.00 | 50.00 |
172196 | 172197 | 68883 | 208 | 1 | 1999.99 | 1999.99 |
172197 | 172198 | 68883 | 502 | 3 | 150.00 | 50.00 |
172198 rows × 6 columns
order_items_for_product_id = order_items.query('order_item_product_id == 502')
order_items_for_product_id
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
2 | 3 | 2 | 502 | 5 | 250.0 | 50.0 |
6 | 7 | 4 | 502 | 3 | 150.0 | 50.0 |
19 | 20 | 8 | 502 | 1 | 50.0 | 50.0 |
37 | 38 | 12 | 502 | 5 | 250.0 | 50.0 |
41 | 42 | 14 | 502 | 1 | 50.0 | 50.0 |
... | ... | ... | ... | ... | ... | ... |
172151 | 172152 | 68861 | 502 | 4 | 200.0 | 50.0 |
172173 | 172174 | 68871 | 502 | 4 | 200.0 | 50.0 |
172189 | 172190 | 68880 | 502 | 5 | 250.0 | 50.0 |
172195 | 172196 | 68882 | 502 | 1 | 50.0 | 50.0 |
172197 | 172198 | 68883 | 502 | 3 | 150.0 | 50.0 |
21035 rows × 6 columns
order_items_for_product_id[['order_item_quantity', 'order_item_subtotal']]
order_item_quantity | order_item_subtotal | |
---|---|---|
2 | 5 | 250.0 |
6 | 3 | 150.0 |
19 | 1 | 50.0 |
37 | 5 | 250.0 |
41 | 1 | 50.0 |
... | ... | ... |
172151 | 4 | 200.0 |
172173 | 4 | 200.0 |
172189 | 5 | 250.0 |
172195 | 1 | 50.0 |
172197 | 3 | 150.0 |
21035 rows × 2 columns
order_items_for_product_id[['order_item_quantity', 'order_item_subtotal']].sum()
order_item_quantity 62956.0
order_item_subtotal 3147800.0
dtype: float64
tuple(order_items_for_product_id[['order_item_quantity', 'order_item_subtotal']].sum())
(62956.0, 3147800.0)
dict(order_items_for_product_id[['order_item_quantity', 'order_item_subtotal']].sum())
{'order_item_quantity': 62956.0, 'order_item_subtotal': 3147800.0}
Task 4¶
Create a collection with sales and commission percentage. Using that collection compute total commission amount. If the commission percent is None or not present, treat it as 0.
Each element in the collection should be a tuple.
First element is the sales amount and second element is commission percentage.
Commission for each sale can be computed by multiplying commission percentage with sales (make sure to divide commission percentage by 100).
Some of the records does not have commission percentage, in that case commission amount for that sale shall be 0
transactions = [(376.0, 8),
(548.23, 14),
(107.93, 8),
(838.22, 14),
(846.85, 21),
(234.84,),
(850.2, 21),
(992.2, 21),
(267.01,),
(958.91, 21),
(412.59,),
(283.14,),
(350.01, 14),
(226.95,),
(132.7, 14)]
sales = pd.DataFrame(transactions, columns=['sale_amount', 'commission_pct'])
sales
sale_amount | commission_pct | |
---|---|---|
0 | 376.00 | 8.0 |
1 | 548.23 | 14.0 |
2 | 107.93 | 8.0 |
3 | 838.22 | 14.0 |
4 | 846.85 | 21.0 |
5 | 234.84 | NaN |
6 | 850.20 | 21.0 |
7 | 992.20 | 21.0 |
8 | 267.01 | NaN |
9 | 958.91 | 21.0 |
10 | 412.59 | NaN |
11 | 283.14 | NaN |
12 | 350.01 | 14.0 |
13 | 226.95 | NaN |
14 | 132.70 | 14.0 |
sales_filled = sales.fillna(0.0)
sales_filled
sale_amount | commission_pct | |
---|---|---|
0 | 376.00 | 8.0 |
1 | 548.23 | 14.0 |
2 | 107.93 | 8.0 |
3 | 838.22 | 14.0 |
4 | 846.85 | 21.0 |
5 | 234.84 | 0.0 |
6 | 850.20 | 21.0 |
7 | 992.20 | 21.0 |
8 | 267.01 | 0.0 |
9 | 958.91 | 21.0 |
10 | 412.59 | 0.0 |
11 | 283.14 | 0.0 |
12 | 350.01 | 14.0 |
13 | 226.95 | 0.0 |
14 | 132.70 | 14.0 |
(sales_filled['sale_amount'] * (sales_filled['commission_pct'] / 100))
0 30.0800
1 76.7522
2 8.6344
3 117.3508
4 177.8385
5 0.0000
6 178.5420
7 208.3620
8 0.0000
9 201.3711
10 0.0000
11 0.0000
12 49.0014
13 0.0000
14 18.5780
dtype: float64
(sales_filled['sale_amount'] * (sales_filled['commission_pct'] / 100)).sum()
1066.5104000000001
(sales_filled['sale_amount'] * (sales_filled['commission_pct'] / 100)).sum().round(2)
1066.51