Using groupby

Let us understand how we can use itertools.groupby to take care of aggregations by key.

  • itertools.groupby can be used to get the data grouped by a key.

  • It can be used to take care of use cases similar to following by using aggregate functions after grouping by key.

    • Get count by order status.

    • Get revenue for each order.

    • Get order count by month.

  • We need to ensure data is pre-sorted by the key, so that all the values associated with each key are grouped together.

import itertools as iter
iter.groupby?
Init signature: iter.groupby(self, /, *args, **kwargs)
Docstring:     
groupby(iterable, key=None) -> make an iterator that returns consecutive
keys and groups from the iterable.  If the key function is not specified or
is None, the element itself is used for grouping.
Type:           type
Subclasses:     
l = [1, 1, 3, 2, 1, 3, 2]
l_grouped = iter.groupby(l)
list(l_grouped)
[(1, <itertools._grouper at 0x7fdd9fb73cf8>),
 (3, <itertools._grouper at 0x7fdd41350d68>),
 (2, <itertools._grouper at 0x7fdd41350a90>),
 (1, <itertools._grouper at 0x7fdd3be07b00>),
 (3, <itertools._grouper at 0x7fdd3be07ef0>),
 (2, <itertools._grouper at 0x7fdd3be07f60>)]
l_sorted = sorted(l)
ls_grouped = iter.groupby(l_sorted)
list(ls_grouped)
[(1, <itertools._grouper at 0x7fdd3be781d0>),
 (2, <itertools._grouper at 0x7fdd3be07630>),
 (3, <itertools._grouper at 0x7fdd3be07be0>)]

Note

Rebuilding l_sorted and ls_grouped as ls_grouped will be flushed out after being read by list(ls_grouped).

l_sorted = sorted(l)
ls_grouped = iter.groupby(l_sorted)
list(iter.starmap(lambda key, values: (key, len(list(values))), ls_grouped))
[(1, 3), (2, 2), (3, 2)]
%run 02_preparing_data_sets.ipynb

Task 1 - Order Count by Status

Get count by order status using orders data set.

orders[:3]
['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE']
orders_sorted = sorted(orders, key=lambda k: k.split(',')[3])
orders_sorted[:3]
['50,2013-07-25 00:00:00.0,5225,CANCELED',
 '112,2013-07-26 00:00:00.0,5375,CANCELED',
 '527,2013-07-28 00:00:00.0,5426,CANCELED']
orders_grouped = iter.groupby(orders_sorted, lambda order: order.split(',')[3])
list(orders_grouped)[:3]
[('CANCELED', <itertools._grouper at 0x7fdd3be07b70>),
 ('CLOSED', <itertools._grouper at 0x7fdd3be62be0>),
 ('COMPLETE', <itertools._grouper at 0x7fdd3a12c0f0>)]
orders_sorted = sorted(orders, key=lambda k: k.split(',')[3])
orders_grouped = iter.groupby(orders_sorted, lambda order: order.split(',')[3])
order_count_by_status = iter.starmap(lambda key, values: (key, len(list(values))), orders_grouped)
list(order_count_by_status)
[('CANCELED', 1428),
 ('CLOSED', 7556),
 ('COMPLETE', 22899),
 ('ON_HOLD', 3798),
 ('PAYMENT_REVIEW', 729),
 ('PENDING', 7610),
 ('PENDING_PAYMENT', 15030),
 ('PROCESSING', 8275),
 ('SUSPECTED_FRAUD', 1558)]

Task 2 - Revenue per Order

Get revenue per order using order_items data set.

order_items[:4]
['1,1,957,1,299.98,299.98',
 '2,2,1073,1,199.99,199.99',
 '3,2,502,5,250.0,50.0',
 '4,2,403,1,129.99,129.99']
order_subtotals = map(lambda oi: (int(oi.split(',')[1]), float(oi.split(',')[4])), order_items)
list(order_subtotals)[:3]
[(1, 299.98), (2, 199.99), (2, 250.0)]
order_subtotals = map(lambda oi: (int(oi.split(',')[1]), float(oi.split(',')[4])), order_items)
order_subtotals_sorted = sorted(order_subtotals)
order_subtotals_grouped = iter.groupby(order_subtotals_sorted, lambda rec: rec[0])
list(order_subtotals_grouped)[:3]
[(1, <itertools._grouper at 0x7fdd3be62da0>),
 (2, <itertools._grouper at 0x7fdd38d9c208>),
 (4, <itertools._grouper at 0x7fdd38d9c390>)]
order_subtotals = map(lambda oi: (int(oi.split(',')[1]), float(oi.split(',')[4])), order_items)
order_subtotals_sorted = sorted(order_subtotals)

order_subtotals_grouped = iter.groupby(order_subtotals_sorted, lambda rec: rec[0])

item = list(order_subtotals_grouped)[0]
print(item[1]) # Contains similar to this [(2, 199.99), (2, 250.0), (2, 129.99)]
<itertools._grouper object at 0x7fdd40adb048>
i = [(2, 199.99), (2, 250.0), (2, 129.99)]
list(map(lambda rec: rec[1], i))
[199.99, 250.0, 129.99]
sum(list(map(lambda rec: rec[1], i))) # this will go as part of first argument to starmap
579.98
order_subtotals = map(lambda oi: (int(oi.split(',')[1]), float(oi.split(',')[4])), order_items)
order_subtotals_sorted = sorted(order_subtotals)

order_subtotals_grouped = iter.groupby(order_subtotals_sorted, lambda rec: rec[0])

order_revenue = iter.starmap(
    lambda key, values: (key, round(sum(list(map(lambda rec: rec[1], values))), 2)), 
    order_subtotals_grouped
)
list(order_revenue)[:3]
[(1, 299.98), (2, 579.98), (4, 699.85)]

Note

Alternative solution by avoiding first map.

order_items_sorted = sorted(order_items, key=lambda oi: int(oi.split(',')[1]))

order_items_grouped = iter.groupby(order_items_sorted, lambda oi: int(oi.split(',')[1]))
order_items[1:4]
['2,2,1073,1,199.99,199.99', '3,2,502,5,250.0,50.0', '4,2,403,1,129.99,129.99']
values = order_items[1:4]
list(map(lambda rec: float(rec.split(',')[4]), values))
[199.99, 250.0, 129.99]
sum(list(map(lambda rec: float(rec.split(',')[4]), values)))
579.98
order_revenue = iter.starmap(
    lambda key, values: (key, round(sum(list(map(lambda rec: float(rec.split(',')[4]), values))), 2)), 
    order_items_grouped
)
list(order_revenue)[:3]
[(1, 299.98), (2, 579.98), (4, 699.85)]