Validate myJoin¶
Let us perform a task to valiate myJoin. Join order_id and order_date from orders as well as order_id and order_item_subtotal from order_items as part of the join results.
Read orders data set
Create list of tuples using order_id and order_date with order_id as first element.
Read order_items data set
Create list of tuples using order_item_order_id and order_item_subtotal with order_item_order_id as first element.
Invoke myJoin function with these collections.
%run 04_develop_myMap.ipynb
orders_path = "/data/retail_db/orders/part-00000"
orders = open(orders_path). \
read(). \
splitlines()
orders[:10]
['1,2013-07-25 00:00:00.0,11599,CLOSED',
'2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
'3,2013-07-25 00:00:00.0,12111,COMPLETE',
'4,2013-07-25 00:00:00.0,8827,CLOSED',
'5,2013-07-25 00:00:00.0,11318,COMPLETE',
'6,2013-07-25 00:00:00.0,7130,COMPLETE',
'7,2013-07-25 00:00:00.0,4530,COMPLETE',
'8,2013-07-25 00:00:00.0,2911,PROCESSING',
'9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
'10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
orders_map = myMap(orders,
lambda order: (int(order.split(',')[0]), order.split(',')[1])
)
orders_map[:10]
[(1, '2013-07-25 00:00:00.0'),
(2, '2013-07-25 00:00:00.0'),
(3, '2013-07-25 00:00:00.0'),
(4, '2013-07-25 00:00:00.0'),
(5, '2013-07-25 00:00:00.0'),
(6, '2013-07-25 00:00:00.0'),
(7, '2013-07-25 00:00:00.0'),
(8, '2013-07-25 00:00:00.0'),
(9, '2013-07-25 00:00:00.0'),
(10, '2013-07-25 00:00:00.0')]
order_items_path = "/data/retail_db/order_items/part-00000"
order_items = open(order_items_path). \
read(). \
splitlines()
order_items[:10]
['1,1,957,1,299.98,299.98',
'2,2,1073,1,199.99,199.99',
'3,2,502,5,250.0,50.0',
'4,2,403,1,129.99,129.99',
'5,4,897,2,49.98,24.99',
'6,4,365,5,299.95,59.99',
'7,4,502,3,150.0,50.0',
'8,4,1014,4,199.92,49.98',
'9,5,957,1,299.98,299.98',
'10,5,365,5,299.95,59.99']
order_items_map = myMap(order_items,
lambda order_item: (int(order_item.split(',')[1]),
float(order_item.split(',')[4])
)
)
%run 10_develop_myJoin.ipynb
orders_join = myJoin(orders_map, order_items_map)
orders_join[:10]
[(1, ('2013-07-25 00:00:00.0', 299.98)),
(2, ('2013-07-25 00:00:00.0', 199.99)),
(2, ('2013-07-25 00:00:00.0', 250.0)),
(2, ('2013-07-25 00:00:00.0', 129.99)),
(4, ('2013-07-25 00:00:00.0', 49.98)),
(4, ('2013-07-25 00:00:00.0', 299.95)),
(4, ('2013-07-25 00:00:00.0', 150.0)),
(4, ('2013-07-25 00:00:00.0', 199.92)),
(5, ('2013-07-25 00:00:00.0', 299.98)),
(5, ('2013-07-25 00:00:00.0', 299.95))]