Preparing Data Sets

We will be primarily using orders and order_items data set to understand about manipulating collections.

  • orders is available at path /data/retail_db/orders/part-00000

  • order_items is available at path /data/retail_db/order_items/part-00000

  • orders - columns

    • order_id - it is of type integer and unique

    • order_date - it can be considered as string

    • order_customer_id - it is of type integer

    • order_status - it is of type string

  • order_items - columns

    • order_item_id - it is of type integer and unique

    • order_item_order_id - it is of type integer and refers to orders.order_id

    • order_item_product_id - it is of type integer and refers to products.product_id

    • order_item_quantity - it is of type integer and represents number of products as an order item with in an order.

    • order_item_subtotal - it is item level revenue (product of order_item_quantity and order_item_product_price)

    • order_item_product_price - it is product price for each item with in an order.

  • orders is parent data set to order_items and will contain one record per order. Each order can contain multiple items.

  • order_items is child data set to orders and can contain multiple entries for a given order_item_order_id.

Task 1 - Read orders into collection

Let us read orders data set into the collection called as orders. This will be used later.

orders_path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000
orders_file = open(orders_path)
orders_raw = orders_file.read()
orders = orders_raw.splitlines()
orders[:10]
['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
len(orders) # same as number of records in the file
68883

Task 2 - Read order_items into collection

Let us read order_items data set into the collection called as order_items. This will be used later.

order_items_path = '/data/retail_db/order_items/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\order_items\\part-00000
order_items_file = open(order_items_path)
order_items_raw = order_items_file.read()
order_items = order_items_raw.splitlines()
order_items[:10]
['1,1,957,1,299.98,299.98',
 '2,2,1073,1,199.99,199.99',
 '3,2,502,5,250.0,50.0',
 '4,2,403,1,129.99,129.99',
 '5,4,897,2,49.98,24.99',
 '6,4,365,5,299.95,59.99',
 '7,4,502,3,150.0,50.0',
 '8,4,1014,4,199.92,49.98',
 '9,5,957,1,299.98,299.98',
 '10,5,365,5,299.95,59.99']
len(order_items) # same as number of records in the file
172198