Reading Data from File

Let us read both orders as well as order_items data set from files into Pandas Data Frame.

  • Both the files does not have header and hence we need to pass the schema while creating data frames.

%%sh

# ls -ltr /data/retail_db/orders/part-00000
-rw-r--r-- 1 root root 2999944 Nov 22 16:08 /data/retail_db/orders/part-00000
%%sh

# head /data/retail_db/orders/part-00000
1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE
6,2013-07-25 00:00:00.0,7130,COMPLETE
7,2013-07-25 00:00:00.0,4530,COMPLETE
8,2013-07-25 00:00:00.0,2911,PROCESSING
9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT
10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT
import pandas as pd
def get_df(path, schema):
    df = pd.read_csv(
        path,
        header=None,
        names=schema
    )
    return df
orders_path = "/data/retail_db/orders/part-00000"
orders_schema = [
    "order_id",
    "order_date",
    "order_customer_id",
    "order_status"
]
orders = get_df(orders_path, orders_schema)
# type(orders)
pandas.core.frame.DataFrame
# orders.head(3)
order_id order_date order_customer_id order_status
0 1 2013-07-25 00:00:00.0 11599 CLOSED
1 2 2013-07-25 00:00:00.0 256 PENDING_PAYMENT
2 3 2013-07-25 00:00:00.0 12111 COMPLETE

Note

When it comes to loading data into database using cursor.executemany, we need to pass data as list of tuples or list of lists (not as Pandas Dataframe). We can use orders.values.tolist() to convert records in the Pandas Dataframe to list of lists.

# orders.values.tolist()[:3]
[[1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'],
 [2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'],
 [3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE']]
# type(orders.values.tolist())
list
# type(orders.values.tolist()[2])
list
order_items_path = "/data/retail_db/order_items/part-00000"
order_items_schema = [
    "order_item_id",
    "order_item_order_id",
    "order_item_product_id",
    "order_item_quantity",
    "order_item_subtotal",
    "order_item_product_price"
]
order_items = get_df(order_items_path, order_items_schema)
# order_items.head(3)
order_item_id order_item_order_id order_item_product_id order_item_quantity order_item_subtotal order_item_product_price
0 1 1 957 1 299.98 299.98
1 2 2 1073 1 199.99 199.99
2 3 2 502 5 250.00 50.00