Reading Data from File¶
Let us read both orders as well as order_items data set from files into Pandas Data Frame.
Both the files does not have header and hence we need to pass the schema while creating data frames.
%%sh
# ls -ltr /data/retail_db/orders/part-00000
-rw-r--r-- 1 root root 2999944 Nov 22 16:08 /data/retail_db/orders/part-00000
%%sh
# head /data/retail_db/orders/part-00000
1,2013-07-25 00:00:00.0,11599,CLOSED
2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT
3,2013-07-25 00:00:00.0,12111,COMPLETE
4,2013-07-25 00:00:00.0,8827,CLOSED
5,2013-07-25 00:00:00.0,11318,COMPLETE
6,2013-07-25 00:00:00.0,7130,COMPLETE
7,2013-07-25 00:00:00.0,4530,COMPLETE
8,2013-07-25 00:00:00.0,2911,PROCESSING
9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT
10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT
import pandas as pd
def get_df(path, schema):
df = pd.read_csv(
path,
header=None,
names=schema
)
return df
orders_path = "/data/retail_db/orders/part-00000"
orders_schema = [
"order_id",
"order_date",
"order_customer_id",
"order_status"
]
orders = get_df(orders_path, orders_schema)
# type(orders)
pandas.core.frame.DataFrame
# orders.head(3)
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
0 | 1 | 2013-07-25 00:00:00.0 | 11599 | CLOSED |
1 | 2 | 2013-07-25 00:00:00.0 | 256 | PENDING_PAYMENT |
2 | 3 | 2013-07-25 00:00:00.0 | 12111 | COMPLETE |
Note
When it comes to loading data into database using cursor.executemany
, we need to pass data as list of tuples or list of lists (not as Pandas Dataframe). We can use orders.values.tolist()
to convert records in the Pandas Dataframe to list of lists.
# orders.values.tolist()[:3]
[[1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'],
[2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'],
[3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE']]
# type(orders.values.tolist())
list
# type(orders.values.tolist()[2])
list
order_items_path = "/data/retail_db/order_items/part-00000"
order_items_schema = [
"order_item_id",
"order_item_order_id",
"order_item_product_id",
"order_item_quantity",
"order_item_subtotal",
"order_item_product_price"
]
order_items = get_df(order_items_path, order_items_schema)
# order_items.head(3)
order_item_id | order_item_order_id | order_item_product_id | order_item_quantity | order_item_subtotal | order_item_product_price | |
---|---|---|---|---|---|---|
0 | 1 | 1 | 957 | 1 | 299.98 | 299.98 |
1 | 2 | 2 | 1073 | 1 | 199.99 | 199.99 |
2 | 3 | 2 | 502 | 5 | 250.00 | 50.00 |