{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading Data from File\n",
"\n",
"Let us read both orders as well as order_items data set from files into Pandas Data Frame.\n",
"* Both the files does not have header and hence we need to pass the schema while creating data frames."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [
{
"data": {
"text/html": [
"\n"
],
"text/plain": [
""
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"%%HTML\n",
""
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"-rw-r--r-- 1 root root 2999944 Nov 22 16:08 /data/retail_db/orders/part-00000\n"
]
}
],
"source": [
"%%sh\n",
"\n",
"# ls -ltr /data/retail_db/orders/part-00000"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1,2013-07-25 00:00:00.0,11599,CLOSED\n",
"2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT\n",
"3,2013-07-25 00:00:00.0,12111,COMPLETE\n",
"4,2013-07-25 00:00:00.0,8827,CLOSED\n",
"5,2013-07-25 00:00:00.0,11318,COMPLETE\n",
"6,2013-07-25 00:00:00.0,7130,COMPLETE\n",
"7,2013-07-25 00:00:00.0,4530,COMPLETE\n",
"8,2013-07-25 00:00:00.0,2911,PROCESSING\n",
"9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT\n",
"10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT\n"
]
}
],
"source": [
"%%sh\n",
"\n",
"# head /data/retail_db/orders/part-00000"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"def get_df(path, schema):\n",
" df = pd.read_csv(\n",
" path,\n",
" header=None,\n",
" names=schema\n",
" )\n",
" return df"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"orders_path = \"/data/retail_db/orders/part-00000\"\n",
"orders_schema = [\n",
" \"order_id\",\n",
" \"order_date\",\n",
" \"order_customer_id\",\n",
" \"order_status\"\n",
"]\n",
"orders = get_df(orders_path, orders_schema)"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# type(orders)"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" order_id | \n",
" order_date | \n",
" order_customer_id | \n",
" order_status | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 2013-07-25 00:00:00.0 | \n",
" 11599 | \n",
" CLOSED | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 2013-07-25 00:00:00.0 | \n",
" 256 | \n",
" PENDING_PAYMENT | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 2013-07-25 00:00:00.0 | \n",
" 12111 | \n",
" COMPLETE | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" order_id order_date order_customer_id order_status\n",
"0 1 2013-07-25 00:00:00.0 11599 CLOSED\n",
"1 2 2013-07-25 00:00:00.0 256 PENDING_PAYMENT\n",
"2 3 2013-07-25 00:00:00.0 12111 COMPLETE"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# orders.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"```{note}\n",
"When it comes to loading data into database using `cursor.executemany`, we need to pass data as list of tuples or list of lists (not as Pandas Dataframe). We can use `orders.values.tolist()` to convert records in the Pandas Dataframe to list of lists.\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[[1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'],\n",
" [2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'],\n",
" [3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE']]"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# orders.values.tolist()[:3]"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"list"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# type(orders.values.tolist())"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"list"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# type(orders.values.tolist()[2])"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"order_items_path = \"/data/retail_db/order_items/part-00000\"\n",
"order_items_schema = [\n",
" \"order_item_id\",\n",
" \"order_item_order_id\",\n",
" \"order_item_product_id\",\n",
" \"order_item_quantity\",\n",
" \"order_item_subtotal\",\n",
" \"order_item_product_price\"\n",
"]\n",
"order_items = get_df(order_items_path, order_items_schema)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" order_item_id | \n",
" order_item_order_id | \n",
" order_item_product_id | \n",
" order_item_quantity | \n",
" order_item_subtotal | \n",
" order_item_product_price | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 1 | \n",
" 1 | \n",
" 957 | \n",
" 1 | \n",
" 299.98 | \n",
" 299.98 | \n",
"
\n",
" \n",
" 1 | \n",
" 2 | \n",
" 2 | \n",
" 1073 | \n",
" 1 | \n",
" 199.99 | \n",
" 199.99 | \n",
"
\n",
" \n",
" 2 | \n",
" 3 | \n",
" 2 | \n",
" 502 | \n",
" 5 | \n",
" 250.00 | \n",
" 50.00 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" order_item_id order_item_order_id order_item_product_id \\\n",
"0 1 1 957 \n",
"1 2 2 1073 \n",
"2 3 2 502 \n",
"\n",
" order_item_quantity order_item_subtotal order_item_product_price \n",
"0 1 299.98 299.98 \n",
"1 1 199.99 199.99 \n",
"2 5 250.00 50.00 "
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# order_items.head(3)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}