Limitations of using Loops¶
There are several limitations using loops.
If you look at the below examples related to processing collections using loops, most of the functions have similar logic to iterate through elements.
We are spending more time on coding non business logic.
It results in too much of code and it can become a maintenance problem.
def get_customer_orders(orders, customer_id):
orders_filtered = []
for order in orders:
if int(order.split(',')[2]) == customer_id:
orders_filtered.append(order)
return orders_filtered
def get_customer_orders_for_month(orders, customer_id, order_month):
orders_filtered = []
for order in orders:
order_elements = order.split(',')
if (int(order_elements[2]) == customer_id and
order_elements[1].startswith(order_month)):
orders_filtered.append(order)
return orders_filtered
for order in orders:
order_elements = order.split(',')
if int(order_elements[2]) == 12431 \
and order_elements[1].startswith('2014-01') \
and (order_elements[3] in ('PROCESSING', 'PENDING_PAYMENT')):
print(order)
Map Reduce APIs or Higher level libraries such as Pandas will solve these problems.
We do not have to develop loops and conditionals.
Loops and Conditionals are taken care by the existing APIs.
We can just focus on business logic. It can be passed using Lambda Functions.
%run 07_preparing_data_sets.ipynb
Note
Here is the approach using filter
that comes as part of Map Reduce APIs. You will learn about Map Reduce APIs soon.
orders_filtered = filter(
lambda order: int(order.split(',')[2]) == 12431,
orders
)
list(orders_filtered)
['3774,2013-08-16 00:00:00.0,12431,CANCELED',
'3870,2013-08-17 00:00:00.0,12431,PENDING_PAYMENT',
'4032,2013-08-17 00:00:00.0,12431,ON_HOLD',
'22812,2013-12-12 00:00:00.0,12431,PENDING',
'22927,2013-12-13 00:00:00.0,12431,CLOSED',
'25614,2013-12-30 00:00:00.0,12431,CLOSED',
'27585,2014-01-12 00:00:00.0,12431,PROCESSING',
'28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
'29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
'29232,2014-01-21 00:00:00.0,12431,ON_HOLD',
'45894,2014-05-06 00:00:00.0,12431,CLOSED',
'46217,2014-05-07 00:00:00.0,12431,CLOSED',
'49678,2014-05-31 00:00:00.0,12431,PENDING',
'51865,2014-06-15 00:00:00.0,12431,PROCESSING',
'63146,2014-02-13 00:00:00.0,12431,PENDING_PAYMENT',
'67110,2014-07-14 00:00:00.0,12431,PENDING']
orders_filtered = filter(
lambda order: int(order.split(',')[2]) == 12431
and order.split(',')[1].startswith('2014-01'),
orders
)
list(orders_filtered)
['27585,2014-01-12 00:00:00.0,12431,PROCESSING',
'28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
'29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
'29232,2014-01-21 00:00:00.0,12431,ON_HOLD']
orders_filtered = filter(
lambda order: int(order.split(',')[2]) == 12431
and order.split(',')[1].startswith('2014-01')
and (order.split(',')[3] in ('PROCESSING', 'PENDING_PAYMENT')),
orders
)
list(orders_filtered)
['27585,2014-01-12 00:00:00.0,12431,PROCESSING',
'28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT']
Note
Here is the approach using Pandas library. You will learn about how to process data using Pandas in subsequent sections.
import pandas as pd
orders_schema = [
'order_id',
'order_date',
'order_customer_id',
'order_status'
]
orders = pd.read_csv('/data/retail_db/orders/part-00000', names=orders_schema)
orders.query('order_customer_id == 12431')
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
3773 | 3774 | 2013-08-16 00:00:00.0 | 12431 | CANCELED |
3869 | 3870 | 2013-08-17 00:00:00.0 | 12431 | PENDING_PAYMENT |
4031 | 4032 | 2013-08-17 00:00:00.0 | 12431 | ON_HOLD |
22811 | 22812 | 2013-12-12 00:00:00.0 | 12431 | PENDING |
22926 | 22927 | 2013-12-13 00:00:00.0 | 12431 | CLOSED |
25613 | 25614 | 2013-12-30 00:00:00.0 | 12431 | CLOSED |
27584 | 27585 | 2014-01-12 00:00:00.0 | 12431 | PROCESSING |
28243 | 28244 | 2014-01-15 00:00:00.0 | 12431 | PENDING_PAYMENT |
29108 | 29109 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
29231 | 29232 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
45893 | 45894 | 2014-05-06 00:00:00.0 | 12431 | CLOSED |
46216 | 46217 | 2014-05-07 00:00:00.0 | 12431 | CLOSED |
49677 | 49678 | 2014-05-31 00:00:00.0 | 12431 | PENDING |
51864 | 51865 | 2014-06-15 00:00:00.0 | 12431 | PROCESSING |
63145 | 63146 | 2014-02-13 00:00:00.0 | 12431 | PENDING_PAYMENT |
67109 | 67110 | 2014-07-14 00:00:00.0 | 12431 | PENDING |
orders.query('order_customer_id == 12431 & order_date.str.startswith("2014-01")')
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
27584 | 27585 | 2014-01-12 00:00:00.0 | 12431 | PROCESSING |
28243 | 28244 | 2014-01-15 00:00:00.0 | 12431 | PENDING_PAYMENT |
29108 | 29109 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
29231 | 29232 | 2014-01-21 00:00:00.0 | 12431 | ON_HOLD |
orders.query('order_customer_id == 12431 & ' +
'order_date.str.startswith("2014-01") &' +
'order_status in ("PROCESSING", "PENDING_PAYMENT")'
)
order_id | order_date | order_customer_id | order_status | |
---|---|---|---|---|
27584 | 27585 | 2014-01-12 00:00:00.0 | 12431 | PROCESSING |
28243 | 28244 | 2014-01-15 00:00:00.0 | 12431 | PENDING_PAYMENT |