Filtering Data¶
Let us perform few tasks to understand how to filter the data in collections using loops and conditionals.
Here are the details about orders.
Data is in text file format
Each line in the file contains one record.
Each record contains 4 attributes which are separated by “,”
order_id
order_date
order_customer_id
order_status
%%sh
ls -ltr /data/retail_db/orders/part-00000
-rw-r--r-- 1 root root 2999944 Nov 22 16:08 /data/retail_db/orders/part-00000
%%sh
tail /data/retail_db/orders/part-00000
68874,2014-07-03 00:00:00.0,1601,COMPLETE
68875,2014-07-04 00:00:00.0,10637,ON_HOLD
68876,2014-07-06 00:00:00.0,4124,COMPLETE
68877,2014-07-07 00:00:00.0,9692,ON_HOLD
68878,2014-07-08 00:00:00.0,6753,COMPLETE
68879,2014-07-09 00:00:00.0,778,COMPLETE
68880,2014-07-13 00:00:00.0,1117,COMPLETE
68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT
68882,2014-07-22 00:00:00.0,10000,ON_HOLD
68883,2014-07-23 00:00:00.0,5533,COMPLETE
path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000
orders_file = open(path)
type(orders_file)
_io.TextIOWrapper
orders_raw = orders_file.read()
type(orders_raw)
str
orders_raw.splitlines?
Docstring:
S.splitlines([keepends]) -> list of strings
Return a list of the lines in S, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends
is given and true.
Type: builtin_function_or_method
orders = orders_raw.splitlines()
type(orders)
list
orders[:10]
['1,2013-07-25 00:00:00.0,11599,CLOSED',
'2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
'3,2013-07-25 00:00:00.0,12111,COMPLETE',
'4,2013-07-25 00:00:00.0,8827,CLOSED',
'5,2013-07-25 00:00:00.0,11318,COMPLETE',
'6,2013-07-25 00:00:00.0,7130,COMPLETE',
'7,2013-07-25 00:00:00.0,4530,COMPLETE',
'8,2013-07-25 00:00:00.0,2911,PROCESSING',
'9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
'10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
type(orders[0])
str
len(orders)
68883
%%sh
wc -l /data/retail_db/orders/part-00000
68883 /data/retail_db/orders/part-00000
Task 1¶
Create a function by name get_customer_orders which take orders list and customer_id as arguments and return all the orders placed by customer_id
orders[:10]
['1,2013-07-25 00:00:00.0,11599,CLOSED',
'2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
'3,2013-07-25 00:00:00.0,12111,COMPLETE',
'4,2013-07-25 00:00:00.0,8827,CLOSED',
'5,2013-07-25 00:00:00.0,11318,COMPLETE',
'6,2013-07-25 00:00:00.0,7130,COMPLETE',
'7,2013-07-25 00:00:00.0,4530,COMPLETE',
'8,2013-07-25 00:00:00.0,2911,PROCESSING',
'9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
'10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
order = '3,2013-07-25 00:00:00.0,12111,COMPLETE'
int(order.split(',')[2]) == 12111
True
def get_customer_orders(orders, customer_id):
orders_filtered = []
for order in orders:
if int(order.split(',')[2]) == customer_id:
orders_filtered.append(order)
return orders_filtered
# Use the function and get all the orders which are placed by customer with id 12431
get_customer_orders(orders, 12431)
['3774,2013-08-16 00:00:00.0,12431,CANCELED',
'3870,2013-08-17 00:00:00.0,12431,PENDING_PAYMENT',
'4032,2013-08-17 00:00:00.0,12431,ON_HOLD',
'22812,2013-12-12 00:00:00.0,12431,PENDING',
'22927,2013-12-13 00:00:00.0,12431,CLOSED',
'25614,2013-12-30 00:00:00.0,12431,CLOSED',
'27585,2014-01-12 00:00:00.0,12431,PROCESSING',
'28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
'29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
'29232,2014-01-21 00:00:00.0,12431,ON_HOLD',
'45894,2014-05-06 00:00:00.0,12431,CLOSED',
'46217,2014-05-07 00:00:00.0,12431,CLOSED',
'49678,2014-05-31 00:00:00.0,12431,PENDING',
'51865,2014-06-15 00:00:00.0,12431,PROCESSING',
'63146,2014-02-13 00:00:00.0,12431,PENDING_PAYMENT',
'67110,2014-07-14 00:00:00.0,12431,PENDING']
len(get_customer_orders(orders, 12431))
16
Task 2¶
Create a function by name get_customer_orders_for_month which take orders list, customer_id and month in the format YYYY-MM as arguments and return all the orders placed by customer_id for a given month.
order = '3,2013-07-25 00:00:00.0,12111,COMPLETE'
int(order.split(',')[2]) == 12111
True
order.split(',')[1].startswith('2013-07')
True
import datetime as dt
d = dt.datetime.strptime(order.split(',')[1], '%Y-%m-%d %H:%M:%S.%f')
d.year == 2013 and d.month == 7
True
int(order.split(',')[2]) == 12111 and order.split(',')[1].startswith('2013-07')
True
def get_customer_orders_for_month(orders, customer_id, order_month):
orders_filtered = []
for order in orders:
order_elements = order.split(',')
if (int(order_elements[2]) == customer_id and
order_elements[1].startswith(order_month)):
orders_filtered.append(order)
return orders_filtered
# Use the function and get all the orders which are placed by customer with id 12431 in January 2014
get_customer_orders_for_month(orders, 12431, '2014-01')
['27585,2014-01-12 00:00:00.0,12431,PROCESSING',
'28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
'29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
'29232,2014-01-21 00:00:00.0,12431,ON_HOLD']
Task 3¶
Write ad hoc code to get all the orders which are placed by customer with id 12431 in January 2014 and status is in PENDING_PAYMENT or PROCESSING
for order in orders:
order_elements = order.split(',')
if int(order_elements[2]) == 12431 \
and order_elements[1].startswith('2014-01') \
and (order_elements[3] in ('PROCESSING', 'PENDING_PAYMENT')):
print(order)
27585,2014-01-12 00:00:00.0,12431,PROCESSING
28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT