Filtering Data

Let us perform few tasks to understand how to filter the data in collections using loops and conditionals.

Here are the details about orders.

  • Data is in text file format

  • Each line in the file contains one record.

  • Each record contains 4 attributes which are separated by “,”

    • order_id

    • order_date

    • order_customer_id

    • order_status

%%sh

ls -ltr /data/retail_db/orders/part-00000
-rw-r--r-- 1 root root 2999944 Nov 22 16:08 /data/retail_db/orders/part-00000
%%sh

tail /data/retail_db/orders/part-00000
68874,2014-07-03 00:00:00.0,1601,COMPLETE
68875,2014-07-04 00:00:00.0,10637,ON_HOLD
68876,2014-07-06 00:00:00.0,4124,COMPLETE
68877,2014-07-07 00:00:00.0,9692,ON_HOLD
68878,2014-07-08 00:00:00.0,6753,COMPLETE
68879,2014-07-09 00:00:00.0,778,COMPLETE
68880,2014-07-13 00:00:00.0,1117,COMPLETE
68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT
68882,2014-07-22 00:00:00.0,10000,ON_HOLD
68883,2014-07-23 00:00:00.0,5533,COMPLETE
path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000
orders_file = open(path)
type(orders_file)
_io.TextIOWrapper
orders_raw = orders_file.read()
type(orders_raw)
str
orders_raw.splitlines?
Docstring:
S.splitlines([keepends]) -> list of strings

Return a list of the lines in S, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends
is given and true.
Type:      builtin_function_or_method
orders = orders_raw.splitlines()
type(orders)
list
orders[:10]
['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
type(orders[0])
str
len(orders)
68883
%%sh

wc -l /data/retail_db/orders/part-00000
68883 /data/retail_db/orders/part-00000

Task 1

Create a function by name get_customer_orders which take orders list and customer_id as arguments and return all the orders placed by customer_id

orders[:10]
['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
order = '3,2013-07-25 00:00:00.0,12111,COMPLETE'
int(order.split(',')[2]) == 12111
True
def get_customer_orders(orders, customer_id):
    orders_filtered = []
    for order in orders:
        if int(order.split(',')[2]) == customer_id:
            orders_filtered.append(order)
    return orders_filtered
# Use the function and get all the orders which are placed by customer with id 12431
get_customer_orders(orders, 12431)
['3774,2013-08-16 00:00:00.0,12431,CANCELED',
 '3870,2013-08-17 00:00:00.0,12431,PENDING_PAYMENT',
 '4032,2013-08-17 00:00:00.0,12431,ON_HOLD',
 '22812,2013-12-12 00:00:00.0,12431,PENDING',
 '22927,2013-12-13 00:00:00.0,12431,CLOSED',
 '25614,2013-12-30 00:00:00.0,12431,CLOSED',
 '27585,2014-01-12 00:00:00.0,12431,PROCESSING',
 '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
 '29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '29232,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '45894,2014-05-06 00:00:00.0,12431,CLOSED',
 '46217,2014-05-07 00:00:00.0,12431,CLOSED',
 '49678,2014-05-31 00:00:00.0,12431,PENDING',
 '51865,2014-06-15 00:00:00.0,12431,PROCESSING',
 '63146,2014-02-13 00:00:00.0,12431,PENDING_PAYMENT',
 '67110,2014-07-14 00:00:00.0,12431,PENDING']
len(get_customer_orders(orders, 12431))
16

Task 2

Create a function by name get_customer_orders_for_month which take orders list, customer_id and month in the format YYYY-MM as arguments and return all the orders placed by customer_id for a given month.

order = '3,2013-07-25 00:00:00.0,12111,COMPLETE'
int(order.split(',')[2]) == 12111
True
order.split(',')[1].startswith('2013-07')
True
import datetime as dt
d = dt.datetime.strptime(order.split(',')[1], '%Y-%m-%d %H:%M:%S.%f')
d.year == 2013 and d.month == 7
True
int(order.split(',')[2]) == 12111 and order.split(',')[1].startswith('2013-07')
True
def get_customer_orders_for_month(orders, customer_id, order_month):
    orders_filtered = []
    for order in orders:
        order_elements = order.split(',')
        if (int(order_elements[2]) == customer_id and
            order_elements[1].startswith(order_month)):
            orders_filtered.append(order)
    return orders_filtered
# Use the function and get all the orders which are placed by customer with id 12431 in January 2014
get_customer_orders_for_month(orders, 12431, '2014-01')
['27585,2014-01-12 00:00:00.0,12431,PROCESSING',
 '28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT',
 '29109,2014-01-21 00:00:00.0,12431,ON_HOLD',
 '29232,2014-01-21 00:00:00.0,12431,ON_HOLD']

Task 3

Write ad hoc code to get all the orders which are placed by customer with id 12431 in January 2014 and status is in PENDING_PAYMENT or PROCESSING

for order in orders:
    order_elements = order.split(',')
    if int(order_elements[2]) == 12431 \
        and order_elements[1].startswith('2014-01') \
        and (order_elements[3] in ('PROCESSING', 'PENDING_PAYMENT')):
        print(order)
27585,2014-01-12 00:00:00.0,12431,PROCESSING
28244,2014-01-15 00:00:00.0,12431,PENDING_PAYMENT