Exercises - Manipulating Collections using Loops

Let us go throuh some of the exercises to understand how to process collections using conventional loops and conditionals. Create functions for each of the below problem statement.

  • Get number of COMPLETE orders placed by each customer

  • Get total number of PENDING or PENDING_PAYMENT orders for the month of 2014 January.

  • Get outstanding amount for each month considering orders with status PAYMENT_REVIEW, PENDING, PENDING_PAYMENT and PROCESSING.

Details of Data

Here are the details about the data which you can leverage to take care of these exercises.

  • Location: /data/retail_db/orders/part-00000

  • Each record is line separated or line delimited.

  • Attribute in each record is comma separated.

# Get the details about file
!ls -ltr /data/retail_db/orders/part-00000
# Get first five lines from the file
!head -5 /data/retail_db/orders/part-00000
# Get number of lines from the file
# We can use linux command wc with -l
!wc -l /data/retail_db/orders/part-00000

Exercise 1 - Read data from file

Before getting into problem statement, develop the code to read the file into list of elements.

  • We should be able to use this function to read any file with text data using line as record delimiter.

# Update the logic here
def get_list_from_file(file_path):
    data_list = open(file_path).read().splitlines()
    return data_list
  • Run below cells to validate the function

  • You should see 68883 records as part of the output for the cell with len(orders) below.

  • You should see 172198 records as part of the output for the cell with len(order_items) below.

orders = get_list_from_file('/data/retail_db/orders/part-00000')
orders[:5]
len(orders)
order_items = get_list_from_file('/data/retail_db/order_items/part-00000')
order_items[:5]
len(order_items)

Exercise 2 - Complete Order Count

Get number of COMPLETE orders placed by each customer. Develop a function which read the orders data and get us complete order count.

  • The function should take the complete order list as argument and return count of Complete orders.

  • The order is said to be complete if the status is COMPLETE.

  • The last element in each comma separated record gives us the order status.

# Update the logic here
def get_complete_order_count(orders):

    return order_count
  • Run below cell to validate the function. You should get 22899 as output.

get_complete_order_count(orders)
  • You can also validate results using simple linux scripts.

!egrep -w 'COMPLETE' /data/retail_db/orders/part-00000|wc -l

Exercise 2 - Pending Order Count

Get total number of PENDING or PENDING_PAYMENT orders for the month of 2014 January. Develop a function which read the orders data and get us pending order count.

  • The function should take the complete order list as argument and return count of pending orders.

  • The order is said to be complete if the status is PENDING or PENDING_PAYMENT. We should only consider the orders placed in the month of 2014 January.

  • The second element in each comma separated record gives us the date

  • The 4th or last element in each comma separated record gives us the order status.

# Update the logic here
def get_pending_order_count(orders):
    return order_count
  • Run below cell to validate your function. You should get 1969 as output.

get_pending_order_count(orders)
  • You can also validate results using simple linux scripts.

!egrep -w '(PENDING|PENDING_PAYMENT)' /data/retail_db/orders/part-00000|grep 2014-01|wc -l

Exercise 3 - Pending Order Count

Get outstanding amount for each month considering orders with status PAYMENT_REVIEW, PENDING, PENDING_PAYMENT and PROCESSING. Modularize by developing multiple functions.

  • Develop a function which takes orders list as argument and return a collection of order ids with one of the pending statuses.

  • Develop a function which takes order_items list as well as orders dict with only status as arguments and return outstanding amount.

  • Develop a function to create list of orders with pending status and lookup into it.

# Update the logic here
def get_pending_orders(orders):
    return pending_orders
  • Validate by running below cells to see if the dict is created with order id and order status.

orders = get_list_from_file('/data/retail_db/orders/part-00000')
pending_orders = get_pending_orders(orders)
# It should return list
type(pending_orders)
# Preview first five elements
pending_orders[:5]
# Reading first element from the list
pending_orders[0] 
# It should return 31644
len(pending_orders)
def get_outstanding_revenue(order_items, pending_orders):
    return round(outstanding_revenue, 2)
order_items = get_list_from_file('/data/retail_db/order_items/part-00000')
%%time
# You should get 15982030.54 as output. Even if it is different by few dollars it is fine.
get_outstanding_revenue(order_items, pending_orders)
  • Develop a function to create set of orders with pending status and lookup into it.

# Update the logic here
def get_pending_orders_set(orders):
    return pending_orders
  • Validate by running below cells to see if the dict is created with order id and order status.

orders = get_list_from_file('/data/retail_db/orders/part-00000')
pending_orders = get_pending_orders_set(orders)
# It should return set
type(pending_orders)
# Preview first five elements
list(pending_orders)[:5]
# Reading first element from the list
list(pending_orders)[0] 
# It should return 31644
len(pending_orders)
def get_outstanding_revenue(order_items, pending_orders):
    return round(outstanding_revenue, 2)
order_items = get_list_from_file('/data/retail_db/order_items/part-00000')
%%time
# You should get 15982030.54 as output. Even if it is different by few dollars it is fine.
get_outstanding_revenue(order_items, pending_orders)
  • Develop a function to create dict of orders with pending status and lookup into it. The dict need to have order id as key and some constant value as value. In my case, I have used 1 as value.

# Update the logic here
def get_pending_orders_dict(orders):
    return pending_orders
  • Validate by running below cells to see if the dict is created with order id and order status.

orders = get_list_from_file('/data/retail_db/orders/part-00000')
pending_orders = get_pending_orders_dict(orders)
# It should return dict
type(pending_orders)
# Preview first five elements
list(pending_orders.items())[:5]
# Reading first element from the dict
list(pending_orders.items())[0] 
# It should return 31644
len(pending_orders)
def get_outstanding_revenue(order_items, pending_orders):
    return round(outstanding_revenue, 2)
order_items = get_list_from_file('/data/retail_db/order_items/part-00000')
%%time
# You should get 15982030.54 as output. Even if it is different by few dollars it is fine.
get_outstanding_revenue(order_items, pending_orders)

Exercise 4 - Compare Performance

As part of the previous exercise you were asked to come up with the solution using 3 different approaches. You need to add a markdown cell below each question and provide answer.

  • Question: Which of the 3 approaches is faster? Add a markdown cell below and provide your answer.

    • list

    • set

    • dict

  • Question: Provide explanation why the option you have chosen is faster over others. Add a markdown cell below and provide your answer.