Row level transformations

Let us understand how to perform row level transformations using orders data set. Here are the details about orders.

  • Data is in text file format

  • Each line in the file contains one record.

  • Each record contains 4 attributes which are separated by “,”

    • order_id

    • order_date

    • order_customer_id

    • order_status

%%sh

ls -ltr /data/retail_db/orders/part-00000
-rw-r--r-- 1 root root 2999944 Nov 22 16:08 /data/retail_db/orders/part-00000
%%sh

tail /data/retail_db/orders/part-00000
68874,2014-07-03 00:00:00.0,1601,COMPLETE
68875,2014-07-04 00:00:00.0,10637,ON_HOLD
68876,2014-07-06 00:00:00.0,4124,COMPLETE
68877,2014-07-07 00:00:00.0,9692,ON_HOLD
68878,2014-07-08 00:00:00.0,6753,COMPLETE
68879,2014-07-09 00:00:00.0,778,COMPLETE
68880,2014-07-13 00:00:00.0,1117,COMPLETE
68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT
68882,2014-07-22 00:00:00.0,10000,ON_HOLD
68883,2014-07-23 00:00:00.0,5533,COMPLETE
path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000
orders_file = open(path)
type(orders_file)
_io.TextIOWrapper
orders_raw = orders_file.read()
type(orders_raw)
str
orders_raw.splitlines?
Docstring:
S.splitlines([keepends]) -> list of strings

Return a list of the lines in S, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends
is given and true.
Type:      builtin_function_or_method
orders = orders_raw.splitlines()
type(orders)
list
orders[:10]
['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
type(orders[0])
str
len(orders)
68883
%%sh

wc -l /data/retail_db/orders/part-00000
68883 /data/retail_db/orders/part-00000

Task 1

Get all order ids and associated statuses. Each record in the output should be comma separated string.

order = '1,2013-07-25 00:00:00.0,11599,CLOSED' # -> '1,CLOSED'
# We invokde join on delimiter
str.join?
Docstring:
S.join(iterable) -> str

Return a string which is the concatenation of the strings in the
iterable.  The separator between elements is S.
Type:      method_descriptor
':'.join(['1', '2', '3', '4'])
'1:2:3:4'
order.split(',')[0]
'1'
order.split(',')[3]
'CLOSED'
[order.split(',')[0], order.split(',')[3]]
['1', 'CLOSED']
','.join([order.split(',')[0], order.split(',')[3]])
'1,CLOSED'
l = [1]
l.append(2)
l
[1, 2]
order_statuses = []
for order in orders:
    order_statuses.append(','.join([order.split(',')[0], order.split(',')[3]]))
order_statuses[:10]
['1,CLOSED',
 '2,PENDING_PAYMENT',
 '3,COMPLETE',
 '4,CLOSED',
 '5,COMPLETE',
 '6,COMPLETE',
 '7,COMPLETE',
 '8,PROCESSING',
 '9,PENDING_PAYMENT',
 '10,PENDING_PAYMENT']
len(order_statuses)
68883
order_statuses = [','.join([order.split(',')[0], order.split(',')[3]]) for order in orders] # alternative solution
order_statuses[:10]
['1,CLOSED',
 '2,PENDING_PAYMENT',
 '3,COMPLETE',
 '4,CLOSED',
 '5,COMPLETE',
 '6,COMPLETE',
 '7,COMPLETE',
 '8,PROCESSING',
 '9,PENDING_PAYMENT',
 '10,PENDING_PAYMENT']
len(order_statuses)
68883

Task 2

Get all order ids, the dates on which order is placed and order status. Each record in the output should be dict with following column names as keys.

  • order_id

  • order_date

  • order_status

{'order_id': 1, 'order_date': '2020-12-22', 'order_status': 'COMPLETE'}
{'order_id': 1, 'order_date': '2020-12-22', 'order_status': 'COMPLETE'}
def get_order_details(order):
    """Extract order details such as id, date as well as status and return as dict"""
    order_values = order.split(',')
    return ({
        'order_id': int(order_values[0]),
        'order_date': order_values[1],
        'order_status': order_values[3]
    })
get_order_details('1,2013-07-25 00:00:00.0,11599,CLOSED')
{'order_id': 1,
 'order_date': '2013-07-25 00:00:00.0',
 'order_status': 'CLOSED'}
order_details = []
for order in orders:
    order_details.append(get_order_details(order))
order_details[:10]
[{'order_id': 1,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'CLOSED'},
 {'order_id': 2,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': 3,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 4,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'CLOSED'},
 {'order_id': 5,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 6,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 7,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 8,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PROCESSING'},
 {'order_id': 9,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': 10,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'}]
len(order_details)
68883
order_details = [get_order_details(order) for order in orders]
order_details[:10]
[{'order_id': 1,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'CLOSED'},
 {'order_id': 2,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': 3,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 4,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'CLOSED'},
 {'order_id': 5,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 6,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 7,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'COMPLETE'},
 {'order_id': 8,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PROCESSING'},
 {'order_id': 9,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'},
 {'order_id': 10,
  'order_date': '2013-07-25 00:00:00.0',
  'order_status': 'PENDING_PAYMENT'}]
len(order_details)
68883