List of tuples

Let us see an example of how we can read data from a file into list of tuples using Python as programming language.

  • When we read data from a file into a list, typically each element in the list will be of type binary or string.

  • We can convert the element into tuple to simplify the processing.

  • Once each element is converted to tuple, we can access elements in the tuple using positional notation.

  • Let us see an example to read the data from a file into list of tuples and access dates.

%%sh

ls -ltr /data/retail_db/orders/part-00000
Copy to clipboard
-rw-r--r-- 1 root root 2999944 Nov 22 16:08 /data/retail_db/orders/part-00000
Copy to clipboard
%%sh

tail /data/retail_db/orders/part-00000
Copy to clipboard
68874,2014-07-03 00:00:00.0,1601,COMPLETE
68875,2014-07-04 00:00:00.0,10637,ON_HOLD
68876,2014-07-06 00:00:00.0,4124,COMPLETE
68877,2014-07-07 00:00:00.0,9692,ON_HOLD
68878,2014-07-08 00:00:00.0,6753,COMPLETE
68879,2014-07-09 00:00:00.0,778,COMPLETE
68880,2014-07-13 00:00:00.0,1117,COMPLETE
68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT
68882,2014-07-22 00:00:00.0,10000,ON_HOLD
68883,2014-07-23 00:00:00.0,5533,COMPLETE
Copy to clipboard
# Reading data from file into a list
path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000
orders_file = open(path)
Copy to clipboard
type(orders_file)
Copy to clipboard
_io.TextIOWrapper
Copy to clipboard
orders_raw = orders_file.read()
Copy to clipboard
type(orders_raw)
Copy to clipboard
str
Copy to clipboard
str.splitlines?
Copy to clipboard
Docstring:
S.splitlines([keepends]) -> list of strings

Return a list of the lines in S, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends
is given and true.
Type:      method_descriptor
Copy to clipboard
orders_raw[:10]
Copy to clipboard
'1,2013-07-'
Copy to clipboard
orders = orders_raw.splitlines()
Copy to clipboard
type(orders)
Copy to clipboard
list
Copy to clipboard
orders[:10]
Copy to clipboard
['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
Copy to clipboard
len(orders) # same as number of records in the file
Copy to clipboard
68883
Copy to clipboard
order = '1,2013-07-25 00:00:00.0,11599,CLOSED'
Copy to clipboard
order
Copy to clipboard
'1,2013-07-25 00:00:00.0,11599,CLOSED'
Copy to clipboard
order.split(',')
Copy to clipboard
['1', '2013-07-25 00:00:00.0', '11599', 'CLOSED']
Copy to clipboard
tuple(order.split(','))
Copy to clipboard
('1', '2013-07-25 00:00:00.0', '11599', 'CLOSED')
Copy to clipboard
(*order.split(','), )# special operator to convert list to tuple
Copy to clipboard
('1', '2013-07-25 00:00:00.0', '11599', 'CLOSED')
Copy to clipboard
order_tuples = [(*order.split(','),) for order in orders] 
Copy to clipboard
order_tuples = [tuple(order.split(',')) for order in orders] 
Copy to clipboard
type(order_tuples)
Copy to clipboard
list
Copy to clipboard
order_tuples[0]
Copy to clipboard
('1', '2013-07-25 00:00:00.0', '11599', 'CLOSED')
Copy to clipboard
order_tuples[:3]
Copy to clipboard
[('1', '2013-07-25 00:00:00.0', '11599', 'CLOSED'),
 ('2', '2013-07-25 00:00:00.0', '256', 'PENDING_PAYMENT'),
 ('3', '2013-07-25 00:00:00.0', '12111', 'COMPLETE')]
Copy to clipboard
len(order_tuples)
Copy to clipboard
68883
Copy to clipboard
order_dates = [order[1] for order in order_tuples]
Copy to clipboard
order_dates[:3]
Copy to clipboard
['2013-07-25 00:00:00.0', '2013-07-25 00:00:00.0', '2013-07-25 00:00:00.0']
Copy to clipboard
len(order_dates)
Copy to clipboard
68883
Copy to clipboard
# We can also change the data types of elements in the tuples
def get_order_details(order):
    order_details = order.split(',')
    return (int(order_details[0]), order_details[1], int(order_details[2]), order_details[3])
Copy to clipboard
order_tuples = [get_order_details(order) for order in orders]
Copy to clipboard
order_tuples[:3]
Copy to clipboard
[(1, '2013-07-25 00:00:00.0', 11599, 'CLOSED'),
 (2, '2013-07-25 00:00:00.0', 256, 'PENDING_PAYMENT'),
 (3, '2013-07-25 00:00:00.0', 12111, 'COMPLETE')]
Copy to clipboard
order_customer_ids = [order[2] for order in order_tuples]
Copy to clipboard
order_customer_ids[:3]
Copy to clipboard
[11599, 256, 12111]
Copy to clipboard
type(order_customer_ids[0])
Copy to clipboard
int
Copy to clipboard
path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research\\data\\retail_db\\orders\\part-00000
orders_file = open(path)
orders_raw = orders_file.read()
orders = orders_raw.splitlines()
order_tuples = [(*order.split(','),) for order in orders] 
order_dates = [order[1] for order in order_tuples]
Copy to clipboard
unique_dates = set(order_dates)
Copy to clipboard
len(unique_dates)
Copy to clipboard
364
Copy to clipboard