Reading files into collections

Let us understand how to read data from files into collections.

  • Python have simple and yet rich APIs to perform file I/O

  • We can create a file object with open in different modes (by default read only mode)

  • To read the contents from the file into memory, we have APIs on top of file object such as read()

  • read() will create large string using contents of the files

  • If the data have multiple records with new line character as delimiter, we can apply splitlines() on the output of read

  • splitlines() will convert the string into list with new line character as delimiter

%%sh

ls -ltr /data/retail_db/orders/part-00000
-rw-r--r-- 1 root root 2999944 Nov 22 16:08 /data/retail_db/orders/part-00000
%%sh

tail /data/retail_db/orders/part-00000
68874,2014-07-03 00:00:00.0,1601,COMPLETE
68875,2014-07-04 00:00:00.0,10637,ON_HOLD
68876,2014-07-06 00:00:00.0,4124,COMPLETE
68877,2014-07-07 00:00:00.0,9692,ON_HOLD
68878,2014-07-08 00:00:00.0,6753,COMPLETE
68879,2014-07-09 00:00:00.0,778,COMPLETE
68880,2014-07-13 00:00:00.0,1117,COMPLETE
68881,2014-07-19 00:00:00.0,2518,PENDING_PAYMENT
68882,2014-07-22 00:00:00.0,10000,ON_HOLD
68883,2014-07-23 00:00:00.0,5533,COMPLETE
path = '/data/retail_db/orders/part-00000'
# C:\\users\\itversity\\Research
orders_file = open(path)
type(orders_file)
_io.TextIOWrapper
orders_raw = orders_file.read()
type(orders_raw)
str
orders_raw.splitlines?
Docstring:
S.splitlines([keepends]) -> list of strings

Return a list of the lines in S, breaking at line boundaries.
Line breaks are not included in the resulting list unless keepends
is given and true.
Type:      builtin_function_or_method
orders = orders_raw.splitlines()
type(orders)
list
orders[:10]
['1,2013-07-25 00:00:00.0,11599,CLOSED',
 '2,2013-07-25 00:00:00.0,256,PENDING_PAYMENT',
 '3,2013-07-25 00:00:00.0,12111,COMPLETE',
 '4,2013-07-25 00:00:00.0,8827,CLOSED',
 '5,2013-07-25 00:00:00.0,11318,COMPLETE',
 '6,2013-07-25 00:00:00.0,7130,COMPLETE',
 '7,2013-07-25 00:00:00.0,4530,COMPLETE',
 '8,2013-07-25 00:00:00.0,2911,PROCESSING',
 '9,2013-07-25 00:00:00.0,5657,PENDING_PAYMENT',
 '10,2013-07-25 00:00:00.0,5648,PENDING_PAYMENT']
len(orders) # same as number of records in the file
68883