Writing Data Frames to Files

Pandas also provides simple APIs to write the data back to files.

  • Let us write the revenue per order along with order_id to a file.

Here are the steps which you need to follow before writing Data Frame to a file.

  • Make sure you have the Data Frame that is supposed to be written to file.

  • You need to ensure that you have write permissions on the folder under which files are supposed to be written.

  • Make sure to use appropriate key word arguments to write the Data Frame into file as per the requirements.

%run 06_csv_to_pandas_data_frame.ipynb
order_items.to_csv?
Signature:
order_items.to_csv(
    path_or_buf:Union[str, pathlib.Path, IO[~AnyStr], NoneType]=None,
    sep:str=',',
    na_rep:str='',
    float_format:Union[str, NoneType]=None,
    columns:Union[Sequence[collections.abc.Hashable], NoneType]=None,
    header:Union[bool, List[str]]=True,
    index:bool=True,
    index_label:Union[bool, str, Sequence[collections.abc.Hashable], NoneType]=None,
    mode:str='w',
    encoding:Union[str, NoneType]=None,
    compression:Union[str, Mapping[str, str], NoneType]='infer',
    quoting:Union[int, NoneType]=None,
    quotechar:str='"',
    line_terminator:Union[str, NoneType]=None,
    chunksize:Union[int, NoneType]=None,
    date_format:Union[str, NoneType]=None,
    doublequote:bool=True,
    escapechar:Union[str, NoneType]=None,
    decimal:Union[str, NoneType]='.',
    errors:str='strict',
) -> Union[str, NoneType]
Docstring:
Write object to a comma-separated values (csv) file.

.. versionchanged:: 0.24.0
    The order of arguments for Series was changed.

Parameters
----------
path_or_buf : str or file handle, default None
    File path or object, if None is provided the result is returned as
    a string.  If a file object is passed it should be opened with
    `newline=''`, disabling universal newlines.

    .. versionchanged:: 0.24.0

       Was previously named "path" for Series.

sep : str, default ','
    String of length 1. Field delimiter for the output file.
na_rep : str, default ''
    Missing data representation.
float_format : str, default None
    Format string for floating point numbers.
columns : sequence, optional
    Columns to write.
header : bool or list of str, default True
    Write out the column names. If a list of strings is given it is
    assumed to be aliases for the column names.

    .. versionchanged:: 0.24.0

       Previously defaulted to False for Series.

index : bool, default True
    Write row names (index).
index_label : str or sequence, or False, default None
    Column label for index column(s) if desired. If None is given, and
    `header` and `index` are True, then the index names are used. A
    sequence should be given if the object uses MultiIndex. If
    False do not print fields for index names. Use index_label=False
    for easier importing in R.
mode : str
    Python write mode, default 'w'.
encoding : str, optional
    A string representing the encoding to use in the output file,
    defaults to 'utf-8'.
compression : str or dict, default 'infer'
    If str, represents compression mode. If dict, value at 'method' is
    the compression mode. Compression mode may be any of the following
    possible values: {'infer', 'gzip', 'bz2', 'zip', 'xz', None}. If
    compression mode is 'infer' and `path_or_buf` is path-like, then
    detect compression mode from the following extensions: '.gz',
    '.bz2', '.zip' or '.xz'. (otherwise no compression). If dict given
    and mode is one of {'zip', 'gzip', 'bz2'}, or inferred as
    one of the above, other entries passed as
    additional compression options.

    .. versionchanged:: 1.0.0

       May now be a dict with key 'method' as compression mode
       and other entries as additional compression options if
       compression mode is 'zip'.

    .. versionchanged:: 1.1.0

       Passing compression options as keys in dict is
       supported for compression modes 'gzip' and 'bz2'
       as well as 'zip'.

quoting : optional constant from csv module
    Defaults to csv.QUOTE_MINIMAL. If you have set a `float_format`
    then floats are converted to strings and thus csv.QUOTE_NONNUMERIC
    will treat them as non-numeric.
quotechar : str, default '\"'
    String of length 1. Character used to quote fields.
line_terminator : str, optional
    The newline character or character sequence to use in the output
    file. Defaults to `os.linesep`, which depends on the OS in which
    this method is called ('\n' for linux, '\r\n' for Windows, i.e.).

    .. versionchanged:: 0.24.0
chunksize : int or None
    Rows to write at a time.
date_format : str, default None
    Format string for datetime objects.
doublequote : bool, default True
    Control quoting of `quotechar` inside a field.
escapechar : str, default None
    String of length 1. Character used to escape `sep` and `quotechar`
    when appropriate.
decimal : str, default '.'
    Character recognized as decimal separator. E.g. use ',' for
    European data.
errors : str, default 'strict'
    Specifies how encoding and decoding errors are to be handled.
    See the errors argument for :func:`open` for a full list
    of options.

    .. versionadded:: 1.1.0

Returns
-------
None or str
    If path_or_buf is None, returns the resulting csv format as a
    string. Otherwise returns None.

See Also
--------
read_csv : Load a CSV file into a DataFrame.
to_excel : Write DataFrame to an Excel file.

Examples
--------
>>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'],
...                    'mask': ['red', 'purple'],
...                    'weapon': ['sai', 'bo staff']})
>>> df.to_csv(index=False)
'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n'

Create 'out.zip' containing 'out.csv'

>>> compression_opts = dict(method='zip',
...                         archive_name='out.csv')  # doctest: +SKIP
>>> df.to_csv('out.zip', index=False,
...           compression=compression_opts)  # doctest: +SKIP
File:      /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages/pandas/core/generic.py
Type:      method
import getpass
username = getpass.getuser()
username
'itversity'
base_dir = f"/home/{username}/data/retail_db"
base_dir
'/home/itversity/data/retail_db'
output_dir = f'{base_dir}/revenue_per_order'
output_dir
'/home/itversity/data/retail_db/revenue_per_order'
%%sh

rm -rf /home/`whoami`/data/retail_db/revenue_per_order
%%sh

ls -ltr /home/`whoami`/data/retail_db
total 0
import subprocess

subprocess.call(['rm', '-rf', output_dir])
0
import subprocess

subprocess.call(['mkdir', '-p', output_dir])
0
import subprocess
#ls -ltr /Users/itversity/Research/data/retail_db/revenue_per_order
subprocess.check_output(['ls', '-ltr', output_dir])
b'total 0\n'
%%sh
ls -ltr /data/retail_db
total 20156
drwxr-xr-x 2 root root     4096 Nov 22 16:08 categories
-rw-r--r-- 1 root root      806 Nov 22 16:08 README.md
drwxr-xr-x 2 root root     4096 Nov 22 16:08 customers
-rw-r--r-- 1 root root     1748 Nov 22 16:08 create_db_tables_pg.sql
-rw-r--r-- 1 root root 10303297 Nov 22 16:08 create_db.sql
drwxr-xr-x 2 root root     4096 Nov 22 16:08 departments
drwxr-xr-x 2 root root     4096 Nov 22 16:08 order_items
-rw-r--r-- 1 root root 10297372 Nov 22 16:08 load_db_tables_pg.sql
drwxr-xr-x 2 root root     4096 Nov 22 16:08 orders
drwxr-xr-x 2 root root     4096 Nov 22 16:08 products
orders
order_id order_date order_customer_id order_status
0 1 2013-07-25 00:00:00.0 11599 CLOSED
1 2 2013-07-25 00:00:00.0 256 PENDING_PAYMENT
2 3 2013-07-25 00:00:00.0 12111 COMPLETE
3 4 2013-07-25 00:00:00.0 8827 CLOSED
4 5 2013-07-25 00:00:00.0 11318 COMPLETE
... ... ... ... ...
68878 68879 2014-07-09 00:00:00.0 778 COMPLETE
68879 68880 2014-07-13 00:00:00.0 1117 COMPLETE
68880 68881 2014-07-19 00:00:00.0 2518 PENDING_PAYMENT
68881 68882 2014-07-22 00:00:00.0 10000 ON_HOLD
68882 68883 2014-07-23 00:00:00.0 5533 COMPLETE

68883 rows × 4 columns

order_items
order_item_id order_item_order_id order_item_product_id order_item_quantity order_item_subtotal order_item_product_price
0 1 1 957 1 299.98 299.98
1 2 2 1073 1 199.99 199.99
2 3 2 502 5 250.00 50.00
3 4 2 403 1 129.99 129.99
4 5 4 897 2 49.98 24.99
... ... ... ... ... ... ...
172193 172194 68881 403 1 129.99 129.99
172194 172195 68882 365 1 59.99 59.99
172195 172196 68882 502 1 50.00 50.00
172196 172197 68883 208 1 1999.99 1999.99
172197 172198 68883 502 3 150.00 50.00

172198 rows × 6 columns

order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count'])
sum min max count
order_item_order_id
1 299.98 299.98 299.98 1
2 579.98 129.99 250.00 3
4 699.85 49.98 299.95 4
5 1129.86 99.96 299.98 5
7 579.92 79.95 299.98 3
... ... ... ... ...
68879 1259.97 129.99 999.99 3
68880 999.77 149.94 250.00 5
68881 129.99 129.99 129.99 1
68882 109.99 50.00 59.99 2
68883 2149.99 150.00 1999.99 2

57431 rows × 4 columns

order_items
order_item_id order_item_order_id order_item_product_id order_item_quantity order_item_subtotal order_item_product_price
0 1 1 957 1 299.98 299.98
1 2 2 1073 1 199.99 199.99
2 3 2 502 5 250.00 50.00
3 4 2 403 1 129.99 129.99
4 5 4 897 2 49.98 24.99
... ... ... ... ... ... ...
172193 172194 68881 403 1 129.99 129.99
172194 172195 68882 365 1 59.99 59.99
172195 172196 68882 502 1 50.00 50.00
172196 172197 68883 208 1 1999.99 1999.99
172197 172198 68883 502 3 150.00 50.00

172198 rows × 6 columns

order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count']). \
    reset_index()
order_item_order_id sum min max count
0 1 299.98 299.98 299.98 1
1 2 579.98 129.99 250.00 3
2 4 699.85 49.98 299.95 4
3 5 1129.86 99.96 299.98 5
4 7 579.92 79.95 299.98 3
... ... ... ... ... ...
57426 68879 1259.97 129.99 999.99 3
57427 68880 999.77 149.94 250.00 5
57428 68881 129.99 129.99 129.99 1
57429 68882 109.99 50.00 59.99 2
57430 68883 2149.99 150.00 1999.99 2

57431 rows × 5 columns

order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count']). \
    rename(columns={'count': 'item_count', 'sum': 'revenue'}). \
    to_json(f'{output_dir}/revenue_per_order.json', orient='table')
%%sh
ls -ltr /home/`whoami`/data/retail_db/revenue_per_order
total 4884
-rw-rw-r-- 1 itversity itversity 4999377 Dec 14 10:54 revenue_per_order.json
%%sh
head -10 /home/`whoami`/data/retail_db/revenue_per_order/revenue_per_order.json
IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)
order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count']). \
    rename(columns={'count': 'item_count', 'sum': 'revenue'})
revenue min max item_count
order_item_order_id
1 299.98 299.98 299.98 1
2 579.98 129.99 250.00 3
4 699.85 49.98 299.95 4
5 1129.86 99.96 299.98 5
7 579.92 79.95 299.98 3
... ... ... ... ...
68879 1259.97 129.99 999.99 3
68880 999.77 149.94 250.00 5
68881 129.99 129.99 129.99 1
68882 109.99 50.00 59.99 2
68883 2149.99 150.00 1999.99 2

57431 rows × 4 columns

order_items. \
    groupby('order_item_order_id')['order_item_subtotal']. \
    agg(['sum', 'min', 'max', 'count']). \
    rename(columns={'count': 'item_count', 'sum': 'revenue'}). \
    round(2). \
    to_csv(output_dir + '/revenue_per_order.csv')
%%sh
ls -ltr /home/`whoami`/data/retail_db/revenue_per_order
total 6460
-rw-rw-r-- 1 itversity itversity 4999377 Dec 14 10:54 revenue_per_order.json
-rw-rw-r-- 1 itversity itversity 1610716 Dec 14 10:55 revenue_per_order.csv
%%sh
head /home/`whoami`/data/retail_db/revenue_per_order/revenue_per_order.csv
order_item_order_id,revenue,min,max,item_count
1,299.98,299.98,299.98,1
2,579.98,129.99,250.0,3
4,699.85,49.98,299.95,4
5,1129.86,99.96,299.98,5
7,579.92,79.95,299.98,3
8,729.84,50.0,299.95,4
9,599.96,199.98,199.99,3
10,651.92,21.99,199.99,5
11,919.79,49.98,399.96,5