Pandas Data Structures – Overview¶
Let us understand the details with respect to Pandas.
Pandas is not a core Python module and hence we need to install using pip -
pip install pandas
.It has 2 types of data structures -
Series
andDataFrame
.Series
is a one dimension array whileDataFrame
is a two dimension array.Series
only contains index for each row and one attribute or column.DataFrame
contains index for each row and multiple columns.Each attribute in the DataFrame is nothing but a Series.
We can perform all standard transformations using Pandas APIs
We also have SQL based wrappers on top of Pandas where we can write queries.
Here are the steps to get started with Pandas Data Structures:
Make sure Pandas library is installed using
pip
.Import Pandas library -
import pandas as pd
We need to have a collection or data in a file to create Pandas Data Structures.
Use appropriate APIs on the data to create Pandas Data Structures.
Series
for single dimension array.DataFrame
for two dimension array.
Note
Typically we use Series
for list of regular objects or dict and DataFrame
for list of tuples or list of dicts. Let us use list for Series
and list of dicts for DataFrame
.
!pip install pandas
Defaulting to user installation because normal site-packages is not writeable
Requirement already satisfied: pandas in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (1.1.4)
Requirement already satisfied: python-dateutil>=2.7.3 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from pandas) (2.8.1)
Requirement already satisfied: pytz>=2017.2 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from pandas) (2020.4)
Requirement already satisfied: numpy>=1.15.4 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from pandas) (1.19.4)
Requirement already satisfied: six>=1.5 in /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)
import pandas as pd
sals_l = [1500.0, 2000.0, 2200.00]
pd.Series?
Init signature:
pd.Series(
data=None,
index=None,
dtype=None,
name=None,
copy=False,
fastpath=False,
)
Docstring:
One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN).
Operations between Series (+, -, /, *, **) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.
Parameters
----------
data : array-like, Iterable, dict, or scalar value
Contains data stored in Series.
.. versionchanged:: 0.23.0
If data is a dict, argument order is maintained for Python 3.6
and later.
index : array-like or Index (1d)
Values must be hashable and have the same length as `data`.
Non-unique index values are allowed. Will default to
RangeIndex (0, 1, 2, ..., n) if not provided. If both a dict and index
sequence are used, the index will override the keys found in the
dict.
dtype : str, numpy.dtype, or ExtensionDtype, optional
Data type for the output Series. If not specified, this will be
inferred from `data`.
See the :ref:`user guide <basics.dtypes>` for more usages.
name : str, optional
The name to give to the Series.
copy : bool, default False
Copy input data.
File: /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages/pandas/core/series.py
Type: type
Subclasses: SubclassedSeries
sals_s = pd.Series(sals_l, name='sal')
sals_s
0 1500.0
1 2000.0
2 2200.0
Name: sal, dtype: float64
sals_s[:2]
0 1500.0
1 2000.0
Name: sal, dtype: float64
sals_ld = [(1, 1500.0), (2, 2000.0), (3, 2200.00)]
pd.DataFrame?
Init signature:
pd.DataFrame(
data=None,
index:Union[Collection, NoneType]=None,
columns:Union[Collection, NoneType]=None,
dtype:Union[_ForwardRef('ExtensionDtype'), str, numpy.dtype, Type[Union[str, float, int, complex]], NoneType]=None,
copy:bool=False,
)
Docstring:
Two-dimensional, size-mutable, potentially heterogeneous tabular data.
Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.
Parameters
----------
data : ndarray (structured or homogeneous), Iterable, dict, or DataFrame
Dict can contain Series, arrays, constants, or list-like objects.
.. versionchanged:: 0.23.0
If data is a dict, column order follows insertion-order for
Python 3.6 and later.
.. versionchanged:: 0.25.0
If data is a list of dicts, column order follows insertion-order
for Python 3.6 and later.
index : Index or array-like
Index to use for resulting frame. Will default to RangeIndex if
no indexing information part of input data and no index provided.
columns : Index or array-like
Column labels to use for resulting frame. Will default to
RangeIndex (0, 1, 2, ..., n) if no column labels are provided.
dtype : dtype, default None
Data type to force. Only a single dtype is allowed. If None, infer.
copy : bool, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input.
See Also
--------
DataFrame.from_records : Constructor from tuples, also record arrays.
DataFrame.from_dict : From dicts of Series, arrays, or dicts.
read_csv : Read a comma-separated values (csv) file into DataFrame.
read_table : Read general delimited file into DataFrame.
read_clipboard : Read text from clipboard into DataFrame.
Examples
--------
Constructing DataFrame from a dictionary.
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
col1 col2
0 1 3
1 2 4
Notice that the inferred dtype is int64.
>>> df.dtypes
col1 int64
col2 int64
dtype: object
To enforce a single dtype:
>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1 int8
col2 int8
dtype: object
Constructing DataFrame from numpy ndarray:
>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
... columns=['a', 'b', 'c'])
>>> df2
a b c
0 1 2 3
1 4 5 6
2 7 8 9
File: /opt/anaconda3/envs/beakerx/lib/python3.6/site-packages/pandas/core/frame.py
Type: type
Subclasses: SubclassedDataFrame
sals_df = pd.DataFrame(sals_ld, columns=['id', 'sal'])
sals_df
id | sal | |
---|---|---|
0 | 1 | 1500.0 |
1 | 2 | 2000.0 |
2 | 3 | 2200.0 |
sals_df['id']
0 1
1 2
2 3
Name: id, dtype: int64