pandas
provides high-level data structures and functions designed to
make working with structured or tabular data fast, easy and expressive.
In this blog I’ll introduce 2 workhorse data structures: Series and DataFrame.
Series
A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.
The simplest way to form a series is:
The string representation of a Series displayed interactively shows the index on
the left and the values on the right. If you want to specify index, using index
parameter in pd.Series()
will be helpful. You can get the array representation
and index object of the Series via its values and index attributes, respectively:
Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values:
Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. You should have data contained in a Python dict, you can create a Series from it by passing the dict:
Both the Series object itself and its index have a name
attribute, which
integrated with other key areas of pandas functionality:
A Serie’s index can be altered in-place by assignment:
DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:
The columns
parameter is used to arrange the columns in that order. If not,
they are placed in sorted order.
A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:
Columns can be modified by assignment. For example, the empty ‘debt’ column could be assigned a scalar value or an array of values
Assigning a column that doesn’t exist will create a new column. The del
keyword will delete columns as with a dict.
Dicts of Series are treated in much the same way:
As with Series, the values
attribute returns the data contained in the
DataFrame as a two-dimensional ndarray:
Notes
df[column]
works for any column name, butdf.column
only works when the column name is a valid Python variable name.
Caution
- New columns can be created with the
df[column]
, but CANNOT be created withdf.column
syntax. - The column returned from indexing a DataFrame is a view on the underlying
data, not a copy. Thus, any in-place modifications to the Series will be
reflected in the DataFrame. The column can be explicitly copied with the
Series’s
copy
method.
Reference
[1] Wes McKinney. 2017. “Chapter 5 Getting Started with pandas” Python for Data Analysis DATA WRANGLING WITH PANDAS, NUMPY, AND IPYTHON p 124-136