pandas
provides high-level data structures and functions designed to
make working with structured or tabular data fast, easy and expressive.
In this blog I’ll introduce 2 workhorse data structures: Series and DataFrame.
Series
A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.
The simplest way to form a series is:
>>> import pandas as pd
>>> obj = pd.Series([4, -7, 2, 1])
>>> obj
0 4
1 -7
2 2
3 1
dtype: int64
The string representation of a Series displayed interactively shows the index on
the left and the values on the right. If you want to specify index, using index
parameter in pd.Series()
will be helpful. You can get the array representation
and index object of the Series via its values and index attributes, respectively:
>>> obj.index
RangeIndex(start=0, stop=4, step=1)
>>> obj.values
array([ 4, -7, 2, 1])
Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values:
>>> obj[1]
-7
>>> obj[2] = 6
>>> obj
0 4
1 -7
2 6
3 1
dtype: int64
Another way to think about a Series is as a fixed-length, ordered dict, as it is a mapping of index values to data values. You should have data contained in a Python dict, you can create a Series from it by passing the dict:
>>> sdata = {'Ohio': 35000, 'Texas': 80000, 'Utah': 9000}
>>> obj2 = pd.Series(sdata)
>>> obj2
Ohio 35000
Texas 80000
Utah 9000
dtype: int64
Both the Series object itself and its index have a name
attribute, which
integrated with other key areas of pandas functionality:
>>> obj2.name = 'population'
>>> obj2.index.name = 'state'
>>> obj2
state
Ohio 35000
Texas 80000
Utah 9000
Name: population, dtype: int64
A Serie’s index can be altered in-place by assignment:
>>> obj
0 4
1 -7
2 6
3 1
dtype: int64
>>> obj.index = ['A', 'B', 'C', 'D']
>>> obj
A 4
B -7
C 6
D 1
dtype: int64
DataFrame
A DataFrame represents a rectangular table of data and contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). There are many ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays:
>>> data = {'state': ['Ohio', 'Ohio', 'Nevada'],
... 'year': [2000, 2001, 2001],
... 'pop': [1.5, 1.7, 3.6]}
>>> df = pd.DataFrame(data, columns=['year', 'state', 'pop'])
>>> df
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2001 Nevada 3.6
The columns
parameter is used to arrange the columns in that order. If not,
they are placed in sorted order.
A column in a DataFrame can be retrieved as a Series either by dict-like notation or by attribute:
>>> df['state']
0 Ohio
1 Ohio
2 Nevada
Name: state, dtype: object
>>> df.year
0 2000
1 2001
2 2001
Name: year, dtype: int64
Columns can be modified by assignment. For example, the empty ‘debt’ column could be assigned a scalar value or an array of values
>>> df['debt'] = 16
>>> df
year state pop debt
0 2000 Ohio 1.5 16
1 2001 Ohio 1.7 16
2 2001 Nevada 3.6 16
Assigning a column that doesn’t exist will create a new column. The del
keyword will delete columns as with a dict.
>>> data = {'state': ['Ohio', 'Ohio', 'Nevada'],
... 'year': [2000, 2001, 2001],
... 'pop': [1.5, 1.7, 3.6]}
>>> df2 = pd.DataFrame(data, columns=['year', 'state', 'pop'])
>>> df2['eastern'] = df2['state'] == 'Ohio'
>>> df2
year state pop eastern
0 2000 Ohio 1.5 True
1 2001 Ohio 1.7 True
2 2001 Nevada 3.6 False
>>> del df2['eastern']
>>> df2
year state pop
0 2000 Ohio 1.5
1 2001 Ohio 1.7
2 2001 Nevada 3.6
Dicts of Series are treated in much the same way:
>>> df3 = pd.DataFrame({'Nevada': {2001: 2.4, 2002: 2.9},
... 'Ohio': {2000: 1.2, 2001: 2.1,
... 2002: 3.5}})
>>> df3
Nevada Ohio
2000 NaN 1.2
2001 2.4 2.1
2002 2.9 3.5
>>> pdata = {'Ohio': df3['Ohio'][:-1],
... 'Nevada': df3['Nevada'][:2]}
>>> pd.DataFrame(pdata)
Nevada Ohio
2000 NaN 1.2
2001 2.4 2.1
As with Series, the values
attribute returns the data contained in the
DataFrame as a two-dimensional ndarray:
>>> df3.values
array([[ nan, 1.2],
[ 2.4, 2.1],
[ 2.9, 3.5]])
Notes
df[column]
works for any column name, butdf.column
only works when the column name is a valid Python variable name.
Caution
- New columns can be created with the
df[column]
, but CANNOT be created withdf.column
syntax. - The column returned from indexing a DataFrame is a view on the underlying
data, not a copy. Thus, any in-place modifications to the Series will be
reflected in the DataFrame. The column can be explicitly copied with the
Series’s
copy
method.
Reference
[1] Wes McKinney. 2017. “Chapter 5 Getting Started with pandas” Python for Data Analysis DATA WRANGLING WITH PANDAS, NUMPY, AND IPYTHON p 124-136