A DataFrame is a data structure used to store tabular data available in Python’s pandas package. It is one of the most important data structures for algorithms and is used to process traditional structured data. It is similar to Excel spreadsheets.

Pandas DataFrames are data structures that contain:

  • Data organized in two dimensions, rows and columns
  • Labels that correspond to the rows and columns. A label can be set for both columns and rows.

Creating DataFrame

Let’s consider the following table:

id	name	age	decision
1	Fares	32	True
2	Elena	23	False
3	Steven	40	True

Below example will represent above table in DataFrame.

import pandas as pd

df = pd.DataFrame([
    ['1', 'Fares', 32, True],
    ['2', 'Elena', 23, False],
    ['3', 'Steven', 40, True]])

# Column label
df.columns = ['id', 'name', 'age', 'decision']
print(df)

'''
# print output
  id    name  age  decision
0  1   Fares   32      True
1  2   Elena   23     False
2  3  Steven   40      True
'''

There are several ways to create a Pandas DataFrame. You can pass the data as a two-dimensional list, tuple, dictionary or NumPy array. Following example uses a dictionary to create a DataFrame instance.

import pandas as pd

# Dictonary
data = {
    'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
    'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
             'Manchester', 'Cairo', 'Osaka'],
    'age': [41, 28, 33, 34, 38, 31, 37],
    'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}

# Row label
row_labels = [101, 102, 103, 104, 105, 106, 107]

df = pd.DataFrame(data=data, index=row_labels)
print(df)

# Print first two rows
print(df.head(n=2))

'''
# print(df) output
       name         city  age  py-score
101  Xavier  Mexico City   41      88.0
102     Ann      Toronto   28      79.0
103    Jana       Prague   33      81.0
104      Yi     Shanghai   34      80.0
105   Robin   Manchester   38      68.0
106    Amal        Cairo   31      61.0
107    Nori        Osaka   37      84.0
'''

To view subset of data, we use head() to show the first few items and tail() to show the last few items.

Creating a subset of a DataFrame

Below are two main ways of creating the subset of a DataFrame:

  • Column selection
  • Row selection

Out of all of the rows/data that we may have, not all of them may be needed at a particular stage of the algorithm. Below example demonstrate how to select a given column(s) and row(s) of DataFrame.

import pandas as pd

# Dictonary
data = {
    'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'],
    'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai',
             'Manchester', 'Cairo', 'Osaka'],
    'age': [41, 28, 33, 34, 38, 31, 37],
    'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0]
}

# Colum label
row_labels = [101, 102, 103, 104, 105, 106, 107]

df = pd.DataFrame(data=data, index=row_labels)

# Select column having label 'city'
print(df['city'])

# Select the row having label 107
#print(df.loc[107])

The positioning of a column is deterministic in a DataFrame. In the following example we are retrieving the first three rows and columns respectively of the DataFrame.

# Print first three rows
print(df.iloc[0:3,:])

# Print first three columns
print(df.iloc[:,0:3])

loc[] retrieves rows or columns by their labels, while iloc[] retrieves a row or column by its integer index.

To create a subset by specifying the filter, we need to use one or more columns to define the selection criterion. Below example shows how to select a subset of data elements.

print(df[(df.age<35)])

It selects all the rows for which age is less than 35 years.

Retrieving Labels

index and columns method are used for getting columns index and labels as shown in below example.

print(df.index)
'''
Int64Index([101, 102, 103, 104, 105, 106, 107], dtype='int64')
'''

print(df.columns)
'''
Index(['name', 'city', 'age', 'py-score'], dtype='object')
'''

Accessors

Pandas has four accessors for retrieving data. Few of them we already came across in above examples.

  • .loc[] accepts the labels of rows and columns and returns Series or DataFrames.
  • .iloc[] accepts the zero-based indices of rows and columns and returns Series or DataFrames.
  • .at[] accepts the labels of rows and columns and returns a single data value.
  • .iat[] accepts the zero-based indices of rows and columns and returns a single data value.