A DataFrame is a data structure used to store tabular data available in Python’s pandas package. It is one of the most important data structures for algorithms and is used to process traditional structured data. It is similar to Excel spreadsheets.
Pandas DataFrames are data structures that contain:
- Data organized in two dimensions, rows and columns
- Labels that correspond to the rows and columns. A label can be set for both columns and rows.
Creating DataFrame
Let’s consider the following table:
id name age decision 1 Fares 32 True 2 Elena 23 False 3 Steven 40 True
Below example will represent above table in DataFrame.
import pandas as pd df = pd.DataFrame([ ['1', 'Fares', 32, True], ['2', 'Elena', 23, False], ['3', 'Steven', 40, True]]) # Column label df.columns = ['id', 'name', 'age', 'decision'] print(df) ''' # print output id name age decision 0 1 Fares 32 True 1 2 Elena 23 False 2 3 Steven 40 True '''
There are several ways to create a Pandas DataFrame. You can pass the data as a two-dimensional list, tuple, dictionary or NumPy array. Following example uses a dictionary to create a DataFrame instance.
import pandas as pd # Dictonary data = { 'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'], 'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai', 'Manchester', 'Cairo', 'Osaka'], 'age': [41, 28, 33, 34, 38, 31, 37], 'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0] } # Row label row_labels = [101, 102, 103, 104, 105, 106, 107] df = pd.DataFrame(data=data, index=row_labels) print(df) # Print first two rows print(df.head(n=2)) ''' # print(df) output name city age py-score 101 Xavier Mexico City 41 88.0 102 Ann Toronto 28 79.0 103 Jana Prague 33 81.0 104 Yi Shanghai 34 80.0 105 Robin Manchester 38 68.0 106 Amal Cairo 31 61.0 107 Nori Osaka 37 84.0 '''
To view subset of data, we use head()
to show the first few items and tail()
to show the last few items.
Creating a subset of a DataFrame
Below are two main ways of creating the subset of a DataFrame:
- Column selection
- Row selection
Out of all of the rows/data that we may have, not all of them may be needed at a particular stage of the algorithm. Below example demonstrate how to select a given column(s) and row(s) of DataFrame.
import pandas as pd # Dictonary data = { 'name': ['Xavier', 'Ann', 'Jana', 'Yi', 'Robin', 'Amal', 'Nori'], 'city': ['Mexico City', 'Toronto', 'Prague', 'Shanghai', 'Manchester', 'Cairo', 'Osaka'], 'age': [41, 28, 33, 34, 38, 31, 37], 'py-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0] } # Colum label row_labels = [101, 102, 103, 104, 105, 106, 107] df = pd.DataFrame(data=data, index=row_labels) # Select column having label 'city' print(df['city']) # Select the row having label 107 #print(df.loc[107])
The positioning of a column is deterministic in a DataFrame. In the following example we are retrieving the first three rows and columns respectively of the DataFrame.
# Print first three rows print(df.iloc[0:3,:]) # Print first three columns print(df.iloc[:,0:3])
loc[]
retrieves rows or columns by their labels, while iloc[]
retrieves a row or column by its integer index.
To create a subset by specifying the filter, we need to use one or more columns to define the selection criterion. Below example shows how to select a subset of data elements.
print(df[(df.age<35)])
It selects all the rows for which age is less than 35 years.
Retrieving Labels
index
and columns
method are used for getting columns index and labels as shown in below example.
print(df.index) ''' Int64Index([101, 102, 103, 104, 105, 106, 107], dtype='int64') ''' print(df.columns) ''' Index(['name', 'city', 'age', 'py-score'], dtype='object') '''
Accessors
Pandas has four accessors for retrieving data. Few of them we already came across in above examples.
- .loc[] accepts the labels of rows and columns and returns Series or DataFrames.
- .iloc[] accepts the zero-based indices of rows and columns and returns Series or DataFrames.
- .at[] accepts the labels of rows and columns and returns a single data value.
- .iat[] accepts the zero-based indices of rows and columns and returns a single data value.