6. Pandas#

6.1 Introduction#

Pandas offers a wide range of functions to read data from various sources, such as CSV files, Excel files, SQL databases, and more. You can perform operations like filtering, sorting, grouping, merging, reshaping, and aggregating data using Pandas. It also integrates well with other popular libraries in the Python ecosystem, such as numpy and matplotlib.

It provides powerful data structures and functions to efficiently work with structured data, such as tabular data, time series, and more. With pandas, you can easily load, manipulate, analyze, and visualize data for tasks like data cleaning, exploration, transformation, and modeling.

To start using pandas, you need to import it into your Python script or Jupyter Notebook. The standard way to import Pandas is as follows:

import pandas as pd

The two primary data structures provided by Pandas are:

  1. Series: A Series in Pandas is a one-dimensional labeled array that can hold any data type. It is similar to a column in a spreadsheet or a single column of a SQL table. Each element in a Series is associated with a unique label called an index.

  2. DataFrame: A two-dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table.

6.2 Series#

Here is an example where we create a list of data containing some numbers and an index list with corresponding labels. We then use these lists to create a Series using pd.Series() function.

data = [10, 20, 30, 40, 50]
index = ['A', 'B', 'C', 'D', 'E']

s = pd.Series(data, index=index)
print("\nOriginal series:")
print (s)
Original series:
A    10
B    20
C    30
D    40
E    50
dtype: int64

We also change the created labels without having any effect on the data as follows:

s.index = ['X', 'Y', 'Zebra', 'W', 'V']
print("\nUpdated series:")
print(s)
Updated series:
X        10
Y        20
Zebra    30
W        40
V        50
dtype: int64

Two helpful functions when working with pandas are the iloc[ ] and loc[ ] functions. For more information see the table below:

Function

Description

Example

iloc[ ]

Integer-based indexing and selection

s.iloc[0] accesses the first row of a DataFrame

s.iloc[2:4] accesses a slice of rows in a DataFrame

loc[ ]

Label-based indexing and selection

s.loc[‘X’] accesses a row labeled ‘A’ in a DataFrame

s.loc[[‘X’, ‘W’]] accesses multiple rows in a DataFrame

s.iloc[2:4] 
Zebra    30
W        40
dtype: int64

6.3 DataFrame#

In the example below, we start by creating a dictionary data that contains information about names, ages, and cities. We then use this dictionary to create a dataframe df using pd.DataFrame( ).

data2 = {'Name': ['John', 'Alice', 'Bob'],
    'Age': [25, 28, 22],
    'City': ['London', 'Paris', 'Berlin']}

df = pd.DataFrame(data2)

print("Original DataFrame:")
df      
Original DataFrame:
Name Age City
0 John 25 London
1 Alice 28 Paris
2 Bob 22 Berlin

You can then use the loc[ ] and loc[ ] functions to locate specific data from your dataframe. For example:

print("\nUsing iloc[]:")
print(df.iloc[0])                
print(df.iloc[1:3])               
print(df.iloc[0, 1])             

print("\nUsing loc[]:")
print(df.loc[0])                 
print(df.loc[1:2])                
print(df.loc[0, 'Age'])    
Using iloc[]:
Name      John
Age         25
City    London
Name: 0, dtype: object
    Name  Age    City
1  Alice   28   Paris
2    Bob   22  Berlin
25

Using loc[]:
Name      John
Age         25
City    London
Name: 0, dtype: object
    Name  Age    City
1  Alice   28   Paris
2    Bob   22  Berlin
25

We then add and remove some data from the created dataframe. You can remove rows and columns from a datafram by using the Yourdataframename.drop() function.

df.loc[0, 'City'] = 'New York'
df['Salary'] = [5000, 6000, 4500]

df = df.drop(1)                   
df = df.drop('Age', axis=1)      

print("\nUpdated DataFrame:")
df
Updated DataFrame:
Name City Salary
0 John New York 5000
2 Bob Berlin 4500

6.4 Importing data into DataFrames and exploring its attributes#

You can also make a dataframes from any csv file (excel files, delimited text files, etc). The table below summarizes the important functions you need to remember when applying the pd.read_csv() command.

Aspect

Description

Function

File Path

Specify the path or filename of the CSV file

pd.read_csv(‘file_path.csv’)

Delimiter

Define the delimiter used in the CSV file

pd.read_csv(‘file.csv’, delimiter=’,’)

Header

Specify whether the CSV file has a header row

pd.read_csv(‘file.csv’, header=0)

Columns

Select specific columns from the CSV file

pd.read_csv(‘file.csv’, usecols=[‘col1’, ‘col2’])

Index Column

Set a specific column as the index of the DataFrame

pd.read_csv(‘file.csv’, index_col=’column’)

Data Types

Specify data types for columns in the DataFrame

pd.read_csv(‘file.csv’, dtype={‘col1’: int, ‘col2’: float})