6. Pandas#
6.1 Introduction#
Pandas
offers a wide range of functions to read data from various sources, such as CSV files, Excel files, SQL databases, and more. You can perform operations like filtering, sorting, grouping, merging, reshaping, and aggregating data using Pandas. It also integrates well with other popular libraries in the Python ecosystem, such as numpy
and matplotlib
.
It provides powerful data structures and functions to efficiently work with structured data, such as tabular data, time series, and more. With pandas
, you can easily load, manipulate, analyze, and visualize data for tasks like data cleaning, exploration, transformation, and modeling.
To start using pandas
, you need to import it into your Python script or Jupyter Notebook. The standard way to import Pandas is as follows:
import pandas as pd
The two primary data structures provided by Pandas are:
Series
: ASeries
in Pandas is a one-dimensional labeled array that can hold any data type. It is similar to a column in a spreadsheet or a single column of a SQL table. Each element in a Series is associated with a unique label called anindex
.DataFrame
: A two-dimensional labeled data structure with columns of potentially different types. It is similar to a spreadsheet or a SQL table.
6.2 Series
#
Here is an example where we create a list of data containing some numbers and an index list with corresponding labels. We then use these lists to create a Series using pd.Series() function.
data = [10, 20, 30, 40, 50]
index = ['A', 'B', 'C', 'D', 'E']
s = pd.Series(data, index=index)
print("\nOriginal series:")
print (s)
Original series:
A 10
B 20
C 30
D 40
E 50
dtype: int64
We also change the created labels without having any effect on the data as follows:
s.index = ['X', 'Y', 'Zebra', 'W', 'V']
print("\nUpdated series:")
print(s)
Updated series:
X 10
Y 20
Zebra 30
W 40
V 50
dtype: int64
Two helpful functions when working with pandas are the iloc[ ]
and loc[ ]
functions. For more information see the table below:
Function |
Description |
Example |
---|---|---|
|
Integer-based indexing and selection |
s.iloc[0] accesses the first row of a DataFrame |
s.iloc[2:4] accesses a slice of rows in a DataFrame |
||
|
Label-based indexing and selection |
s.loc[‘X’] accesses a row labeled ‘A’ in a DataFrame |
s.loc[[‘X’, ‘W’]] accesses multiple rows in a DataFrame |
s.iloc[2:4]
Zebra 30
W 40
dtype: int64
6.3 DataFrame
#
In the example below, we start by creating a dictionary data that contains information about names, ages, and cities. We then use this dictionary to create a dataframe
df using pd.DataFrame( ).
data2 = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 28, 22],
'City': ['London', 'Paris', 'Berlin']}
df = pd.DataFrame(data2)
print("Original DataFrame:")
df
Original DataFrame:
Name | Age | City | |
---|---|---|---|
0 | John | 25 | London |
1 | Alice | 28 | Paris |
2 | Bob | 22 | Berlin |
You can then use the loc[ ]
and loc[ ]
functions to locate specific data from your dataframe
. For example:
print("\nUsing iloc[]:")
print(df.iloc[0])
print(df.iloc[1:3])
print(df.iloc[0, 1])
print("\nUsing loc[]:")
print(df.loc[0])
print(df.loc[1:2])
print(df.loc[0, 'Age'])
Using iloc[]:
Name John
Age 25
City London
Name: 0, dtype: object
Name Age City
1 Alice 28 Paris
2 Bob 22 Berlin
25
Using loc[]:
Name John
Age 25
City London
Name: 0, dtype: object
Name Age City
1 Alice 28 Paris
2 Bob 22 Berlin
25
We then add and remove some data from the created dataframe
. You can remove rows and columns from a datafram by using the Yourdataframename.drop() function.
df.loc[0, 'City'] = 'New York'
df['Salary'] = [5000, 6000, 4500]
df = df.drop(1)
df = df.drop('Age', axis=1)
print("\nUpdated DataFrame:")
df
Updated DataFrame:
Name | City | Salary | |
---|---|---|---|
0 | John | New York | 5000 |
2 | Bob | Berlin | 4500 |
6.4 Importing data into DataFrames
and exploring its attributes#
You can also make a dataframes
from any csv file (excel files, delimited text files, etc). The table below summarizes the important functions you need to remember when applying the pd.read_csv()
command.
Aspect |
Description |
Function |
---|---|---|
File Path |
Specify the path or filename of the CSV file |
pd.read_csv(‘file_path.csv’) |
Delimiter |
Define the delimiter used in the CSV file |
pd.read_csv(‘file.csv’, delimiter=’,’) |
Header |
Specify whether the CSV file has a header row |
pd.read_csv(‘file.csv’, header=0) |
Columns |
Select specific columns from the CSV file |
pd.read_csv(‘file.csv’, usecols=[‘col1’, ‘col2’]) |
Index Column |
Set a specific column as the index of the DataFrame |
pd.read_csv(‘file.csv’, index_col=’column’) |
Data Types |
Specify data types for columns in the DataFrame |
pd.read_csv(‘file.csv’, dtype={‘col1’: int, ‘col2’: float}) |
Note
The excel file needs to be in the same place in the folder that the python file is at for this to work!