Pandas is a popular Python package for data analysis. It is strong and flexible and helps with data cleaning and wrangling tasks. This tutorial explains how to handle various data analysis tasks using pandas package, along with examples.
What is pandas package?
The Pandas package has many functions which are the essence for data handling and manipulation. In short, it can perform the following tasks for you –
- Creating a structured data similar to MS Excel spreadsheet.
- Reading data from various sources such as CSV, TXT, XLSX, SQL database, R etc.
- Selecting rows or columns from data set
- Arranging data in ascending or descending order
- Filtering data
- Summarizing data
- Transpose data into wide or long format
- Merging and concatenating two datasets
Important pandas functions to remember
The following is a list of common tasks along with pandas functions.
Tasks | Functions |
---|---|
Extract Column Names | df.columns |
Select first 2 rows | df.iloc[:2] |
Select first 2 columns | df.iloc[:,:2] |
Select columns by name | df.loc[:,[“col1″,”col2”]] |
Select random no. of rows | df.sample(n = 10) |
Select fraction of random rows | df.sample(frac = 0.2) |
Rename the variables | df.rename( ) |
Selecting a column as index | df.set_index( ) |
Removing rows or columns | df.drop( ) |
Sorting values | df.sort_values( ) |
Grouping variables | df.groupby( ) |
Filtering | df.query( ) |
Finding the missing values | df.isnull( ) |
Dropping the missing values | df.dropna( ) |
Removing the duplicates | df.drop_duplicates( ) |
Creating dummies | pd.get_dummies( ) |
Ranking | df.rank( ) |
Cumulative sum | df.cumsum( ) |
Quantiles | df.quantile( ) |
Selecting numeric variables | df.select_dtypes( ) |
Concatenating two dataframes | pd.concat() |
Merging on basis of common variable | pd.merge( ) |
Importing pandas library
You need to import or load the Pandas library first in order to use it by running the following code.
import pandas as pd
The “pd” is an alias or abbreviation which will be used as a shortcut to access or call pandas functions.
Importing Data
To read or import data from CSV file, you can use read_csv() function. In the function, you need to specify the file location of your CSV file. This dataset includes state income data from 2002 to 2015, comprising 51 rows and 16 columns.
mydata = pd.read_csv("<https://github.com/deepanshu88/Datasets/raw/master/UploadedFiles/income.csv>")
Output
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 \\
0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
4 C California 1685349 1675807 1889570 1480280 1735069 1812546
Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
4 1487315 1663809 1624509 1639670 1921845 1156536 1388461 1644607
Get Variable Names
By using mydata.columns
command, you can fetch the names of variables of a data frame.
Output
Index(['Index', 'State', 'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007',
'Y2008', 'Y2009', 'Y2010', 'Y2011', 'Y2012', 'Y2013', 'Y2014', 'Y2015'],
dtype='object')
mydata.columns[0:2]
returns first two column names ‘Index’, ‘State’. In python, indexing starts from 0.
How to Check Variable Types
You can use the DataFrameName.dtypes command to extract the information of types of variables stored in the data frame.
mydata.dtypes
Output
Index object
State object
Y2002 int64
Y2003 int64
Y2004 int64
Y2005 int64
Y2006 int64
Y2007 int64
Y2008 int64
Y2009 int64
Y2010 int64
Y2011 int64
Y2012 int64
Y2013 int64
Y2014 int64
Y2015 int64
dtype: object
Here ‘object‘ means strings or character variables. ‘int64‘ refers to numeric variables (without decimals).
To see the variable type of one variable (let’s say “State”) instead of all the variables, you can use the command below –
mydata['State'].dtypes
It returns dtype(‘O’). In this case, ‘O’ refers to object i.e. type of variable as character.
Changing Data Types
Y2008 is an integer. Suppose we want to convert it to float (numeric variable with decimals) we can write:
mydata.Y2008 = mydata.Y2008.astype(float)
mydata.dtypes
Output
Index object
State object
Y2002 int64
Y2003 int64
Y2004 int64
Y2005 int64
Y2006 int64
Y2007 int64
Y2008 float64
Y2009 int64
Y2010 int64
Y2011 int64
Y2012 int64
Y2013 int64
Y2014 int64
Y2015 int64
dtype: object
How to Check No. of Rows and Columns
mydata.shape
Output
(51, 16)
51 is the number of rows and 16 is the number of columns.
You can also use shape[0] to see the number of rows (similar to nrow() in R) and shape[1] for number of columns (similar to ncol() in R).
mydata.shape[0]
mydata.shape[1]
View First/Last N Rows
By default head( ) shows first 5 rows. If we want to see a specific number of rows we can mention it in the parenthesis. Similarly tail( ) function shows last 5 rows by default.
mydata.head()
mydata.head(2) #shows first 2 rows.
mydata.tail()
mydata.tail(2) #shows last 2 rows
Alternatively, any of the following commands can be used to fetch first five rows. mydata[0:5]
mydata.iloc[0:5]
Define Categorical Variable
Like factors() function in R,we can include categorical variable in python using “category” dtype.
s = pd.Series([1,2,3,1,2], dtype="category")
s
0 1
1 2
2 3
3 1
4 2
dtype: category
Categories (3, int64): [1, 2, 3]
Extract Unique Values
The unique() function shows the unique levels or categories in the dataset.
mydata.Index.unique()
Output
array(['A', 'C', 'D', ..., 'U', 'V', 'W'], dtype=object)
The nunique( ) shows the number of unique values.
mydata.Index.nunique()
It returns 19 as index column contains distinct 19 values.
Generate Cross Tab
pd.crosstab( ) is used to create a bivariate frequency distribution. Here the bivariate frequency distribution is between Index and State columns.
pd.crosstab(mydata.Index,mydata.State)
Creating frequency distribution
mydata.Index selects the ‘Index’ column of ‘mydata’ dataset and value_counts( ) creates a frequency distribution. By default ascending = False i.e. it will show the ‘Index’ having the maximum frequency on the top.
mydata.Index.value_counts(ascending = True)
Output
F 1
G 1
U 1
L 1
H 1
P 1
R 1
D 2
T 2
S 2
V 2
K 2
O 3
C 3
I 4
W 4
A 4
M 8
N 8
Name: Index, dtype: int64
Random Sampling
mydata.sample( ) is used to draw random samples from the dataset containing all the columns. Here n = 5 depicts we need 5 columns and frac = 0.1 tells that we need 10 percent of the data as my sample.
mydata.sample(n = 5)
mydata.sample(frac = 0.1)
Selecting Columns
There are multiple ways you can select a particular column. Both the following line of code selects State variable from mydata data frame.
mydata["State"]
mydata.State
To select multiple columns by name, you can use the following syntax.
mydata[["Index","State","Y2008"]]
To select only specific columns and rows, we use either loc[ ] or iloc[ ] functions. The index or columns to be selected are passed as lists. “Index”:”Y2008″ denotes the that all the columns from Index to Y2008 are to be selected.
Syntax of df.loc[ ]
df.loc[row_index , column_index]
mydata.loc[:,["Index","State","Y2008"]]
mydata.loc[0:2,["Index","State","Y2008"]] #Selecting rows with Index label 0 to 2 & columns
mydata.loc[:,"Index":"Y2008"] #Selecting consecutive columns
#In the above command both Index and Y2008 are included.
mydata.iloc[:,0:5] #Columns from 1 to 5 are included. 6th column not included
Difference between loc and iloc
loc considers rows (or columns) with particular labels from the index. Whereas iloc considers rows (or columns) at particular positions in the index so it only takes integers.
x = pd.DataFrame({"var1" : np.arange(1,20,2)}, index=[9,8,7,6,10, 1, 2, 3, 4, 5])
Output
var1
9 1
8 3
7 5
6 7
10 9
1 11
2 13
3 15
4 17
5 19
Renaming Variables
We create a dataframe ‘data’ for information of people and their respective zodiac signs.
data = pd.DataFrame({"A" : ["John","Mary","Julia","Kenny","Henry"], "B" : ["Libra","Capricorn","Aries","Scorpio","Aquarius"]})
data
Output
A B
0 John Libra
1 Mary Capricorn
2 Julia Aries
3 Kenny Scorpio
4 Henry Aquarius
If all the columns are to be renamed then we can use data.columns and assign the list of new column names.
#Renaming all the variables.
data.columns = ['Names','Zodiac Signs']
Output
Names Zodiac Signs
0 John Libra
1 Mary Capricorn
2 Julia Aries
3 Kenny Scorpio
4 Henry Aquarius
If only some of the variables are to be renamed then we can use rename( ) function where the new names are passed in the form of a dictionary.
#Renaming only some of the variables.
data.rename(columns = {"Names":"Cust_Name"},inplace = True)
Output
Cust_Name Zodiac Signs
0 John Libra
1 Mary Capricorn
2 Julia Aries
3 Kenny Scorpio
4 Henry Aquarius
By default in pandas inplace = False which means that no changes are made in the original dataset. Thus if we wish to alter the original dataset we need to define inplace = True.
Suppose we want to replace only a particular character in the list of the column names then we can use str.replace( ) function. For example, renaming the variables which contain “Y” as “Year”
mydata.columns = mydata.columns.str.replace('Y' , 'Year ')
mydata.columns
Output
Index(['Index', 'State', 'Year 2002', 'Year 2003', 'Year 2004', 'Year 2005',
'Year 2006', 'Year 2007', 'Year 2008', 'Year 2009', 'Year 2010',
'Year 2011', 'Year 2012', 'Year 2013', 'Year 2014', 'Year 2015'],
dtype='object')
Setting Column as Index
Using set_index(“column name”) we can set the indices as that column and that column gets removed.
mydata.set_index("Index",inplace = True)
mydata.head()
#Note that the indices have changed and Index column is now no more a column
mydata.columns
mydata.reset_index(inplace = True)
mydata.head()
reset_index( ) tells us that one should use the by default indices.
Removing columns and rows
To drop a column we use drop( ) where the first argument is a list of columns to be removed.
By default axis = 0
which means the operation should take place horizontally, row wise. To remove a column we need to set axis = 1
.
mydata.drop('Index',axis = 1)
#Alternatively
mydata.drop("Index",axis = "columns")
mydata.drop(['Index','State'],axis = 1)
mydata.drop(0,axis = 0)
mydata.drop(0,axis = "index")
mydata.drop([0,1,2,3],axis = 0)
Also inplace = False by default thus no alterations are made in the original dataset. axis = “columns” and axis = “index” means the column and row(index) should be removed respectively.
Sorting Data
To sort the data sort_values( ) function is deployed. By default inplace = False and ascending = True.
mydata.sort_values("State",ascending = False)
mydata.sort_values("State",ascending = False,inplace = True)
mydata.Y2006.sort_values()
We have got duplicated for Index thus we need to sort the dataframe firstly by Index and then for each particular index we sort the values by Y2002.
mydata.sort_values(["Index","Y2002"])
Create new variables
Using eval( ) arithmetic operations on various columns can be carried out in a dataset.
mydata["difference"] = mydata.Y2008-mydata.Y2009
#Alternatively
mydata["difference2"] = mydata.eval("Y2008 - Y2009")
mydata.head()
Output
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 \\
0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
4 C California 1685349 1675807 1889570 1480280 1735069 1812546
Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015 \\
0 1945229.0 1944173 1237582 1440756 1186741 1852841 1558906 1916661
1 1551826.0 1436541 1629616 1230866 1512804 1985302 1580394 1979143
2 1752886.0 1554330 1300521 1130709 1907284 1363279 1525866 1647724
3 1188104.0 1628980 1669295 1928238 1216675 1591896 1360959 1329341
4 1487315.0 1663809 1624509 1639670 1921845 1156536 1388461 1644607
difference difference2
0 1056.0 1056.0
1 115285.0 115285.0
2 198556.0 198556.0
3 -440876.0 -440876.0
4 -176494.0 -176494.0
mydata.ratio = mydata.Y2008/mydata.Y2009
The above command does not work, thus to create new columns we need to use square brackets.
We can also use assign( ) function but this command does not make changes in the original data as there is no inplace parameter. Hence we need to save it in a new dataset.
data = mydata.assign(ratio = (mydata.Y2008 / mydata.Y2009))
data.head()
Calculating Descriptive Statistics
describe( ) is used to find some statistics like mean,minimum, quartiles etc. for numeric variables.
mydata.describe() #for numeric variables
Output
Y2002 Y2003 Y2004 Y2005 Y2006 \\
count 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01
mean 1.566034e+06 1.509193e+06 1.540555e+06 1.522064e+06 1.530969e+06
std 2.464425e+05 2.641092e+05 2.813872e+05 2.671748e+05 2.505603e+05
min 1.111437e+06 1.110625e+06 1.118631e+06 1.122030e+06 1.102568e+06
25% 1.374180e+06 1.292390e+06 1.268292e+06 1.267340e+06 1.337236e+06
50% 1.584734e+06 1.485909e+06 1.522230e+06 1.480280e+06 1.531641e+06
75% 1.776054e+06 1.686698e+06 1.808109e+06 1.778170e+06 1.732259e+06
max 1.983285e+06 1.994927e+06 1.979395e+06 1.990062e+06 1.985692e+06
Y2007 Y2008 Y2009 Y2010 Y2011 \\
count 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01
mean 1.553219e+06 1.538398e+06 1.658519e+06 1.504108e+06 1.574968e+06
std 2.539575e+05 2.958132e+05 2.361854e+05 2.400771e+05 2.657216e+05
min 1.109382e+06 1.112765e+06 1.116168e+06 1.103794e+06 1.116203e+06
25% 1.322419e+06 1.254244e+06 1.553958e+06 1.328439e+06 1.371730e+06
50% 1.563062e+06 1.545621e+06 1.658551e+06 1.498662e+06 1.575533e+06
75% 1.780589e+06 1.779538e+06 1.857746e+06 1.639186e+06 1.807766e+06
max 1.983568e+06 1.990431e+06 1.993136e+06 1.999102e+06 1.992996e+06
Y2012 Y2013 Y2014 Y2015
count 5.100000e+01 5.100000e+01 5.100000e+01 5.100000e+01
mean 1.591135e+06 1.530078e+06 1.583360e+06 1.588297e+06
std 2.837675e+05 2.827299e+05 2.601554e+05 2.743807e+05
min 1.108281e+06 1.100990e+06 1.110394e+06 1.110655e+06
25% 1.360654e+06 1.285738e+06 1.385703e+06 1.372523e+06
50% 1.643855e+06 1.531212e+06 1.580394e+06 1.627508e+06
75% 1.866322e+06 1.725377e+06 1.791594e+06 1.848316e+06
max 1.988270e+06 1.994022e+06 1.990412e+06 1.996005e+06
For character or string variables, you can write include = [‘object’]. It will return total count, maximum occurring string and its frequency
mydata.describe(include = ['object']) #Only for strings / objects
To find out specific descriptive statistics of each column of data frame
mydata.mean()
mydata.median()
mydata.agg(["mean","median"])
agg( )
performs aggregation with summary functions like sum, mean, median, min, max etc.
How to run functions for a particular column(s)?
mydata.Y2008.mean()
mydata.Y2008.median()
mydata.Y2008.min()
mydata.loc[:,["Y2002","Y2008"]].max()
GroupBy function
To group the data by a categorical variable we use groupby( ) function and hence we can do the operations on each category.
mydata.groupby("Index")["Y2002","Y2003"].min()
Output
Y2002 Y2003
Index
A 1170302 1317711
C 1343824 1232844
D 1111437 1268673
F 1964626 1468852
G 1929009 1541565
H 1461570 1200280
I 1353210 1438538
K 1509054 1290700
L 1584734 1110625
M 1221316 1149931
N 1395149 1114500
O 1173918 1334639
P 1320191 1446723
R 1501744 1942942
S 1159037 1150689
T 1520591 1310777
U 1771096 1195861
V 1134317 1163996
W 1677347 1380662
To run multiple summary functions, we can use agg( )
function which is used to aggregate the data.
mydata.groupby("Index")["Y2002","Y2003"].agg(["min","max","mean"])
The following command finds minimum and maximum values for Y2002 and only mean for Y2003
mydata.groupby("Index").agg({"Y2002": ["min","max"],"Y2003" : "mean"})
Output
Y2002 Y2003
min max mean
Index
A 1170302 1742027 1810289.000
C 1343824 1685349 1595708.000
D 1111437 1330403 1631207.000
F 1964626 1964626 1468852.000
G 1929009 1929009 1541565.000
H 1461570 1461570 1200280.000
I 1353210 1776918 1536164.500
K 1509054 1813878 1369773.000
L 1584734 1584734 1110625.000
M 1221316 1983285 1535717.625
N 1395149 1885081 1382499.625
O 1173918 1802132 1569934.000
P 1320191 1320191 1446723.000
R 1501744 1501744 1942942.000
S 1159037 1631522 1477072.000
T 1520591 1811867 1398343.000
U 1771096 1771096 1195861.000
V 1134317 1146902 1498122.500
W 1677347 1977749 1521118.500
In order to rename
the columns after groupby
, you can use tuple. See the code below.
mydata.groupby("Index").agg({"Y2002" : [("Y2002_min","min"),("Y2002_max","max")], "Y2003" : [("Y2003_mean","mean")]})
Renaming columns can also be done via the method below.
dt = mydata.groupby("Index").agg({"Y2002": ["min","max"],"Y2003" : "mean"})
dt.columns = ['Y2002_min', 'Y2002_max', 'Y2003_mean']
Groupby more than 1 column
mydata.groupby(["Index", "State"]).agg({"Y2002": ["min","max"],"Y2003" : "mean"})
By default, option as_index=True
is enabled in groupby which means the columns you use in groupby will become an index in the new dataframe. To disable it, you can make it False which stores the variables you use in groupby in different columns in the new dataframe.
dt = mydata.groupby(["Index","State"], as_index=False)["Y2002","Y2003"].min()
Filtering
To filter only those rows which have Index as “A” we write:
mydata[mydata.Index == "A"]
#Alternatively
mydata.loc[mydata.Index == "A",:]
Output
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 \\
0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
To select the States having Index as “A”:
mydata.loc[mydata.Index == "A","State"]
mydata.loc[mydata.Index == "A",:].State
To filter the rows with Index as “A” and mydata for 2002 > 1500000″
mydata.loc[(mydata.Index == "A") & (mydata.Y2002 > 1500000),:]
To filter the rows with index either “A” or “W”, we can use isin( ) function:
mydata.loc[(mydata.Index == "A") | (mydata.Index == "W"),:]
#Alternatively.
mydata.loc[mydata.Index.isin(["A","W"]),:]
Output
Index State Y2002 Y2003 Y2004 Y2005 Y2006 Y2007 \\
0 A Alabama 1296530 1317711 1118631 1492583 1107408 1440134
1 A Alaska 1170302 1960378 1818085 1447852 1861639 1465841
2 A Arizona 1742027 1968140 1377583 1782199 1102568 1109382
3 A Arkansas 1485531 1994927 1119299 1947979 1669191 1801213
47 W Washington 1977749 1687136 1199490 1163092 1334864 1621989
48 W West Virginia 1677347 1380662 1176100 1888948 1922085 1740826
49 W Wisconsin 1788920 1518578 1289663 1436888 1251678 1721874
50 W Wyoming 1775190 1498098 1198212 1881688 1750527 1523124
Y2008 Y2009 Y2010 Y2011 Y2012 Y2013 Y2014 Y2015
0 1945229 1944173 1237582 1440756 1186741 1852841 1558906 1916661
1 1551826 1436541 1629616 1230866 1512804 1985302 1580394 1979143
2 1752886 1554330 1300521 1130709 1907284 1363279 1525866 1647724
3 1188104 1628980 1669295 1928238 1216675 1591896 1360959 1329341
47 1545621 1555554 1179331 1150089 1775787 1273834 1387428 1377341
48 1238174 1539322 1539603 1872519 1462137 1683127 1204344 1198791
49 1980167 1901394 1648755 1940943 1729177 1510119 1701650 1846238
50 1587602 1504455 1282142 1881814 1673668 1994022 1204029 1853858
Alternatively we can use query( )
function which also eliminates the need to specify data frame while mentioning column(s) and lets you write our filtering criteria:
mydata.query('Y2002>1700000 & Y2003 > 1500000')
Dealing with missing values
We create a new dataframe named ‘crops’ and to create a NaN value we use np.nan by importing numpy.
import numpy as np
mydata = {'Crop': ['Rice', 'Wheat', 'Barley', 'Maize'],
'Yield': [1010, 1025.2, 1404.2, 1251.7],
'cost' : [102, np.nan, 20, 68]}
crops = pd.DataFrame(mydata)
crops
isnull( ) returns True and notnull( ) returns False if the value is NaN.
crops.isnull() #same as is.na in R
crops.notnull() #opposite of previous command.
crops.isnull().sum() #No. of missing values.
crops.cost.isnull() firstly subsets the ‘cost’ from the dataframe and returns a logical vector with isnull()
crops[crops.cost.isnull()] #shows the rows with NAs.
crops[crops.cost.isnull()].Crop #shows the rows with NAs in crops.Crop
crops[crops.cost.notnull()].Crop #shows the rows without NAs in crops.Crop
To drop all the rows which have missing values in any rows we use dropna(how = “any”) . By default inplace = False . If how = “all” means drop a row if all the elements in that row are missing
crops.dropna(how = "any").shape
crops.dropna(how = "all").shape
To remove NaNs if any of ‘Yield’ or’cost’ are missing we use the subset parameter and pass a list:
crops.dropna(subset = ['Yield',"cost"],how = 'any').shape
crops.dropna(subset = ['Yield',"cost"],how = 'all').shape
Replacing the missing values by “UNKNOWN” sub attribute in Column name.
crops['cost'].fillna(value = "UNKNOWN",inplace = True)
crops
Dealing with duplicates
We create a new dataframe comprising of items and their respective prices.
data = pd.DataFrame({"Items" : ["TV","Washing Machine","Mobile","TV","TV","Washing Machine"], "Price" : [10000,50000,20000,10000,10000,40000]})
data
Output
Items Price
0 TV 10000
1 Washing Machine 50000
2 Mobile 20000
3 TV 10000
4 TV 10000
5 Washing Machine 40000
duplicated() returns a logical vector returning True when encounters duplicated.
data.loc[data.duplicated(),:]
data.loc[data.duplicated(keep = "first"),:]
By default keep = ‘first’ i.e. the first occurence is considered a unique value and its repetitions are considered as duplicates. If keep = “last” the last occurence is considered a unique value and all its repetitions are considered as duplicates.
data.loc[data.duplicated(keep = "last"),:] #last entries are not there,indices have changed.
If keep = “False” then it considers all the occurences of the repeated observations as duplicates.
data.loc[data.duplicated(keep = False),:] #all the duplicates, including unique are shown.
To drop the duplicates drop_duplicates is used with default inplace = False, keep = ‘first’ or ‘last’ or ‘False’ have the respective meanings as in duplicated( )
data.drop_duplicates(keep = "first")
data.drop_duplicates(keep = "last")
data.drop_duplicates(keep = False,inplace = True) #by default inplace = False
data
Creating dummies
Let’s explore the iris dataset, consisting of 150 rows and 5 columns, providing details on 3 species of flowers.
iris = pd.read_csv("<https://github.com/deepanshu88/Datasets/raw/master/UploadedFiles/iris.csv>")
iris.head()
Output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
map( ) function is used to match the values and replace them in the new series automatically created.
iris["setosa"] = iris.Species.map({"setosa" : 1,"versicolor":0, "virginica" : 0})
iris.head()
To create dummies get_dummies( ) is used. iris.Species.prefix = “Species” adds a prefix ‘ Species’ to the new series created.
pd.get_dummies(iris.Species,prefix = "Species")
pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:1] #1 is not included
species_dummies = pd.get_dummies(iris.Species,prefix = "Species").iloc[:,0:]
With concat( ) function we can join multiple series or dataframes. axis = 1 denotes that they should be joined columnwise.
iris = pd.concat([iris,species_dummies],axis = 1)
iris.head()
Output
Sepal.Length Sepal.Width Petal.Length Petal.Width Species \\
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
Species_setosa Species_versicolor Species_virginica
0 1 0 0
1 1 0 0
2 1 0 0
3 1 0 0
4 1 0 0
It is usual that for a variable with ‘n’ categories we creat ‘n-1’ dummies, thus to drop the first ‘dummy’ column we write drop_first = True
pd.get_dummies(iris,columns = ["Species"],drop_first = True).head()
Ranking
To create a dataframe of all the ranks we use rank( )
iris.rank()
Ranking by a variable
Suppose we want to rank the Sepal.Length for different species in ascending order:
iris['Rank2'] = iris['Sepal.Length'].groupby(iris["Species"]).rank(ascending=1)
iris.head()
Calculating the Cumulative sum
Using cumsum( ) function we can obtain the cumulative sum.
iris['cum_sum'] = iris["Sepal.Length"].cumsum()
iris.head()
Cumulative sum by a variable
To find the cumulative sum of sepal lengths for different species we use groupby( ) and then use cumsum( )
iris["cumsum2"] = iris.groupby(["Species"])["Sepal.Length"].cumsum()
iris.head()
Calculating Percentiles
Various quantiles can be obtained by using quantile( )
iris.quantile(0.5)
iris.quantile([0.1,0.2,0.5])
iris.quantile(0.55)
if else in Python
We create a new dataframe of students’ name and their respective zodiac signs.
students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
def name(row):
if row["Names"] in ["John","Henry"]:
return "yes"
else:
return "no"
students['flag'] = students.apply(name, axis=1)
students
Functions in python are defined using the block keyword def
, followed with the function’s name as the block’s name. apply( ) function applies function along rows or columns of dataframe.
Note :
If using simple ‘if else’ we need to take care of the indentation . Python does not involve curly braces for the loops and if else.
Output
Names Zodiac Signs flag
0 John Aquarius yes
1 Mary Libra no
2 Henry Gemini yes
3 Augustus Pisces no
4 Kenny Virgo no
Alternatively, By importing numpy we can use np.where. The first argument is the condition to be evaluated, 2nd argument is the value if condition is True and last argument defines the value if the condition evaluated returns False.
import numpy as np
students['flag'] = np.where(students['Names'].isin(['John','Henry']), 'yes', 'no')
students
Multiple Conditions : If Else-if Else
def mname(row):
if row["Names"] == "John" and row["Zodiac Signs"] == "Aquarius" :
return "yellow"
elif row["Names"] == "Mary" and row["Zodiac Signs"] == "Libra" :
return "blue"
elif row["Zodiac Signs"] == "Pisces" :
return "blue"
else:
return "black"
students['color'] = students.apply(mname, axis=1)
students
We create a list of conditions and their respective values if evaluated True and use np.select where default value is the value if all the conditions is False.
conditions = [
(students['Names'] == 'John') & (students['Zodiac Signs'] == 'Aquarius'),
(students['Names'] == 'Mary') & (students['Zodiac Signs'] == 'Libra'),
(students['Zodiac Signs'] == 'Pisces')]
choices = ['yellow', 'blue', 'purple']
students['color'] = np.select(conditions, choices, default='black')
students
Names Zodiac Signs flag color
0 John Aquarius yes yellow
1 Mary Libra no blue
2 Henry Gemini yes black
3 Augustus Pisces no purple
4 Kenny Virgo no black
Select numeric or categorical columns only
To include numeric columns we use select_dtypes( )
data1 = iris.select_dtypes(include=[np.number])
data1.head()
_get_numeric_data also provides utility to select the numeric columns only.
data3 = iris._get_numeric_data()
data3.head(3)
Output
Sepal.Length Sepal.Width Petal.Length Petal.Width cum_sum cumsum2
0 5.1 3.5 1.4 0.2 5.1 5.1
1 4.9 3.0 1.4 0.2 10.0 10.0
2 4.7 3.2 1.3 0.2 14.7 14.7
For selecting categorical variables
data4 = iris.select_dtypes(include = ['object'])
data4.head(2)
Species
0 setosa
1 setosa
Combine Dataframes
We create 2 dataframes containing the details of the students:
students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
'Marks' : [50,81,98,25,35]})
using pd.concat( ) function we can join the 2 dataframes:
data = pd.concat([students,students2]) #by default axis = 0
Output
Marks Names Zodiac Signs
0 NaN John Aquarius
1 NaN Mary Libra
2 NaN Henry Gemini
3 NaN Augustus Pisces
4 NaN Kenny Virgo
0 50.0 John NaN
1 81.0 Mary NaN
2 98.0 Henry NaN
3 25.0 Augustus NaN
4 35.0 Kenny NaN
By default axis = 0
thus the new dataframe will be added row-wise. If a column is not present then in one of the dataframes it creates NaNs. To join column wise we set axis = 1
data = pd.concat([students,students2],axis = 1)
data
Names Zodiac Signs Marks Names
0 John Aquarius 50 John
1 Mary Libra 81 Mary
2 Henry Gemini 98 Henry
3 Augustus Pisces 25 Augustus
4 Kenny Virgo 35 Kenny
Using append function we can join the dataframes row-wise
students.append(students2) #for rows
Alternatively we can create a dictionary of the two data frames and can use pd.concat to join the dataframes row wise
classes = {'x': students, 'y': students2}
result = pd.concat(classes)
result
Output
Marks Names Zodiac Signs
x 0 NaN John Aquarius
1 NaN Mary Libra
2 NaN Henry Gemini
3 NaN Augustus Pisces
4 NaN Kenny Virgo
y 0 50.0 John NaN
1 81.0 Mary NaN
2 98.0 Henry NaN
3 25.0 Augustus NaN
4 35.0 Kenny NaN
Merging or Joining Dataframes
We take 2 dataframes with different number of observations:
students = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
'Zodiac Signs': ['Aquarius','Libra','Gemini','Pisces','Virgo']})
students2 = pd.DataFrame({'Names': ['John','Mary','Henry','Augustus','Kenny'],
'Marks' : [50,81,98,25,35]})
Using pd.merge we can join the two dataframes. on = ‘Names’ denotes the common variable on the basis of which the dataframes are to be combined is ‘Names’
result = pd.merge(students, students2, on='Names') #it only takes intersections
result
Output
Names Zodiac Signs Marks
0 John Aquarius 50
1 Mary Libra 81
2 Henry Gemini 98
By default how = “inner” thus it takes only the common elements in both the dataframes. If you want all the elements in both the dataframes set how = “outer”
result = pd.merge(students, students2, on='Names',how = "outer") #it only takes unions
result
Output
Names Zodiac Signs Marks
0 John Aquarius 50.0
1 Mary Libra 81.0
2 Henry Gemini 98.0
3 Maria Capricorn NaN
4 Augustus NaN 25.0
5 Kenny NaN 35.0
To take only intersections and all the values in left df set how = ‘left’
result = pd.merge(students, students2, on='Names',how = "left")
result
Output
Names Zodiac Signs Marks
0 John Aquarius 50.0
1 Mary Libra 81.0
2 Henry Gemini 98.0
3 Maria Capricorn NaN
Similarly how = ‘right’ takes only intersections and all the values in right df.
result = pd.merge(students, students2, on='Names',how = "right",indicator = True)
result
Output
Names Zodiac Signs Marks _merge
0 John Aquarius 50 both
1 Mary Libra 81 both
2 Henry Gemini 98 both
3 Augustus NaN 25 right_only
4 Kenny NaN 35 right_only