In this tutorial, we will cover how to remove one or more columns from a pandas dataframe. Pandas is a python package that has several functions for data analysis.
Syntax to Drop Columns
import pandas as pd
new_df = df.drop(['column_name1','column_name2'], axis=1)
In pandas, drop( )
function is used to remove column(s) from a pandas dataframe. axis=1
tells Python that you want to apply function on columns instead of rows.
Drop Columns from Pandas Dataframe in Python
Let’s create a sample dataframe to explain examples in this tutorial. The code below creates 4 columns named ‘A’ through ‘D’.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
The following code removes Column ‘A’ from dataframe named ‘df’ and store it to new dataframe named ‘newdf’.
newdf = df.drop(['A'], axis=1)
B C D
0 -1.656038 1.655995 -1.413243
1 0.710933 -1.335381 0.832619
2 -0.411327 0.098119 0.768447
3 -0.093217 1.077528 0.196891
4 0.302687 0.125881 -0.665159
5 -0.692847 -1.463154 -0.707779
#Check columns in newdf after dropping column A
newdf.columns
# Output
# Index(['B', 'C', 'D'], dtype='object')
Remove Multiple Columns in Python
You can specify all the columns you want to remove in a list and pass it in drop( )
function.
Method I
df2 = df.drop(['B','C'], axis=1)
Method II
cols = ['B','C']
df2 = df.drop(cols, axis=1)
Dropping Columns by Column Position
You can find out name of first column by using this command df.columns[0]
. Indexing in python starts from 0.
df.drop(df.columns[0], axis =1)
To drop multiple columns by position (first and third columns), you can specify the position in list [0,2]
.
cols = [0,2]
df.drop(df.columns[cols], axis =1)
Dropping Columns by Name Pattern
df = pd.DataFrame({"X1":range(1,6),"X_2":range(2,7),"YX":range(3,8),"Y_1":range(2,7),"Z":range(5,10)})
X1 X_2 YX Y_1 Z
0 1 2 3 2 5
1 2 3 4 3 6
2 3 4 5 4 7
3 4 5 6 5 8
4 5 6 7 6 9
Dropping Columns Starting with ‘X’
df.loc[:,~df.columns.str.contains('^X')]
How it works?
^X
is a expression of regex language which refers to beginning of letter ‘X’df.columns.str.contains('^X')
returns array [True, True, False, False, False]. True where condition meets. Otherwise False- Sign
~
refers to negate the condition. df.loc[ ]
is used to select columns
It can also be written like :
df.drop(df.columns[df.columns.str.contains('^X')], axis=1)
Other Examples
#Removing columns whose name contains string 'X'
df.loc[:,~df.columns.str.contains('X')]
#Removing columns whose name contains string either 'X' or 'Y'
df.loc[:,~df.columns.str.contains('X|Y')]
#Removing columns whose name ends with string 'X'
df.loc[:,~df.columns.str.contains('X$')]
Dropping Columns with Missing Values Greater than 50%
df = pd.DataFrame({'A':[1,3,np.nan,5,np.nan],
'B':[4,np.nan,np.nan,5,np.nan]
})
% of missing values can be calculated by mean of NAs in each column.
cols = df.columns[df.isnull().mean()>0.5]
df.drop(cols, axis=1)