Background
In the last post, we saw the basics if using the pandas library in Python which is used for data analysis. We saw two basic data structures supported by pandas
- Series
- DataFrame
Filtering a data frame in Pandas
loc & iloc methods
To recap, a data frame is a two-dimensional data structure consisting of rows and columns. So we need a way to filter rows and columns efficiently. Two main methods exposed by data frame for this are
- loc - uses rows and column labels
- iloc - uses rows and column indexes
For example:
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"], columns=["colA", "colB", "colC", "colD"]) print(df.loc[["a", "b"], ["colA", "colC"]]) print(df.iloc[:2, :3])
Output:
colA colC
a 4 7
b 1 4
colA colB colC
a 4 9 7
b 1 1 4
Selecting subset of columns
We can get a Series (a single column data) from the data frame using df["column_name"], similarly, we can get a new data frame with a subset of columns by passing a list of columns needed. For eg.,
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"], columns=["colA", "colB", "colC", "colD"]) print(df[["colA", "colC"]]) print(type(df[["colA", "colC"]]))
Output:
colA colC
a 5 2
b 7 2
c 4 3
d 9 1
<class 'pandas.core.frame.DataFrame'>
As you can see from the output we selected 2 columns - ColA and ColC and the result is a new DataFrame object.
Filtering by condition
You can also filter a data frame by conditions. Consider the following example:
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"], columns=["colA", "colB", "colC", "colD"]) print(df[df["colA"] >= 1])
Output:
colA colB colC colD
a 3 9 5 6
b 8 5 9 6
c 9 4 1 4
d 8 4 3 5
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"], columns=["colA", "colB", "colC", "colD"]) print(df[(df["colA"] >= 1) | (df["colB"] <= 5)])
- ==: equal
- !=: not equal
- >: greater than
- >=: greater than or equal to
- <: less than
- <=: less than or equal to
import pandas as pd import numpy as np df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"], columns=["colA", "colB", "colC", "colD"]) print(df[df["colA"].isin([1, 2, 3])])