Open Source For Geeks: Filtering a DataFrame in Pandas

Sunday, 2 February 2025

Filtering a DataFrame in Pandas

Background

In the last post, we saw the basics if using the pandas library in Python which is used for data analysis. We saw two basic data structures supported by pandas

Series
DataFrame

In this post, we will further see how we can filter data in a data frame. These are some of the most common operations performed for data analysis.

Filtering a data frame in Pandas

loc & iloc methods

To recap, a data frame is a two-dimensional data structure consisting of rows and columns. So we need a way to filter rows and columns efficiently. Two main methods exposed by data frame for this are

loc - uses rows and column labels
iloc - uses rows and column indexes

For example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df.loc[["a", "b"], ["colA", "colC"]])
print(df.iloc[:2, :3])

Output:

colA colC

a 4 7

b 1 4

colA colB colC

a 4 9 7

b 1 1 4

The loc and iloc methods are frequently used for selecting or extracting a part of a data frame. The main difference is that loc works with labels whereas iloc works with indices.

Selecting subset of columns

We can get a Series (a single column data) from the data frame using df["column_name"], similarly, we can get a new data frame with a subset of columns by passing a list of columns needed. For eg.,

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[["colA", "colC"]])
print(type(df[["colA", "colC"]]))

Output:

colA colC

a 5 2

b 7 2

c 4 3

d 9 1

<class 'pandas.core.frame.DataFrame'>

As you can see from the output we selected 2 columns - ColA and ColC and the result is a new DataFrame object.

Filtering by condition

You can also filter a data frame by conditions. Consider the following example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[df["colA"] >= 1])

Output:

colA colB colC colD

a 3 9 5 6

b 8 5 9 6

c 9 4 1 4

d 8 4 3 5

The data frame has randomly generated data so the output will not be consistent but you can confirm that output will always have entries corresponding to colA having values greater than or equal to 1 as we specified in the filtering condition.

you can also specify multiple conditions with & or | operators. Consider the following example

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[(df["colA"] >= 1) | (df["colB"] <= 5)])

Output:

colA colB colC colD

a 2 4 4 7

b 4 4 3 6

c 1 9 1 9

d 2 3 8 3

Again the output would not be consistent due to randomness of data but you should get the output that matches the filtering conditions. Following are all conditions supported

==: equal
!=: not equal
>: greater than
>=: greater than or equal to
<: less than
<=: less than or equal to

You can also use the .isin method to filter data as follows.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[df["colA"].isin([1, 2, 3])])

Output:

colA colB colC colD

b 1 7 6 4

c 2 4 9 7

Sunday, 2 February 2025

Filtering a DataFrame in Pandas

Background

Filtering a data frame in Pandas

loc & iloc methods

Selecting subset of columns

Filtering by condition

Related Links

No comments:

Post a Comment