Sunday, 2 February 2025

Filtering a DataFrame in Pandas

 Background

In the last post, we saw the basics if using the pandas library in Python which is used for data analysis. We saw two basic data structures supported by pandas

  1. Series
  2. DataFrame
In this post, we will further see how we can filter data in a data frame. These are some of the most common operations performed for data analysis.



Filtering a data frame in Pandas

loc & iloc methods

To recap, a data frame is a two-dimensional data structure consisting of rows and columns. So we need a way to filter rows and columns efficiently. Two main methods exposed by data frame for this are

  1. loc - uses rows and column labels
  2. iloc - uses rows and column indexes

For example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df.loc[["a", "b"], ["colA", "colC"]])
print(df.iloc[:2, :3])

Output:

   colA  colC

a     4     7

b     1     4

   colA  colB  colC

a     4     9     7

b     1     1     4

The loc and iloc methods are frequently used for selecting or extracting a part of a data frame. The main difference is that loc works with labels whereas iloc works with indices.

Selecting subset of columns

We can get a Series (a single column data) from the data frame using df["column_name"], similarly, we can get a new data frame with a subset of columns by passing a list of columns needed. For eg.,

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[["colA", "colC"]])
print(type(df[["colA", "colC"]]))


Output:

   colA  colC

a     5     2

b     7     2

c     4     3

d     9     1

<class 'pandas.core.frame.DataFrame'>

As you can see from the output we selected 2 columns - ColA and ColC and the result is a new DataFrame object.

Filtering by condition

You can also filter a data frame by conditions. Consider the following example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[df["colA"] >= 1])


Output:

   colA  colB  colC  colD

a     3     9     5     6

b     8     5     9     6

c     9     4     1     4

d     8     4     3     5

The data frame has randomly generated data so the output will not be consistent but you can confirm that output will always have entries corresponding to colA having values greater than or equal to 1 as we specified in the filtering condition.

you can also specify multiple conditions with & or | operators. Consider the following example
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[(df["colA"] >= 1) | (df["colB"] <= 5)])



Output:
   colA  colB  colC  colD
a     2     4     4     7
b     4     4     3     6
c     1     9     1     9
d     2     3     8     3

Again the output would not be consistent due to randomness of data but you should get the output that matches the filtering conditions. Following are all conditions supported
  • ==: equal
  • !=: not equal
  • >: greater than
  • >=: greater than or equal to
  • <: less than
  • <=: less than or equal to
You can also use the .isin method to filter data as follows.
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[df["colA"].isin([1, 2, 3])])

Output:
   colA  colB  colC  colD
b     1     7     6     4
c     2     4     9     7


Related Links

Getting started with Pandas library in Python

 Background

If you have worked with data analysis or data sciences roles you would have worked with the Pandas/numpy libraries in Python which comes in handy. In this post, we will see how to get started on working with the pandas library. This post assumes you have a basic understanding of Python



Installing pandas library

You can install the pandas library using pip

Or you can directly install it in your pycharm as a package. You can go to
Setting -> Project ->Python interpreter and click on "+" icon to search & add the pandas package.



Once you have installed pandas library you can import it as 
  • import pandas as pd
and start using it

Data structures supported in Pandas

Pandas library supports 2 main data structures

  1. Series: One dimensional array with an axis.
  2. DataFrame: Two dimensional data structure with labelled rows and columns

Series

Let's try to see how Series works. You can simply create a series from a python array
import pandas as pd

test_series = pd.Series([11, 12, 13])
print(test_series)

Output is:
0    11
1    12
2    13
dtype: int64

As you can see from the output we have integer index (0,1,2) and 1-dimensional data with values [11,12,13]. You can also have a string index for this one-dimensional array, You can use this index to access the data from Series.
import pandas as pd

test_series = pd.Series([11, 12, 13], index=["RowA", "RowB", "RowC"])
print(test_series)
print(test_series["RowA"])

Output is:
RowA    11
RowB    12
RowC    13
dtype: int64
11

DataFrame

To create a data frame you can simply pass a dictionary where the key of the dictionary forms the columns and the actual values of that keys from the data.
import pandas as pd

df = pd.DataFrame({
    "ColA": [11, 12, 13],
    "Col B": [21, 22, 23],
    "Col C": [31, 32, 33],
}, index=["Row A", "Row B", "Row C"])

print(df)

Output is:
           ColA  Col B  Col C
Row A    11     21     31
Row B    12     22     32
Row C    13     23     33


As with series, the passing index is optional, if you do not pass default behavior is to use integer indexes (0,1,2... etc.). Similarly if you do not assign explicit column names then integer columns are used.

DataFrame support various methods
  • df.head(): Gives first 5 rows. 
  • df.size: Gives number of cells in data frame (no of rows * no of columns). For above example output will be 9.
  • df.shape:  Gives dimension of data frame. For above example output will be (3,3)
  • len(df): Give number of rows in data frame. For above example output will be 3.

For checking the data type and converting the column type we can use below methods:
  • df.dtypes : Give data types of columns present in the data frame
  • df.astype: Mehtod to convert data type of a column
Consider the following example:
import pandas as pd

df = pd.DataFrame({
    "ColA": [11, 12, 13],
    "Col B": [21, 22, 23],
    "Col C": [31, 32, 33],
}, index=["Row A", "Row B", "Row C"])

print(df.dtypes)
df = df.astype({"ColA": float})
print(df.dtypes)

Output:
ColA     int64
Col B    int64
Col C    int64
dtype: object
ColA     float64
Col B      int64
Col C      int64
dtype: object

Once you have a dateframe in place you can reference the individual columns and perform analysis on it. A single column referenced from data frame can perform below operations:
  • df.col.nunique(): Returns number of unique elements
  • df.col.uniqie():  Return actual unique elements
  •  df.col.mean(): Retuns mean of column values
  • df.col.median(): Returns median of column values
  • df.col.value_counts(): Return unique values and their counts
Consider below example:
import pandas as pd

df = pd.DataFrame({
    "Col A": [11, 12, 13],
    "Col B": [21, 22, 23],
    "Col C": [31, 32, 33],
}, index=["Row A", "Row B", "Row C"])

print(df["Col A"].nunique())
print(df["Col A"].unique())
print(df["Col A"].mean())
print(df["Col A"].median())
print(df["Col A"].value_counts())

Output is:
3
[11 12 13]
12.0
12.0
Col A
11    1
12    1
13    1
Name: count, dtype: int64


Note that column of data frame is actually a series
df = pd.DataFrame({"A":[1,2,3], "C": [1,2,3]})
print(type(df))
print(type(df.A))

Output:
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.series.Series'>

Related links

t> UA-39527780-1 back to top