Friday, 7 February 2025

Handling missing values in DataFrame with Pandas

 Background

In the last few posts, we have been seeing the basics of Pandas - Series, DataFrame , how to manipulate data etc. In this post, we will try to see how to handle missing values in DataFrame.



Handling missing values in DataFrame with Pandas

Padas accept all following values as missing data
  • np.nan
  • pd.NA
  • None
We can use the isna or notna function to detect these missing data. 
  • The isna function evaluates each cell in a DataFrame and returns True to indicate a missing value. 
  • The notna function evaluates each cell in a DataFrame and returns True to indicate a non-missing value. 
Let's try to see an example for the above:

Code:
import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})

print(df.to_string())
print(df.isna())
print(df.notna())


Output:
      Name    Role  Joining Date
0   Aniket  IT Dev      20190101
1  abhijit    None      20200202
2     <NA>   IT QA      20210303
3     Anvi     NaN      20220404

    Name   Role  Joining Date
0  False  False         False
1  False   True         False
2   True  False         False
3  False   True         False

    Name   Role  Joining Date
0   True   True          True
1   True  False          True
2  False   True          True
3   True  False          True

You can now use the truth data to filter rows as follows

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})

print(df[df["Role"].notna()])

Output:

     Name    Role  Joining Date
0  Aniket  IT Dev      20190101
2    <NA>   IT QA      20210303


Dropping (dropna)& replacement (fillna)of missing data

  • dropna : The dropna function is used to drop rows and columns with missing values. It takes following arguements
    • axis - 0 for rows and 1 for columns
    • how - any for dropping if any data point is missing , all for dropping if all data points are missing
    • thresh - Threshold number of data points missing for dropping
    • inplace - Instead of returning a new modified DataFrame does dropping inplace for same DataFrame.
  • fillna: The fillnafunction is used to fill missing values with some data.

Example for dropna:
import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})

# Drop all columns with missing any data
print(df.dropna(axis=1, how="any"))

# Drop all rows with missing any data
print(df.dropna(axis=0, how="any"))

Output:
   Joining Date
0      20190101
1      20200202
2      20210303
3      20220404

     Name    Role  Joining Date
0  Aniket  IT Dev      20190101

Example for fillna:
import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, np.nan, 20220404]})

# Replace missing values with some default value
print(df.fillna({"Name": "Default Name", "Role": "Default Role"}))

# Replace missing joining date with latest/max joining date in data
print(df["Joining Date"].fillna(value=df["Joining Date"].max()))

# forward fill the missing data
print(df.ffill(limit=1))


Output:

           Name          Role  Joining Date
0        Aniket        IT Dev    20190101.0
1       abhijit  Default Role    20200202.0
2  Default Name         IT QA           NaN
3          Anvi  Default Role    20220404.0

0    20190101.0
1    20200202.0
2    20220404.0
3    20220404.0
Name: Joining Date, dtype: float64

      Name    Role  Joining Date
0   Aniket  IT Dev    20190101.0
1  abhijit  IT Dev    20200202.0
2  abhijit   IT QA    20200202.0
3     Anvi   IT QA    20220404.0

NOTE: Previously we could pass method argument as bfill as ffill but that is deprecated now. Worning message: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead

t> UA-39527780-1 back to top