Background
Handling missing values in DataFrame with Pandas
- np.nan
- pd.NA
- None
- The isna function evaluates each cell in a DataFrame and returns True to indicate a missing value.
- The notna function evaluates each cell in a DataFrame and returns True to indicate a non-missing value.
import pandas as pd
import numpy as np
df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
"Role": ["IT Dev", None, "IT QA", np.nan],
"Joining Date": [20190101, 20200202, 20210303, 20220404]})
print(df.to_string())
print(df.isna())
print(df.notna())
You can now use the truth data to filter rows as follows
import pandas as pd
import numpy as np
df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
"Role": ["IT Dev", None, "IT QA", np.nan],
"Joining Date": [20190101, 20200202, 20210303, 20220404]})
print(df[df["Role"].notna()])
Output:
Name Role Joining Date
0 Aniket IT Dev 20190101
2 <NA> IT QA 20210303
Dropping (dropna)& replacement (fillna)of missing data
- dropna : The dropna function is used to drop rows and columns with missing values. It takes following arguements
- axis - 0 for rows and 1 for columns
- how - any for dropping if any data point is missing , all for dropping if all data points are missing
- thresh - Threshold number of data points missing for dropping
- inplace - Instead of returning a new modified DataFrame does dropping inplace for same DataFrame.
- fillna: The fillnafunction is used to fill missing values with some data.
import pandas as pd
import numpy as np
df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
"Role": ["IT Dev", None, "IT QA", np.nan],
"Joining Date": [20190101, 20200202, 20210303, 20220404]})
# Drop all columns with missing any data
print(df.dropna(axis=1, how="any"))
# Drop all rows with missing any data
print(df.dropna(axis=0, how="any"))
import pandas as pd
import numpy as np
df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
"Role": ["IT Dev", None, "IT QA", np.nan],
"Joining Date": [20190101, 20200202, np.nan, 20220404]})
# Replace missing values with some default value
print(df.fillna({"Name": "Default Name", "Role": "Default Role"}))
# Replace missing joining date with latest/max joining date in data
print(df["Joining Date"].fillna(value=df["Joining Date"].max()))
# forward fill the missing data
print(df.ffill(limit=1))