Background
Handling missing values in DataFrame with Pandas
- np.nan
- pd.NA
- None
- The isna function evaluates each cell in a DataFrame and returns True to indicate a missing value.
- The notna function evaluates each cell in a DataFrame and returns True to indicate a non-missing value.
import pandas as pd import numpy as np df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"], "Role": ["IT Dev", None, "IT QA", np.nan], "Joining Date": [20190101, 20200202, 20210303, 20220404]}) print(df.to_string()) print(df.isna()) print(df.notna())
You can now use the truth data to filter rows as follows
import pandas as pd import numpy as np df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"], "Role": ["IT Dev", None, "IT QA", np.nan], "Joining Date": [20190101, 20200202, 20210303, 20220404]}) print(df[df["Role"].notna()])
Output:
Name Role Joining Date
0 Aniket IT Dev 20190101
2 <NA> IT QA 20210303
Dropping (dropna)& replacement (fillna)of missing data
- dropna : The dropna function is used to drop rows and columns with missing values. It takes following arguements
- axis - 0 for rows and 1 for columns
- how - any for dropping if any data point is missing , all for dropping if all data points are missing
- thresh - Threshold number of data points missing for dropping
- inplace - Instead of returning a new modified DataFrame does dropping inplace for same DataFrame.
- fillna: The fillnafunction is used to fill missing values with some data.
import pandas as pd import numpy as np df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"], "Role": ["IT Dev", None, "IT QA", np.nan], "Joining Date": [20190101, 20200202, 20210303, 20220404]}) # Drop all columns with missing any data print(df.dropna(axis=1, how="any")) # Drop all rows with missing any data print(df.dropna(axis=0, how="any"))
import pandas as pd import numpy as np df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"], "Role": ["IT Dev", None, "IT QA", np.nan], "Joining Date": [20190101, 20200202, np.nan, 20220404]}) # Replace missing values with some default value print(df.fillna({"Name": "Default Name", "Role": "Default Role"})) # Replace missing joining date with latest/max joining date in data print(df["Joining Date"].fillna(value=df["Joining Date"].max())) # forward fill the missing data print(df.ffill(limit=1))