Background
In the last few posts, we have been seeing the basics of Pandas - Series, DataFrame , how to manipulate data etc. In this post, we will try to see how to handle missing values in DataFrame.
Handling missing values in DataFrame with Pandas
Padas accept all following values as missing data
- np.nan
- pd.NA
- None
We can use the isna or notna function to detect these missing data.
- The isna function evaluates each cell in a DataFrame and returns True to indicate a missing value.
- The notna function evaluates each cell in a DataFrame and returns True to indicate a non-missing value.
Let's try to see an example for the above:
Code:
import pandas as pd import numpy as np df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"], "Role": ["IT Dev", None, "IT QA", np.nan], "Joining Date": [20190101, 20200202, 20210303, 20220404]}) print(df.to_string()) print(df.isna()) print(df.notna())
Output:
Name Role Joining Date
0 Aniket IT Dev 20190101
1 abhijit None 20200202
2 <NA> IT QA 20210303
3 Anvi NaN 20220404
Name Role Joining Date
0 False False False
1 False True False
2 True False False
3 False True False
Name Role Joining Date
0 True True True
1 True False True
2 False True True
3 True False True
You can now use the truth data to filter rows as follows
import pandas as pd import numpy as np df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"], "Role": ["IT Dev", None, "IT QA", np.nan], "Joining Date": [20190101, 20200202, 20210303, 20220404]}) print(df[df["Role"].notna()])
Output:
Name Role Joining Date
0 Aniket IT Dev 20190101
2 <NA> IT QA 20210303
Dropping (dropna)& replacement (fillna)of missing data
- dropna : The dropna function is used to drop rows and columns with missing values. It takes following arguements
- axis - 0 for rows and 1 for columns
- how - any for dropping if any data point is missing , all for dropping if all data points are missing
- thresh - Threshold number of data points missing for dropping
- inplace - Instead of returning a new modified DataFrame does dropping inplace for same DataFrame.
- fillna: The fillnafunction is used to fill missing values with some data.
Example for dropna:
import pandas as pd import numpy as np df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"], "Role": ["IT Dev", None, "IT QA", np.nan], "Joining Date": [20190101, 20200202, 20210303, 20220404]}) # Drop all columns with missing any data print(df.dropna(axis=1, how="any")) # Drop all rows with missing any data print(df.dropna(axis=0, how="any"))
Output:
Joining Date
0 20190101
1 20200202
2 20210303
3 20220404
Name Role Joining Date
0 Aniket IT Dev 20190101
Example for fillna:
import pandas as pd import numpy as np df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"], "Role": ["IT Dev", None, "IT QA", np.nan], "Joining Date": [20190101, 20200202, np.nan, 20220404]}) # Replace missing values with some default value print(df.fillna({"Name": "Default Name", "Role": "Default Role"})) # Replace missing joining date with latest/max joining date in data print(df["Joining Date"].fillna(value=df["Joining Date"].max())) # forward fill the missing data print(df.ffill(limit=1))
Output:
Name Role Joining Date
0 Aniket IT Dev 20190101.0
1 abhijit Default Role 20200202.0
2 Default Name IT QA NaN
3 Anvi Default Role 20220404.0
0 20190101.0
1 20200202.0
2 20220404.0
3 20220404.0
Name: Joining Date, dtype: float64
Name Role Joining Date
0 Aniket IT Dev 20190101.0
1 abhijit IT Dev 20200202.0
2 abhijit IT QA 20200202.0
3 Anvi IT QA 20220404.0
NOTE: Previously we could pass method argument as bfill as ffill but that is deprecated now. Worning message: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead