Background
Handling missing values in DataFrame with Pandas
- np.nan
- pd.NA
- None
- The isna function evaluates each cell in a DataFrame and returns True to indicate a missing value.
- The notna function evaluates each cell in a DataFrame and returns True to indicate a non-missing value.
1 2 3 4 5 6 7 8 9 10 | import pandas as pd import numpy as np df = pd.DataFrame({ "Name" : [ "Aniket" , "abhijit" , pd.NA, "Anvi" ], "Role" : [ "IT Dev" , None, "IT QA" , np.nan], "Joining Date" : [ 20190101 , 20200202 , 20210303 , 20220404 ]}) print(df.to_string()) print(df.isna()) print(df.notna()) |
You can now use the truth data to filter rows as follows
1 2 3 4 5 6 7 8 | import pandas as pd import numpy as np df = pd.DataFrame({ "Name" : [ "Aniket" , "abhijit" , pd.NA, "Anvi" ], "Role" : [ "IT Dev" , None, "IT QA" , np.nan], "Joining Date" : [ 20190101 , 20200202 , 20210303 , 20220404 ]}) print(df[df[ "Role" ].notna()]) |
Output:
Name Role Joining Date
0 Aniket IT Dev 20190101
2 <NA> IT QA 20210303
Dropping (dropna)& replacement (fillna)of missing data
- dropna : The dropna function is used to drop rows and columns with missing values. It takes following arguements
- axis - 0 for rows and 1 for columns
- how - any for dropping if any data point is missing , all for dropping if all data points are missing
- thresh - Threshold number of data points missing for dropping
- inplace - Instead of returning a new modified DataFrame does dropping inplace for same DataFrame.
- fillna: The fillnafunction is used to fill missing values with some data.
1 2 3 4 5 6 7 8 9 10 11 12 | import pandas as pd import numpy as np df = pd.DataFrame({ "Name" : [ "Aniket" , "abhijit" , pd.NA, "Anvi" ], "Role" : [ "IT Dev" , None, "IT QA" , np.nan], "Joining Date" : [ 20190101 , 20200202 , 20210303 , 20220404 ]}) # Drop all columns with missing any data print(df.dropna(axis= 1 , how= "any" )) # Drop all rows with missing any data print(df.dropna(axis= 0 , how= "any" )) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | import pandas as pd import numpy as np df = pd.DataFrame({ "Name" : [ "Aniket" , "abhijit" , pd.NA, "Anvi" ], "Role" : [ "IT Dev" , None, "IT QA" , np.nan], "Joining Date" : [ 20190101 , 20200202 , np.nan, 20220404 ]}) # Replace missing values with some default value print(df.fillna({ "Name" : "Default Name" , "Role" : "Default Role" })) # Replace missing joining date with latest/max joining date in data print(df[ "Joining Date" ].fillna(value=df[ "Joining Date" ].max())) # forward fill the missing data print(df.ffill(limit= 1 )) |