Open Source For Geeks: February 2025

Friday, 7 February 2025

Handling missing values in DataFrame with Pandas

Background

In the last few posts, we have been seeing the basics of Pandas - Series, DataFrame , how to manipulate data etc. In this post, we will try to see how to handle missing values in DataFrame.

Handling missing values in DataFrame with Pandas

Padas accept all following values as missing data

np.nan
pd.NA
None

We can use the isna or notna function to detect these missing data.

The isna function evaluates each cell in a DataFrame and returns True to indicate a missing value.
The notna function evaluates each cell in a DataFrame and returns True to indicate a non-missing value.

Let's try to see an example for the above:

Code:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})

print(df.to_string())
print(df.isna())
print(df.notna())

Output:

Name Role Joining Date

0 Aniket IT Dev 20190101

1 abhijit None 20200202

2 <NA> IT QA 20210303

3 Anvi NaN 20220404

Name Role Joining Date

0 False False False

1 False True False

2 True False False

3 False True False

Name Role Joining Date

0 True True True

1 True False True

2 False True True

3 True False True

You can now use the truth data to filter rows as follows

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})

print(df[df["Role"].notna()])

Output:

Name Role Joining Date
0 Aniket IT Dev 20190101
2 <NA> IT QA 20210303

Dropping (dropna)& replacement (fillna)of missing data

dropna : The dropna function is used to drop rows and columns with missing values. It takes following arguements

axis - 0 for rows and 1 for columns
how - any for dropping if any data point is missing , all for dropping if all data points are missing
thresh - Threshold number of data points missing for dropping
inplace - Instead of returning a new modified DataFrame does dropping inplace for same DataFrame.

fillna: The fillnafunction is used to fill missing values with some data.

Example for dropna:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})

# Drop all columns with missing any data
print(df.dropna(axis=1, how="any"))

# Drop all rows with missing any data
print(df.dropna(axis=0, how="any"))

Output:

Joining Date

0 20190101

1 20200202

2 20210303

3 20220404

Name Role Joining Date

0 Aniket IT Dev 20190101

Example for fillna:

import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, np.nan, 20220404]})

# Replace missing values with some default value
print(df.fillna({"Name": "Default Name", "Role": "Default Role"}))

# Replace missing joining date with latest/max joining date in data
print(df["Joining Date"].fillna(value=df["Joining Date"].max()))

# forward fill the missing data
print(df.ffill(limit=1))

Output:

Name Role Joining Date

0 Aniket IT Dev 20190101.0

1 abhijit Default Role 20200202.0

2 Default Name IT QA NaN

3 Anvi Default Role 20220404.0

0 20190101.0

1 20200202.0

2 20220404.0

3 20220404.0

Name: Joining Date, dtype: float64

Name Role Joining Date

0 Aniket IT Dev 20190101.0

1 abhijit IT Dev 20200202.0

2 abhijit IT QA 20200202.0

3 Anvi IT QA 20220404.0

NOTE: Previously we could pass method argument as bfill as ffill but that is deprecated now. Worning message: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead

Thursday, 6 February 2025

String and Date manipulation in DataFrame with Pandas

Background

In the last post, we saw how to filter data in a DataFrame provided by the panda's library in Python. In this post, we will see how we can manipulate string and date type columns in DataFrame.

String and Date manipulation in DataFrame with Pandas

String & date manipulation in

String manipulation

You can use df.dtypes to check the data types of columns in a DataFrame. You can also use df.astype to convert the column types. In this section, we will see how to manipulate and operate string data types in DataFrame. You can do this by accessing the string content using .str accessor type.

Code:

import pandas as pd

df = pd.DataFrame({"Name": ["Aniket", "abhijit", "awantika", "Anvi"],
                   "Role": ["IT Dev", "Finance Analyst", "IT QA", "Fun Play"],
                   "Joining Date": [20190101, 20200101, 20210101, 20220101]})

print(df)

# .str accessor to operate on string data type
df["Name_initials"] = df["Name"].str[0:3]
df["Department"] = df["Role"].str.split(" ", expand=True)[1]

# Use + operator to concatenate data
df["Name_Department_Combined"] = df["Name"] + "_" + df["Department"]

# Chain operations to get results in one line
df["Capitalized_Initials"] = df["Name"].str.capitalize().str[0:3]

print(df.to_string())

Output:

Name Role Joining Date

0 Aniket IT Dev 20190101
1 abhijit Finance Analyst 20200101
2 awantika IT QA 20210101
3 Anvi Fun Play 20220101

Name object
Role object
Joining Date int64
dtype: object

Name Role Joining Date Name_initials Department Name_Department_Combined Capitalized_Initials

0 Aniket IT Dev 20190101 Ani Dev Aniket_Dev Ani
1 abhijit Finance Analyst 20200101 abh Analyst abhijit_Analyst Abh
2 awantika IT QA 20210101 awa QA awantika_QA Awa
3 Anvi Fun Play 20220101 Anv Play Anvi_Play Anv

In the above example, you can see various ways you can manipulate the str column type on data frame.

Date manipulation

Similar to manipulating string we can also manipulate date data types in pandas using accessory "dt". We will use the same example as above for date manipulation. For this we use a different data type called "datetime64[ns]" , this data type represents a timestamp meaning it represents a date and time.

You can try executing below code to see how this data type works

print(pd.to_datetime("20240701"))

This outputs

2024-07-01 00:00:00

In the example above we have a "Joining date" & as you see in the outout of code above it currently prints int as dtype of that column. We need to convert it to a datetime type before we do further manipulations on a date. As I mentioned above we can convert the data type using df.astype() method.

Code:

import pandas as pd

df = pd.DataFrame({"Name": ["Aniket", "abhijit", "awantika", "Anvi"],
                   "Role": ["IT Dev", "Finance Analyst", "IT QA", "Fun Play"],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})

df['Joining Date'] = pd.to_datetime(df['Joining Date'], format='%Y%m%d')

# If date was of standard YYYY-MM-DD format you could use velow
# df = df.astype({"Joining Date": "datetime64[ns]"})
print(df.dtypes)

df["Joining Year"] = df["Joining Date"].dt.year
df["Joining Month"] = df["Joining Date"].dt.month
df["Joining Day"] = df["Joining Date"].dt.day
print(df.to_string())

Output:

Name object
Role object
Joining Date datetime64[ns]
dtype: object

Name Role Joining Date Joining Year Joining Month Joining Day
0 Aniket IT Dev 2019-01-01 2019 1 1
1 abhijit Finance Analyst 2020-02-02 2020 2 2
2 awantika IT QA 2021-03-03 2021 3 3
3 Anvi Fun Play 2022-04-04 2022 4 4

You can do more operations using the ".dt" accessor for date data types.

Sunday, 2 February 2025

Filtering a DataFrame in Pandas

Background

In the last post, we saw the basics if using the pandas library in Python which is used for data analysis. We saw two basic data structures supported by pandas

Series
DataFrame

In this post, we will further see how we can filter data in a data frame. These are some of the most common operations performed for data analysis.

Filtering a data frame in Pandas

loc & iloc methods

To recap, a data frame is a two-dimensional data structure consisting of rows and columns. So we need a way to filter rows and columns efficiently. Two main methods exposed by data frame for this are

loc - uses rows and column labels
iloc - uses rows and column indexes

For example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df.loc[["a", "b"], ["colA", "colC"]])
print(df.iloc[:2, :3])

Output:

colA colC

a 4 7

b 1 4

colA colB colC

a 4 9 7

b 1 1 4

The loc and iloc methods are frequently used for selecting or extracting a part of a data frame. The main difference is that loc works with labels whereas iloc works with indices.

Selecting subset of columns

We can get a Series (a single column data) from the data frame using df["column_name"], similarly, we can get a new data frame with a subset of columns by passing a list of columns needed. For eg.,

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[["colA", "colC"]])
print(type(df[["colA", "colC"]]))

Output:

colA colC

a 5 2

b 7 2

c 4 3

d 9 1

<class 'pandas.core.frame.DataFrame'>

As you can see from the output we selected 2 columns - ColA and ColC and the result is a new DataFrame object.

Filtering by condition

You can also filter a data frame by conditions. Consider the following example:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[df["colA"] >= 1])

Output:

colA colB colC colD

a 3 9 5 6

b 8 5 9 6

c 9 4 1 4

d 8 4 3 5

The data frame has randomly generated data so the output will not be consistent but you can confirm that output will always have entries corresponding to colA having values greater than or equal to 1 as we specified in the filtering condition.

you can also specify multiple conditions with & or | operators. Consider the following example

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[(df["colA"] >= 1) | (df["colB"] <= 5)])

Output:

colA colB colC colD

a 2 4 4 7

b 4 4 3 6

c 1 9 1 9

d 2 3 8 3

Again the output would not be consistent due to randomness of data but you should get the output that matches the filtering conditions. Following are all conditions supported

==: equal
!=: not equal
>: greater than
>=: greater than or equal to
<: less than
<=: less than or equal to

You can also use the .isin method to filter data as follows.

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[df["colA"].isin([1, 2, 3])])

Output:

colA colB colC colD

b 1 7 6 4

c 2 4 9 7

Getting started with Pandas library in Python

Background

If you have worked with data analysis or data sciences roles you would have worked with the Pandas/numpy libraries in Python which comes in handy. In this post, we will see how to get started on working with the pandas library. This post assumes you have a basic understanding of Python

Installing pandas library

You can install the pandas library using pip

Or you can directly install it in your pycharm as a package. You can go to

Setting -> Project ->Python interpreter and click on "+" icon to search & add the pandas package.

Once you have installed pandas library you can import it as

import pandas as pd

and start using it

Data structures supported in Pandas

Pandas library supports 2 main data structures

Series: One dimensional array with an axis.
DataFrame: Two dimensional data structure with labelled rows and columns

Series

Let's try to see how Series works. You can simply create a series from a python array

import pandas as pd

test_series = pd.Series([11, 12, 13])
print(test_series)

Output is:

0 11

1 12

2 13

dtype: int64

As you can see from the output we have integer index (0,1,2) and 1-dimensional data with values [11,12,13]. You can also have a string index for this one-dimensional array, You can use this index to access the data from Series.

import pandas as pd

test_series = pd.Series([11, 12, 13], index=["RowA", "RowB", "RowC"])
print(test_series)
print(test_series["RowA"])

Output is:

RowA 11

RowB 12

RowC 13

dtype: int64

DataFrame

To create a data frame you can simply pass a dictionary where the key of the dictionary forms the columns and the actual values of that keys from the data.

import pandas as pd

df = pd.DataFrame({
    "ColA": [11, 12, 13],
    "Col B": [21, 22, 23],
    "Col C": [31, 32, 33],
}, index=["Row A", "Row B", "Row C"])

print(df)

Output is:

ColA Col B Col C

Row A 11 21 31

Row B 12 22 32

Row C 13 23 33

As with series, the passing index is optional, if you do not pass default behavior is to use integer indexes (0,1,2... etc.). Similarly if you do not assign explicit column names then integer columns are used.

DataFrame support various methods

df.head(): Gives first 5 rows.
df.size: Gives number of cells in data frame (no of rows * no of columns). For above example output will be 9.
df.shape: Gives dimension of data frame. For above example output will be (3,3)
len(df): Give number of rows in data frame. For above example output will be 3.

For checking the data type and converting the column type we can use below methods:

df.dtypes : Give data types of columns present in the data frame
df.astype: Mehtod to convert data type of a column

Consider the following example:

import pandas as pd

df = pd.DataFrame({
    "ColA": [11, 12, 13],
    "Col B": [21, 22, 23],
    "Col C": [31, 32, 33],
}, index=["Row A", "Row B", "Row C"])

print(df.dtypes)
df = df.astype({"ColA": float})
print(df.dtypes)

Output:

ColA int64

Col B int64

Col C int64

dtype: object

ColA float64

Col B int64

Col C int64

dtype: object

Once you have a dateframe in place you can reference the individual columns and perform analysis on it. A single column referenced from data frame can perform below operations:

df.col.nunique(): Returns number of unique elements
df.col.uniqie(): Return actual unique elements
df.col.mean(): Retuns mean of column values
df.col.median(): Returns median of column values
df.col.value_counts(): Return unique values and their counts

Consider below example:

import pandas as pd

df = pd.DataFrame({
    "Col A": [11, 12, 13],
    "Col B": [21, 22, 23],
    "Col C": [31, 32, 33],
}, index=["Row A", "Row B", "Row C"])

print(df["Col A"].nunique())
print(df["Col A"].unique())
print(df["Col A"].mean())
print(df["Col A"].median())
print(df["Col A"].value_counts())

Output is:

[11 12 13]

12.0

Col A

11 1

12 1

13 1

Name: count, dtype: int64

Note that column of data frame is actually a series

df = pd.DataFrame({"A":[1,2,3], "C": [1,2,3]})
print(type(df))
print(type(df.A))

Output:

<class 'pandas.core.frame.DataFrame'>

<class 'pandas.core.series.Series'>

Friday, 7 February 2025

Handling missing values in DataFrame with Pandas

Background

Handling missing values in DataFrame with Pandas

Dropping (dropna)& replacement (fillna)of missing data

Related Links

Thursday, 6 February 2025

String and Date manipulation in DataFrame with Pandas

Background

String and Date manipulation in DataFrame with Pandas

String manipulation

Date manipulation

Related Links

Sunday, 2 February 2025

Filtering a DataFrame in Pandas

Background

Filtering a data frame in Pandas

loc & iloc methods

Selecting subset of columns

Filtering by condition

Related Links

Getting started with Pandas library in Python

Background

Installing pandas library

Data structures supported in Pandas

Series

DataFrame

Related links