Monday, 31 March 2025

Understanding Bisect module in Python

 Background

You would have often come across a use case of finding an element in a sorted list. Typically you would use a binary search which takes O(logN). Python provides a module that does that for us - it is called bisect. The methods in this API allow us to find a position in the sorted list where a new element can be inserted to keep the lost sorted. In this post, we will explain how to use this module.


Understanding Bisect module in Python

Let's try to see the method bisect_left that gives us a position in a sorted list where a new element can be inserted to keep the list in sorted order. The syntax is
  • bisect_left(list, element , low, high)
The arguments are
  • list - a sorted list where we want to find the element to be inserted
  • element - The element to be inserted
  • low - the starting index of the list from where we want to start the search
  • high - the ending index of the list from where we want to end the search
See the "Related links" section at the end for the official documentation link.

Now let's see an example
1
2
3
4
5
6
7
import bisect
 
list = [1, 3, 5, 9, 10 ,15]
idx = bisect.bisect_left(list, 6)
print(f"Index to insert 6 is {idx}")
list.insert(idx, 6)
print(f"New list {list}")

Above code prints: 
Index to insert 6 is 3
New list [1, 3, 5, 6, 9, 10, 15]

You can also do both of the steps shown above - getting the index to be inserted for maintaining sorting and inserting it into the list in a single API provided by bisect. The API is
  • insort_left(list, element)
Let's see how we can use above API to achieve the same result
1
2
3
4
5
import bisect
 
list = [1, 3, 5, 9, 10 ,15]
bisect.insort_left(list, 6)
print(f"New list {list}")

Output is: New list [1, 3, 5, 6, 9, 10, 15]

NOTE: bisect_left as the name suggests gives the left-most possible index to insert, whereas there is a similar API called bisect_right that gives the right-most possible index to insert. Similarly to insort_left we also have insort_right  that does in place insertion of given element in given list.

Let's see an example of the above
1
2
3
4
5
6
7
import bisect
 
list = [1, 3, 5, 5, 5, 9, 10 ,15]
idx = bisect.bisect_left(list, 5)
print(f"Left most index to insert 5 is {idx}")
idx = bisect.bisect_right(list, 5)
print(f"Right most index to insert 5 is {idx}")

Above code prints:
Left most index to insert 5 is 2
Right most index to insert 5 is 5


NOTE: Time complexity is O(logN) which is the same as that of binary search.


We can actually use this API for multiple use-cases, let's see them below

1. Binary Search

As we say above you can use bisect to implement binary search.
1
2
3
4
5
6
7
8
9
10
11
import bisect
def binary_Search(arr, element, low, high):
    idx = bisect.bisect_left(arr, element, low, high)
    if idx < len(arr) and arr[idx] == element:
        return idx
    return -1
 
 
arr = [1, 3, 5, 7, 12, 20]
search_idx = binary_Search(arr, 7, 0, len(arr))
print(f"Binary search index for 7 : {search_idx}")


Output is: Binary search index for 7 : 3

2. Prefix search

If your list had all strings (lower case) and in sorted order then we can use bisect for prefix search as well as follows:
1
2
3
4
5
6
7
8
9
10
11
import bisect
def prefix_search(arr, prefix):
    idx = bisect.bisect_left(arr, prefix)
    if idx >= len(arr):
        return None
    el = arr[idx]
    return el if el.startswith(prefix) else None
 
 
arr = ["apple", "cat", "dog","elephant"]
print(prefix_search(arr, "ap"))

Output is: apple

3. Find no of repeating values 

If you have a sorted array and you want to find the number of times a number is repeated then we can use bisect again. Note that since the array is sorted the number will be sequential.
1
2
3
4
5
6
7
8
9
10
11
12
import bisect
def count_repeated_no(arr, no):
    left_idx = bisect.bisect_left(arr, no)
    right_idx = bisect.bisect_right(arr, no)
 
    if left_idx >= len(arr):
        return -1
    else:
        return right_idx - left_idx
 
arr = [1, 2, 5, 5, 5, 5, 9, 1]
print(f"Count of 5 in array {count_repeated_no(arr, 5)}")
Output: Count of 5 in array 4

Performance

Note that this API is really fast. 
  1. One because it uses binary search so it's time complexity is O(LogN)
  2. Secondly, it's precompiled in C so it's faster than if you implement it yourself (See SO question on comparison with direct lookup in list)

Related Links

Wednesday, 26 March 2025

SOLID principles in Object Oriented Programming

 Background

This post will try to see the 5 SOLID design principles used for object-oriented software design. These principles are
  • S: Single Responsibility Principle (SRP)
  • O: Open/Closed Principle
  • L: Liskov’s Substitution Principle (LSP)
  • I: Interface Segregation Principle (ISP)
  • D: Dependency Inversion Principle (DIP)
Whenever you write new code or design new software, ensure you respect the SOLID principles mentioned above. This helps create clean and modular code that will help you maintain and scale it in the future. Let's see each of these in detail.




Single Responsibility Principle (SRP)

This principle states that - "A class should have single responsibility". 

Let us take an example to understand this better. Let's assume you are building software for a restaurant and you create a class classed Worker which has method like prepare_food & serve_food.
1
2
3
4
5
6
class Worker:
    def prepare_food(self):
        print("Preparing Food")
 
    def serve_food(self):
        print("Serving Food"


What do you think about the above design? It's bad but why? It's bad because it's very tightly coupled with 2 responsibilities - one is preparing food and one is serving food which is independent tasks. Now let's say you make any changes to how you prepare food in the future you also need to test and ensure that serve_food functionality is not impacted. This unnecessary coupled dependency has made code hard to maintain. A better design would be to have separate classes for each
  • Class Chef - For Preparing Food
  • Class Waiter - For serving Food
That way there is separation of concern, and each class has logic for the business use case it is responsible for. It's easy to maintain & scale.

Open/Closed Principle

This principle states that - "A class should have open for extension but closed for modification". 

Let's see an example of this as well. Let's again take our above example of writing software for restaurant business. Let's say we are implementing a Chef class below.
1
2
3
class Chef:
    def prepare_chinese_food(self):
        print("Preparing Chinese Food"




Do you see any potential problems with the above? What happens when your restaurant starts preparing continental cuisine. You will add another method called prepare_continental_food to our Chef class. This will increase the number of methods in the class as we will have to add one each time we support a new cuisine. This is bad because you are modifying a code that is already tested and functioning which means adding Continental cuisine support has put the original implementation to support Chinese at risk. 

To fix this a better way would be to extend the Chef class to support new functionality
1
2
3
4
5
6
7
8
9
10
11
12
class Chef:
    def prepare_food(self):
        print("Preparing Chinese Food")
 
 
class ContinentalChef(Chef):
    def prepare_food(self):
        print("Preparing Continental Food")
 
 
chef = ContinentalChef()
chef.prepare_food()


As you can see now you can extend the base class Chef and add your new functionality without impacting the original implementation. If for any additional feature in code if you have to make lot of modification to your original code it is definitely violating this principle, so you should step back and see how you can make your code generic so that very less code changes are needed for future enhancements.

Liskov’s Substitution Principle (LSP)

This principle states - "Derived or child classes must be substitutable for their base or parent classes",

For this, we can again use the above example. What would have happened if we had implemented our Chef class a little differently?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class Chef:
    def prepare_chines_food(self):
        print("Preparing Chinese Food")
 
 
class ContinentalChef(Chef):
    def prepare_chines_food(self):
        print("Preparing Continental Food")
 
    def prepare_continental_food(self):
        print("Preparing Continental Food")
 
 
chef = ContinentalChef()
chef.prepare_chines_food()


This is also bad because you have overridden the prepare_chinese_food functionality to actually prepare continental food. The caller of the Chef class will call prepare_chines_food under the impression that Chinese food will be prepared but due to the way the above classes are implemented it will do the unexpected thing (which is preparing continental cuisine).


The principle says you should be able to use ContinentalChef anywhere you have used Chef class without breaking the software contract. If you really intended ContinentalChef to not prepare the Chinese food you can raise an Exception from the prepare_chinese_food method of child class.

Interface Segregation Principle (ISP)

This principle states - "Do not force any client to implement an interface which is irrelevant to them".

This principle is very similar to Single responsibility principle we saw above except it tasks about interfaces rather than classes.

For example, let's say you have Chef class and Cusine Interface with methods like create_chinese, create_continental, etc. Instead of having a single interface definition for all abstract method, it's good to have separate interfaces - ChineseCuisine, ContinentalCuisine, etc. That way your concrete classes can choose to implement the ones they think are correct for their use cases.

In real work a chef can know both cuisines or a single cuisine. By creating a single interface you are forcing the concrete class to implement both even if they are not supporting bot cuisines which is bad. Concrete class should be free to choose what interface to implement based on the cuisine the chef can prepare.

Dependency Inversion Principle (DIP)

This principle states - "High-level modules should not depend on low-level modules. Both should depend on abstractions"


This one is actually my favorite. Let's see an example to understand this. Let's say you have a Restaurant class and it has a ChinesePrepatingChef as an instance member. Tomorrow if the chef changes to ContinentalPreparingChef then you need to modify the Restaurant code to use the new class. 

Instead of this you should define a base class called Chef and create two concrete classes - ChinesePreparingChef & ContinentalPreparingChef. In Restaurant you will just have a reference to Chef instance which can be either of the concrete classes depending on the use case.




Friday, 7 February 2025

Handling missing values in DataFrame with Pandas

 Background

In the last few posts, we have been seeing the basics of Pandas - Series, DataFrame , how to manipulate data etc. In this post, we will try to see how to handle missing values in DataFrame.



Handling missing values in DataFrame with Pandas

Padas accept all following values as missing data
  • np.nan
  • pd.NA
  • None
We can use the isna or notna function to detect these missing data. 
  • The isna function evaluates each cell in a DataFrame and returns True to indicate a missing value. 
  • The notna function evaluates each cell in a DataFrame and returns True to indicate a non-missing value. 
Let's try to see an example for the above:

Code:
1
2
3
4
5
6
7
8
9
10
import pandas as pd
import numpy as np
 
df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})
 
print(df.to_string())
print(df.isna())
print(df.notna())


Output:
      Name    Role  Joining Date
0   Aniket  IT Dev      20190101
1  abhijit    None      20200202
2     <NA>   IT QA      20210303
3     Anvi     NaN      20220404

    Name   Role  Joining Date
0  False  False         False
1  False   True         False
2   True  False         False
3  False   True         False

    Name   Role  Joining Date
0   True   True          True
1   True  False          True
2  False   True          True
3   True  False          True

You can now use the truth data to filter rows as follows

1
2
3
4
5
6
7
8
import pandas as pd
import numpy as np
 
df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})
 
print(df[df["Role"].notna()])

Output:

     Name    Role  Joining Date
0  Aniket  IT Dev      20190101
2    <NA>   IT QA      20210303


Dropping (dropna)& replacement (fillna)of missing data

  • dropna : The dropna function is used to drop rows and columns with missing values. It takes following arguements
    • axis - 0 for rows and 1 for columns
    • how - any for dropping if any data point is missing , all for dropping if all data points are missing
    • thresh - Threshold number of data points missing for dropping
    • inplace - Instead of returning a new modified DataFrame does dropping inplace for same DataFrame.
  • fillna: The fillnafunction is used to fill missing values with some data.

Example for dropna:
1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
import numpy as np
 
df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})
 
# Drop all columns with missing any data
print(df.dropna(axis=1, how="any"))
 
# Drop all rows with missing any data
print(df.dropna(axis=0, how="any"))
Output:
   Joining Date
0      20190101
1      20200202
2      20210303
3      20220404

     Name    Role  Joining Date
0  Aniket  IT Dev      20190101

Example for fillna:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import pandas as pd
import numpy as np
 
df = pd.DataFrame({"Name": ["Aniket", "abhijit", pd.NA, "Anvi"],
                   "Role": ["IT Dev", None, "IT QA", np.nan],
                   "Joining Date": [20190101, 20200202, np.nan, 20220404]})
 
# Replace missing values with some default value
print(df.fillna({"Name": "Default Name", "Role": "Default Role"}))
 
# Replace missing joining date with latest/max joining date in data
print(df["Joining Date"].fillna(value=df["Joining Date"].max()))
 
# forward fill the missing data
print(df.ffill(limit=1))

Output:

           Name          Role  Joining Date
0        Aniket        IT Dev    20190101.0
1       abhijit  Default Role    20200202.0
2  Default Name         IT QA           NaN
3          Anvi  Default Role    20220404.0

0    20190101.0
1    20200202.0
2    20220404.0
3    20220404.0
Name: Joining Date, dtype: float64

      Name    Role  Joining Date
0   Aniket  IT Dev    20190101.0
1  abhijit  IT Dev    20200202.0
2  abhijit   IT QA    20200202.0
3     Anvi   IT QA    20220404.0

NOTE: Previously we could pass method argument as bfill as ffill but that is deprecated now. Worning message: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead

Thursday, 6 February 2025

String and Date manipulation in DataFrame with Pandas

Background

In the last post, we saw how to filter data in a DataFrame provided by the panda's library in Python. In this post, we will see how we can manipulate string and date type columns in DataFrame.



String and Date manipulation in DataFrame with Pandas

String & date manipulation in 

String manipulation

You can use df.dtypes to check the data types of columns in a DataFrame. You can also use df.astype to convert the column types. In this section, we will see how to manipulate and operate string data types in DataFrame. You can do this by accessing the string content using .str   accessor type.

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import pandas as pd
 
df = pd.DataFrame({"Name": ["Aniket", "abhijit", "awantika", "Anvi"],
                   "Role": ["IT Dev", "Finance Analyst", "IT QA", "Fun Play"],
                   "Joining Date": [20190101, 20200101, 20210101, 20220101]})
 
print(df)
 
# .str accessor to operate on string data type
df["Name_initials"] = df["Name"].str[0:3]
df["Department"] = df["Role"].str.split(" ", expand=True)[1]
 
# Use + operator to concatenate data
df["Name_Department_Combined"] = df["Name"] + "_" + df["Department"]
 
# Chain operations to get results in one line
df["Capitalized_Initials"] = df["Name"].str.capitalize().str[0:3]
 
print(df.to_string())

Output:

       Name             Role  Joining Date

0    Aniket           IT Dev      20190101
1   abhijit  Finance Analyst      20200101
2  awantika            IT QA      20210101
3      Anvi         Fun Play      20220101

Name            object
Role            object
Joining Date     int64
dtype: object

       Name             Role  Joining Date Name_initials Department Name_Department_Combined Capitalized_Initials

0    Aniket           IT Dev      20190101           Ani        Dev               Aniket_Dev                  Ani
1   abhijit  Finance Analyst      20200101           abh    Analyst          abhijit_Analyst                  Abh
2  awantika            IT QA      20210101           awa         QA              awantika_QA                  Awa
3      Anvi         Fun Play      20220101           Anv       Play                Anvi_Play                  Anv

In the above example, you can see various ways you can manipulate the str column type on data frame.

Date manipulation

Similar to manipulating string we can also manipulate date data types in pandas using accessory "dt". We will use the same example as above for date manipulation. For this we use a different data type called "datetime64[ns]" , this data type represents a timestamp meaning it represents a date and time.

You can try executing below code to see how this data type works

  • print(pd.to_datetime("20240701"))
This outputs
2024-07-01 00:00:00


In the example above we have a "Joining date" & as you see in the outout of code above it currently prints int as dtype of that column. We need to convert it to a datetime type before we do further manipulations on a date. As I mentioned above we can convert the data type using df.astype() method.

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import pandas as pd
 
df = pd.DataFrame({"Name": ["Aniket", "abhijit", "awantika", "Anvi"],
                   "Role": ["IT Dev", "Finance Analyst", "IT QA", "Fun Play"],
                   "Joining Date": [20190101, 20200202, 20210303, 20220404]})
 
df['Joining Date'] = pd.to_datetime(df['Joining Date'], format='%Y%m%d')
 
# If date was of standard YYYY-MM-DD format you could use velow
# df = df.astype({"Joining Date": "datetime64[ns]"})
print(df.dtypes)
 
df["Joining Year"] = df["Joining Date"].dt.year
df["Joining Month"] = df["Joining Date"].dt.month
df["Joining Day"] = df["Joining Date"].dt.day
print(df.to_string())

Output:

Name                    object
Role                    object
Joining Date    datetime64[ns]
dtype: object

       Name             Role           Joining Date       Joining Year  Joining Month  Joining Day
0    Aniket           IT Dev        2019-01-01           2019              1            1
1   abhijit  Finance Analyst   2020-02-02          2020              2            2
2  awantika            IT QA     2021-03-03           2021              3            3
3      Anvi         Fun Play       2022-04-04           2022              4            4

You can do more operations using the ".dt" accessor for date data types.

Related Links

Sunday, 2 February 2025

Filtering a DataFrame in Pandas

 Background

In the last post, we saw the basics if using the pandas library in Python which is used for data analysis. We saw two basic data structures supported by pandas

  1. Series
  2. DataFrame
In this post, we will further see how we can filter data in a data frame. These are some of the most common operations performed for data analysis.



Filtering a data frame in Pandas

loc & iloc methods

To recap, a data frame is a two-dimensional data structure consisting of rows and columns. So we need a way to filter rows and columns efficiently. Two main methods exposed by data frame for this are

  1. loc - uses rows and column labels
  2. iloc - uses rows and column indexes

For example:

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df.loc[["a", "b"], ["colA", "colC"]])
print(df.iloc[:2, :3])

Output:

   colA  colC

a     4     7

b     1     4

   colA  colB  colC

a     4     9     7

b     1     1     4

The loc and iloc methods are frequently used for selecting or extracting a part of a data frame. The main difference is that loc works with labels whereas iloc works with indices.

Selecting subset of columns

We can get a Series (a single column data) from the data frame using df["column_name"], similarly, we can get a new data frame with a subset of columns by passing a list of columns needed. For eg.,

1
2
3
4
5
6
7
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[["colA", "colC"]])
print(type(df[["colA", "colC"]]))


Output:

   colA  colC

a     5     2

b     7     2

c     4     3

d     9     1

<class 'pandas.core.frame.DataFrame'>

As you can see from the output we selected 2 columns - ColA and ColC and the result is a new DataFrame object.

Filtering by condition

You can also filter a data frame by conditions. Consider the following example:

1
2
3
4
5
6
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[df["colA"] >= 1])


Output:

   colA  colB  colC  colD

a     3     9     5     6

b     8     5     9     6

c     9     4     1     4

d     8     4     3     5

The data frame has randomly generated data so the output will not be consistent but you can confirm that output will always have entries corresponding to colA having values greater than or equal to 1 as we specified in the filtering condition.

you can also specify multiple conditions with & or | operators. Consider the following example
1
2
3
4
5
6
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[(df["colA"] >= 1) | (df["colB"] <= 5)])



Output:
   colA  colB  colC  colD
a     2     4     4     7
b     4     4     3     6
c     1     9     1     9
d     2     3     8     3

Again the output would not be consistent due to randomness of data but you should get the output that matches the filtering conditions. Following are all conditions supported
  • ==: equal
  • !=: not equal
  • >: greater than
  • >=: greater than or equal to
  • <: less than
  • <=: less than or equal to
You can also use the .isin method to filter data as follows.
1
2
3
4
5
6
import pandas as pd
import numpy as np
 
df = pd.DataFrame(np.random.randint(1, 10, (4, 4)), index=["a", "b", "c", "d"],
                  columns=["colA", "colB", "colC", "colD"])
print(df[df["colA"].isin([1, 2, 3])])

Output:
   colA  colB  colC  colD
b     1     7     6     4
c     2     4     9     7


Related Links

t> UA-39527780-1 back to top