Showing posts with label Interview Questions. Show all posts
Showing posts with label Interview Questions. Show all posts

Saturday, 19 April 2025

How to approach System design interview questions (HLD)

 Background

Design questions are an integral part of today's IT interview process. Design questions can generally be categorized into
  1.  High-level design (HLD) - High-level architecture design of a system
  2.  Low-level design (LLD) - Low-level class / Component design of the system

In High-level design, you would typically get a question to design a system like YouTube, Google Maps, or an e-commerce website, and you would provide the high-level architecture of the system, starting from the client all the way to backend systems that would process and return the request. Focus would be to solve functional use cases, but also address non-functional ones like scalability, availability, performance, etc.

In Low-level design, however, you will focus more on writing classes or components to solve the problem statement. For eg, let's say you are designing a Chess game, then you need to design a class for various components like the Board, Player, etc., along with methods/variables that will be used to implement interactions effectively. Focus would be on implementing the solution, but also sticking to design principles like SOLID.

In this post, we will look at what are some common things you look at in terms of approaching a high-level design question. We will look at a specific design question and see how we approach it based on various aspects we discuss here.





How to approach System design interview questions

1. Summarize the requirements of the System and clarify open questions

The 1st thing you should think about once you know the problem statement is to summarize the requirements of the System (functional and non-functional) and clarify open questions. Once the requirements are summarized, you can check with the interviewer if the list looks ok, if there is anything specific the interviewer wants to discuss, we add it to our list of requirements. This is very important to build the foundation of how you approach the question. This will help you with the following
  1. Understand all aspects of the problem that you are trying to solve
  2. You can refer to these when you are actually building your system to ensure you have addressed all the requirements you noted earlier.
  3. This shows you can systematically address the question and not jump to solutions.
Make sure when you are summarizing functional and non-functional requirements, also mention why you think a requirement is important.

Let's see some examples
  1. Let's say you are designing a YouTube video upload and viewing service. 
    1. Functional requirements can be 
      1. The user should be able to upload the video
      2. The user should be able to specify video metadata like title, description, thumbnail, etc.
      3. The user should be able to view the video
      4. The user should be able to generate the caption, etc.
    2. Non-functional requirements can be
      1. The system should be available & fault-tolerant (no downtime)
      2. The system should be able to scale with load
      3. The system should be performant (Less video processing time, less rendering time, etc.)
      4. The system should be durable (Video once uploaded should be persisted)
With summarization, make sure you ask open questions, e.g., 
  1. Do we want to support video streaming?
  2. What is the expected system load like? Designing a system for a global user base will require a highly scalable system.
  3. Do we want to consider front-end system design (UX experience) or just the backend design, etc?

2. Take time to think about your high-level approach and design

Once you clarify the requirements, you should take some time to think through what your approach should be to implement a design that solves the problem at hand (Do not jump to giving random suggestions - that's a red flag).

A few things you can think of
  1. What components will be involved in the design (API gateway, web services, persistent storage options, event-driven communication, etc), and how will they interact with each other?
  2. Go through your non-functional requirements and see what is needed to handle those. E.g., for scalability, you might need auto-scaling infrastructure, for availability, you might need a cluster of DB nodes, for performance, you might need in-memory caches, etc.
  3. You can also think about aspects like what APIs are needed, what API structure will look like, what are expected input and output of APIs are. If you have to store information, think about what the DB schema will look like.
  4. Think about what DB you need - Relational or non-relational, etc.
Standard components that you might need
  1. API Gateway: API gateway is probably the most common component in any architecture. API gateway is used for rate limiting, authentication, response caching, and data transformations.
  2. Microservices node cluster: Modern systems use microservices rather than a monolithic architecture. You would typically want autoscaling enabled so that you can scale the microservice nodes on demand. Instead of actual server nodes, you could have a serverless architecture like AWS Lambda.
  3. Storage: You need storage options. It could be a database(Relational or non-relational), storage service like S3, etc. 
  4. Event-driven system: If you need an asynchronous workflow, then you need to decouple your microservice nodes from the component that will process your request, eg, image processing or video processing, which are expensive operations. For this, you can have a message queue system like Active MQ, AWS SQS, or Kafka.
  5. Caching system: To improve performance, you might want to have some in-memory caching system like Redis, memcache, etc.
  6. Cron jobs/ batch infra: For use-cases where you need a periodic job to run, you can use a cron-based scheduler or batch job infrastructure.
  7. Monitoring/Analytics systems: Depending on use cases, you might also need a monitoring setup like monitoring system health,  system load, security alerts in cases like a DOS attack, etc. There could be an analytics requirement like monitoring logs, user access patterns, etc. Eg., AWS CloudWatch.
  8. CDN: Content delivery network or CDN is a set of servers at edge locations (near user locations) that are responsible for serving static content like CSS, images, etc., to reduce the latency.
Concepts that you need to consider for the design
  1. Authentication: As we saw above API gateway will handle authentication. You can have various authentication mechanisms like OAuth 2.0, JWT, etc.
  2. Relational vs No-SQL DB: You need to decide based on system requirements whether you need a relational DB like SQL Server or a non-sql DB like MongoDB or maybe something else. You can follow below general guidelines below to decide
    1. If you want to scale, like if you are dealing with big data, you would generally want to go to NoSQL, as relational DB might be difficult to scale. You can argue that you can have sharding to scale the DB, and while that is a valid point, it becomes difficult to scale relational DB after a threshold.
    2. If you do not have a fixed schema, then again, NoSQL is the way to go.
    3. If you need low latency, then again, NoSQL is the way to go.
    4. If you need ACID properties, then you should use a relational DB. For eg, if you are designing a payment system, you should use a relational DB because you need strong consistency. If you are ok with eventual consistency, you can go with NOSQL.
    5. If your access queries and patterns are not fixed, then NoSQL is probably a good choice; else if it is fixed relational DB might work out.

  3. CAP theorem: The CAP theorem, also known as Brewer's theorem, states that a distributed system can only guarantee two out of three properties: Consistency, Availability, and Partition Tolerance. This means that designers must make trade-offs when building distributed systems, prioritizing the most important guarantees for their specific needs. 
  4. ACID properties: ACID is an acronym that stands for atomicity, consistency, isolation, and durability (ACID). Together, ACID properties ensure that a set of database operations (grouped together in a transaction) leave the database in a valid state even in the event of unexpected errors. Further, ACID transactions provide the level of transactional guarantees that many regulatory agencies require.
  5. DB concepts: If you do get into DB architecture discussions, you might need the following concepts handy
    1. Sharding
    2. Paritioning
    3. Indexes
    4. Batch writes
    5. Read replicas
  6. API design: You might have to design the APIs as well, for which you need to consider:
    1. API type (Socket communication / REST API's)
    2. Versioning
    3. Pagination
    4. API protocol (HTTP / HTTPS)
    5. API timeout 
    6. Rate limiting, etc.
  7. UX concepts
    1. Server-side rendering vs client-side rendering: (If SEO, Search engine optimization, is a use case, then we need server-side rendering as 1st render should have all data, which is not the case with client-side rendering)
    2. Debouncing: Call the API or a function after a deal, so that if a user mistakenly clicks the submit button 3/4 times call is made only once. 

3. Implement your design

Once you have thought about your design, you can start implementing it. You can do this over a whiteboard or a Word doc (whatever is available). You should consider following at this stage
  1. While suggesting a component, also give a reason why you think the component is suggested, and what use case it solves. Be as vocal as you can and constantly communicate your thought proceess.
  2. You can expect probing questions that will be asked to make sure you understand the depth of the solution you are suggesting. So, do not suggest anything that you might have heard but are not sure how it actually works, and help solve the user case at hand. You should be confident enough to back up your answers.
  3. Take pauses after periodic discussion to check if the interviewer has any questions about what has been suggested/discussed so far. 
  4. Be prepared for course corrections. It can happen that the interviewer asks what you will do if the requirement changes, you should be able to course correct your approach to align with the new temporary goal set. This is again to understand that you can do necessary course corrections when needed.

4. Summarize your solution

Once implementation is done, go through your initial notes on functional and non-functional requirements and summarize how your system design handles them. 

NOTES:
  1. Design discussion is all about open discussion, unlike problem solving, where you have a specific solution/ direction for the problem.
  2. There can be more than one solution to the problem, so just be honest, confident, and vocal about your answers. Justify and back up your answers.
  3. Never take assumptions or jump to solutions. Clarify all the questions upfront. If you are suggesting something based on some assumption, mention it upfront as well.
  4. Do not propose or suggest anything that you are not aware of or cannot justify. If you suggest something but cannot justify it, it will be considered a red flag. It's ok to say if you are not aware of something specific, like how DB indexes work or how sharding works. 

Related Links

Wednesday, 26 March 2025

SOLID principles in Object Oriented Programming

 Background

This post will try to see the 5 SOLID design principles used for object-oriented software design. These principles are
  • S: Single Responsibility Principle (SRP)
  • O: Open/Closed Principle
  • L: Liskov’s Substitution Principle (LSP)
  • I: Interface Segregation Principle (ISP)
  • D: Dependency Inversion Principle (DIP)
Whenever you write new code or design new software, ensure you respect the SOLID principles mentioned above. This helps create clean and modular code that will help you maintain and scale it in the future. Let's see each of these in detail.




Single Responsibility Principle (SRP)

This principle states that - "A class should have single responsibility". 

Let us take an example to understand this better. Let's assume you are building software for a restaurant and you create a class classed Worker which has method like prepare_food & serve_food.
class Worker:
    def prepare_food(self):
        print("Preparing Food")

    def serve_food(self):
        print("Serving Food")  


What do you think about the above design? It's bad but why? It's bad because it's very tightly coupled with 2 responsibilities - one is preparing food and one is serving food which is independent tasks. Now let's say you make any changes to how you prepare food in the future you also need to test and ensure that serve_food functionality is not impacted. This unnecessary coupled dependency has made code hard to maintain. A better design would be to have separate classes for each
  • Class Chef - For Preparing Food
  • Class Waiter - For serving Food
That way there is separation of concern, and each class has logic for the business use case it is responsible for. It's easy to maintain & scale.

Open/Closed Principle

This principle states that - "A class should have open for extension but closed for modification". 

Let's see an example of this as well. Let's again take our above example of writing software for restaurant business. Let's say we are implementing a Chef class below.
class Chef:
    def prepare_chinese_food(self):
        print("Preparing Chinese Food")  




Do you see any potential problems with the above? What happens when your restaurant starts preparing continental cuisine. You will add another method called prepare_continental_food to our Chef class. This will increase the number of methods in the class as we will have to add one each time we support a new cuisine. This is bad because you are modifying a code that is already tested and functioning which means adding Continental cuisine support has put the original implementation to support Chinese at risk. 

To fix this a better way would be to extend the Chef class to support new functionality
class Chef:
    def prepare_food(self):
        print("Preparing Chinese Food")


class ContinentalChef(Chef):
    def prepare_food(self):
        print("Preparing Continental Food")


chef = ContinentalChef()
chef.prepare_food()


As you can see now you can extend the base class Chef and add your new functionality without impacting the original implementation. If for any additional feature in code if you have to make lot of modification to your original code it is definitely violating this principle, so you should step back and see how you can make your code generic so that very less code changes are needed for future enhancements.

Liskov’s Substitution Principle (LSP)

This principle states - "Derived or child classes must be substitutable for their base or parent classes",

For this, we can again use the above example. What would have happened if we had implemented our Chef class a little differently?
  
class Chef:
    def prepare_chines_food(self):
        print("Preparing Chinese Food")


class ContinentalChef(Chef):
    def prepare_chines_food(self):
        print("Preparing Continental Food")

    def prepare_continental_food(self):
        print("Preparing Continental Food")


chef = ContinentalChef()
chef.prepare_chines_food()


This is also bad because you have overridden the prepare_chinese_food functionality to actually prepare continental food. The caller of the Chef class will call prepare_chines_food under the impression that Chinese food will be prepared but due to the way the above classes are implemented it will do the unexpected thing (which is preparing continental cuisine).


The principle says you should be able to use ContinentalChef anywhere you have used Chef class without breaking the software contract. If you really intended ContinentalChef to not prepare the Chinese food you can raise an Exception from the prepare_chinese_food method of child class.

Interface Segregation Principle (ISP)

This principle states - "Do not force any client to implement an interface which is irrelevant to them".

This principle is very similar to Single responsibility principle we saw above except it tasks about interfaces rather than classes.

For example, let's say you have Chef class and Cusine Interface with methods like create_chinese, create_continental, etc. Instead of having a single interface definition for all abstract method, it's good to have separate interfaces - ChineseCuisine, ContinentalCuisine, etc. That way your concrete classes can choose to implement the ones they think are correct for their use cases.

In real work a chef can know both cuisines or a single cuisine. By creating a single interface you are forcing the concrete class to implement both even if they are not supporting bot cuisines which is bad. Concrete class should be free to choose what interface to implement based on the cuisine the chef can prepare.

Dependency Inversion Principle (DIP)

This principle states - "High-level modules should not depend on low-level modules. Both should depend on abstractions"


This one is actually my favorite. Let's see an example to understand this. Let's say you have a Restaurant class and it has a ChinesePrepatingChef as an instance member. Tomorrow if the chef changes to ContinentalPreparingChef then you need to modify the Restaurant code to use the new class. 

Instead of this you should define a base class called Chef and create two concrete classes - ChinesePreparingChef & ContinentalPreparingChef. In Restaurant you will just have a reference to Chef instance which can be either of the concrete classes depending on the use case.




Saturday, 25 January 2025

Using 'Union find' pattern for problem solving questions

 About the pattern

The pattern is used to group elements into sets based on specified properties. Each element is unique and the sets are non-overlapping meaning each set has distinct elements. Each set forms a tree data structure, each element has a parent and the root of the tree is called the representative element of that set. The parent of this representative node is the same node (itself). If we pick any element of a set and follow its parent node then we will always reach the representative element of the set (root of the tree representing the disjoint set).

The pattern is implemented using two methods
  1. find(node): Find the representative of the set containing the node.
  2. union(node1, node2): Merges the set containing node1 and node 2 into one.

Let's say we have an array representing numbers 0 to 5. Before we start applying this pattern each element has itself as the parent.

This means we have 6 different unique sets each containing one element. Now we want to start grouping them into a unique set (typically we will have more than one unique set which we will see with an actual example in some time but for this example consider  we want to create a single set).


  1. Do union(0, 1): This will merge nodes 0 and 1 together. Change the representative of 0 to 1.


  2. Similarly do
    1. union(1,2)
    2. union(2,3)
    3. union(4,5): Notice how we merged 4 and 5 instead of 3 and 4. This is done just to give you an idea that an element can merge into any disjoint sets based on the property under consideration.
    4. union(3,5)
At the end, we have a tree below:


As you would have imagined by now as you are making this disjoint set the size of the tree can be O(N) in the worst case and so is the time complexity of this pattern.

Optimization: We can optimize this further by maintaining the rank of each element which would denote the number of the child node beneath it. So the rank of the representative node will denote the size of the tree it represents (the number of child nodes it has). We can use this rank to then decide which of the two representative nodes should we use as the parent new representative node while merging the two trees corresponding to the two disjoint sets/trees. Eg., above node 5 at the end of iteration has the rank of 6 (5 nodes under it  + 1 counting itself).

With the above optimization, you will now select a new representative node as the representative node with a higher rank among the two getting merged. So it will be something like below:




As you would have guessed by now the tree is balanced with the above approach and guarantees that the TC for search is reduced to log(N) - length of the tree.

So if I have to find the representative of 4, I will find its parent which is 5, and then recursively find its parent till we reach root which in this case is 1. Once the root is reached (node with itself as a parent) we return that node as it is the representative node. As we traverse the length of the tree the TC is log(N).

Using 'Union find' pattern

Now let's try to use this pattern to solve an actual problem-solving question

Problem statement:
For a given integer, n, and an array, edges, return the number of connected components in a graph containing n nodes. 
The array edges[i] = [x, y] indicates that there’s an edge between x and y in the graph.

Constraint:
  • 1<=n<=1000
  • 0<=edges.length<=500
  • edges[I].length == 2
  • 0<=x,y<n
  • x!=y
  • No repeated edges
Notice how the problem statement says that the elements (vertices of the graph) are unique, and there are no repeated edges. Let's see how we can implement this using the union-find method.

Solution:
class UnionFind:

    def __init__(self, n):
        self.parent = []
        for i in range(n):
            self.parent.append(i)
        self.rank = [1] * n
        self.connected_components = n

    def find(self, v):
        if self.parent[v] != v:
            return self.find(self.parent[v])
        return v
   

    def union(self, x, y):
        p1, p2 = self.find(x), self.find(y)
        if p1 != p2:
          if self.rank[p1] < self.rank[p2]:
            self.parent[p1] = p2
          else:
            self.parent[p2] = p1
          self.connected_components = self.connected_components - 1
		
def count_components(n, edges):
  uf = UnionFind(n)
  for edge in edges:
    v1 = edge[0]
    v2 = edge[1]
    uf.union(v1, v2)  
  return uf.connected_components


If we run the above for the following data sets it gives the correct results:
Example 1:
  • Input: 5 , [[0,1],[1,2],[3,4]] 
  • Output: 2
Example 2:
  • Input: 6 , [[0,1],[3,4],[4,5]]
  • Output: 3
Now let's try to understand the solution and implementation of the pattern. The core logic of the union pattern is in the UnionFind class. In the constructor, we have initialized 3 things
  1. parent - list that tracks the parent of each element
  2. rank - list that tracks the rank of each element
  3. connected_component - number of connected component

At the start each node has itself assigned as the parent and the rank of each node is 1. Similarly, since each node is separate at the start number of connected components is equal to the number of nodes.

As we iterate over the edges we pass them to the union method to merge them into the single set - remember the pattern is used to split unique elements into disjoint sets (connect component in this case). As we merge the vertices which are supposed to be part of edges we reduce the connected_component by 1 since two elements forming an edge are merged in a single set (representing connected component). At the end when we have parsed each edge we would have completed having unique disjoint sets representing the number of connected components in edges, so we simply return connected_componenets which we have been using to track it.

Also, notice how we are using rank to ensure that while emerging two disjoint sets we set the parent as the one with a higher rank to ensure the resultant tree is balanced and consequently find() operations take log(N).


Related links






Thursday, 23 May 2024

Stack and Queues in Python

 Background

In the last few posts, we saw how to work with classes in Python and how it can be used in an object-oriented way. In this post, I will explain how you use a stack and queue implementation in Python. This is needed if you are doing problem-solving questions using Python or in general using Python for coding.

Remember:

  • Stack:  "Last in First out"
  • Queue:  "First in First out"



Using stack in Python 

Stack works on logic "Last in First out".Stack is pretty straightforward. You can use a standard Python list as your stack. See below the Python code

stack = []
stack.append("Aniket")
stack.append("Abhijit")
stack.append("Awantika")
print(stack)
print("Pushing Anvi to Stack")
stack.append("Anvi")
print(stack)
print(f"Popped {stack.pop()}")
print(stack)
print(f"Popped {stack.pop()}")
print(stack)

This prints the following:

['Aniket', 'Abhijit', 'Awantika']
['Aniket', 'Abhijit', 'Awantika', 'Anvi']
Popped Anvi
['Aniket', 'Abhijit', 'Awantika']
Popped Awantika
['Aniket', 'Abhijit']

As you can see initial values in the stack were "Aniket", "Abhijit" & "Awantika". Then we pushed "Anvi" to the stack. Then we popped which gave us "Anvi", then we popped again and we got "Awantika". Remember stack is "Last in First out" and the above behavior is in line with it.

NOTE: The list works better at the end of the list, so append and pop operations happen in O(1) time complexity. So if you need a stack implementation go ahead and use a list. You could also use dequeue implementation for a stack with the same O(1) complexity - It has append and pop methods similar to queue. (That's not the case for the queue - see below).

Using Queue in Python

The queue is "First in First out". You could use list implementation in Python for queue as well but remember the note from above - The list works better at the end of the list. It does not work well with the start of the list, so if you try to remove the element from the beginning of the list, the list needs to be shifted back one place making the time complexity O(N).

You should use dequeue implementation from the collections module for O(1) push/pop operations in the queue. I will show both use cases below. 

Using a list for queue implementation (Not recommended)

See below the Python code:

queue = []
queue.append("Aniket")
queue.append("Abhijit")
queue.append("Awantika")
print(queue)
print("Pushing Anvi to Queue")
queue.append("Anvi")
print(queue)
print(f"Popped {queue.pop(0)}")
print(queue)
print(f"Popped {queue.pop(0)}")
print(queue)


The output in this case is:

['Aniket', 'Abhijit', 'Awantika']
Pushing Anvi to Queue
['Aniket', 'Abhijit', 'Awantika', 'Anvi']
Popped Aniket
['Abhijit', 'Awantika', 'Anvi']
Popped Abhijit
['Awantika', 'Anvi']

As you can see this time it popped "Aniket" & "Abhijit" which are first in the list (Remember - "First in First out").

Using dequeue for queue implementation (recommended)

A deque is a generalization of the stack and a queue (It is short for "double-ended queue"). Deques support thread-safe, memory-efficient appends and pops from either side of the deque with approximately the same O(1) performance in either direction. (See documentation for more details)

NOTE: Though list objects support similar operations, they are optimized for fast fixed-length operations and incur O(n) memory movement costs for pop(0) and insert(0, v) operations which change both the size and position of the underlying data representation.


See below the Python code:

from collections import  deque

queue = deque([])
queue.append("Aniket")
queue.append("Abhijit")
queue.append("Awantika")
print(queue)
print("Pushing Anvi to Queue")
queue.append("Anvi")
print(queue)
print(f"Popped {queue.popleft()}")
print(queue)
print(f"Popped {queue.popleft()}")
print(queue)

Output is the same as that of above (as using list for queue) but internally the operation is O(1) in case of dequeue unlike O(N) in case of list.

['Aniket', 'Abhijit', 'Awantika']
Pushing Anvi to Queue
['Aniket', 'Abhijit', 'Awantika', 'Anvi']
Popped Aniket
['Abhijit', 'Awantika', 'Anvi']
Popped Abhijit
['Awantika', 'Anvi']

Add in the comments if you have any questions.

Related Links

Saturday, 2 September 2017

Understanding database indexes - Part 2

Background

Some time back we took a look at what database indexing is and how it internally works. -
In this post we will see database indexing more from a development design perspective. Many of you might be of the impression that database indexes, tables , performance, lookups maybe responsibility of database admins. Though this might be true to some extent indexes selection, and constructing where clause is developers responsibility and poor choice of indexes and where clause may often lead to performance issues causing queries to run slow. So whenever you are developing an application that requires database interactions as a developer it is very important that you design your indexes first. How do we do that?  - We will see that in sometime. 

Understanding database indexes - Part 2

An index lookup require 3 steps -
  1. B-Tree traversal to root node
  2. Traversal along root node
  3. Access actual table data from each root node
Step 1 is limited is size as tree height/level will be limited by log N constraint. For millions of rows there could be 3-4 level of the tree. It will be extremely rare to see a B-Tree with level more than 5.
You can use following query to see the B-Tree level for your index -

SELECT index_name, blevel+1 FROM user_indexes ORDER BY 2;


blevel gives you the levels of your B-Tree index. Plus one is for the leaf node. So these are the number of levels that needs to be traversed to get an index at the leaf node (considering unique scan).

Step 2 and Step 3 can vary and in most cases are causes of slow index lookup resulting in slow running queries.

Let's start understanding by taking an actual example. Let's create a table as follows -

create table schema8.EMPLOYEE(ID int, name varchar2(255),age int, department varchar2(255), salary int);
alter table schema8.EMPLOYEE ADD CONSTRAINT PRIMARY_KEY PRIMARY KEY (ID); 

CREATE UNIQUE INDEX schema8.UX_EMPLOYEE_1 ON schema8.EMPLOYEE (name, age, department);
ALTER TABLE schema8.EMPLOYEE ADD CONSTRAINT UK_EMPLOYEE_1 UNIQUE (name, age, department) USING INDEX schema8.UX_EMPLOYEE_1;


Lets' add some data in it -

insert into schema8.EMPLOYEE values(1,'Aniket',26,'IT',100);
insert into schema8.EMPLOYEE values(2,'John',29,'FINANCE',40);
insert into schema8.EMPLOYEE values(3,'Sam',27,'IT',101);
insert into schema8.EMPLOYEE values(4,'Ron',30,'ACCOUNTING',35);
insert into schema8.EMPLOYEE values(5,'Sky',33,'DEVOPS',62);
insert into schema8.EMPLOYEE values(6,'Paul',26,'FINANCE',43);
insert into schema8.EMPLOYEE values(7,'Dan',24,'IT',100);
insert into schema8.EMPLOYEE values(8,'Jess',25,'ACCOUNTING',37);
insert into schema8.EMPLOYEE values(9,'Troy',31,'FINANCE',41);
insert into schema8.EMPLOYEE values(10,'Mike',28,'IT',103);
insert into schema8.EMPLOYEE values(11,'Anuj',28,'DEVOPS',64);
insert into schema8.EMPLOYEE values(12,'Vinit',29,'FINANCE',48);
insert into schema8.EMPLOYEE values(13,'Sudhir',29,'ACCOUNTING',39);
insert into schema8.EMPLOYEE values(14,'Anish',28,'IT',100);
insert into schema8.EMPLOYEE values(15,'Shivam',25,'DEVOPS',61);
insert into schema8.EMPLOYEE values(16,'Monica',26,'ACCOUNTING',30);
insert into schema8.EMPLOYEE values(17,'Ramji',32,'FINANCE',41);
insert into schema8.EMPLOYEE values(18,'Anjali',34,'ACCOUNTING',38);
insert into schema8.EMPLOYEE values(19,'Payas',26,'IT',100);
insert into schema8.EMPLOYEE values(20,'Zara',27,'DEVOPS',60);


Normal index


Let's start with a simple query -

select * from schema8.EMPLOYEE where department='IT';


It gives 6 rows. What we really want to understand is the performance of the query and if we can improve it. To understand the queries performance we need to take a look at the execution plan that was used by the sql optimizer. In Sql developer you can just
  • Select the query -> Right click -> Explain -> Explain Plan



And you should see the plan that was selected to run this query and associated cost.

So for above query execution plan is -



As you can see a "FULL TABLE SCAN" was selected.  Since your where clause has department column in it there was no other option. Unique index starting with name could not be used. Primary key index could not be used (Index is always created on primary key -id in this case). So it had to go for a full table scan. Now this is obviously expensive. You can see the cardinality is 6 which basically means there are 6 rows which satisfy "department='IT'" clause and cost is also high.

Let's do something about this. Let's create an index in column department and then again inspect the plan.

create index X_EMPLOYEE_DEPT on schema8.EMPLOYEE(department);


and now lets see the execution plan -



Better? Our cost is reduced by half now. As you can see this time our new index was used for the lookup - "RANGE SCAN". So full table access was avoided. Recall our earlier discussion on steps needed for index lookup -
  1. It used index to get to the leaf node
  2. Traveled along leaf node linked list to find all nodes with department='IT' ("RANGE SCAN")
  3. finally for each index accessed the actual table using rowid to get other table data ("BY INDEX ROWID BATCHED") (Batched because data for all rowids are retrieved in a single call)
Hope this clears how indexes help faster execution of queries.

NOTE :
Cardinality is the estimated number of rows a particular step will return.
Cost is the estimated amount of work the plan will do for that step.
Higer cardinality mean more work which means higher cost associated with that step.
A lower cost query will run faster than a higer cost query.

Primary key index 

As you know primary key has an index created by default. Let's try to query the table using primary key and  see it's execution plan -

select * from schema8.EMPLOYEE where id='12';



As expected cost has further gone down. As primary key index is unique index (since primary key itself is unique) execution plan went for - "UNIQUE SCAN"  and then simple "BY INDEX ROWID" (No batched lookup here since there will be just one entry given that it is unique).  So again if you recollect index lookup steps this consists of -
  1. Use unique index to reach leaf node (Just one leaf node) ("UNIQUE SCAN")
  2. Get the table data ("BY INDEX ROWID")
Notice how there was no traversal among lead nodes and no consequent batch access by row ids.

Unique key index

This would again be same as primary key index since primary key index is also a unique key index but let's give this a try since we have defined unique key for our table on - name, age, department

select * from schema8.EMPLOYEE where name='Aniket' and age=26 and department='IT';

And execution plan for this is -



As expected it is same as primary key index. Instead of primary key index it used unique index we created on our own. Steps are same too.

Composite index

Remember our unique index - name, age, department . We saw in the 1st case where we had department in where clause (before creating an index on department) that this particular index was not used and a full table scan was performed.

If you recollect from our previous discussion index column order matters. If the column order was - department , name, age this new index would gave been picked up. Anyway lets try based on what we already have. Now we are going to add name in where clause and based on our existing knowledge our unique index should get picked up (since it starts with name column) -

select * from schema8.EMPLOYEE where name='Aniket'; 

Execution plan -



As expected our index got used and a full table scan was avoided. However if you use age in where clause index will not be used again - since none of your index starts with age. Try it yourself!

Order of columns in where clause does not matter

We saw how order of columns matter in index creation. Same does not apply for where clause column order. Eg consider -

select * from schema8.EMPLOYEE where name='Aniket' and age=26;
select * from schema8.EMPLOYEE where age=26 and name='Aniket';


SQL optimizer is intelligent enough to figure out name is one of the column in where clause and it has an index on it that can be used for lookup and it does so -

For each query above execution plan is -



And that concludes  - order of columns in where clause does not matter!

Multiple indexes applicable - select the one with least cost

So far we have seen various cases in which just one index was applicable. What if there are 2. Let's say you use department and name in your where clause. Now there are 2 options -
  1. Use the unique index starting with name 
  2. Use the index in department
Let's see how it works out -

select * from schema8.EMPLOYEE where name='Aniket' and department='IT';


Execution plan -


As you can see index on name was selected and once leaf nodes were retrieved filter was applied in it to get those rows with department='IT' and finally batched rowid access to get all the table data. This index was selected probably because unique indexes are given preference over non unique index since they are faster. But it depends totally on sql optimizer to figure that out based on execution cost.

Covered indexes and queries

In our previous discussion we saw what covered indexes/queries are. They are basically indexes that have all the data needed to be retrieved and there is no need to access the actual table by rowid. FOr eg. consider -

select name, age,department from schema8.EMPLOYEE where name='Aniket';

Take some time to think this though based on our discussion so far. We know where clause has name it it. We also know we have an unique index that starts with name. So it will be used. On top of that we have an added bonus - we just need name,age,department that us already part of that unique index. So we really don't need to access to actual table data to get any other content.

Let's see this execution plan -


 As expected. There is no "BY INDEX ROWID" or "BY INDEX ROWID BATCHED". That's because table access is not needed since all required data is in index itself. Also note the range scan - even though unique index is used there is no unique row returned since only part of unique index is used.

Summing Up

So to sum up execution plan can do -
  • FULL TABLE SCAN or
  • UNIQUE SCAN or
  • RANGE SCAN
 and then access the table data (if needed) with -
  • BY INDEX ROWID or
  • BY INDEX ROWID BATCHED
Either case you need to choose index very wisely based on your where clause, it's combination. Order of columns in index is very important. Always go for index that starts with column that is used in most of the queries where clause. Also go for equality first than range. So if you where clause is something like - "where name='Aniket' and age>24" always for name as column ordered before age. since equality will give less results to filter from. Age will be applied as filter in above case.

Related Links

Sunday, 27 August 2017

Understanding having clause in SQL

Background

If you have written queries or worked on a project that requires database support then you must have use or atleast familiar with having clause. This is also one of the popular interview questions for beginners to test database knowledge if you ask me. Simple syntax looks like -


SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s);


So your 1st and foremost answer is that you use having clause with "group by" clause. How and why we will come to later part in of this discussion. Having said that before we proceed make sure you know what group by clause does. Also you need to have an idea of what aggregate functions are . Eg. min(), max(), count(), sum() etc.


Understanding having clause in SQL

So far we know we use having clause with group by clause. Let's answer the question why. 

Let's say you have a table employee which have basic data of an employee - id, name, department etc. 

Problem statement : Now we are interested to find out how many employees are there in each department and probably see the result in sorted order so that department with maximum employees is displayed first. How would you do this? Using following query -

select department, count(*) from employee group by department order by count(*) desc;

This works fine.  Now let's redefine our problem statement.

Problem statement :  Let's say we now want the same thing - department and number of employees in each department sorted in descending order. However this time we have an additional constraint. We want to see only those departments that have more than 10 employees in it. You would probably try -

select department, count(*) from employee group by department where count(*) > 10 order by count(*) desc;

Problem : Does not work
Error : ORA-00934: group function is not allowed here
             00934. 00000 -  "group function is not allowed here"
(Above error is show for oracle database)

NOTE :  An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list, and the column being aggregated is an outer reference

Problem is that you cannot use aggregate functions in where clause. Solution? - Having clause. This is exactly why having clause was introduced. Once you have applied the group by clause and wish to filter the data further on the results obtained you use having clause. So your correct query would be -


select department, count(*) from employee group by department having count(*) > 10 order by count(*) desc;

Aggregate functions are allowed in having clause. So lets go over the original syntax again and see how it works -

SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s);

  1. First you select  column_name(s) from the table that match the where clause condition
  2. Once result is obtained then group by clause is applied to it to get the next set of result.
  3. Once that is done condition is having clause is applied to further filter the result.
  4. Finally order by clause is applied to sort the result as required and return.

NOTE :  where clause us applied to filter results before group by clause is applied where as having clause is applied after.

Other alternative can be like -

select department, count from ( 
select department, count(*) count from employee group by department )\
where  count>10;

Related Links

Friday, 18 August 2017

Find fastest 3 horses out of 25 horses puzzle

Question



You have 25 horses and you need to pick fastest 3 horses out of those. In each race you can race maximum of 5 horses as there are 5 tracks available. What are minimum number of races needed to find the fastest 3 without using a stopwatch. 

Solution

7 races needed. 

Understanding

Since we don't have a stopwatch only way to find fastest is by racing horses. Lets have 5 races each of 5 horses and let following be the results -

  • A > B > C > D > E
  • F > G > H > I > J
  • K > L > M > N > O
  • P > Q > R > S > T
  • U > V > W > X > Y
Above are results of each race.  We have 5 races till now. Now lest race between fastest of all previos 5 races i.e A,F,K,P,U. We have 6 races till now .Let's say the result for that is -
  • F > K > A > P > U

Since We are interested in top 3 P and U are useless for us. Also since P and U were fastest among their group we can ignore all members of P and U too. So now only horses under consideration are -

  • F > G > H > I > J
  • K > L > M > N > O
  • A > B > C > D > E
Now if we consider A as possible candidate for top 3 then others in group of A are redundant - since we already have F and K faster than A. So we can ignore B,C,D,E

Now horses under consideration are  -

  • F > G > H > I > J
  • K > L > M > N > O
  • A
Now in the group lead by k possible horse candidates are K and L - since F is already faster than K if we consider K and L we already have 3. So we can ignore M,N and O. So remaining horses are -

  • F > G > H > I > J
  • K > L
  • A
 Now lets consider group led by F. We can consider G and H as possible candidates for top 3 since if they are I and J are redundant. So remaining now are -

  • F > G > H
  • K > L
  • A
 Now we already know fastest among all in F since we got that result by running fastest among each group. So only horses we need comparison for 2nd and 3rd position are - G, H, K, L, A

These are 5 horses we can have another race and find the top 2 out of them. Lets say the result was -
  • L > H > K > G > A
We have done 7 races now. So fastest onces are L and H. We already the fastest among all - F. So the final answer is -
  • F > L > H 
So the answer is 7 races needed.

10 coins puzzle

Question



There are 10 coin placed in front of you 5 of which are heads and 5 of which are tails. You are blind folded so you don't know which ones are which. You need to make two piles of coins so that both have equal heads. You are allowed to flip a coin any number of time. You obviously wont know which is head and which is tails by touching it.

Solution

Make two piles of coins of equal number (5 each) and then flip all the coins on one side.

Understanding

Lets say you split in into two equal piles. 1st pile has 3 heads and 2 tails. Since there were 5 heads and 5 tails other pile will have 3 tails and 2 heads. Now when you flip all in pile 1 then there will be 3 tails and 2 heads same as pile number 2.

Generically if there are n heads and 5-n tails in pile 1 then in pile 2 will have n tails and 5-n heads. When we flip all coins in pile 1 then pile 1 will have n tails and 5-n heads which is same as pile 2.

Burning island puzzle

Question



A man in stranded on an island covered in forest. Wind is blowing from the west. Lightning strikes to the west side of the forest and starts spreading with the wind. The fire will burn the whole forest killing the man in the process. There are cliffs around the island so that man cannot escape. How can man survive the fire?

Solution

Man picks up a logs , lights it up with the fire from the west end. Then he runs towards the east and lights that part of the forest. This will burn up the eastern end of the forest and then man can take shelter in that burnt area while fire from eastern end burns the remaining forest.

Four men in hats puzzle

Question



As shown in picture above there are 4 men looking forward. None of them can see back. There is a opaque wall between man number 3 and man 4 (1,2,3 cannot see pass the wall). Two of the men are wearing a black hat and two of them are wearing a white hat. Each man can see the color of the hat wore by the men in front of him. (1 can see 2,3 and 2 can see 3) but each person does not know the color of the hat he is wearing.

Now one of the man needs to call out the color of his hat else they all die in 10 mins. Which man will callout the color of his hat correctly and why?


Solution

Answer is Man no 2.

Reasoning


 Lets start by eliminating men. Man number 4 is at the other end of opaque wall facing other side. There is no way he can see any men or the color of their hat. So he is eliminated. Now man no 3 also cannot see anyone else - he cannot look back and he cannot see beyond wall. So he is eliminated too. Now man number 1 knows the color of man 2 and 3. Now lets say they (2 and 3) were wearing same color hat then man no 1 would know the color of his hat since there are 2 white and 2 black hat. But he keeps mum which means man 2 and 3 are wearing different hat. S0 man number 2 waits for sometime if he does not hear man 1 calling out that means man 2 and 3 are wearing different color hats. Since man 2 knows the color of hat wore by man 3 he know the color of his hat and calls it out.

Saturday, 5 August 2017

Select top N records from each category in PL/SQL

Background

Lets say you have 2 tables -
  1. EMPLOYEE
  2. EMPLOYEE_SALARY
EMPLOYEE table has employee id which is a primary key and his name.  EMPLOYEE_SALARY has employee id which is foreign key to id in EMPLOYEE table. This table has employee department and salary. You need to write a query that returns top 2 employees from each department that has highest salary.

Tables creation and data insertion

Table Queries :

create table schema8.EMPLOYEE(ID int, name varchar2(255));
alter table schema8.EMPLOYEE ADD CONSTRAINT PRIMARY_KEY PRIMARY KEY (ID);
create table schema8.EMPLOYEE_SALARY(EMPLOYEE_ID int, department varchar2(255), salary int);
alter table schema8.EMPLOYEE_SALARY ADD CONSTRAINT FK_EMP_ID FOREIGN KEY (EMPLOYEE_ID) REFERENCES schema8.EMPLOYEE(ID);

Data Queries for  EMPLOYEE table:

insert into schema8.EMPLOYEE values(1,'Aniket');
insert into schema8.EMPLOYEE values(2,'John');
insert into schema8.EMPLOYEE values(3,'Sam');
insert into schema8.EMPLOYEE values(4,'Ron');
insert into schema8.EMPLOYEE values(5,'Sky');
insert into schema8.EMPLOYEE values(6,'Paul');
insert into schema8.EMPLOYEE values(7,'Dan');
insert into schema8.EMPLOYEE values(8,'Jess');
insert into schema8.EMPLOYEE values(9,'Troy');
insert into schema8.EMPLOYEE values(10,'Mike');

 Data Queries for EMPLOYEE_SALARY table:

insert into schema8.EMPLOYEE_SALARY values(1,'IT',10000);
insert into schema8.EMPLOYEE_SALARY values(2,'Admin',500);
insert into schema8.EMPLOYEE_SALARY values(3,'Sales',1200);
insert into schema8.EMPLOYEE_SALARY values(4,'Sales',1500);
insert into schema8.EMPLOYEE_SALARY values(5,'IT',9000);
insert into schema8.EMPLOYEE_SALARY values(6,'Admin',4000);
insert into schema8.EMPLOYEE_SALARY values(7,'Admin',5000);
insert into schema8.EMPLOYEE_SALARY values(8,'IT',9500);
insert into schema8.EMPLOYEE_SALARY values(9,'Sales',1000);
insert into schema8.EMPLOYEE_SALARY values(10,'Admin',6000);

Final data :
select * from schema8.EMPLOYEE;
select * from schema8.EMPLOYEE_SALARY;


Solution

We are going to use RANK to partition by department and order by salary - 

select * from (
select id , name, department, salary, RANK() over (partition by department order by salary desc) as rank from(
select e.id, e.name, es.department,es.salary from schema8.EMPLOYEE e left OUTER join schema8.EMPLOYEE_SALARY es on (e.id=es.employee_id))
) where rank <= 2;




First we have done a left outer join so that we capture all employee records with their respective salaries and departments. In outer query we have ranked it based on their salaries in respective department. Finally we select all records that have rank <=2 i.e top 2 records.




Related Links


Monday, 15 May 2017

How does database indexing work?

Background

Database indexing is a wide topic. Database indexing plays a important role in your query result performance. But like everything this too has a trade off.  In this post we will see what database indexing is and how does it work.


Clustered and Non clustered index

Before we go to how indexing actually works lets see the two types on indexes -
  1. Clustered index
  2. Non clustered index
Data in tables of a database need to be stored on a physical disk at the end of the day. It is important the way data is stored since data lookup is based on it. The way data is stored on physical disk is decided by an index which is known as Clustered index. Since data is physically stored only once , only one clustered index is possible for a DB table. Generally that's the primary key of the table. That's right. Primary key of your table is a Clustered index by default and data is stored physically based on it.

NOTE : You can change this though. You can create a primary key that is not clustered index but a non clustered one. However you need to define one clustered index for your table since physical storage order depends on it.

Non clustered indexes are normal indexes. They order the data based on the column we have created non clustered index on. Note since data is stored only once on the disk and there is just one column(or group of columns ) which can be used to order stored data on disk we cannot store the same data with ordering specified by non clustered index (that's the job of clustered index). So new memory is allocated for non clustered indexes. These have the column on which it is created as the key of the data row and a pointer to the actual data row (which is ordered by clustered index - usually the primary key). So if a search is performed on this table based on the columns that have non clustered indexes then this new data set is searched (which is faster since records are ordered with respect to this column) and then the pointer in it is used to access the actual data record with all columns.


Now that we have knowledge of clustered and non clustered indexes lets see how it actually works.

Why is indexing needed?

When data is stored on disk based storage devices, it is stored as blocks of data. These blocks are accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured in much the same way as linked lists; both contain a section for data, a pointer to the location of the next node (or block), and both need not be stored contiguously.

Due to the fact that a number of records can only be sorted on one field, we can state that searching on a field that isn’t sorted requires a Linear Search which requires N/2 block accesses (on average), where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t contain unique entries) then the entire table space must be searched at N block accesses.

Whereas with a sorted field, a Binary Search may be used, this has log2 N block accesses. Also since the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate values, once a higher value is found. Thus the performance increase is substantial.

What is indexing?

This is rather a silly question given we already saw clustered and non clustered indexes but lets give it a try.

Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it. This index is obviously the non clustered one. There is no need to create separate data structure for Clustered indexes since the original data is stored physically sorted based on it.

The downside to (non clustered) indexing is that these indexes require additional space on the disk, since the indexes are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of the underlying file system if many fields within the same table are indexed.

Performance

Indexes don't come free.  They have their own overhead. Each index creates an new data set ordered by the columns on which it is created. This takes extra space (though not as much as the original data set since this just has single data and pointer to actual row data). Also inserts are slower now since each insert will have to update this new index data set as well. Same goes for deletion.

Data structures used in indexes

Hash index :
 Think of this using a HashMap. Key here would be derived from columns that are used to create a index (non clustered index to be precise). Value would be pointer to the actual table row entry. They are good for equality lookups like get rows of data of all customer whose age is 24. But what happens when we need a query like set of data of customer with age greater than 24. Hash index does not work so go in this case.  Hash indexes are just good for equality lookups.
Eg.



B-tree Indexes:
These are most common types of indexes. It's kind of self balancing tree. It stores data in an ordered manner so that retrievals are fast. These are useful since they provide insertion, search and deletion in logarithmic time.
Eg.


Consider above picture. If we need rows with data less that 100 all we need are notes to the left of 100.

These are just common ones. Others are R-tree indexes, bitmap indexes etc.

 To get a complete idea on how indexes work internally please refer - SQL Indexing and Tuning.

NOTE : There is no clustered index in oracle database. A regular non clustered is automatically created on primary key. As an alternative in Oracle DB you can explore Index organized tables.


Important Points to remember

  • Avoid using function in where clause which takes column parameter as index. Functions render indexes useless. Functions are like blackbox and database optimized does not know about it's relationship with argument the function takes. So instead of using the index it will perform full table scan. Eg
    • create index employee_name_idx on employee(name);
    • select * from employee where name='Aniket'; --uses index
    • select * from employee where lower(name)='aniket'; --cannot use index
  •  If you do have a concatenated index then choose the column order so that mulitple sql queries (of you usecases) can use it. Eg.
    • select * from employee where name='Aniket'; --sql1
    • select * from employee where name='Aniket' and age=26; -- sql2
    • create index employee_name_idx on employee(name, age);  --correct
    •  create index employee_name_idx on employee(age, name);  --incorrect (sql1 cannot use this)
  • Avoid like expressions in your where clause which starts with a wildcard. it will be of no use even the column is indexed. Eg.
    • create index employee_name_idx on employee(name);
    •  select * from employee where name like '%ket'; -- will be very slow
    •  select * from employee where name like 'Anik%'; --will be fast as it used prefix on index
  • Always index for equality first (vrs lets say a range)
    •  create index employee_name_idx1 on employee(joining_date, name);  --slower
    •  create index employee_name_idx2 on employee( name, joining_date); --faster
    • select * from employee where name='Aniket' and  joining_date >= TO_DATE('2017-08-20')
  • Always check if covered indexes are possible. This prevents actual table access since you get all the data in index only. So lets say you have a table with columns A-Z. Also lets say you have an index on column B and your query is something like -
    • select A from table where B='XYZ'
    • Query looks good and it will use our index defined on column B and speed up the query time in the process but for each hit in btree leaf of index it will need to access actual table row to get A.
    • Now consider you have index on (B,A). Since your where clause has B this index will be picked up. However in this case once entries are located we already have value of A which we need to find. So this will avoid lookup of actual table row.
    •  This obviously depends on your use case. None the less it's important to consider this use case.

Related Links

t> UA-39527780-1 back to top