Sunday, 10 September 2017

Installing and Using OpenGrok on Mac OS X

Background

In previous couple of posts we saw how we can setup git repositories, install git client and maintain your codebase. 
In this post we will see how to set up opengrok. This is a code base search tool that you can use to index and search your huge complex codebase. I am going to show this on my mac laptop. If you want to setup a proper server please refer to official documentation.


Installing and Using OpenGrok on Mac OS X

I am going to use Homebrew to do most of the setup here. If you are not aware of homebrew then you can read -
 Couple of things you need to install before are -
  • A servlet container like GlassFish or Tomcat to run and deploy your grok server. I will use tomcat.
  • Exuberant Ctags for analysis.
You can run following commands to set these up -
  • brew update
  • brew install tomcat
  • brew install ctags 
You can just type catalina to see tomcat is properly installed -


Next set environment variable as follows -
  • export OPENGROK_TOMCAT_BASE=/usr/local/Cellar/tomcat/8.5.20/libexec
 For path you can refer to catalina screenshot above. This environment variable will basically tell where grok needs to be deployed.

Download the latest opengrok binary from-
 I am using opengrok-1.1-rc13.tar.gz.

Next go yo your opengrok bin directory. In my case it is -
  • /Users/athakur/Documents/Softwares/opengrok-1.1-rc13/bin
and run -
  • ./OpenGrok deploy
 This will deploy grok code on your tomcat container. Now start the tomcat container




 You can now access it via -


 The error you see is ok since we have not provided our codebase source directory yet.

Noe lets add source directory. My code is in-
  •  ~/Documents/git/DataStructures
NOTE : DataStructures is a local copy of my github repo -
I am going to maintain all codebase references in
  • ~/local_repos/src/
So create a directory and add soft links as below -


 Now it's time to define your code directory that opengrok can understand. So define another environment variable -

  • export OPENGROK_INSTANCE_BASE=/Users/athakur/local_repos

That's now lets index this content. To index it go to you opengrok bin directory and run -
  • ./OpenGrok index.

You can see it automatically creates directory it needs. Just make sure it has appropriate permissions -




That's it you can refresh grok page and start searching code.


 NOTE : For every update to your actual repository or for any new repository getting added you need to call ./Opengrok index to index it. You can probably write a cron job that does an automatic pull of your repository and runs index on it.


Related Links

Saturday, 2 September 2017

Understanding database indexes - Part 2

Background

Some time back we took a look at what database indexing is and how it internally works. -
In this post we will see database indexing more from a development design perspective. Many of you might be of the impression that database indexes, tables , performance, lookups maybe responsibility of database admins. Though this might be true to some extent indexes selection, and constructing where clause is developers responsibility and poor choice of indexes and where clause may often lead to performance issues causing queries to run slow. So whenever you are developing an application that requires database interactions as a developer it is very important that you design your indexes first. How do we do that?  - We will see that in sometime. 

Understanding database indexes - Part 2

An index lookup require 3 steps -
  1. B-Tree traversal to root node
  2. Traversal along root node
  3. Access actual table data from each root node
Step 1 is limited is size as tree height/level will be limited by log N constraint. For millions of rows there could be 3-4 level of the tree. It will be extremely rare to see a B-Tree with level more than 5.
You can use following query to see the B-Tree level for your index -

SELECT index_name, blevel+1 FROM user_indexes ORDER BY 2;


blevel gives you the levels of your B-Tree index. Plus one is for the leaf node. So these are the number of levels that needs to be traversed to get an index at the leaf node (considering unique scan).

Step 2 and Step 3 can vary and in most cases are causes of slow index lookup resulting in slow running queries.

Let's start understanding by taking an actual example. Let's create a table as follows -

create table schema8.EMPLOYEE(ID int, name varchar2(255),age int, department varchar2(255), salary int);
alter table schema8.EMPLOYEE ADD CONSTRAINT PRIMARY_KEY PRIMARY KEY (ID); 

CREATE UNIQUE INDEX schema8.UX_EMPLOYEE_1 ON schema8.EMPLOYEE (name, age, department);
ALTER TABLE schema8.EMPLOYEE ADD CONSTRAINT UK_EMPLOYEE_1 UNIQUE (name, age, department) USING INDEX schema8.UX_EMPLOYEE_1;


Lets' add some data in it -

insert into schema8.EMPLOYEE values(1,'Aniket',26,'IT',100);
insert into schema8.EMPLOYEE values(2,'John',29,'FINANCE',40);
insert into schema8.EMPLOYEE values(3,'Sam',27,'IT',101);
insert into schema8.EMPLOYEE values(4,'Ron',30,'ACCOUNTING',35);
insert into schema8.EMPLOYEE values(5,'Sky',33,'DEVOPS',62);
insert into schema8.EMPLOYEE values(6,'Paul',26,'FINANCE',43);
insert into schema8.EMPLOYEE values(7,'Dan',24,'IT',100);
insert into schema8.EMPLOYEE values(8,'Jess',25,'ACCOUNTING',37);
insert into schema8.EMPLOYEE values(9,'Troy',31,'FINANCE',41);
insert into schema8.EMPLOYEE values(10,'Mike',28,'IT',103);
insert into schema8.EMPLOYEE values(11,'Anuj',28,'DEVOPS',64);
insert into schema8.EMPLOYEE values(12,'Vinit',29,'FINANCE',48);
insert into schema8.EMPLOYEE values(13,'Sudhir',29,'ACCOUNTING',39);
insert into schema8.EMPLOYEE values(14,'Anish',28,'IT',100);
insert into schema8.EMPLOYEE values(15,'Shivam',25,'DEVOPS',61);
insert into schema8.EMPLOYEE values(16,'Monica',26,'ACCOUNTING',30);
insert into schema8.EMPLOYEE values(17,'Ramji',32,'FINANCE',41);
insert into schema8.EMPLOYEE values(18,'Anjali',34,'ACCOUNTING',38);
insert into schema8.EMPLOYEE values(19,'Payas',26,'IT',100);
insert into schema8.EMPLOYEE values(20,'Zara',27,'DEVOPS',60);


Normal index


Let's start with a simple query -

select * from schema8.EMPLOYEE where department='IT';


It gives 6 rows. What we really want to understand is the performance of the query and if we can improve it. To understand the queries performance we need to take a look at the execution plan that was used by the sql optimizer. In Sql developer you can just
  • Select the query -> Right click -> Explain -> Explain Plan



And you should see the plan that was selected to run this query and associated cost.

So for above query execution plan is -



As you can see a "FULL TABLE SCAN" was selected.  Since your where clause has department column in it there was no other option. Unique index starting with name could not be used. Primary key index could not be used (Index is always created on primary key -id in this case). So it had to go for a full table scan. Now this is obviously expensive. You can see the cardinality is 6 which basically means there are 6 rows which satisfy "department='IT'" clause and cost is also high.

Let's do something about this. Let's create an index in column department and then again inspect the plan.

create index X_EMPLOYEE_DEPT on schema8.EMPLOYEE(department);


and now lets see the execution plan -



Better? Our cost is reduced by half now. As you can see this time our new index was used for the lookup - "RANGE SCAN". So full table access was avoided. Recall our earlier discussion on steps needed for index lookup -
  1. It used index to get to the leaf node
  2. Traveled along leaf node linked list to find all nodes with department='IT' ("RANGE SCAN")
  3. finally for each index accessed the actual table using rowid to get other table data ("BY INDEX ROWID BATCHED") (Batched because data for all rowids are retrieved in a single call)
Hope this clears how indexes help faster execution of queries.

NOTE :
Cardinality is the estimated number of rows a particular step will return.
Cost is the estimated amount of work the plan will do for that step.
Higer cardinality mean more work which means higher cost associated with that step.
A lower cost query will run faster than a higer cost query.

Primary key index 

As you know primary key has an index created by default. Let's try to query the table using primary key and  see it's execution plan -

select * from schema8.EMPLOYEE where id='12';



As expected cost has further gone down. As primary key index is unique index (since primary key itself is unique) execution plan went for - "UNIQUE SCAN"  and then simple "BY INDEX ROWID" (No batched lookup here since there will be just one entry given that it is unique).  So again if you recollect index lookup steps this consists of -
  1. Use unique index to reach leaf node (Just one leaf node) ("UNIQUE SCAN")
  2. Get the table data ("BY INDEX ROWID")
Notice how there was no traversal among lead nodes and no consequent batch access by row ids.

Unique key index

This would again be same as primary key index since primary key index is also a unique key index but let's give this a try since we have defined unique key for our table on - name, age, department

select * from schema8.EMPLOYEE where name='Aniket' and age=26 and department='IT';

And execution plan for this is -



As expected it is same as primary key index. Instead of primary key index it used unique index we created on our own. Steps are same too.

Composite index

Remember our unique index - name, age, department . We saw in the 1st case where we had department in where clause (before creating an index on department) that this particular index was not used and a full table scan was performed.

If you recollect from our previous discussion index column order matters. If the column order was - department , name, age this new index would gave been picked up. Anyway lets try based on what we already have. Now we are going to add name in where clause and based on our existing knowledge our unique index should get picked up (since it starts with name column) -

select * from schema8.EMPLOYEE where name='Aniket'; 

Execution plan -



As expected our index got used and a full table scan was avoided. However if you use age in where clause index will not be used again - since none of your index starts with age. Try it yourself!

Order of columns in where clause does not matter

We saw how order of columns matter in index creation. Same does not apply for where clause column order. Eg consider -

select * from schema8.EMPLOYEE where name='Aniket' and age=26;
select * from schema8.EMPLOYEE where age=26 and name='Aniket';


SQL optimizer is intelligent enough to figure out name is one of the column in where clause and it has an index on it that can be used for lookup and it does so -

For each query above execution plan is -



And that concludes  - order of columns in where clause does not matter!

Multiple indexes applicable - select the one with least cost

So far we have seen various cases in which just one index was applicable. What if there are 2. Let's say you use department and name in your where clause. Now there are 2 options -
  1. Use the unique index starting with name 
  2. Use the index in department
Let's see how it works out -

select * from schema8.EMPLOYEE where name='Aniket' and department='IT';


Execution plan -


As you can see index on name was selected and once leaf nodes were retrieved filter was applied in it to get those rows with department='IT' and finally batched rowid access to get all the table data. This index was selected probably because unique indexes are given preference over non unique index since they are faster. But it depends totally on sql optimizer to figure that out based on execution cost.

Covered indexes and queries

In our previous discussion we saw what covered indexes/queries are. They are basically indexes that have all the data needed to be retrieved and there is no need to access the actual table by rowid. FOr eg. consider -

select name, age,department from schema8.EMPLOYEE where name='Aniket';

Take some time to think this though based on our discussion so far. We know where clause has name it it. We also know we have an unique index that starts with name. So it will be used. On top of that we have an added bonus - we just need name,age,department that us already part of that unique index. So we really don't need to access to actual table data to get any other content.

Let's see this execution plan -


 As expected. There is no "BY INDEX ROWID" or "BY INDEX ROWID BATCHED". That's because table access is not needed since all required data is in index itself. Also note the range scan - even though unique index is used there is no unique row returned since only part of unique index is used.

Summing Up

So to sum up execution plan can do -
  • FULL TABLE SCAN or
  • UNIQUE SCAN or
  • RANGE SCAN
 and then access the table data (if needed) with -
  • BY INDEX ROWID or
  • BY INDEX ROWID BATCHED
Either case you need to choose index very wisely based on your where clause, it's combination. Order of columns in index is very important. Always go for index that starts with column that is used in most of the queries where clause. Also go for equality first than range. So if you where clause is something like - "where name='Aniket' and age>24" always for name as column ordered before age. since equality will give less results to filter from. Age will be applied as filter in above case.

Related Links

Sunday, 27 August 2017

Understanding having clause in SQL

Background

If you have written queries or worked on a project that requires database support then you must have use or atleast familiar with having clause. This is also one of the popular interview questions for beginners to test database knowledge if you ask me. Simple syntax looks like -


SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s);


So your 1st and foremost answer is that you use having clause with "group by" clause. How and why we will come to later part in of this discussion. Having said that before we proceed make sure you know what group by clause does. Also you need to have an idea of what aggregate functions are . Eg. min(), max(), count(), sum() etc.


Understanding having clause in SQL

So far we know we use having clause with group by clause. Let's answer the question why. 

Let's say you have a table employee which have basic data of an employee - id, name, department etc. 

Problem statement : Now we are interested to find out how many employees are there in each department and probably see the result in sorted order so that department with maximum employees is displayed first. How would you do this? Using following query -

select department, count(*) from employee group by department order by count(*) desc;

This works fine.  Now let's redefine our problem statement.

Problem statement :  Let's say we now want the same thing - department and number of employees in each department sorted in descending order. However this time we have an additional constraint. We want to see only those departments that have more than 10 employees in it. You would probably try -

select department, count(*) from employee group by department where count(*) > 10 order by count(*) desc;

Problem : Does not work
Error : ORA-00934: group function is not allowed here
             00934. 00000 -  "group function is not allowed here"
(Above error is show for oracle database)

NOTE :  An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list, and the column being aggregated is an outer reference

Problem is that you cannot use aggregate functions in where clause. Solution? - Having clause. This is exactly why having clause was introduced. Once you have applied the group by clause and wish to filter the data further on the results obtained you use having clause. So your correct query would be -


select department, count(*) from employee group by department having count(*) > 10 order by count(*) desc;

Aggregate functions are allowed in having clause. So lets go over the original syntax again and see how it works -

SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s);

  1. First you select  column_name(s) from the table that match the where clause condition
  2. Once result is obtained then group by clause is applied to it to get the next set of result.
  3. Once that is done condition is having clause is applied to further filter the result.
  4. Finally order by clause is applied to sort the result as required and return.

NOTE :  where clause us applied to filter results before group by clause is applied where as having clause is applied after.

Other alternative can be like -

select department, count from ( 
select department, count(*) count from employee group by department )\
where  count>10;

Related Links

Friday, 18 August 2017

Find fastest 3 horses out of 25 horses puzzle

Question



You have 25 horses and you need to pick fastest 3 horses out of those. In each race you can race maximum of 5 horses as there are 5 tracks available. What are minimum number of races needed to find the fastest 3 without using a stopwatch. 

Solution

7 races needed. 

Understanding

Since we don't have a stopwatch only way to find fastest is by racing horses. Lets have 5 races each of 5 horses and let following be the results -

  • A > B > C > D > E
  • F > G > H > I > J
  • K > L > M > N > O
  • P > Q > R > S > T
  • U > V > W > X > Y
Above are results of each race.  We have 5 races till now. Now lest race between fastest of all previos 5 races i.e A,F,K,P,U. We have 6 races till now .Let's say the result for that is -
  • F > K > A > P > U

Since We are interested in top 3 P and U are useless for us. Also since P and U were fastest among their group we can ignore all members of P and U too. So now only horses under consideration are -

  • F > G > H > I > J
  • K > L > M > N > O
  • A > B > C > D > E
Now if we consider A as possible candidate for top 3 then others in group of A are redundant - since we already have F and K faster than A. So we can ignore B,C,D,E

Now horses under consideration are  -

  • F > G > H > I > J
  • K > L > M > N > O
  • A
Now in the group lead by k possible horse candidates are K and L - since F is already faster than K if we consider K and L we already have 3. So we can ignore M,N and O. So remaining horses are -

  • F > G > H > I > J
  • K > L
  • A
 Now lets consider group led by F. We can consider G and H as possible candidates for top 3 since if they are I and J are redundant. So remaining now are -

  • F > G > H
  • K > L
  • A
 Now we already know fastest among all in F since we got that result by running fastest among each group. So only horses we need comparison for 2nd and 3rd position are - G, H, K, L, A

These are 5 horses we can have another race and find the top 2 out of them. Lets say the result was -
  • L > H > K > G > A
We have done 7 races now. So fastest onces are L and H. We already the fastest among all - F. So the final answer is -
  • F > L > H 
So the answer is 7 races needed.

10 coins puzzle

Question



There are 10 coin placed in front of you 5 of which are heads and 5 of which are tails. You are blind folded so you don't know which ones are which. You need to make two piles of coins so that both have equal heads. You are allowed to flip a coin any number of time. You obviously wont know which is head and which is tails by touching it.

Solution

Make two piles of coins of equal number (5 each) and then flip all the coins on one side.

Understanding

Lets say you split in into two equal piles. 1st pile has 3 heads and 2 tails. Since there were 5 heads and 5 tails other pile will have 3 tails and 2 heads. Now when you flip all in pile 1 then there will be 3 tails and 2 heads same as pile number 2.

Generically if there are n heads and 5-n tails in pile 1 then in pile 2 will have n tails and 5-n heads. When we flip all coins in pile 1 then pile 1 will have n tails and 5-n heads which is same as pile 2.
t> UA-39527780-1 back to top