Open Source For Geeks: September 2017

Sunday, 10 September 2017

Installing and Using OpenGrok on Mac OS X

Background

In previous couple of posts we saw how we can setup git repositories, install git client and maintain your codebase.

In this post we will see how to set up opengrok. This is a code base search tool that you can use to index and search your huge complex codebase. I am going to show this on my mac laptop. If you want to setup a proper server please refer to official documentation.

Installing and Using OpenGrok on Mac OS X

I am going to use Homebrew to do most of the setup here. If you are not aware of homebrew then you can read -

Tips and Tricks for your MacBook

Couple of things you need to install before are -

A servlet container like GlassFish or Tomcat to run and deploy your grok server. I will use tomcat.
Exuberant Ctags for analysis.

You can run following commands to set these up -

brew update
brew install tomcat
brew install ctags

You can just type catalina to see tomcat is properly installed -

Next set environment variable as follows -

export OPENGROK_TOMCAT_BASE=/usr/local/Cellar/tomcat/8.5.20/libexec

For path you can refer to catalina screenshot above. This environment variable will basically tell where grok needs to be deployed.

Download the latest opengrok binary from-

https://github.com/OpenGrok/OpenGrok/releases

I am using opengrok-1.1-rc13.tar.gz.

Next go yo your opengrok bin directory. In my case it is -

/Users/athakur/Documents/Softwares/opengrok-1.1-rc13/bin

and run -

./OpenGrok deploy

This will deploy grok code on your tomcat container. Now start the tomcat container

You can now access it via -

http://localhost:8080/source/

The error you see is ok since we have not provided our codebase source directory yet.

Noe lets add source directory. My code is in-

~/Documents/git/DataStructures

NOTE : DataStructures is a local copy of my github repo -

https://github.com/aniket91/DataStructures/

I am going to maintain all codebase references in

~/local_repos/src/

So create a directory and add soft links as below -

Now it's time to define your code directory that opengrok can understand. So define another environment variable -

export OPENGROK_INSTANCE_BASE=/Users/athakur/local_repos

That's now lets index this content. To index it go to you opengrok bin directory and run -

./OpenGrok index.

You can see it automatically creates directory it needs. Just make sure it has appropriate permissions -

That's it you can refresh grok page and start searching code.

NOTE : For every update to your actual repository or for any new repository getting added you need to call ./Opengrok index to index it. You can probably write a cron job that does an automatic pull of your repository and runs index on it.

Saturday, 2 September 2017

Understanding database indexes - Part 2

Background

Some time back we took a look at what database indexing is and how it internally works. -

How does database indexing work?

In this post we will see database indexing more from a development design perspective. Many of you might be of the impression that database indexes, tables , performance, lookups maybe responsibility of database admins. Though this might be true to some extent indexes selection, and constructing where clause is developers responsibility and poor choice of indexes and where clause may often lead to performance issues causing queries to run slow. So whenever you are developing an application that requires database interactions as a developer it is very important that you design your indexes first. How do we do that? - We will see that in sometime.

Understanding database indexes - Part 2

An index lookup require 3 steps -

B-Tree traversal to root node
Traversal along root node
Access actual table data from each root node

Step 1 is limited is size as tree height/level will be limited by log N constraint. For millions of rows there could be 3-4 level of the tree. It will be extremely rare to see a B-Tree with level more than 5.
You can use following query to see the B-Tree level for your index -

SELECT index_name, blevel+1 FROM user_indexes ORDER BY 2;

blevel gives you the levels of your B-Tree index. Plus one is for the leaf node. So these are the number of levels that needs to be traversed to get an index at the leaf node (considering unique scan).

Step 2 and Step 3 can vary and in most cases are causes of slow index lookup resulting in slow running queries.

Let's start understanding by taking an actual example. Let's create a table as follows -

create table schema8.EMPLOYEE(ID int, name varchar2(255),age int, department varchar2(255), salary int);
alter table schema8.EMPLOYEE ADD CONSTRAINT PRIMARY_KEY PRIMARY KEY (ID); 

CREATE UNIQUE INDEX schema8.UX_EMPLOYEE_1 ON schema8.EMPLOYEE (name, age, department);
ALTER TABLE schema8.EMPLOYEE ADD CONSTRAINT UK_EMPLOYEE_1 UNIQUE (name, age, department) USING INDEX schema8.UX_EMPLOYEE_1;

Lets' add some data in it -

insert into schema8.EMPLOYEE values(1,'Aniket',26,'IT',100);
insert into schema8.EMPLOYEE values(2,'John',29,'FINANCE',40);
insert into schema8.EMPLOYEE values(3,'Sam',27,'IT',101);
insert into schema8.EMPLOYEE values(4,'Ron',30,'ACCOUNTING',35);
insert into schema8.EMPLOYEE values(5,'Sky',33,'DEVOPS',62);
insert into schema8.EMPLOYEE values(6,'Paul',26,'FINANCE',43);
insert into schema8.EMPLOYEE values(7,'Dan',24,'IT',100);
insert into schema8.EMPLOYEE values(8,'Jess',25,'ACCOUNTING',37);
insert into schema8.EMPLOYEE values(9,'Troy',31,'FINANCE',41);
insert into schema8.EMPLOYEE values(10,'Mike',28,'IT',103);
insert into schema8.EMPLOYEE values(11,'Anuj',28,'DEVOPS',64);
insert into schema8.EMPLOYEE values(12,'Vinit',29,'FINANCE',48);
insert into schema8.EMPLOYEE values(13,'Sudhir',29,'ACCOUNTING',39);
insert into schema8.EMPLOYEE values(14,'Anish',28,'IT',100);
insert into schema8.EMPLOYEE values(15,'Shivam',25,'DEVOPS',61);
insert into schema8.EMPLOYEE values(16,'Monica',26,'ACCOUNTING',30);
insert into schema8.EMPLOYEE values(17,'Ramji',32,'FINANCE',41);
insert into schema8.EMPLOYEE values(18,'Anjali',34,'ACCOUNTING',38);
insert into schema8.EMPLOYEE values(19,'Payas',26,'IT',100);
insert into schema8.EMPLOYEE values(20,'Zara',27,'DEVOPS',60);

Normal index

Let's start with a simple query -

select * from schema8.EMPLOYEE where department='IT';

It gives 6 rows. What we really want to understand is the performance of the query and if we can improve it. To understand the queries performance we need to take a look at the execution plan that was used by the sql optimizer. In Sql developer you can just

Select the query -> Right click -> Explain -> Explain Plan

And you should see the plan that was selected to run this query and associated cost.

So for above query execution plan is -

As you can see a "FULL TABLE SCAN" was selected. Since your where clause has department column in it there was no other option. Unique index starting with name could not be used. Primary key index could not be used (Index is always created on primary key -id in this case). So it had to go for a full table scan. Now this is obviously expensive. You can see the cardinality is 6 which basically means there are 6 rows which satisfy "department='IT'" clause and cost is also high.

Let's do something about this. Let's create an index in column department and then again inspect the plan.

create index X_EMPLOYEE_DEPT on schema8.EMPLOYEE(department);

and now lets see the execution plan -

Better? Our cost is reduced by half now. As you can see this time our new index was used for the lookup - "RANGE SCAN". So full table access was avoided. Recall our earlier discussion on steps needed for index lookup -

It used index to get to the leaf node
Traveled along leaf node linked list to find all nodes with department='IT' ("RANGE SCAN")
finally for each index accessed the actual table using rowid to get other table data ("BY INDEX ROWID BATCHED") (Batched because data for all rowids are retrieved in a single call)

Hope this clears how indexes help faster execution of queries.

NOTE :
Cardinality is the estimated number of rows a particular step will return.
Cost is the estimated amount of work the plan will do for that step.
Higer cardinality mean more work which means higher cost associated with that step.
A lower cost query will run faster than a higer cost query.

Primary key index

As you know primary key has an index created by default. Let's try to query the table using primary key and see it's execution plan -

select * from schema8.EMPLOYEE where id='12';

As expected cost has further gone down. As primary key index is unique index (since primary key itself is unique) execution plan went for - "UNIQUE SCAN" and then simple "BY INDEX ROWID" (No batched lookup here since there will be just one entry given that it is unique). So again if you recollect index lookup steps this consists of -

Use unique index to reach leaf node (Just one leaf node) ("UNIQUE SCAN")
Get the table data ("BY INDEX ROWID")

Notice how there was no traversal among lead nodes and no consequent batch access by row ids.

Unique key index

This would again be same as primary key index since primary key index is also a unique key index but let's give this a try since we have defined unique key for our table on - name, age, department

select * from schema8.EMPLOYEE where name='Aniket' and age=26 and department='IT';

And execution plan for this is -

As expected it is same as primary key index. Instead of primary key index it used unique index we created on our own. Steps are same too.

Composite index

Remember our unique index - name, age, department . We saw in the 1st case where we had department in where clause (before creating an index on department) that this particular index was not used and a full table scan was performed.

If you recollect from our previous discussion index column order matters. If the column order was - department , name, age this new index would gave been picked up. Anyway lets try based on what we already have. Now we are going to add name in where clause and based on our existing knowledge our unique index should get picked up (since it starts with name column) -

select * from schema8.EMPLOYEE where name='Aniket';

Execution plan -

As expected our index got used and a full table scan was avoided. However if you use age in where clause index will not be used again - since none of your index starts with age. Try it yourself!

Order of columns in where clause does not matter

We saw how order of columns matter in index creation. Same does not apply for where clause column order. Eg consider -

select * from schema8.EMPLOYEE where name='Aniket' and age=26;
select * from schema8.EMPLOYEE where age=26 and name='Aniket';

SQL optimizer is intelligent enough to figure out name is one of the column in where clause and it has an index on it that can be used for lookup and it does so -

For each query above execution plan is -

And that concludes - order of columns in where clause does not matter!

Multiple indexes applicable - select the one with least cost

So far we have seen various cases in which just one index was applicable. What if there are 2. Let's say you use department and name in your where clause. Now there are 2 options -

Use the unique index starting with name
Use the index in department

Let's see how it works out -

select * from schema8.EMPLOYEE where name='Aniket' and department='IT';

Execution plan -

As you can see index on name was selected and once leaf nodes were retrieved filter was applied in it to get those rows with department='IT' and finally batched rowid access to get all the table data. This index was selected probably because unique indexes are given preference over non unique index since they are faster. But it depends totally on sql optimizer to figure that out based on execution cost.

Covered indexes and queries

In our previous discussion we saw what covered indexes/queries are. They are basically indexes that have all the data needed to be retrieved and there is no need to access the actual table by rowid. FOr eg. consider -

select name, age,department from schema8.EMPLOYEE where name='Aniket';

Take some time to think this though based on our discussion so far. We know where clause has name it it. We also know we have an unique index that starts with name. So it will be used. On top of that we have an added bonus - we just need name,age,department that us already part of that unique index. So we really don't need to access to actual table data to get any other content.

Let's see this execution plan -

As expected. There is no "BY INDEX ROWID" or "BY INDEX ROWID BATCHED". That's because table access is not needed since all required data is in index itself. Also note the range scan - even though unique index is used there is no unique row returned since only part of unique index is used.

Summing Up

So to sum up execution plan can do -

FULL TABLE SCAN or
UNIQUE SCAN or
RANGE SCAN

and then access the table data (if needed) with -

BY INDEX ROWID or
BY INDEX ROWID BATCHED

Either case you need to choose index very wisely based on your where clause, it's combination. Order of columns in index is very important. Always go for index that starts with column that is used in most of the queries where clause. Also go for equality first than range. So if you where clause is something like - "where name='Aniket' and age>24" always for name as column ordered before age. since equality will give less results to filter from. Age will be applied as filter in above case.

Sunday, 10 September 2017

Installing and Using OpenGrok on Mac OS X

Background

Installing and Using OpenGrok on Mac OS X

Related Links

Saturday, 2 September 2017

Understanding database indexes - Part 2

Background

Understanding database indexes - Part 2

Normal index

Primary key index

Unique key index

Composite index

Order of columns in where clause does not matter

Multiple indexes applicable - select the one with least cost

Covered indexes and queries

Summing Up

Related Links