Friday, 13 February 2015

Understanding how hashing (HashMap and HashSet) work in Java

Background

Hashing is a very important concept in computer programming and a very popular concept for interview questions. Two important data structures related to hashing are HashMap and HashSet which is exactly what we are going to look at in this post.

Overview

If you are from a computer science background you would know HashMap is a data structure which stores key value pairs where as a HashSet stores unique data. In HashMap we have kind of buckets and each data added to a HashMap falls into one of the buckets depending on the hash value of it. Also you must have heard that adding and retrieving objects in HashMap happen in time complexity O(1).

But still there are open end question like -
  • What happens when two objects added to HashMap have same hash (code) value ? - a situation typically known as collision.
  • If above is handled how get and put work in hashmap?.... and so on.
We will address them now.

Understanding how HashMap works

 You can visualize HashMap as follows-




So you have an Array and each array position is essentially a Linked List. As you know you don't have to specify size of the HashMap. It increases dynamically like ArrayList.  The main data structure is essentially an array.

    /**
     * The table, resized as necessary. Length MUST Always be a power of two.
     */
    transient Entry[] table;


When you create a HashMap you either choose to provide initial capacity or when you don't a default value is used.

    /**
     * The default initial capacity - MUST be a power of two.
     */
    static final int DEFAULT_INITIAL_CAPACITY = 16;


 When table is created it is created with either this initial default capacity or the capacity you provide in the constructor.

Next important thing is when should out table we resized to meet the dynamically changing data size of HashMap. Answer depends on a threshold that is determined by load factor.

    /**

     * The load factor used when none specified in constructor.

     */

    static final float DEFAULT_LOAD_FACTOR = 0.75f;


HashMap provides a constructor to initialize load factor as well. So when your data size equals

  • threshold = table capacity * load factor

then the HashMap table is resized. Constructors are as follows -

    /**
     * Constructs an empty <tt>HashMap</tt> with the specified initial
     * capacity and load factor.
     *
     * @param  initialCapacity the initial capacity
     * @param  loadFactor      the load factor
     * @throws IllegalArgumentException if the initial capacity is negative
     *         or the load factor is nonpositive
     */
    public HashMap(int initialCapacity, float loadFactor) {
        if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                                               initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;
        if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                                               loadFactor);

        // Find a power of 2 >= initialCapacity
        int capacity = 1;
        while (capacity < initialCapacity)
            capacity <<= 1;

        this.loadFactor = loadFactor;
        threshold = (int)(capacity * loadFactor);
        table = new Entry[capacity];
        init();
    }

    /**
     * Constructs an empty <tt>HashMap</tt> with the specified initial
     * capacity and the default load factor (0.75).
     *
     * @param  initialCapacity the initial capacity.
     * @throws IllegalArgumentException if the initial capacity is negative.
     */
    public HashMap(int initialCapacity) {
        this(initialCapacity, DEFAULT_LOAD_FACTOR);
    }

    /**
     * Constructs an empty <tt>HashMap</tt> with the default initial capacity
     * (16) and the default load factor (0.75).
     */
    public HashMap() {
        this.loadFactor = DEFAULT_LOAD_FACTOR;
        threshold = (int)(DEFAULT_INITIAL_CAPACITY * DEFAULT_LOAD_FACTOR);
        table = new Entry[DEFAULT_INITIAL_CAPACITY];
        init();
    }


How is data with keys generating same hash code stored in HashMap?

As mentioned earlier each index has reference to object of type Entry which is a Linked List. It has a next pointer. So if an entry is added to hash map it's hash code is computed which determines the index at which it should be put. If that index has an entry then new entry is added to the start of the linked list and existing linked list is appended to next of it.

    public V put(K key, V value) {
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key.hashCode());
        int i = indexFor(hash, table.length);
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }

    void addEntry(int hash, K key, V value, int bucketIndex) {
    Entry<K,V> e = table[bucketIndex];
        table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
        if (size++ >= threshold)
            resize(2 * table.length);
    }


Also note how null is handled. Yes null is an acceptable key in HashMap.

So how is data retrieved if two keys generate same hash code?

Same hash code will make searched for both keys data land on same index in the table. From there the each Entry object is iterated over and it's key compared with the search key. Yes both key and value are stored in the Node/Entry object! On successful match corresponding value is returned.

    public V get(Object key) {
        if (key == null)
            return getForNullKey();
        int hash = hash(key.hashCode());
        for (Entry<K,V> e = table[indexFor(hash, table.length)];
             e != null;
             e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
                return e.value;
        }
        return null;
    }

How data is stored in HashMap

 First of all the Node array size is always 2^N. Following method guarantees it -

static final int tableSizeFor(int cap) {

    int n = cap - 1;
    n |= n >>> 1;
    n |= n >>> 2;
    n |= n >>> 4;
    n |= n >>> 8;
    n |= n >>> 16;
    return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;
}

So lets say you provide initial capacity as 5
cap = 5

n = cap - 1 =  4 = 0 1 0 0
n |= n >>> 1;    0 1 0 0 | 0 0 1 0 = 0 1 1 0 = 6
n |= n >>> 2;    0 0 1 1 | 0 1 1 0 = 0 1 1 1 = 7
n |= n >>> 4;    0 0 0 0 | 0 1 1 1 = 0 1 1 1 = 7
n |= n >>> 8;    0 0 0 0 | 0 1 1 1 = 0 1 1 1 = 7
n |= n >>> 16;   0 0 0 0 | 0 1 1 1 = 0 1 1 1 = 7
return n + 1     7 + 1 = 8 


So table size is 8 = 2^3

Now possible index values you can put your element in map are 0-7 since table size is 8. Now lets look at put method. It looks for bucket index as follows -

Node<K,V> p =  tab[i = (n - 1) & hash];

where n is the array size. So n = 8. It is same as saying

Node p = tab[i = hash % n];

So all we need to see now is how

hash % n == (n - 1) & hash

Lets again take an example. Lets say hash of a value is 10.

hash = 10
hash % n = 10 % 8 = 2
(n - 1) & hash = 7 & 10 = 0 1 1 1 & 1 0 1 0 = 0 0 1 0 = 2

So it's a optimized modulo operation with bitwise & operator.

A good hash function

A hash function is a method that computes hash of a key where data is stored. It should obviously return value between 0 - (n-1) for an array of length n used to store the data.

A good has function will have take minimum computation time will evenly distribute keys in the array. 
For array of size n and m elements inserted it's load factory would ideally be 

α = m/n

A hash function can be thought of two parts - 
  1. Hash code Map (Key -> Integer)
  2. Compression Map (Integer -> [0,N-1])
Simple hash function (Compression Map) for an array of size N would be

  • h(k) = k mod n where k is the key and n is the size of the array.

NOTE : In above function h(k) you need to take care that m is not  a power of 2. If you do you are only using last m bits of the number to compute hash in that case which is not a good method.

So choose m close to n and m should be prime.


Another compression map can be -

  • h(k) =lowerbound ( k A mod 1) where k is the key, m is the size of the array and A is a constant between 0 and 1 i.e 1<A<0
NOTE : For String avoid adding ascii values of characters as it is not a good function. Multiple words may result in same hash and map to same bucket resulting in higher collision. Use polynomial function instead. So if your integers of ascii chars are c0, c1, c2 use polynomial like -

C0 + C1(X) + C2(X^2)

String class in Java uses following has method -

    /**
     * Returns a hash code for this string. The hash code for a
     * <code>String</code> object is computed as
     * <blockquote><pre>
     * s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
     * </pre></blockquote>
     * using <code>int</code> arithmetic, where <code>s[i]</code> is the
     * <i>i</i>th character of the string, <code>n</code> is the length of
     * the string, and <code>^</code> indicates exponentiation.
     * (The hash value of the empty string is zero.)
     *
     * @return  a hash code value for this object.
     */
    public int hashCode() {
    int h = hash;
        int len = count;
    if (h == 0 && len > 0) {
        int off = offset;
        char val[] = value;

            for (int i = 0; i < len; i++) {
                h = 31*h + val[off++];
            }
            hash = h;
        }
        return h;
    } 



Other hashing techniques are
  1. Linear probing :

    if (table is full error)
    probe = h(k)
    while(table(probe) is not empty)
              probe = (probe + 1) mod m
    table[prob] = k
  2. double hashing :

    if (table is full error)
    probe = h1(k)
    offset = h2(k)
    while(table(probe) is not empty)
              probe = (probe + offset) mod m
    table[prob] = k

NOTE : In java if hashing is involved (lets say you are using HashMap or a HashSet) make sure you override equals() and hascode() method to suit your requirements.  As much as is reasonably practical, the hashCode() method defined by class Object does return distinct integers for distinct objects. (This is typically implemented by converting the internal address of the object into an integer)

HashMap changes in Java8

The performance has been improved by using balanced trees instead of linked lists under specific circumstances. It has only been implemented in the classes 
  1. java.util.HashMap, 
  2. java.util.LinkedHashMap and 
  3. java.util.concurrent.ConcurrentHashMap.
This will improve the worst case performance from O(n) to O(log n).

Lastly lets see HashSet.

Understanding HashSet

Well if you are still guessing the data structure of HashSet then following would be a surprise -

    private transient HashMap<E,Object> map;

    // Dummy value to associate with an Object in the backing Map
    private static final Object PRESENT = new Object();



Yup HashSet stores data as keys of HashMap with dummy value. Constructor is equivalent to that of HashMap -

    /**
     * Constructs a new, empty set; the backing <tt>HashMap</tt> instance has
     * default initial capacity (16) and load factor (0.75).
     */
    public HashSet() {
    map = new HashMap<E,Object>();
    } 



    public HashSet(int initialCapacity, float loadFactor) {
    map = new HashMap<E,Object>(initialCapacity, loadFactor);
    }



    public HashSet(int initialCapacity) {
    map = new HashMap<E,Object>(initialCapacity);
    }


Add and Remove methods are as follows -

    public boolean add(E e) {
    return map.put(e, PRESENT)==null;
    }



    public boolean remove(Object o) {
    return map.remove(o)==PRESENT;
    }

Understanding LinkedHashMap

LinkedHashMap as we know stores the insertion order. However it also extends HashMap so it respects O(1) time complexity as well for insertion. So how does it really work.

LinkedHashMap maintains another type on Entry which holds pointer to previous as well as next Entry Node.

    static class Entry<K,V> extends HashMap.Node<K,V> {
        Entry<K,V> before, after;
        Entry(int hash, K key, V value, Node<K,V> next) {
            super(hash, key, value, next);
        }
    }

LinkedHashMap also stores head and tail of this double linked list and thats how it maintains the insertion order. So even though put follows hashing storage and retrieval order is maintained using doubly linked list.

    /**
     * The head (eldest) of the doubly linked list.
     */
    transient LinkedHashMap.Entry<K,V> head;
    /**
     * The tail (youngest) of the doubly linked list.
     */
    transient LinkedHashMap.Entry<K,V> tail;

So whenever you add a new key, value pair following happens -

    Node<K,V> newNode(int hash, K key, V value, Node<K,V> e) {
        LinkedHashMap.Entry<K,V> p =
            new LinkedHashMap.Entry<K,V>(hash, key, value, e);
        linkNodeLast(p);
        return p;
    }
    // link at the end of list
    private void linkNodeLast(LinkedHashMap.Entry<K,V> p) {
        LinkedHashMap.Entry<K,V> last = tail;
        tail = p;
        if (last == null)
            head = p;
        else {
            p.before = last;
            last.after = p;
        }
    } 

When iterating it iterates from head to tail thereby providing same order as insertion.

NOTE : before and after pointers are in addition to the next pointer which is inherited from HashMap.Node class. So next points to next node having same hash (collision scenario) thereby preserving O(1) lookups. Before and After pointers guarantee insertion lookup order.

Understanding TreeMap

TreeMap as you already know stores the data in sorted order. If the data it stores implements Comparable interface then it stores data in that natural order or you can pass a custom comparator to the TreeMap and it will use that to sort and store the data.

TreeMap maintains a binary search tree structure. It has a root, value less than root go to left where as value greater than root go to right and it's balanced too (RBT - A red–black tree is a kind of self-balancing binary search tree).

    /**
     * The comparator used to maintain order in this tree map, or
     * null if it uses the natural ordering of its keys.
     *
     * @serial
     */
    private final Comparator<? super K> comparator;
    private transient Entry<K,V> root;

Simple code snippet would be -

        Entry<K,V> e = new Entry<>(key, value, parent);
        if (cmp < 0)
            parent.left = e;
        else
            parent.right = e;

where cmp is the comparison value got either from comparators compare() method or Comparables compareTo() method.

Related Links

t> UA-39527780-1 back to top