Saturday 7 February 2015

String processing using Matcher, Pattern and Regex in Java

Background

Using regex or simply regular expressions have been very important part of String processing. We will explore it is this post.


Before you get to Java code lets take a looks at Regex symbols - 

Regex Symbols

Common Regex Symbols : 

Common Regex Meta Symbols : 


Common Regex Quantifier Symbols : 

Now lets head on to Java Code....

Using java.util.regex.Pattern and java.util.regex.Matcher classes

Lets use these classes to demonstrate regex.

    public static void main(String args[])
    {
        String str = "Hi there! My name is John. How can I help you?";
        Pattern p = Pattern.compile("[.!?]");
        Matcher matcher = p.matcher(str);
        int count = 0;
        while(matcher.find()) {
            count++;
        }
        System.out.println("Count : " + count);

    }

and the output is - 3. Lets see what we did here. We have a String - "Hi there! My name is John. How can I help you?"  and we are interested in finding number of times symbol '.', '!' or '?' appears in the String. So we provided the regex "[.!?]". 

Did not quite get the regex part? Go back to previous section - Regex symbols. Notice in Common Regex symbols image [xyz] represent x or y or z. We are simply incrementing the counter if we find such a regex meaning '.', '!' or '?' '.

Lets head on to something more interesting. How many time have you used String classes split() method? Did you realize that it is infact a regex that split method takes. We split words from line as follows -

    public static void main(String args[])
    {
        String str = "Hello all. Welcome to Open Source For Geeks!";
        String[] tokens = str.split(" ");
        for(String token : tokens)
        {
            System.out.println(token);
        }

    }


and you get output as -

Hello
all.
Welcome
to
Open
Source
For
Geeks!


as expected.  If you see the split method code it internally used java.util.regex.Pattern to compile you regex and use it to split the String -

    public String[] split(String regex, int limit) {
    return Pattern.compile(regex).split(this, limit);
    }


Now lets do the same thing as split using Patter and Matcher.

    public static void main(String args[])
    {
        String str = "Hello all. Welcome to Open Source For Geeks!";
        Pattern pattern= Pattern.compile("\\w+");
        Matcher matcher = pattern.matcher(str);
        while(matcher.find()) {
            System.out.println(matcher.group());
        }

    }


and the output is -

Hello
all
Welcome
to
Open
Source
For
Geeks


Noticed any difference? '.'(dot) and '!'(exclamation) are not present. Well that is expected. See Regex symbols again. We used "\\w+"  which means match one or more word character and dot and exclamation aren't one of them.

Note : Space is also not a word character. Else we would have got whole String as matched pattern (leaving dot and exclamation mark aside). Digits are also part of word characters i.e \\w+ .

 So lets slightly modify our code to get the same output.

    public static void main(String args[])
    {
        String str = "Hello all. Welcome to Open Source For Geeks!";
        Pattern pattern= Pattern.compile("\\w+\\.*!*");
        Matcher matcher = pattern.matcher(str);
        while(matcher.find()) {
            System.out.println(matcher.group());
        }

    }


and the output is -

Hello
all.
Welcome
to
Open
Source
For
Geeks!


Ok that's the same. What did we do here? Notice the regex again. This time it is - "\\w+\\.*!*". We are saying get me sequence that matches [(one or more word characters) then (zero or more dot symbols) then (zero or more exclamation symbols)].

Note :  Dot is a regex symbol and hence you need to escape (like we did \\. ) it if you want to use it for searching dot pattern. Exclamation on the other hand was not so we could directly use it.


Another Example


Lets  do something practical now. Lets say you have following data (maybe in a file or just an array) -

data1=${key1}
data2=${key2}
data3=${key3}

and you have to replace the placeholder with the actual value whose mapping you have (again maybe in a different file).

key1 -> ActualKey1
key2 -> ActualKey2
key3 -> ActualKey3

and finally you want output like -

data1=ActualKey1
data2=ActualKey2
data3=ActualKey3

Lets write code to achieve this using regex.

    public static void main(String args[])

    {

        String[] replacableData = new String[]{"data1=${key1}","data2=${key2}","data3=${key3}"};

        Map<String, String> keyMappings = new HashMap<String, String>();

        keyMappings.put("key1", "ActualKey1");

        keyMappings.put("key2", "ActualKey2");

        keyMappings.put("key3", "ActualKey3");

        Pattern pattern= Pattern.compile("(\\w+=)(\\$\\{)(\\w+)(\\})");

        for(String str : replacableData) {

            Matcher matcher = pattern.matcher(str);

            String key = matcher.replaceAll("$3");

            System.out.println("Key : " + key);

            String newStr = matcher.replaceAll("$1"+ keyMappings.get(key));

            System.out.println("After Replacement : " + newStr);

        }

    }


and the output is -

Key : key1
After Replacement : data1=ActualKey1
Key : key2
After Replacement : data2=ActualKey2
Key : key3
After Replacement : data3=ActualKey3


Another important concept to note here is the groups in the rgex that you can refer later with $groupIndex.

In our regex - "(\\w+=)(\\$\\{)(\\w+)(\\})"

group1 -> (\\w+=)
group2 -> (\\$\\{)
group3 -> (\\w+)
group4 -> (\\})
  Also replaceAll() replaces the entire original String with whatever group combinations you provide. For example matcher.replaceAll("$1$2$3$4") will give you back the same String.

Note : replaceAll() method return a new String. original String is not changed.

Note :  Even used replace() and replaceAll() method of String class?  Both of them replace all occurences with the String you desire. Only difference is replaceAll() uses a regex where as replace() uses simple CharSequence.
t> UA-39527780-1 back to top