Open Source For Geeks: 10/06/14

Monday, 6 October 2014

Web scrapping using Jsoup Java library

Background

Websites are generally intended to humans to visit and go through it's content but we can very well automate this process. What happens when we hit a URL in the browser ? A GET or a POST request is sent to the server, server authenticates the request (if any) and replies with a response - typically a HTML response. Our browser understands the response and renders it in human readable form. If you know this it is no problem for a programmer to automate a REST API using CURL or HTTPClient , parse the response and fetch the interested data.

Let me go a few steps ahead. Today we have many libraries in many languages that automate the process of sending requests, parsing the response and getting the data you are really interested in. They are know as Web Scrappers and the technique is know as Web scrapping.

Even see a text image validation or captcha before entering a website ? Well it's there to ensure you are a human and not some bot (automated scripts like the one we will see in this post). Well then one would ask why do we need this web scrapping ? Heard of Google ? How do you think it shows you so accurate search result based on your search query or rather how does it know that a relevant page exists in first place? Yeah well......... web scrapping. Answering how does it get so accurate results is a bit tricky as it involves ranking algorithms and in depth knowledge of data mining. So lets skip that for now :)

Note

Web scrapping is not strictly ethical! and may be termed as hacking in some scenarios. So please check the legal policies of the website before trying anything like that. My intentions here are purely academic in nature :)

So do go ahead play with new libraries, new APIs, learn new things... but by staying withing the rules.

Web scrapping using Jsoup

So coming back to our title of the POST. we are going to use Jsoup library in Java to scrap web pages. As per their homepage info -

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

scrape and parse HTML from a URL, file, or string
find and extract data, using DOM traversal or CSS selectors
manipulate the HTML elements, attributes, and text
clean user-submitted content against a safe white-list, to prevent XSS attacks
output tidy HTML jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

Setup & Goal

Goal : I am going to scrap my own blog - http://opensourceforgeeks.blogspot.in and then print all the post titles that come up in the 1st page.

Setup :

I am going to use Ivy dependency manager and Eclipse as I do for most of my projects.

Installing and using Apache Ivy as dependency manager.

I am using jsoup version 1.7.3. You can see the version in the maven repository. So my Ivy file looks something like below -

<ivy-module version="2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:noNamespaceSchemaLocation="http://ant.apache.org/ivy/schemas/ivy.xsd">
    <info
        organisation="OpenSourceForGeeks"
        module="WebScrapper"
        status="integration">
    </info>
    
    <dependencies>
    
        <dependency org="org.jsoup" name="jsoup" rev="1.7.3"/>
        
    </dependencies>
   

</ivy-module>

So go ahead build your project, resolve and add Ivy library. This would download the library and set it in your classpath. Create classes, packages to suit your requirement. My project structure looks like -

Code :

Add the following code in your WebScrapper class -

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.Scanner;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

/**
 * 
 * @author athakur
 *
 */
public class WebScrapper {
    
    public static void main(String args[]) {
        WebScrapper webScrapper = new WebScrapper();
        String page = webScrapper.getPageData("http://opensourceforgeeks.blogspot.in/" );
        Document doc = Jsoup.parse(page);
        Elements elements = doc.select(".post-title > a");
        for(Element element : elements) {
            System.out.println("POST TITLE : " + element.childNode(0).toString());
        }
    }

    
    
    public String getPageData(String targetUrl) {
        URL url = null;
        URLConnection urlConnection = null;
        BufferedReader reader = null;
        String output = "";
       
        try {
            url = new URL(targetUrl);
        }
        catch(MalformedURLException e){
        System.out.println("Target URL is not correct. URL : " + targetUrl);
        return null;
        }
       
        try {
            urlConnection = url.openConnection();
            reader = new BufferedReader(new InputStreamReader(urlConnection.getInputStream()));
            String input = null;
            while((input = reader.readLine()) != null){
                output += input;
            }
            reader.close();
        }
        catch( IOException ioException) {
            System.out.println("IO Exception occurred");
            ioException.printStackTrace();
            return null;
        }
       
        return output;
       
    }

    
}

Go ahead, run the program and check the output.

Output :

POST TITLE : What is object-oriented programming? - By Steve Jobs
POST TITLE : Experimenting with Oracle 11g R2 database
POST TITLE : Difference between DML and DDL statements in SQL
POST TITLE : Difference between a stored procedure and user defined function in SQL
POST TITLE : IO in Java (Using Scanner and BufferedReader)

Explanation

If you are aware of HTML/CSS code is quite self explanatory. All you have to understand is how the select query works. If your HTML element had id="myId" we can refer it as #myId. Same goes for class. if your element had class="myClass" we can refer it as .myClass. Select query work exactly the same way. For more on syntax refer -

Use selector-syntax to find elements

In the select query that I have used I am simply saying parse the page and get me all the elements that are a(anchor tag) and are children of Element with class = "post-title". Display text of the anchor tag is the data we are interested in.

How did I know what class or id to search ? Well we need to do some manual searching. In my case first I parsed the whole page, printed it and searched for the pattern I was interested in. You can do the same by going to a page and inspecting the page source code - via browser.