sábado, 14 de junho de 2014

Web scraping on java with webgrude

On this post I will show you how to scrape a fake search result page using Webgrude.
Webgrude is a java library that uses Apache HttpComponents and Jsoup to scrape web pages and fill annotated java classes with the scrapped values.
For example, let's say you have a search result of a GET request using the url
http://www.example.com/searchTerm

The page returned have a repeating div of class "result", each one containing two spans, the result title and the result content. The search term is in a h1 tag on the top of the page.
Let's start by scraping the page only for the page header. The page will be mapped to this class:

To do this we declare SearchResultPage with a @Page annotation and the String field "header" with a @Selector annotation.
The Page annotation is a class annotation. It receives as a parameter the URL which will be scrapped. The value can also be set to a file using "file://<path>" .
The @Selector annotation is a field annotation. It will be used by the Webgrude browser to look for elements on the html and insert the corresponding element value on the field.
For the example to be self contained, SearchResultPage will be declared as an inner class.

On the @Page annotation you can see that the page url is passed as the annotation value. Also, see that the search term was replaced by a {0} on the url. The tokens on the page url are replaced by the arguments passed to Browser.open .
The field with the @Selector("h1") annotation is populated with the text from the element from the selector "h1", in this case "Search results for searchTerm".
Browser.open(SearchResultPage.class, "search term") returns a instance of SearchResultPage, with the content from the url. The argument "search term" replaces the {0} token.
Now let's add the search results to a list of SearchResult. We will start by writing a class that represents a seach result.

What we want is to map the divs of class "result" to a list of SearchResult. To do this we annotate a List of SearchResult with the class selector ".result".
On the SearchResult class we annotate the fields with the selector that will be applied inside each element coming from @Selector(".result").

See that SearchResult is annotated with @Selector containing an empty value. Webgrude needs this annotation to know that this class must be mapped. This may change on future versions.
Also, this example is very atypical. What usually happens is the element that you wish to map is wrapped inside another element, so usually you will want to use some value on the class selector.

Moving on to practical examples, here is how to scrap the PirateBay search result:

The only difference here is that the @Selector annotation is defining which attribute will be mapped to the field. If no attribute is defined, the rendered html value is used.

Besides Strings, lists and Classes annotated with @Selector Webgrude can map values to other types of field:

  • Primitive types: float, integer, boolean.
  • webGrude.elements.Link. Link is used when you have a link that points to another page that has a class corresponding to it.
  • org.jsoup.nodes.Element . Webgrude uses Jsoup to select elements. The Element class has getters and setters for attributes, among other things.
Any field with other types will be ignored, or if it is annotated, will raise an exception.

Webgrude uses httpcomponents, so it does not support sites build dinamically with javascript. It is possible to change the browser implementation using Browser.setWebClient(final BrowserClient client) . So if javascript support is needed it's possible to use htmlutils to implement a new browser, for example. Another feature that maybe will be implemented on the future is POST requests.

If you want to see Webgrude in action, it is currently being used on FilmeUtils, a automated subtitle/torrent downloader, or you can read the unit tests on the source.