sábado, 14 de junho de 2014

Web scraping on java with webgrude

On this post I will show you how to scrape a fake search result page using Webgrude.
Webgrude is a java library that uses Apache HttpComponents and Jsoup to scrape web pages and fill annotated java classes with the scrapped values.
For example, let's say you have a search result of a GET request using the url
http://www.example.com/searchTerm

<html>
<body>
<h1>Search results for searchTerm:</h1>
<div class="result">
<span class="result-title">First result</span>
<span class="result-content">First result content</span>
</div>
<div class="result">
<span class="result-title">Second result</span>
<span class="result-content">Second result content</span>
</div>
</body>
</html>

The page returned have a repeating div of class "result", each one containing two spans, the result title and the result content. The search term is in a h1 tag on the top of the page.
Let's start by scraping the page only for the page header. The page will be mapped to this class:

public class SearchResultPage {
public String header;//this field will contain the page header
}

To do this we declare SearchResultPage with a @Page annotation and the String field "header" with a @Selector annotation.
The Page annotation is a class annotation. It receives as a parameter the URL which will be scrapped. The value can also be set to a file using "file://<path>" .
The @Selector annotation is a field annotation. It will be used by the Webgrude browser to look for elements on the html and insert the corresponding element value on the field.
For the example to be self contained, SearchResultPage will be declared as an inner class.

import webGrude.annotations.Page;
import webGrude.annotations.Selector;
public class PrintExampleSearchResults {
@Page("http://www.example.com/{0}")
public static class SearchResultPage {
@Selector("h1") public String header;
}
public static void main(String... args){
SearchResultPage searchPage = Browser.open(SearchResultPage.class, "search term");
System.out.println(searchPage.header);
}
}

On the @Page annotation you can see that the page url is passed as the annotation value. Also, see that the search term was replaced by a {0} on the url. The tokens on the page url are replaced by the arguments passed to Browser.open .
The field with the @Selector("h1") annotation is populated with the text from the element from the selector "h1", in this case "Search results for searchTerm".
Browser.open(SearchResultPage.class, "search term") returns a instance of SearchResultPage, with the content from the url. The argument "search term" replaces the {0} token.
Now let's add the search results to a list of SearchResult. We will start by writing a class that represents a seach result.

public class SearchResult{
public String resultTitle;
public String resultContent;
}

What we want is to map the divs of class "result" to a list of SearchResult. To do this we annotate a List of SearchResult with the class selector ".result".
On the SearchResult class we annotate the fields with the selector that will be applied inside each element coming from @Selector(".result").

import webGrude.annotations.Page;
import webGrude.annotations.Selector;
import java.util.List;
public class PrintExampleSearchResults {
@Selector("")
public static class SearchResult{
@Selector(".result-title") public String resultTitle;
@Selector(".result-content") public String resultContent;
}
@Page("http://www.example.com/{0}")
public static class SearchResultPage {
@Selector("h1") public String header;
@Selector(".result") public List<SearchResult> searchResults;
}
public static void main(String... args){
SearchResultPage searchPage = Browser.open(SearchResultPage.class, "search term");
System.out.println(searchPage.header);
searchPage.searchResults.forEach(result ->
System.out.println("Title: "+result.resultTitle+"\n Content: "+result.resultContent)
);
}
}

See that SearchResult is annotated with @Selector containing an empty value. Webgrude needs this annotation to know that this class must be mapped. This may change on future versions.
Also, this example is very atypical. What usually happens is the element that you wish to map is wrapped inside another element, so usually you will want to use some value on the class selector.

Moving on to practical examples, here is how to scrap the PirateBay search result:

package webGrude;
import webGrude.annotations.Page;
import webGrude.annotations.Selector;
import java.util.List;
public class PirateBayExample {
@Page("http://thepiratebay.se/search/{0}/0/7/0")
public static class SearchResult {
@Selector(value = "#searchResult tbody tr td a[href*=magnet]", attr = "href") public List<String> magnets;
}
public static void main(String... args){
Browser.open(SearchResult.class, "ubuntu iso").magnets.forEach(s -> System.out.println(s));
}
}

The only difference here is that the @Selector annotation is defining which attribute will be mapped to the field. If no attribute is defined, the rendered html value is used.

Besides Strings, lists and Classes annotated with @Selector Webgrude can map values to other types of field:

  • Primitive types: float, integer, boolean.
  • webGrude.elements.Link. Link is used when you have a link that points to another page that has a class corresponding to it.
  • org.jsoup.nodes.Element . Webgrude uses Jsoup to select elements. The Element class has getters and setters for attributes, among other things.
Any field with other types will be ignored, or if it is annotated, will raise an exception.

Webgrude uses httpcomponents, so it does not support sites build dinamically with javascript. It is possible to change the browser implementation using Browser.setWebClient(final BrowserClient client) . So if javascript support is needed it's possible to use htmlutils to implement a new browser, for example. Another feature that maybe will be implemented on the future is POST requests.

If you want to see Webgrude in action, it is currently being used on FilmeUtils, a automated subtitle/torrent downloader, or you can read the unit tests on the source.

Nenhum comentário:

Postar um comentário