sábado, 14 de junho de 2014

Web scraping on java with webgrude

On this post I will show you how to scrape a fake search result page using Webgrude.
Webgrude is a java library that uses Apache HttpComponents and Jsoup to scrape web pages and fill annotated java classes with the scrapped values.
For example, let's say you have a search result of a GET request using the url

The page returned have a repeating div of class "result", each one containing two spans, the result title and the result content. The search term is in a h1 tag on the top of the page.
Let's start by scraping the page only for the page header. The page will be mapped to this class:

To do this we declare SearchResultPage with a @Page annotation and the String field "header" with a @Selector annotation.
The Page annotation is a class annotation. It receives as a parameter the URL which will be scrapped. The value can also be set to a file using "file://<path>" .
The @Selector annotation is a field annotation. It will be used by the Webgrude browser to look for elements on the html and insert the corresponding element value on the field.
For the example to be self contained, SearchResultPage will be declared as an inner class.

On the @Page annotation you can see that the page url is passed as the annotation value. Also, see that the search term was replaced by a {0} on the url. The tokens on the page url are replaced by the arguments passed to Browser.open .
The field with the @Selector("h1") annotation is populated with the text from the element from the selector "h1", in this case "Search results for searchTerm".
Browser.open(SearchResultPage.class, "search term") returns a instance of SearchResultPage, with the content from the url. The argument "search term" replaces the {0} token.
Now let's add the search results to a list of SearchResult. We will start by writing a class that represents a seach result.

What we want is to map the divs of class "result" to a list of SearchResult. To do this we annotate a List of SearchResult with the class selector ".result".
On the SearchResult class we annotate the fields with the selector that will be applied inside each element coming from @Selector(".result").

See that SearchResult is annotated with @Selector containing an empty value. Webgrude needs this annotation to know that this class must be mapped. This may change on future versions.
Also, this example is very atypical. What usually happens is the element that you wish to map is wrapped inside another element, so usually you will want to use some value on the class selector.

Moving on to practical examples, here is how to scrap the PirateBay search result:

The only difference here is that the @Selector annotation is defining which attribute will be mapped to the field. If no attribute is defined, the rendered html value is used.

Besides Strings, lists and Classes annotated with @Selector Webgrude can map values to other types of field:

  • Primitive types: float, integer, boolean.
  • webGrude.elements.Link. Link is used when you have a link that points to another page that has a class corresponding to it.
  • org.jsoup.nodes.Element . Webgrude uses Jsoup to select elements. The Element class has getters and setters for attributes, among other things.
Any field with other types will be ignored, or if it is annotated, will raise an exception.

Webgrude uses httpcomponents, so it does not support sites build dinamically with javascript. It is possible to change the browser implementation using Browser.setWebClient(final BrowserClient client) . So if javascript support is needed it's possible to use htmlutils to implement a new browser, for example. Another feature that maybe will be implemented on the future is POST requests.

If you want to see Webgrude in action, it is currently being used on FilmeUtils, a automated subtitle/torrent downloader, or you can read the unit tests on the source.

sábado, 26 de abril de 2014

Problema na instalação do receitanet 2014 no ubuntu 14.04 64

Ao tentar instalar o Receitanet-1.04.bin no ubuntu 14.04 64 estava tomando o erro
invalid command name "bind"
    while executing
"::unknown bind Text "
    ("uplevel" body line 1)
    invoked from within
"uplevel 1 $next $args"
    (procedure "::obj::Unknown" line 3)
    invoked from within
"bind Text "
    (procedure "::InstallJammer::InitializeGui" line 19)
    invoked from within
"::InstallJammer::InitializeGui "
    (procedure "::InstallJammer::InitInstall" line 68)
    invoked from within
    (file "/installkitvfs/main.tcl" line 37477)
Pesquisando no google achei em vários lugares que o problema era falta de pacotes do gtk pra 32 bits, sendo os pacotes que faltavam:
sudo apt-get install libgtk2.0-0:i386 libidn11:i386 libglu1-mesa:i386
Mas instalar os pacotes não adiantou. O problema e que o .bin não vai funcionar.
O site da receita seleciona o instalador .bin por padrão, mas no Dropdown "Selecione o seu sistema operacional:" existe a opção "Linux (deb)" que instalará sem problemas.

quarta-feira, 26 de fevereiro de 2014

A Koch Island written in javascript

The Koch snowflake is a fractal curve that arises from recursively divind a line segment in three sections and replacing the middle section with two line segments that forms a triangle.
It`s a really famous fractal and one that`s really easy to draw.
One cool property of this fractal is that if you switch randomly the direction of the top of the middle section triangle you get a line that resembles a coast.

I`ve written an interactive version of the koch snowflake, including the option to randomly change the triangle direction and a slider to increase/decrease the middle section size.

Have fun playing with it :)


Draw a circle on playn

A simple implementation of drawCircle on a playn surface.

    private void drawCircle(float x, float y, float radius, Surface surface, int color) {
        int circunferenceSidesCount = 10;

        float[] points = new float[(circunferenceSidesCount + 1) * 2];
        int[] indices = new int[circunferenceSidesCount * 3];
        points[0] = x;
        points[1] = y;
        for (int i = 0; i < circunferenceSidesCount; i++) {
            int pointIndex = (i + 1) * 2;

            float angleProgression = (float) i / circunferenceSidesCount;

            double deltaX = radius * Math.cos(MathUtil.TWO_PI * angleProgression);
            points[pointIndex] = (float) (x - deltaX);
            double deltaY = radius * Math.sin(MathUtil.TWO_PI * angleProgression);
            points[pointIndex + 1] = (float) (y - deltaY);

            int indicesIndex = i * 3;
            indices[indicesIndex] = i + 1;
            indices[indicesIndex + 1] = (i + 2 > circunferenceSidesCount) ? 1 : i + 2;
            indices[indicesIndex + 2] = 0;

        surface.fillTriangles(points, indices);