Parsing HTML

How to Parse HTML aka creating the `Doc`-object

Skrape{it} will create a Doc -object that is representing the parsed HTML Document whenever the htmlDocument{} function has been called.

It can either parse HTML from String, a File or from an HTTP-requests response body (from web).

skrape {
    url = "http://skrape.it"
    extract {
        htmlDocument {
            // parsed Doc is available here
        }
    }
}

htmlDocument(File("path/to/file/example.html")) {
    // parsed Doc is available here
}

htmlDocument("<div>skrape<b>it</b></div>") {
    // parsed Doc is available here
}

Picking Html-Elements from a Doc

Now that we have a Doc-object this is one of the parts where skrape{it} probably shines the most. It provides the possibility to pick Html-elements in a DSL-ish way. To archive this behaviour skrape{it} provides extension functions on Doc that are representing all available HTML tags following the HTML5 standart.

val someHtml = """
    <html>
        <head>
            <link rel="shortcut icon" href="https://some.url/icon">
            <script src="https://some.url/some-script.js"></script>
            <meta name="foo" content="bar">
        </head>
        <body>
            i'm the body
            <h1>i'm the headline</h1>
            <main>
                <p class="foo bar">i'm a paragraph</p>
                <p>i'm a second paragraph</p>
                <p>i'm a paragraph <wbr> with word break</p>
                <p>i'm the last paragraph</p>
            </main>
        </body>
    </html>
"""

htmlDocument(someHtml) {
    meta { 
        withAttribute = "name" to "foo"
        findFirst {
            attribute("content") toBe "bar"
        }
    }
    h1 {
        findFirst {
            text toBe "i'm the headline"
        }
    }
    ol {
        findFirst {
            className toContain "navigation"
        }
    }
    p {
        findAll {
            toBePresentTimes(4)
            forEach { 
                text toContain "paragraph"
            }
        }
    }
    p {
        withClass = "foo" and "bar"
        findFirst {
            text toBe "i'm a paragraph"
        }
    }
}

Picking Custom HTML tags

Since the HTML5 standart it is possible create Custom Elements that define new types of HTML elements. Such elements as well as all other elements can be selected by the use of string inkocation that will again create an css-selector and be used to pick matching elements.

val someHtml = """
    <body>
        <a-custom-tag>foo</a-custom-tag>
        <a-custom-tag class="some-style">bar</a-custom-tag>
    </body>
"""

htmlDocument(someHtml) {
    "a-custom-tag" {
        withClass = "some-style"
        findFirst {
            text toBe "bar"
        }
    }
}

Building CSS selectors

As shown above Skrape{it} provides the possibility to pick elements by either use its corresponding DSL function or via String invokation. Both will create a CssSelector-scope that allows us to build complex css-selectors in an idiomatic fashion.

Let's imagine we want to create a selector that is matching the following complex html element: <button class="foo bar" fizz="buzz" disabled>click me</button>

We could either archive this by using a css query selector:

htmlDocument {
    "button.foo.bar[fizz='buzz'][disabled]" {
        findFirst { /* will pick first matching occurence */ }
        findAll { /* will pick all matching occurences */ }
    }
}

or we could do it even more readable and less error prone:

htmlDocument {
    button {
        withClass = "foo" and "bar"
        withAttribute = "fizz" to "buzz"
        withAttributeKey = "disabled"
        findFirst { /* will pick first matching occurence */ }
        findAll { /* will pick all matching occurences */ }
    }
}

PreviousHow to Use NextMatchers

Last updated 4 years ago

Was this helpful?

How to Parse HTML aka creating the Doc-object

Picking Html-Elements from a Doc

Picking Custom HTML tags

Building CSS selectors

How to Parse HTML aka creating the `Doc`-object