skrape{it}
1.1.X
1.1.X
  • Introduction
  • Why it exists
  • overview
    • Setup
    • Who should be using it
  • Http Client
    • Overview
    • Fetchers
      • HttpFetcher
      • BrowserFetcher
      • AsyncFetcher
      • Implement your own
    • Request Options
    • Pre-configure client
    • Response
      • Status
      • Cookies
  • Html Parser
    • Parsing HTML
  • assertions
    • expect content
  • How to Use
    • Testing
    • Scraping
    • JS-rendered sites
  • Examples
    • Grab all links from a Website
    • Creating a RESTful API (Spring-Boot)
  • GitHub Repo
  • Extensions
    • MockMvc
      • Getting Started
      • GitHub Repo
    • Ktor
      • Getting Started
      • GitHub Repo
  • About skrape{it}
Powered by GitBook
On this page
  • How to Parse HTML aka creating the Doc-object
  • Picking Html-Elements from a Doc
  • Picking Custom HTML tags
  • Building CSS selectors

Was this helpful?

  1. Html Parser

Parsing HTML

PreviousCookiesNextexpect content

Last updated 4 years ago

Was this helpful?

How to Parse HTML aka creating the Doc-object

Skrape{it} will create a Doc -object that is representing the parsed HTML Document whenever the htmlDocument{} function has been called.

It can either parse HTML from String, a File or from an HTTP-requests response body (from web).

skrape {
    url = "http://skrape.it"
    extract {
        htmlDocument {
            // parsed Doc is available here
        }
    }
}
htmlDocument(File("path/to/file/example.html")) {
    // parsed Doc is available here
}
htmlDocument("<div>skrape<b>it</b></div>") {
    // parsed Doc is available here
}

Picking Html-Elements from a Doc

Now that we have a this is one of the parts where skrape{it} probably shines the most. It provides the possibility to pick Html-elements in a DSL-ish way. To archive this behaviour skrape{it} provides extension functions on Doc that are representing all available HTML tags following the HTML5 standart.

val someHtml = """
    <html>
        <head>
            <link rel="shortcut icon" href="https://some.url/icon">
            <script src="https://some.url/some-script.js"></script>
            <meta name="foo" content="bar">
        </head>
        <body>
            i'm the body
            <h1>i'm the headline</h1>
            <main>
                <p class="foo bar">i'm a paragraph</p>
                <p>i'm a second paragraph</p>
                <p>i'm a paragraph <wbr> with word break</p>
                <p>i'm the last paragraph</p>
            </main>
        </body>
    </html>
"""

htmlDocument(someHtml) {
    meta { 
        withAttribute = "name" to "foo"
        findFirst {
            attribute("content") toBe "bar"
        }
    }
    h1 {
        findFirst {
            text toBe "i'm the headline"
        }
    }
    ol {
        findFirst {
            className toContain "navigation"
        }
    }
    p {
        findAll {
            toBePresentTimes(4)
            forEach { 
                text toContain "paragraph"
            }
        }
    }
    p {
        withClass = "foo" and "bar"
        findFirst {
            text toBe "i'm a paragraph"
        }
    }
}

Picking Custom HTML tags

val someHtml = """
    <body>
        <a-custom-tag>foo</a-custom-tag>
        <a-custom-tag class="some-style">bar</a-custom-tag>
    </body>
"""

htmlDocument(someHtml) {
    "a-custom-tag" {
        withClass = "some-style"
        findFirst {
            text toBe "bar"
        }
    }
}

Building CSS selectors

As shown above Skrape{it} provides the possibility to pick elements by either use its corresponding DSL function or via String invokation. Both will create a CssSelector-scope that allows us to build complex css-selectors in an idiomatic fashion.

Let's imagine we want to create a selector that is matching the following complex html element: <button class="foo bar" fizz="buzz" disabled>click me</button>

We could either archive this by using a css query selector:

htmlDocument {
    "button.foo.bar[fizz='buzz'][disabled]" {
        findFirst { /* will pick first matching occurence */ }
        findAll { /* will pick all matching occurences */ }
    }
}

or we could do it even more readable and less error prone:

htmlDocument {
    button {
        withClass = "foo" and "bar"
        withAttribute = "fizz" to "buzz"
        withAttributeKey = "disabled"
        findFirst { /* will pick first matching occurence */ }
        findAll { /* will pick all matching occurences */ }
    }
}

Since the HTML5 standart it is possible create that define new types of HTML elements. Such elements as well as all other elements can be selected by the use of string inkocation that will again create an css-selector and be used to pick matching elements.

Custom Elements
Doc-object