Parsing HTML
How to Parse HTML aka creating the Doc-object
Doc-objectskrape {
url = "http://skrape.it"
extract {
htmlDocument {
// parsed Doc is available here
}
}
}htmlDocument(File("path/to/file/example.html")) {
// parsed Doc is available here
}htmlDocument("<div>skrape<b>it</b></div>") {
// parsed Doc is available here
}Picking Html-Elements from a Doc
val someHtml = """
<html>
<head>
<link rel="shortcut icon" href="https://some.url/icon">
<script src="https://some.url/some-script.js"></script>
<meta name="foo" content="bar">
</head>
<body>
i'm the body
<h1>i'm the headline</h1>
<main>
<p class="foo bar">i'm a paragraph</p>
<p>i'm a second paragraph</p>
<p>i'm a paragraph <wbr> with word break</p>
<p>i'm the last paragraph</p>
</main>
</body>
</html>
"""
htmlDocument(someHtml) {
meta {
withAttribute = "name" to "foo"
findFirst {
attribute("content") toBe "bar"
}
}
h1 {
findFirst {
text toBe "i'm the headline"
}
}
ol {
findFirst {
className toContain "navigation"
}
}
p {
findAll {
toBePresentTimes(4)
forEach {
text toContain "paragraph"
}
}
}
p {
withClass = "foo" and "bar"
findFirst {
text toBe "i'm a paragraph"
}
}
}Picking Custom HTML tags
Building CSS selectors
Last updated
Was this helpful?