Overview

Why does skrape{it} provide its own http client implementations?

Skrape{it} offers an unified, intuitive and DSL-controlled way to make parsing of websites as comfortable as possible.

  • Http-Client DSL without verbosity and ceremony to make requests and corresponding request options like headers, cookies etc. in a fluent style interface.

  • Pre-configure a client once to either reuse it or adjust only the things that differ at certain requests - especially handy while working with authentication flows or custom headers.

  • Can handle client side rendered web pages (e.g. pages created with frameworks like React.js, Angular or Vue.js or pages manipulated with jQuery or other javascript)

A Http request is done as easy as in the given example. Just call the skrape function wherever you want in your code. It will force you to pass a fetcher and make further request option available in the clojure.

skrape(HttpFetcher) { // <-- pass any Fetcher, e.g. HttpFetcher, BrowserFetcher, ...
// ... request options goes here, e.g the most basic would be url
url = "https://docs.skrape.it"
expect {}
extract {}
}

The http-request is only executed after either the extract or expect function has been called. This behaviour also allows to preconfigure the http-client for multiple calls. If you use expect as well as extract it will only make 1 request.

The Different Fetchers

Skrape{it} provides different types of Fetchers (aka Http-Clients) that can be passed to its DSL. All of them will execute http requests but each of them handles a different use-case.

You want to scrape a simple HTML page, easy, as fast as possible, but with deactivated Javascript?

You want to scrape a complex website, maybe a SPA app that has been written with frameworks like React.js, Angular or Vue.js or at least rely on javascript a lot?

You want to scrape multiple HTML pages in parallel from inside a coroutine?