Http-Client

skrape-it offers an intuitive and DSL-controlled http client to make parsing websites as comfortable as possible. A special feature is the mode parameter, which allows web pages to be client-side rendered (e.g. pages created with frameworks like React.js, Angular or Vue.js or pages manipulated with jQuery or other javascript).

skrape {
// ... request options goes here
extract {
// working with the resonse
}
}

The http-request is only executed after either the extract or expect function has been called. This behaviour also allows to preconfigure the http-client for multiple calls.

Request Options

All of the available options already have reasonable defaults which aim to make the use of skrape{it} as easy and intuitive as possible.

Option

Description

Type

Default

url

The URL that is used to fetch and parse a web page. The protocol must be http or https

String

http://localhost:8080

method

HTTP defines a set of request methods to indicate the desired action to be performed for a given resource. Although they can also be nouns, these request methods are sometimes referred as HTTP verbs.

Method

GET

userAgent

The User-Agent request header contains a characteristic string that allows the network protocol peers to identify the application type, operating system, software vendor or software version of the requesting software user agent.

String

Mozilla/5.0 skrape.it

headers

Request headers containing more information about the resource to be fetched or about the client itself.

Map<String, String>

no additional custom headers will be sent by default

cookies

Will add Cookies to your request

Map<String, String>

will send no Cookies by default

timeout

Sets the total request timeout duration. A timeout of zero (0) is treated as an infinite timeout.

Int

5000

followRedirects

Configures the connection to (not) follow server redirects.

Boolean

true

ignoreContentType

Ignore the document's Content-Type when parsing the response. If set to false, an unrecognized content-type will cause an IOException to be thrown. (This is to prevent producing garbage by attempting to parse a JPEG binary image, for example.)

Boolean

true

ignoreHttpErrors

Configures the connection to not throw exceptions when a HTTP error occurs. (4xx - 5xx, e.g. 404 or 500). An IOException is thrown if an error is encountered. If set to true the response is populated with the error body, and the status message will reflect the error.

Boolean

true

validateTLSCertificates

Disable/enable TLS certificates validation for HTTPS requests.

All connections over HTTPS perform normal validation of certificates, and will abort requests if the provided certificate does not validate.

Boolean

true

maxBodySize

Set the maximum bytes to read from the (uncompressed) connection into the body, before the connection is closed, and the input truncated.

Int

no maximum body size

mode

For server-side rendered Websites or other XML related responses you should always use the default mode (SOURCE) because it's more performant (good old HTTP request).

If you need to parse client side rendered Websites (e.g. build with React.js, Vue.js, Angular or jQuery) try the DOM mode.

Mode

SOURCE