Web scraping of websites

Description of the customizable settings for targeted web scraping of websites

In order to be able to use blocked websites as sources for the knowledgebase, user-defined settings are required to enable the scraping of these websites. The settings differ depending on whether the integration is carried out with Basic Authentication or with user-defined headers .

Basic Authentication

If the website is protected with Basic Authentication, the access data (user name and password) are added to the header directly and in Base64 encoding. This addition is made by adding the following code snippet:

{
"headers": {
"Authorization": "Basic YWRtaW46MTIzNDU="
}
}

To add the code, a URL must be added as a source in the knowledgebase. You can then check the box for user-defined web scraping under Expert options and add the code.

Bildschirmfoto 2024-11-19 um 10.22.44

To translate the access data into the appropriate Base64 format, this website can be used: https://www.base64decode.org/

User-defined headers

If the website is secured by a user-defined header, it is released by adding the header and the corresponding value (in the example, the header is “moin-Ai-Scraper” and the value is “SECRET”). This addition is made by adding the following code snippet:

{
"headers": {
"Moin-Ai-Scraper": "SECRET"
}
}

To add the code, a URL must be added as a source in the knowledgebase. You can then check the box for user-defined web scraping under Expert options and add the code.

Bildschirmfoto 2024-11-19 um 10.22.33

Voter favorabilityDescription of further web scraping parameters:

The settings described help to control the specific parts of the web page content that should be included or excluded from the extraction process:

includeTags: this option allows to specify which HTML tags should be included in the output. For example, if only content within the <p> and <h1> tags is to be extracted, the specification is: [“p”, “h1”].
excludeTags: This option allows to specify which HTML tags should be excluded from the output. For example, if you want to remove all <script> and <style> tags from the extracted content, the specification is: [“script”, “style”].
onlyMainContent: This parameter ensures that only the main content of the web page is returned, without headers, navigation bars, footers and other non-essential elements. It is useful for extracting the core information of a web page without additional ballast. Sometimes the parameter can be too restrictive. If important content is missing on the page, the parameter can be set to “false”. The default setting is “true”.

Most common error sources

In most cases, the error sources are incorrect header parameters, firewall protection or failed BasicAuth authentication. In order to guarantee that these error sources are excluded, you must

Ensure that the correct header name and the corresponding value have been added.
Ensure that Crawler is enabled in the website's firewall.
Ensure that the access data has been correctly translated into Base64 format.