Individualized settings enable web scraping of websites requiring authentication and thus also the integration of such websites into the knowledgebase.
In order to be able to use blocked websites as sources for the knowledgebase, user-defined settings are required to enable the scraping of these websites. The settings differ depending on whether the integration is carried out with Basic Authentication or with user-defined headers .
Basic Authentication
If the website is protected with Basic Authentication, the access data (user name and password) are added to the header directly and in Base64 encoding. This addition is made by adding the following code snippet:
{
"headers": {
"Authorization": "Basic YWRtaW46MTIzNDU="
}
}
To add the code, a URL must be added as a source in the knowledgebase. You can then check the box for user-defined web scraping under Expert options and add the code.
To translate the access data into the appropriate Base64 format, this website can be used: https://www.base64decode.org/
User-defined headers
If the website is secured by a user-defined header, it is released by adding the header and the corresponding value (in the example, the header is “moin-Ai-Scraper” and the value is “SECRET”). This addition is made by adding the following code snippet:
{
"headers": {
"Moin-Ai-Scraper": "SECRET"
}
}
To add the code, a URL must be added as a source in the knowledgebase. You can then check the box for user-defined web scraping under Expert options and add the code.
Most common error sources
In most cases, the error sources are incorrect header parameters, firewall protection or failed BasicAuth authentication. In order to guarantee that these error sources are excluded, you must
- Ensure that the correct header name and the corresponding value have been added.
- Ensure that Crawler is enabled in the website's firewall.
- Ensure that the access data has been correctly translated into Base64 format.