Skip to content
English
  • There are no suggestions because the search field is empty.

Optimise web scraping

Resource optimisation with JSON-based web scraping rules for static and dynamic websites

The successful use of generative AI depends heavily on the data basis. This should be tidy, structured and complete. Irrelevant and duplicate content leads to poorer results in response generation.

Web scraping options can be set in the moinAI Hub for reading web resources. This ensures that relevant content is read and irrelevant content is excluded. This guarantees a tidy and concrete data basis.

  1. Store web scraping options
  2. Available web scraping options
  3. Scrape complex dynamic websites

 

The Customer Success Team will be happy to advise you on optimisation and assist you with implementation. The following describes how web scraping options can be stored and which options are available.

1. Store web scraping options

The web scraping options can be edited at resource level and bot level. Both ways are described below.

Set centrally for all resources

Web scraping options for all stored and future resources are stored centrally via the menu item Bot Settings → General settings & Privacy in the General Web Scraping Options area.

Custom header settings are also possible. The Customer Success Team will be happy to advise you on the implementation.

Set for individual resources

Web scraping options for individual resources are saved via the Knowledge Base menu item. To do this, click on Options for each individual resource using the three-dot menu at the bottom right of the resource.

A window will then open in which the web scraping options are stored.

Incl. general options button

  • Disabled: Only the stored JSON code affects the resource.
  • Enabled: The web scraping option in Bot Settings → General & Privacy also affects the resource.

JSON code is inserted into the field. Click Save to apply the changes. All affected resources are re-loaded. This process takes a few minutes.

Bulk adjustment

If you want to adjust web scraping for several resources at once, bulk adjustment is the way to go.

1. Select all resources:

  • In the Knowledge Base menu, set ‘Rows per page: all’ at the bottom right.
  • Then select all resources by ticking the box at the top left at the beginning of the resource list.

2. Select multiple resources

  • Select individual resources by clicking the checkbox on the left.
  • Once you have selected the desired resources, click the Options button in the top right corner to open a window where you can define the web scraping options.

Incl. general options button

  • Disabled: Only the stored JSON code affects the resource.
  • Enabled: The web scraping option in Bot Settings → General & Privacy also affects the resource.

JSON code is inserted into the field. Click Save to apply the changes. All affected resources are re-loaded. This process takes a few minutes.


Click the eye button to the right of a resource to check whether the relevant content has been imported correctly.


2. Available web scraping options

The following describes some web scraping parameters for expert mode when adding resources/sources to the moinAI Knowledge Base. This article describes how to add a resource.

These settings help to control the specific parts of the website content that should be included or excluded from the extraction process.

  • includeTags: This option allows you to specify which HTML tags should be included in the output. For example, if you only want to extract content within the <p> and <h1> tags, the specification is: [‘p’, ‘h1’].
  • excludeTags: This option allows you to specify which HTML tags should be excluded from the output. For example, if you want to remove all <script> and <style> tags from the extracted content, the specification is: [‘script’, ‘style’].
  • onlyMainContent: This parameter ensures that only the main content of the web page is returned, without headers, navigation bars, footers and other non-essential elements. It is useful for extracting the core information of a web page without additional ballast. Sometimes the parameter can be too restrictive. If important content is missing from the page, the parameter can be set to ‘false’. The default setting is ‘true’.

Example: Option to retrieve page content without the header
{

      "onlyMainContent": false,

      "excludeTags": [

        "header"

      ]

}

3. Scrape complex dynamic websites

Many modern websites only load content dynamically following user interaction. For example, answers in the FAQ section are only loaded after clicking on the relevant question. Furthermore, the categories change dynamically after every click.

Normal scraping of the website is not possible in this case. With moinAI, the extraction process can be supplemented with rules and actions to capture dynamic content. This involves simulating a click on the category or opening all questions with a single click.

As the category change is triggered after the click and the questions and answers are reloaded and overlaid each time, a separate resource must be created in the Knowledge Base for each category.

To capture dynamic content, a combination of executeJavascript (interaction) and a wait step (waiting for rendering/AJAX requests) is required.

The use of scripting selectors such as [aria-expanded=‘false’] ensures that only closed elements are clicked. Without this check, accordions that are already open would be closed again by a further click.

How non-developers can create suitable code: A general-purpose chatbot such as ChatGPT, Gemini or Perplexity is asked to generate code that implements the desired rule. To do this, provide the URL of the website to be scraped and a link to these instructions.

 

Example JSON: Command to expand content on the FAQ webpage

This snippet opens all visible questions. It uses a specific selector and ensures that only closed elements are triggered.

“actions”: [  {    “type”: “executeJavascript”,    “script”: “document.querySelectorAll(‘div.cursor-pointer.justify-between[aria-expanded=\“false\“], button[aria-expanded=\“false\“]’).forEach(el => el.click());”  },  {    “type”: “wait”,    “milliseconds”: 5000  }]


JSON example:
Command to expand the contents of a category on the FAQ webpage

Here, the desired category is first selected, and then the ‘expand command’ for the questions is executed.

{
  “includeTags”: [ “.gap-3" ],
  “actions”: [
    {
      “comment”: “Step 1: Select a category (e.g. the second element)”,
      “type”: “executeJavascript”,
      “script”: “document.querySelectorAll(‘ div.webkit-tap-transparent.cursor-pointer’)[1].click();”
    },
    {
      “comment”: “Step 2: Expand all questions that have now loaded”,
      “type”: “executeJavascript”,
      “script”: “document.querySelectorAll(‘ div.cursor-pointer.justify-between[aria-expanded=\“false\“]’).forEach(el => el.click());”
    },
    {
      “type”: “wait”,
      “milliseconds”: 5000
    }
  ]
}

Implementation notes

  • Selector precision: The standard button tag is often sufficient. If the menu or unwanted elements are accidentally collapsed, use more specific CSS classes (e.g. .accordion-title) or use .slice(1) to skip the first element (often the header menu).

  • Wait times: * A value between 2000ms and 5000ms is essential. Choose a longer time (5000ms) if the page uses a lot of animations or reloads data slowly via an API.

  • Exclusions: If certain sections are not to be scraped, filter the elements in JavaScript specifically by class before executing .forEach(el => el.click()).

  • Avoiding duplicates: By checking for aria-expanded=‘false’, you prevent errors during re-scraping and ensure a stable process.