Protection against prompt injection
A multi-layered security architecture prevents the manipulation of AI agents
When using AI agents, the security and reliability of automated communication with customers are of paramount importance. A key component of this security is protection against so-called prompt injections. These are attempts by users to manipulate or circumvent an AI chatbot’s internal instructions through specific inputs (e.g. via commands such as ‘Ignore all previous instructions’).
An effective protection mechanism prevents misuse of the system, safeguards confidential information and ensures that the AI agent acts exclusively in the company’s best interests. This strengthens the brand’s integrity and ensures that communication always remains within defined boundaries. At moinAI, protection against prompt injections is provided by the security architecture and preventive safeguards.
1. Security architecture against prompt injection
At moinAI, protection against manipulation is based on a multi-layered architecture. Instead of passing user input directly to a language model, each query passes through various control instances and filters.
2. Preventive measures in detail
2.1 Strict prompt design and system separation
The system role (known as the system prompt) is strictly separated from user input. These instructions are neither visible to users nor can they be overwritten. Even when explicitly asked to disregard rules, the AI operates exclusively at the protected system level.
2.2 Input Sanitisation and Injection Detection
Incoming messages are automatically analysed for typical manipulation patterns (e.g. “ignore previous instructions”). Such inputs are detected and neutralised before they reach the processing logic.
2.3 Context Isolation
The conversation context is managed in a structured manner within a protected environment. It is stored in a validated format (JSON) that is strictly separated from the generative process. This technically prevents users from manipulating the conversation history.
2.4 Integrity of the Knowledge Base
The Knowledge Base stored in the Hub functions as a read-only data source. User inputs have no write access to this data. The AI agent is therefore unable to incorporate incorrect information (e.g. price changes) into the knowledge base or permanently store it there.
2.5 Controlled RAG Pipeline
When using Retrieval-Augmented Generation (RAG), responses are generated via a controlled pipeline. The AI agent uses only verified snippets from the knowledge base. User text serves merely as a search parameter, not as a directive for shaping the content of the knowledge base.
2.6 Securing Webhooks
The activation of webhooks and external system calls is strictly linked to predefined intents. The system prevents functions from being executed solely via unstructured user commands (‘Execute webhook XY’).
2.7 Security Guidelines (Guardrails)
In addition to the system architecture, guardrails serve as specific constraints on the AI agent’s behaviour. These mechanisms prevent, amongst other things, identity impersonation, inappropriate behaviour or the retrieval of internal system background information.
To fine-tune protection on an individual basis, the guardrails can be configured in the Knowledge Base section. There, for example, topic restrictions or specific competition protection can be defined. The article on guardrails provides more detailed information on this.