case study

Comprehensive QA Framework for a Municipal AI Chatbot

  Role

   Test Manager

  Client Type

   German Municipality (Sandberg)

Assignment from the client

Krone Consulting was commissioned to design and execute a rigorous quality assurance (QA) framework for an AI-driven chatbot deployed by a German municipality (Sandberg). The system serves as a centralized digital information hub, utilizing Retrieval-Augmented Generation (RAG) to extract and synthesize data from municipal PDFs, the official website, and local service pages. 

Core Objective: to transform a non-deterministic AI model into a reliable public service tool by ensuring factual precision, structural security, and contextual relevance for citizen engagement.

 

Core Technology

  • RAG-based LLM (Retrieval-Augmented Generation)

Data Sources

  • Municipal PDFs, Official Website, Service Pages

Identified Challenges in AI Implementation

The technical analysis phase revealed several critical hurdles inherent in deploying Large Language Models (LLMs) within a high-accountability government context:

  • Hallucinations and Factual Accuracy: The tendency for the model to generate “plausible-sounding” but entirely fabricated data. A primary risk identified was the invention of municipal opening hours or administrative deadlines that appeared authentic but were factually incorrect.
  • Temporal Logic Deficiencies: Difficulty in processing time-sensitive calculations. The model frequently struggled to distinguish between past events and future appointment availability, a critical failure point for municipal scheduling.
  • Contextual Integrity: Maintaining a coherent “thread” over multi-turn interactions. The bot must retain information provided at the start of the conversation to provide relevant follow-up assistance.
  • Out-of-Bounds Queries: The risk of the chatbot engaging in topics unrelated to Sandberg (e.g., general knowledge, global weather, or other municipalities), which could dilute the brand and waste computational resources.
  • Linguistic Variability: The challenge of interpreting diverse inputs, ranging from formal “High German” to regional dialects and informal vernacular, without losing semantic meaning.

The Multi-Layered Testing Solution

Krone Consulting implemented a four-pillar testing strategy designed to provide comprehensive coverage across functional, technical, and security dimensions.

1. Manual Testing & Scenario Development

We constructed a library of positive and negative test cases based on actual citizen inquiry patterns. A critical success metric in our negative testing was “Explicit Boundary Setting.” If a query falls outside the municipal scope, the bot is trained to state: “I am a chatbot specialized in Sandberg; I do not have information on this topic,” or provide a direct link to the relevant official department.

2. API & Integration Testing: Deterministic Checks

To stabilize the non-deterministic nature of the LLM, we implemented automated API-level checks. These serve as “deterministic checks for non-deterministic models,” ensuring that mission-critical, static data – such as the municipal headquarters’ address and central emergency contacts – remains 100% accurate regardless of how the AI chooses to phrase the rest of its response.

3. The “AI Judge” Framework

To achieve the scalability required for hundreds of test scenarios, we deployed a secondary LLM to act as an automated “AI Judge.” This framework evaluates the primary chatbot’s outputs against expected ground-truth data, generating success rate statistics. The municipality established a strict quality gate: the system must maintain a 90%+ accuracy threshold before any code or data update is promoted to production.

4. Security & Prompt Injection Testing

We conducted rigorous testing to prevent the “Exploitation of Internal Logic.” This involved attempting to “trick” the bot into revealing its system prompt, the hidden instructions provided by developers. Exposing these constraints represents a significant compliance risk for a government entity, as it reveals internal logic and potential vulnerabilities to the public.

Technical Infrastructure and Data Management

Krone Consulting prioritized a modular, scalable architecture that allows the municipality to maintain the system without continuous developer intervention.

The infrastructure is specifically built for Regression Testing. When the underlying model is upgraded (e.g., transitioning from GPT-4 to GPT-4o), the entire suite of hundreds of tests can be re-run instantly to ensure that the update has not introduced new hallucinations or logic errors.

Strategic Advantages & Long-term Value

The implementation of this framework provides the municipality with a resilient and professional AI presence.

Continuous Regression Capability: Any change in source PDFs or website content can be validated immediately against the existing test bank, ensuring that new information is integrated without breaking existing logic.

Operational Scalability: The framework is department-agnostic. As Sandberg digitizes more services, they can be integrated into the testing pipeline with minimal friction.

Reputational Risk Mitigation: By enforcing strict “out-of-bounds” protocols and deterministic checks, the municipality avoids the public embarrassment of an AI providing false legal or administrative advice.

Conclusion: A Maintenance-First Philosophy

AI quality assurance is not a “one-off” event but a continuous lifecycle requirement. Large Language Models are prone to “Performance Drift” and “Stochastic Degradation” as providers update the underlying weights. Krone Consulting recommends a dedicated Maintenance Phase to monitor the system post-launch. Without ongoing validation, a chatbot that performs perfectly today could begin providing inaccurate or “hallucinated” information within weeks due to model evolution or data changes. A structured, professional monitoring phase is essential to ensuring the chatbot remains a trusted pillar of municipal communication.