Building a trustworthy AI assistant for pest diagnosis: lessons from Hor-Tal

Tailor Chat
Jun 5
6 min read

How we designed, tested, and governed a guided diagnosis experience for Hor-Tal — without treating the large language model as the source of truth.

Hor-Tal helps professionals and homeowners choose the right treatment for pest problems, starting with ants. They needed more than a generic chat window: users describe messy real-world situations, answers must align with Hor-Tal’s product knowledge, and wrong recommendations have real consequences.

We built a guided diagnosis assistant on the TailorChat platform — a multi-service stack with a config-driven conversation graph, structured LLM outputs, automated branch coverage tests, and a post-conversation quality reviewer. This article explains the engineering choices behind that system: the patterns we used (and deliberately avoided), how we evaluate quality without outsourcing judgment to a hosted eval product, and how security and governance fit into the design.

The problem: open-ended chat is the wrong default

A naive “ChatGPT wrapper” sounds attractive until you try to ship it:

Users don’t speak in clean multiple-choice answers.
The model can sound confident while recommending the wrong product.
Regression testing by searching for phrases in the bot’s reply text is brittle and breaks every time copy changes.

For Hor-Tal, we needed predictable branching (interior vs exterior, species, nest location, cortadoras vs carpinteras) and recommendations tied to curated knowledge, not improvised prose.

Our response was to treat the LLM as a classifier and extractor that fills structured fields. The conversation graph — defined in configuration, not buried in imperative code — decides what happens next and which products apply.

Design patterns: what we use, what we skip

Job descriptions often list ReAct, multi-agent systems, reflection, and plan-and-execute. Hor-Tal uses several of these ideas in forms that fit a regulated product flow, not a free-form tool loop.

Plan-and-execute (config as the plan)

The ant diagnosis flow is a directed graph in YAML: nodes are steps (location, size, species, recommendation), edges are branches with conditions on structured state. The runtime walks from the start node and recomputes position after corrections — so if a user changes “interior” to “exterior”, the engine retraces from the top without custom “change detection” code.

That is plan-and-execute in practice: the plan is the flow file; the executor is a generic runner plus small handlers per step type.

Multi-agent at the platform boundary

Production chat is moving toward a conversation orchestrator that makes one structured supervisor decision per user message (continue diagnosis, product FAQ, human handoff) and dispatches to the Hor-Tal graph service. The customer-specific logic stays in the graph repo; routing and session I/O stay in platform services.

That is multi-agent in the architectural sense — separate roles and processes — without duplicating diagnosis rules in every service.

Reflection after the conversation

We did not rely on the model to “critique itself” inside every turn. Instead, Suricata (a dedicated review service) reads the stored session transcript and structured state, runs a versioned reviewer prompt, and can flag needs_human for the operations team. That is reflection as an offline quality gate — easier to audit and version than hidden chain-of-thought in the live path.

Why not ReAct in the diagnosis flow?

ReAct (reason → act with tools → observe → repeat) excels when the task is open-ended and tool use is dynamic. Hor-Tal’s ant flow is the opposite: finite branches, known products, and explicit handoff rules. A tool loop would be harder to test and harder to explain to the client.

We keep the LangGraph topology simple; complexity lives in flow config and validation, not in ad hoc agent loops.

One source of truth: structure over strings

Three rules shape the whole codebase:

Do not parse user input with keywords — the LLM assesses the turn and returns JSON fields (location_value, species_value, can_find_nest_value, and so on).
Do not parse bot replies for meaning — semantics live in state (product_ids, current_node, species ids) and in YAML outcomes on edges.
Tests assert on that structure — never on substrings of response_text.

Product copy and treatment text come from a curated knowledge base injected by context, not from vector search across arbitrary documents. That matches Hor-Tal’s domain: bounded, expert-authored content where correctness matters more than retrieval novelty.

Evaluation: measuring quality we can defend

“Evaluation” here means evidence that a prompt or model change did not break known behaviour — not a single vanity score.

Branch coverage from the same config that runs in production

We generate integration tests by walking every path from the flow’s start node to each terminal node (depth-first enumeration over flows.yaml). Each edge carries an example_message that stands in for the user in tests. Adding a branch without an example message fails the suite on purpose — so the graph and the test harness stay aligned.

That gives Hor-Tal (and us) a clear story: every configured branch has at least one automated path test.

Live LLM vs stable CI

Paths exercised with a live model catch real-world flakiness — for example, the model marking species_unchanged: true while still returning a species id, so state never updates even when the reply sounds right. Those failures are valuable: they point to prompt and validation hardening, not bad test design.

For continuous integration, the next step is the industry-standard split:

Fast, deterministic runs with a mocked LLM returning fixed JSON per step (graph and validation logic).
Scheduled or pre-release runs with the real model for regression on critical golden paths.

We report live runs with structured JSON and text summaries (pass rate per path, duration per case) — owned in the repo, not locked in a third-party eval UI.

Post-conversation review (Suricata)

Automated path tests prove routing and state. They do not catch tone, subtle wrong advice, or edge cases in free-form user wording. Suricata closes that gap with a versioned schema and prompt, storing analyses per conversation for admin review and webhook escalation when human attention is needed.

Together: graph tests for branches, golden state for regressions on key journeys, Suricata for qualitative risk in production-like transcripts.

Observability and performance (without a hosted tracing product)

We are growing observability in the same spirit as the tests: structured, self-hosted, aligned with sessions already stored in S3.

Per turn, the useful record is not only the chat text but a small envelope: conversation id, current_node, product ids, model id, prompt/version identifiers, LLM call count, latency, and errors. That supports questions clients actually ask: Where do users drop off? Did latency spike after a deploy? Which node precedes Suricata flags?

OpenTelemetry-style tracing is a natural next step for correlating HTTP, graph steps, and LLM calls — exportable to whatever log stack the deployment already uses (for Hor-Tal, AWS Lightsail and container logs), without requiring a LangChain-specific SaaS dashboard.

Security, privacy, and responsible AI

Hor-Tal’s assistant is not anonymous trivia chat. The design assumes real users and real recommendations, so governance is part of the architecture:

Concern	Approach
Secrets	API keys and internal shared secrets only in environment configuration; pre-commit secret scanning on the codebase.
Service boundaries	Internal APIs (graph run-turn, session read/write, Suricata analyze) authenticated with shared secrets; analysis triggers not exposed on the public edge without auth.
Data residency	Conversation sessions and analysis artifacts under tenant-scoped storage prefixes; admin access through authenticated dashboard flows.
Human oversight	Suricata can mark conversations that need human review; ops can inspect full session detail rather than trusting the model alone.
Prompt and schema versioning	Reviewer behaviour tied to checked-in prompt and schema versions — reproducible audits when something goes wrong.
Responsible limits	Handoff paths when the graph cannot recommend safely; structured reasons instead of silent failure.

Prompt-injection resistance is layered: user text is input to assessment, but decisions follow validated fields and graph rules — reducing the chance that a clever sentence alone redirects product logic.

What Hor-Tal gains

Transparency for the business: Flows can be reviewed and extended in configuration with a visible map of branches and outcomes.
Safer recommendations: Products and species come from curated knowledge and graph outcomes, not from the model inventing SKUs.
Maintainable quality: Tests derived from the same YAML that runs in production; reviewer prompts versioned like any other contract.
A path to production chat: Platform orchestration, sessions, admin UI, and deployment as a single bundle — the graph service remains the place for Hor-Tal-specific expertise.

What we would tell another team (and an hiring panel)

If you are building a high-stakes guided assistant, consider:

Make the graph (or state machine) the contract — not the LLM’s paragraph.
Invest in branch coverage tests generated from that contract.
Separate live-model regression from deterministic CI — both matter.
Add an offline reviewer for what structural tests cannot see.
Log structure per turn — you will need it the first time a deploy “feels worse” with no obvious error.

Hor-Tal’s project is a practical reference for that playbook: domain-serious, testable, and governable — the kind of AI engineering that holds up when the client’s name is on the recommendation.

TailorChat is our platform for graph-driven, multi-tenant conversational products. Hor-Tal is a client deployment focused on pest diagnosis and product guidance. Technical product rules for the Hor-Tal graph live in the Hor-Tal design guidelines; platform architecture and operations are documented in the TailorChat platform repository.

For enquiries about similar assistants — guided flows, evaluation harnesses, and production deployment on AWS — contact your TailorChat delivery team.

TAILOR

CHAT

Our Blog