In the first post, we argued that trustworthy AI for BI depends on business meaning, not just model capability. In the second, we showed what that meaning looks like in practice: governed metrics, safe relationships, aliases, defaults, and context packs. The next question is how to build that foundation without turning it into a one-off pilot. That is where sequencing matters.
The teams that succeed do not start by asking, “Which AI tool should we use?” They start by asking, “Which business questions need to be answered reliably, and what meaning does the system need in order to answer them?” From there, the work follows a repeatable sequence: identify the questions, classify their complexity, build the semantic foundation, enrich it for natural language, validate against ground truth, and package the process so it can scale to the next domain. Over the course of our AI for BI engagements, we have developed a high level, repeatable approach that looks something like this:

Start with Business Questions
Data and engineering teams often start with the tech stack. That usually means picking tools and designing the architecture before they have fully defined the questions the system needs to answer. The problem is that this often leads to systems that do not match the way business users actually think or ask questions.
The more productive starting point is a structured discovery process with the business stakeholders who will use it to create a prioritized question inventory. The goal is to understand what decisions they make, what information they need to make those decisions, and how they naturally phrase questions about that information. This surfaces real use cases, reveals which metrics are relied upon versus theoretically useful, and exposes inconsistencies in metric definitions across teams before any building begins.
This question inventory becomes the ground truth for the rest of the process. It shapes the semantic model, determines which definitions need to be governed first, and becomes the validation set used to determine whether the agent is ready for production.
In a recent engagement with a global CPG company building this capability for its Finance function, discovery interviews surfaced a prioritized backlog of business questions around performance to plan, margin movement, working capital, and period-over-period variance. Those questions directly shaped the semantic model structure and gave stakeholders a concrete way to evaluate readiness. Instead of asking whether the AI “worked,” the team could ask a better question: can it reliably answer the Finance questions we already agreed matter most?
Classify Questions Before Building
Once the question inventory exists, it needs to be classified by complexity before development begins. Not all questions place the same demands on the semantic layer. Treating them uniformly leads to two common problems: teams over-engineer for simple lookups, or they underestimate the semantic depth required for more complex analysis. A practical classification approach uses three tiers.
- Descriptive questions - What happened: The majority of questions in most business domains fall into straightforward single-metric lookups with clear filters. For example: “What was net revenue last month?” or “What was gross margin for Beverages in the Northeast?” These questions still require governed definitions, but they are typically the simplest to support and validate.
- Comparative questions - How are we doing versus something else? Another group involve multi-metric comparisons, period-over-period calculations, or cross-dimensional analysis. For example: “How did beverage margin compare to plan this quarter?” or “How are Northeast Express stores performing versus last year?” These questions require more explicit time logic, comparison rules, and dimensional consistency.
- Diagnostic questions - Why did it happen? A smaller portion involve conditional logic, ranking, decomposition, or more complex consideration that requires robust semantic definitions and calculation logic. For example: “Which products contributed most to the margin decline?” or “Why did revenue fall short of plan in the Northeast?” These are the hardest questions to support because they require robust metric definitions, safe dimensional paths, ranking logic, and clear rules for how the agent should reason through possible explanations.
This classification shapes development priorities and sets realistic expectations. The semantic layer should achieve strong, consistent accuracy on the simpler tier before the more complex questions are used to assess readiness. Mixing tiers in early validation creates misleading accuracy signals and makes it harder to diagnose where the gaps actually are.
Build in Two Passes
After discovery and classification, the semantic layer can be built to enable the agent to answer questions more accurately and effectively. This typically works best as two distinct passes: first, the governed analytics foundation; second, the AI-ready interpretation layer.
Pass 1: Metric Layer - Build the Governed Analytics Foundation
The first pass is traditional analytics engineering work. This means standardizing metric definitions, resolving calculation inconsistencies, documenting business logic, and confirming that the data model reflects how the business typically measures performance.
This is where you likely discover that a metric like net revenue is calculated differently across two or three different teams. Margin may refer to gross profit dollars in one report and gross margin percentage in another. These differences may be manageable when humans are interpreting dashboards, but they become a serious problem when an agent is expected to generate answers dynamically.
Those inconsistencies need to be resolved before AI enrichment is added. Otherwise, the agent is not operating on governed business meaning. It is navigating unresolved disagreement.
The output of this first pass should be a reliable analytics foundation: governed metrics, trusted dimensions, documented grain, approved relationships, tested calculations, and clear ownership.
Pass 2: Context layer - Add AI-Ready Interpretation
The second pass adds the context an agent needs to interpret natural language the way the organization expects. This is the AI-ready enrichment layer described in the previous post. It should include:
- Shared language: natural-language aliases, business terminology, and disambiguation for overloaded terms so the agent understands how people actually ask questions.
- Guardrails: defaults, assumptions, and rules for when the agent should answer directly versus ask a clarifying question.
- Relationships and safe paths: approved joins, slices, and valid combinations so dynamically generated queries do not produce plausible-but-wrong results.
- Ground truth examples: high-quality Q&A pairs that reflect real organizational conventions across the three complexity tiers, rather than generic examples or trivia.
A useful implementation principle is to keep the core metric layer and the context layer distinct but governed together. Calculation logic should change carefully, with tests and disciplined release control. Language mappings, defaults, examples, and clarification rules may evolve more frequently as usage grows, but they still need ownership, review, and versioning. That separation allows the semantic layer to become richer over time without turning the underlying data model into a moving target.
Validate Against Ground Truth Before Expanding
Once the semantic layer has been built, the semantic layer needs structured validation against the question inventory before it's considered production ready. This means running the full question set through the agent, scoring responses against expected outputs, and categorizing failures by type: metric error, filter selection error, calculation error, hallucination, etc.
Categorizing failures is important because the pattern of failures points to specific remediation. A concentration of filter errors usually means embedded filter values are incomplete, calculation errors point back to unresolved logic in the analytics engineering layer and hallucinations typically indicate gaps in Q&A pair coverage for a particular question type or phrasing pattern.
Running validation in structured cycles, with targeted remediation between each cycle, gives stakeholders clear evidence of readiness rather than a demo that performs well on a curated set of questions. It also builds the institutional confidence needed to move from pilot to production and to secure stakeholder buy-in and executive approval for domain expansion.
The Roadmap in Practice
Taken together, these steps form a repeatable roadmap for building AI-ready semantic layers domain by domain. The goal is not to create a one-off pilot that works under controlled conditions, but to establish a process the organization can reuse as more teams, metrics, and use cases are brought into scope.
At a high level, the process looks like this:
The sequence matters. Business questions define what the semantic layer needs to support. Classification helps teams prioritize development and set realistic validation expectations. The analytics engineering pass establishes the governed foundation, while the AI enrichment pass adds the context an agent needs to interpret natural language safely. Validation then tests the system against real business questions, and documentation turns the first domain into a repeatable onboarding model for the next.
The Foundation Matters
Delivering AI for BI well requires both the right process and the right architecture, and neither substitutes for the other. The tech stack needs to be deliberate and fit for the environment it operates in. Whether that is a dbt semantic layer with MetricFlow, a Snowflake-native approach with Semantic Views and Cortex, a Databricks-native approach with metric views and Genie, or a hybrid across platforms, the architecture decision has long-term implications for governance, scalability, and how easily new domains get onboarded.
At the same time, even a well-chosen stack produces poor results if the underlying semantic layer is underdeveloped or the validation process is skipped. In our experience, the organizations that get the most out of AI for BI are the ones that invest in both a thoughtful architecture and a disciplined build process from the start.
This post is the third and final in our comprehensive series on AI for BI. The first covers what an AI-ready semantic layer is and why it matters as a foundation, whereas the second walks through what the semantic artifacts look like in practice. Collectively, they are intended to give both business and technical readers a grounded view of what it takes to build this capability in a way that lasts. If you want to talk through where your organization stands, feel free to reach out!


