What Production AI Actually Looks Like in 2026

Klarna, Morgan Stanley, JPMorgan, Intercom: same models, divergent outcomes. The variance is not a model gap. It is a discipline gap, the engineering and operating decisions that turn a foundation model into a system a business can run.

1 May 202640 min read9,700 wordsArchitecture & Patterns

Cite as: The Applied Layer. (2026). What Production AI Actually Looks Like in 2026. The Applied Layer. https://appliedlayer-ai.com/research/what-production-ai-actually-looks-like-in-2026

Architectural cross-section, plate from a 17th-century treatise (Sant'Andrea al Quirinale, Rome) — Architectural cross-section. Met Museum, public domain.

Executive summary

Three years into the generative AI cycle, the gap between leading and lagging enterprise deployments has stopped tracking the gap between leading and lagging models. Klarna, Morgan Stanley, JPMorgan Chase, and Intercom are running production systems on the same OpenAI, Anthropic, and Google models available to every other large enterprise. Their outcomes diverge anyway, sometimes by an order of magnitude in resolution rate, accuracy, or cost-to-serve. The divergence sits in a layer that has no canonical name: the discipline that converts a foundation model into a system a business can run.

This report calls that discipline the applied layer and argues it has six identifiable components, retrieval, orchestration, evaluation, governance, integration, and human-AI workflow design. It is not infrastructure. It is not a model. It is the engineering and operating decisions that determine whether a system performs in production. Industry data, from RAND, S&P Global, IDC, Stanford HAI, McKinsey, and a16z, converges on the same picture: 70 to 90 percent of enterprise AI initiatives fail to reach production, and even successful ones rarely realize the EBIT impact projected at pilot.

The applied layer is undertheorized. Foundation-model coverage dominates technical media; vendor marketing prefers transformation narratives to architectural specificity; the practitioners doing the work are too busy to write much of it down. This report defends five principles for working at the applied layer, presents a five-level maturity ladder, and frames the trends shaping the next decade as scenarios rather than predictions.

Executive Summary

Opening Hook

In February 2024 Klarna and OpenAI announced that an AI assistant built on GPT‑4 was handling roughly two-thirds of Klarna’s customer service chats, completing 2.3 million conversations in its first month, cutting average resolution time from 11 minutes to under two, and projecting USD 40 million in 2024 profit improvement.¹ T1 By spring 2025, CEO Sebastian Siemiatkowski was telling Bloomberg that the company had “underestimated the trade-off” on quality and was rebuilding human capacity, recruiting in an “Uber-type” flexible model while keeping the bot for routine work.² T1 The model did not change. The applied layer around it did.

Morgan Stanley deployed an internal assistant on the same family of models, beginning in March 2023, and by mid-2024 reported 98 percent adoption among financial advisor teams, an evaluation framework that grew from 7,000 questions to a corpus of 100,000 documents, and a parallel meeting-summary tool, Debrief, rolled out across roughly 16,000 advisors.³ ⁴ T1 The firm describes the effort, on its own platform, as an “evaluation framework to test every AI use case before deployment,” refined in collaboration with OpenAI on retrieval methods.³ T1

Two deployments. Comparable underlying models. Materially different outcomes, Klarna walking back automation claims, Morgan Stanley scaling. The variance is not a model gap. It is a discipline gap: in retrieval architecture, evaluation rigor, escalation design, and the operating model around the system. That gap is the subject of this report. It has a shape. It has components. It has principles. It has not, until recently, had a name. The Applied Layer is what this publication will call it. T4

Section 1: The Gap

The aggregate numbers are well-rehearsed and consistent across independent sources. Stanford’s AI Index 2025 records that 78 percent of organizations reported using AI in at least one business function in 2024, up from 55 percent the prior year; generative AI use roughly doubled to 71 percent.⁵ T1 McKinsey’s State of AI in 2025 survey, run on a base of more than 1,400 respondents, finds that 88 percent of firms now use AI in at least one function but that “nearly two-thirds have not yet begun scaling AI across the enterprise” and only 39 percent report any enterprise-level EBIT impact.⁶ T1 T4, McKinsey is a Tier D analyst source; figures should be read as self-reported survey data.

On the failure side, the picture is just as consistent. RAND’s August 2024 study of AI project outcomes, based on structured interviews with 65 experienced data scientists and engineers, concludes that more than 80 percent of AI projects fail to reach meaningful production deployment, twice the failure rate of comparable IT projects.⁷ T1 IDC research conducted with Lenovo found that 88 percent of observed proofs-of-concept did not graduate to wide-scale deployment, with only four out of 33 average POCs reaching production.⁸ T1 S&P Global Market Intelligence’s 2025 enterprise survey reports that the share of companies abandoning most AI initiatives jumped from 17 percent in 2024 to 42 percent in 2025, with the average organization scrapping 46 percent of POCs before production.⁹ T1 These are three independent data sources pointing at the same phenomenon.

What the numbers do not say, and what the public discourse rarely names, is that the gap between leaders and laggards is not principally a gap in model access. The 90th-percentile model is roughly the same in every enterprise that has signed a Microsoft, AWS, or Google enterprise agreement. Anthropic’s Opus 4.5, released November 24, 2025, at USD 5 per million input tokens and USD 25 per million output tokens, is available through the Claude API, AWS Bedrock, Google Vertex AI, and Microsoft Foundry on day one.¹⁰ T1 So is GPT‑class capability through Azure OpenAI. The shared frontier means that what differentiates a working production system from an abandoned one is what sits around the model.

That distinction shows up in named cases. JPMorgan Chase’s COiN platform, launched in 2017 to automate commercial loan agreement review, and frequently cited as analyzing roughly 12,000 agreements with annual savings of about 360,000 work hours, is an applied-layer success built on much older NLP technology than is now available; the discipline, not the model, did the work.¹¹ T2 Bloomberg’s 2023 BloombergGPT, a 50-billion parameter model trained on a 363-billion-token financial corpus, was overtaken on most financial tasks by general-purpose GPT‑4 within months of release, suggesting that proprietary models alone do not produce defensible outcomes; what mattered for Bloomberg was how those capabilities were integrated into the Terminal workflow.¹² T1

Negative cases sharpen the same point. McDonald’s three-year drive-through automated-order-taking pilot with IBM, deployed in more than 100 U.S. restaurants, was terminated in June 2024 with the technology shut off no later than July 26, 2024.¹³ T1 The cancellation followed widely circulated social-media clips of misorders. The model was capable of speech recognition; the applied layer, accent and noise handling, escalation, integration with kitchen workflows, real-time error correction, was not. Air Canada’s 2024 loss before the British Columbia Civil Resolution Tribunal in Moffatt v. Air Canada turned on a chatbot that gave a customer incorrect retroactive bereavement-fare advice; the airline argued unsuccessfully that the chatbot was “a separate legal entity,” and the tribunal awarded CAD 812.02.¹⁴ T1 The model could compose fluent text. The applied layer, content alignment between chatbot and policy page, governance review, output verification, failed.

The pattern is durable and replicates across sectors. Intercom’s Fin AI Agent, rolled out across more than 6,000 customers, reports an average automation rate climbing from 41 percent to 51 percent and now to 66 percent across the customer base, with more than 20 percent of customers above 80 percent.¹⁵ T1 Variance among Fin’s customers using the same product on similar query mixes is wide, the sources of difference cited by Intercom are content quality, escalation design, and “Procedures” governing complex flows. Same model, same product, different applied-layer discipline, different outcomes.

The honest read on the public data is that the spread between top and bottom decile of enterprise deployments is now larger than the spread between the top and second-tier foundation model. T4 No single source quantifies this rigorously, a real limitation of the field, addressed in Figure 2 below, but the directional claim is supported by multiple converging sources: McKinsey’s high-performer cohort (only 1 percent of executives in a complementary survey describe their gen AI rollouts as “mature”); a16z’s reporting that 81 percent of enterprises now run three or more model families in production, swapping models with diminishing friction; and Anthropic’s Economic Index, which finds that the top 10 tasks account for 24 percent of Claude.ai usage, with directive automation rising from 27 to 39 percent in eight months, suggesting that the binding constraint is workflow design, not raw capability.⁶ ¹⁶ ¹⁷ T2

The gap is a discipline gap. The discipline needs a name and a shape.

Section 2: What We Mean by the Applied Layer

The applied layer is the engineering and operating discipline that turns a foundation model into a system a business can run in production. It sits above the model and below the business workflow.

Figure 1, The Applied Layer in Context

[Schematic four-layer stack, rendered top to bottom]

  BUSINESS WORKFLOW
  (process, role, decision rights, customer experience)
        ▲
        │  invocation, escalation, audit, feedback
        ▼
  THE APPLIED LAYER
  ┌────────────────────────────────────────────────┐
  │ Retrieval │ Orchestration │ Evaluation         │
  │ Governance│ Integration   │ Human-AI Workflow  │
  └────────────────────────────────────────────────┘
        ▲
        │  inference calls, embeddings, fine-tunes
        ▼
  MODELS
  (foundation models, hosted services: OpenAI,
   Anthropic, Google, AWS Bedrock, Azure AI,
   Vertex AI, open-source weights)
        ▲
        │  GPU compute, storage, networking, data
        ▼
  INFRASTRUCTURE & DATA PLATFORMS
  (cloud, Kubernetes, lakehouses, warehouses,
   feature stores, vector stores)

Caption: The applied layer is decomposed into six canonical
components. The boundary above (to business workflow) is
where decisions are taken; the boundary below (to models)
is where capability is consumed. Source: editorial framework,
The Applied Layer, drawing on Huyen, Designing Machine
Learning Systems (2022); Yan et al., "What We Learned from
a Year of Building with LLMs" (O'Reilly, 2024); Wang et al.,
Uber Michelangelo evolution (2024).

Each of the six components has an identifiable body of practitioner work and named systems behind it.

Retrieval. The function of bringing the right information into a model’s context window at query time. Retrieval-augmented generation as a pattern was articulated in Lewis et al., “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS 2020, which paired a dense retriever (initialized from Karpukhin et al.’s Dense Passage Retrieval, EMNLP 2020) with a seq2seq generator and demonstrated improved performance on knowledge-intensive tasks.¹⁸ ¹⁹ T1 In production, retrieval has become the central engineering problem: chunking, embedding choice, hybrid lexical-plus-vector search, re-ranking, metadata filtering, freshness, and access control all sit here. A 2024 study of RAG in production reports that 86 percent of enterprises deploying LLMs use the pattern.²⁰ T2 T4, figure cited by an academic paper attributing it to a survey; reader should treat as a directional indicator, not a precise share.

Orchestration. The control flow that links model calls, tools, retrievers, validators, and downstream systems into a coherent pipeline. Uber’s GenAI Gateway, described in a July 2024 engineering post, exposes a unified OpenAI-compatible interface across vendor and in-house models, with PII reduction, safety guardrails, logging, and budget controls, the platform-level orchestration of more than 60 internal LLM use cases.²¹ T1

Evaluation. The systematic measurement of system behavior against task-specific criteria, distinct from benchmark scores on the underlying model. Hamel Husain’s practitioner essays, drawn from work with more than 30 companies, frame evaluation as a continuous practice rather than a release gate, including techniques such as LLM-as-judge alignment, error analysis, and assertion-based unit tests.²² T2 Academic frameworks formalize parts of this, RAGAS (EACL 2024) introduces reference-free metrics for faithfulness, answer relevancy, context precision, and context recall; ARES extends this with trained judges per pipeline component.²³ ²⁴ T1

Governance. Policy, risk, and compliance practice as it intersects with the system itself: input filtering, output classification, audit logging, evaluation gates in CI/CD, escalation rules, model and prompt versioning, data residency, and review workflows for regulated decisions. The Moffatt v. Air Canada ruling demonstrates that, in at least one common-law jurisdiction, governance is now legally non-negotiable.¹⁴ T1

Integration. Connectivity to the systems of record where work actually happens, CRMs, ticketing systems, data warehouses, identity providers, ERP, payment rails. Morgan Stanley’s Debrief integrates with Salesforce; AskResearchGPT integrates with email drafting; AI @ Morgan Stanley Assistant retrieves over 100,000 internal documents.⁴ ²⁵ T1 Integration is the work that determines whether an AI feature can take action, not just talk.

Human-AI workflow design. The deliberate structure of how humans and the system interact: where the human sits in the loop, when control transfers, how feedback is captured, how disagreement is resolved, and how the work changes over time. Klarna’s evolution, from “AI replaces 700 agents” framing in early 2024 to a hybrid “AI for the easy 80 percent, expert humans for the hard 20 percent” pilot in 2025, is a case study in re-doing this component after the fact.² T1

The applied layer is not infrastructure. The lakehouse, the feature store, the GPU cluster, and the vector database are infrastructure; they sit below. Stripe’s Railyard (an internal Kubernetes-based model training service that has trained nearly 100,000 models) and Shepherd (its Chronon-derived feature platform) are infrastructure investments that enabled, but are not themselves, applied-layer work.²⁶ ²⁷ T1

The applied layer is not the model. Whether the model is GPT‑5.x, Claude Opus 4.5, Gemini 3 Pro, or a fine-tuned open-weights model, the applied layer is what wraps it.

The applied layer is not the business workflow. A redesigned underwriting process, a new advisor role, or a revised customer service organization is the workflow above; the applied layer is what couples the model to that workflow.

The boundaries are useful because they let practitioners and executives reason about where to invest. The McKinsey “takers / shapers / makers” framing, off-the-shelf, customized, and from-scratch, describes choices at the model layer.²⁸ T2 It says nothing about the applied layer, where most effort and most value generation actually live regardless of the build/buy decision above.

A note on terminology: practitioners use overlapping vocabulary, “AI engineering” (Chip Huyen, 2025); “LLMOps” (an extension of MLOps); “applied LLMs” (Yan, Bischof and co-authors); “production AI”; “ML platform.”²⁹ ³⁰ ³¹ T2 The discipline is not yet named consistently. This report uses “applied layer” because it most cleanly separates the engineering and operating discipline from the infrastructure beneath and the workflow above. Report 2: Production RAG will extend Section 5(i) below into the retrieval component in depth.

Section 3: The Economics of the Applied Layer

The applied layer is where most of the cost, time, and risk in an enterprise AI project accumulates. The headlines focus on inference cost and model spend; the spend that matters is everywhere else.

a16z’s 2025 survey of 100 enterprise CIOs reports that average enterprise spend on LLMs has risen from approximately USD 4.5 million in 2024 to USD 7 million in 2025 and is projected to reach roughly USD 11.6 million in 2026, with application-layer spend tracking close behind.¹⁷ T1 T4, Tier D analyst-style data; flag as practitioner survey rather than independently audited figures. Inference cost, while non-trivial, is now a small share of project economics for most enterprise use cases. Stanford’s AI Index 2025 records that the cost of running a model with GPT‑3.5-equivalent performance fell roughly 280-fold over 18 months on a per-million-token basis.⁵ T1 The model is not where the money goes once a system is in production.

Where it goes, based on triangulating Klarna’s disclosed implementation cost, Morgan Stanley’s described eval program, McKinsey survey data, and engineering write-ups from Uber, Stripe, and DoorDash, is the applied layer. Klarna’s CEO disclosed that the implementation cost of the Klarna AI assistant was between USD 2 million and USD 3 million, on a project that the company projected would deliver USD 40 million in 2024 profit improvement; that ratio is itself instructive, the build cost is a small fraction of the operating cost saving target, and almost none of it is GPU compute.³² T1 The operating model around it, where humans sit, what training they do, what oversight is in place, is the much larger ongoing line item, as Klarna’s 2025 reversal made clear.

Figure 3, Where Applied-Layer Cost Concentrates

[Stacked bar showing typical share of enterprise AI project
 spend across categories. Illustrative ranges, not precise
 percentages. Categories ordered as labelled below.]

   Model inference / API:          ~5-15%
   Cloud infrastructure:           ~5-10%
   Retrieval engineering:          ~15-25%
   Evaluation (build + run):       ~15-20%
   Integration:                    ~15-25%
   Governance / risk / legal:      ~10-15%
   Change management / training:   ~10-20%

Caption: Illustrative cost decomposition for an enterprise AI
project of meaningful scope (six-figure to low-seven-figure
USD annual spend). Bands constructed by combining named
disclosures (Klarna 2024 implementation cost USD 2-3M;
Morgan Stanley evaluation framework descriptions; Uber's
GenAI Gateway and platform descriptions; a16z 2025 CIO
survey on LLM and application spend) with industry-attested
patterns from McKinsey, BCG, and engineering blog posts.
This figure is constructed from limited public data; precise
decompositions for individual enterprise AI projects are
rarely published. Treat as directional. Sources: a16z (2025);
Klarna press materials (2024); Stanford AI Index (2025);
McKinsey State of AI (2025); Uber Engineering (2024).
[Tier D / editorial reconstruction, flagged as a known
limitation in the Methodology Note.]

Three patterns underlie the cost concentration.

Retrieval is expensive to build and expensive to maintain. Morgan Stanley’s evolution from “answering 7,000 questions” to “any question from a corpus of 100,000 documents” was not a model upgrade.³ T1 It was retrieval engineering, refined in collaboration with OpenAI on retrieval methods, plus a sustained content operation to keep the corpus accurate and current. Report 2: Production RAG will quantify retrieval engineering cost in detail. The high-level point: chunking, embeddings, indexing, hybrid retrieval, re-ranking, freshness, and content stewardship are real engineering programs, not configuration.

Evaluation, done well, is a product, not a one-off. Hamel Husain’s “Field Guide to Rapidly Improving AI Products,” derived from work with more than 30 companies, argues that the principal differentiator between teams that ship and teams that stall is investment in measurement and iteration rather than tool selection.³³ T2 Phillip Carter at Honeycomb describes a comparable practice: instrument every LLM call with traces, sample production traffic, run evaluators asynchronously, treat the system as a distributed system because it is one.³⁴ T2 Building this, datasets, judges, dashboards, alerting, is an applied-layer cost.

Governance is delivery work, not policy work. The Air Canada ruling makes clear that the legal liability does not stop at the chatbot disclaimer.¹⁴ T1 In regulated industries, finance, healthcare, insurance, governance is not an afterthought to a deployment; it is a constraint that shapes architecture from the first sprint. Morgan Stanley describes integrating quality assurance into its eval framework as a precondition for advisor rollout, not a follow-up.³ T1 Report 3: Operating Models will treat this in depth.

The economic argument has three concrete implications.

First, ROI is back-loaded and front-risked. The S&P Global 2025 finding that the average organization scraps 46 percent of POCs before production is consistent with this shape: the costs of doing the applied-layer work to get to production are not absorbed in pilot phase; they hit when the project graduates.⁹ T1

Second, the marginal value of additional model capability declines as the applied-layer work matures. Once a system is well-instrumented, well-retrieved, and well-integrated, swapping in a stronger model produces a one-time bump rather than a generational change. a16z reports that 81 percent of enterprises now use three or more model families in production; model substitution is a tactical choice, not a strategic one.¹⁷ T2

Third, applied-layer competence compounds. A team that has built a serious evaluation harness for one product can ship the second product faster. A retrieval architecture that handles policy documents in customer service can be repurposed for a sales-enablement assistant. Uber’s Michelangelo platform, evolved over eight years from predictive ML through deep learning into LLM use cases via a GenAI Gateway, supports more than 60 LLM use cases off a shared platform investment.³⁵ ²¹ T1 Report 4: Economics will develop the unit economics of compounding applied-layer investment.

Section 4: Why the Applied Layer Is Undertheorized

The discipline is mature enough to be observed. It is not yet mature enough to be canonical. Three forces explain the lag.

Foundation-model coverage dominates technical media. The release cadence of frontier models, GPT‑4 in March 2023, Claude 3 in March 2024, GPT‑5 in 2025, Gemini 3 in 2025, Claude Opus 4.5 in November 2025, Claude Opus 4.7 in early 2026, has structured the discourse around capability frontiers and benchmark scores.¹⁰ T1 Anthropic’s own Claude Opus 4.5 system card highlights breaking 80 percent on SWE-bench Verified as a notable result; the question of how an enterprise team turns SWE-bench into a working developer-productivity program is treated as a downstream concern.³⁶ T1 That is rational from a vendor standpoint and from a research-news standpoint. It is also why outsiders can spend a year reading about AI without learning much about how production systems are built. T4

Vendor incentives push toward platform narratives, not architectural specificity. Cloud and platform vendors compete on breadth: Azure AI Foundry, AWS Bedrock, Google Vertex AI, Databricks Mosaic, Snowflake Cortex, LangChain/LangSmith, LlamaIndex. Each has an interest in describing the applied layer as something its platform substantially solves. Each has a weaker interest in publishing rigorous decomposition of where customers actually fail and what proportion of that failure its platform addresses. The McKinsey “takers / shapers / makers” framing, useful, and frequently cited, describes model strategy and elides the applied-layer choices that matter at production time.²⁸ T2

This is not a critique of any vendor in particular. It is a structural observation. When a vendor is incentivized to publish reference architectures, those architectures are most useful as marketing artifacts and least useful as descriptions of the trade-offs an enterprise team will face. The Applied LLMs essays by Yan, Bischof, Howard, Shankar and others (O’Reilly, 2024) are valuable in part because they were written by independent practitioners with no platform to sell, drawn from a year of work building real applications.³¹ T2

The practitioners doing the work are too busy to write about it. The third reason is the most banal and the most binding. The applied layer is the discipline of people running production AI inside companies that pay them to keep production AI running. They publish irregularly. The genre that does exist, engineering blogs at Uber, DoorDash, Stripe, Netflix, Airbnb, Klarna, Bloomberg, GitHub, Shopify, Anthropic, is excellent but episodic.²¹ ²⁶ ²⁷ ³⁵ ³⁷ T1 The patterns are scattered across hundreds of posts, conference talks (AI Engineer Summit, Ray Summit, MLOps World, QCon AI), and Substack newsletters from people like Hamel Husain, Eugene Yan, Chip Huyen, Simon Willison, Jason Liu, and Phillip Carter.²² ²⁹ ³³ ³⁸ ³⁹ ³⁴ T2 No one has been paid to integrate them.

A telling indicator: the most influential single document in the applied-layer canon is arguably “What We Learned from a Year of Building with LLMs” by Bryan Bischof, Charles Frye, Hamel Husain, Jason Liu, Shreya Shankar, and Eugene Yan, a three-part essay co-written from a group chat and published on O’Reilly Radar in May-June 2024.³¹ T1 The discipline’s foundational text was, until recently, an unpaid side project.

The result is a field where practitioners know the shape of the work, retrieval is hard; evaluation is the differentiator; governance has to be wired in early, but executives commissioning enterprise AI programs typically do not, and consultants commissioned to advise them do not consistently surface it. Report 5: Evaluation will go deeper on the evaluation gap; Report 8: Failure Modes will examine specific failure patterns by industry.

Section 5: Five Principles for Working at the Applied Layer

The following five principles are an editorial framework, derived by the report’s author from the named sources cited, with each principle defended by industry-attested examples. They are not a complete theory of applied-layer engineering. They are the load-bearing claims this publication will defend across subsequent reports. T4

(i) Treat retrieval as the core engineering problem.

Retrieval is the component that determines whether a model has access to the truth a business needs it to know. In a 2024 study of LLM deployments, an estimated 86 percent of enterprises ran retrieval-augmented systems.²⁰ T2 The pattern dates to Lewis et al. 2020 and Karpukhin et al. 2020, both of which treated retrieval as a research problem; it has now become the most consequential engineering problem in production AI because it is where domain specificity is inserted, where freshness is controlled, and where access control lives.¹⁸ ¹⁹ T1

Morgan Stanley’s expansion from a 7,000-question internal corpus to a 100,000-document corpus serving 98 percent of advisor teams was not driven by a model upgrade but by retrieval-method work done in collaboration with OpenAI.³ T1 Klarna’s 2025 self-described shift to “strict whitelisting protocols,” in which “the AI retrieves information exclusively from the help center and customer account data” and escalates rather than guesses when a query falls outside scope, is a retrieval-design decision, not a model decision.⁴⁰ T2 Honeycomb’s Query Assistant, described in detail by Phillip Carter, has roughly 40 steps in its RAG pipeline, any one of which can dramatically change outputs.³⁴ T2

The defensive read is that retrieval is the part of the stack that, when done badly, produces hallucination, wrong-answer-with-confidence, and the kind of policy mismatches that produced the Moffatt v. Air Canada ruling. The constructive read is that retrieval is the part of the stack where domain investment pays a compounding return. Retrieval improvements outlive model swaps. Better content, better chunking, better re-ranking, and better metadata survive every model migration. Report 2: Production RAG will extend this principle into a full architectural treatment.

(ii) Make evaluation a product, not a phase.

The single most consistent claim across independent practitioners, Hamel Husain on his blog and in his course with Shreya Shankar; Eugene Yan in Patterns for Building LLM-based Systems & Products; Phillip Carter on Honeycomb’s deployment; Jason Liu in Systematically Improving RAG; the Applied LLMs group essay, is that systematic evaluation is the differentiator between teams that ship and teams that stall.²² ²⁹ ³⁴ ³⁸ ³¹ T2

Morgan Stanley names this directly: it implemented an evaluation framework “to test every AI use case before deployment” and ran daily regression tests against a sample suite to identify weaknesses; eval was the precondition for scaling, not a check at the end.³ T1 Stripe applied a comparable discipline to its Payments Foundation Model, building benchmarks for AI agents on Stripe integrations because, in payments, “a mostly correct integration is a failure.”⁴¹ T1 Honeycomb sampled production traffic, ran asynchronous evaluators, and treated the LLM call as part of a distributed system requiring distributed-system observability.³⁴ T2

The opposite pattern, release a system, declare success on a vibe check, discover degradation in production, is what produced Klarna’s quality-trade-off retreat and the McDonald’s IBM cancellation, where viral misorder videos preceded the formal decision to shut down by months.² ¹³ T1 Husain’s framing in Your AI Product Needs Evals puts the pattern bluntly: teams obsess over tools and frameworks but cannot tell whether their changes are helping or hurting, because they have not built the measurement system that would let them know.²² T2

Evaluation done as a product means: a curated dataset that grows; assertions that codify domain rules; LLM-judge alignment with domain experts; production-traffic sampling; CI-gated metrics; and an owner who treats the eval suite as their roadmap. Report 5: Evaluation will operationalize this in detail.

(iii) Optimize for changeability, not capability.

The frontier model in production today is unlikely to be the frontier model in production in 18 months. a16z’s 2025 CIO survey reports that 81 percent of enterprises now use three or more model families in testing or production, up from 68 percent the year before.¹⁷ T2 Anthropic’s own release cadence, Sonnet 4.5, Opus 4.5, Opus 4.6, Opus 4.7 across late 2025 and early 2026, is faster than most enterprise procurement cycles.³⁶ T1 Designing a system around the specific quirks of any single model is a depreciating bet.

Uber’s GenAI Gateway is the canonical example of building for changeability. By exposing a unified OpenAI-compatible API across vendor and in-house models, Uber’s platform team made model substitution a configuration change rather than a re-architecture.²¹ T1 DoorDash’s evolution from rule-based ML in 2016 through deep learning to LLM-powered chatbots, all on a single Sibyl/MLW platform with shadow-deployment for new models, follows the same logic.⁴² T2

The principle has a softer corollary. Capability is rented; the applied-layer work is owned. A team that invests in retrieval architecture, evaluation harness, integration plumbing, and human-AI workflow design is investing in assets that survive the next model release. A team that fine-tunes against a specific model, hard-codes its quirks, and skips evaluation is trading owned assets for rented capability. The honest read of the McKinsey “high performer” pattern, companies redesigning workflows around AI rather than wrapping AI around legacy workflows, is that they are optimizing for changeability of the system as a whole, not capability of the model in particular.⁶ T2

(iv) Governance is a delivery practice, not a policy artifact.

Governance documents do not prevent failures. Governance practices wired into delivery, output classifiers, escalation triggers, audit logs, evaluation gates in CI/CD, clear escalation rules, content alignment between AI surfaces and policy surfaces, authority limits, do.

The Moffatt v. Air Canada ruling is the clearest data point.¹⁴ T1 The CRT explicitly rejected the argument that the chatbot was “a separate legal entity” and held that Air Canada was responsible for all information on its website, whether static page or chatbot. The award was small (CAD 812.02), but the precedent matters: the firm is liable for the system’s outputs, and the only durable defense is to have built the system such that misalignment with policy cannot easily occur. That is delivery work, content stewardship, integration with the policy system of record, escalation logic, not a privacy policy update.

McKinsey’s 2025 survey reports that 47 percent of organizations have experienced at least one negative consequence from gen AI use, with inaccuracy as the most common, but only 18 percent have an enterprise-wide council with decision authority over responsible AI; the gap between policy aspiration and operational embedding is wide.⁶ T2 Anthropic’s Economic Index finds that automation patterns are rising, directive use jumped from 27 percent in late 2024 to 39 percent eight months later, meaning the volume of decisions taken by AI without a human review step is increasing, raising the cost of governance failure.¹⁷ T1

JPMorgan’s COiN case, often cited as a generative-AI success but in fact predating the gen-AI cycle, is instructive on what governance-as-delivery looks like in a regulated context: the platform is designed for human-in-the-loop review of extracted attributes, runs on the bank’s private “Gaia” cloud with regulated-workload security posture, and integrates with internal case-management systems where outputs are consumed.¹¹ T2 Morgan Stanley wired QA into its evaluation framework as a precondition for rollout.³ T1 These are governance decisions that look like architecture decisions because they are architecture decisions.

Report 3: Operating Models will develop this principle into a full operating model treatment.

(v) The human in the loop is part of the architecture.

Every production AI system is a human-AI system. The question is not whether humans are in the loop but where they sit, when control transfers, what they see, and how their feedback returns to the system.

Klarna’s 2025 reversal made this explicit. The company did not abandon AI; it redrew the human-AI boundary, keeping AI for routine inquiries and rebuilding human capacity, including a recruiting pilot for “highly educated students, professionals and entrepreneurs” in a flexible “Uber-type” model.² T1 The CEO’s framing, “AI solves the easy stuff, our experts handle the moments that matter”, describes an architectural choice, not a fallback. The 2024 numbers (resolution time from 11 minutes to 2; 25 percent drop in repeat inquiries) were preserved; what changed was the role and visibility of the human.⁴⁰ T2

Morgan Stanley built Debrief such that “advisors review and adjust AI-generated outputs before finalizing them, maintaining a balance between automation and human oversight.”³ T1 GitHub Copilot’s deployment at Zoominfo, with 33 percent acceptance for suggestions and 20 percent for lines, depends on human review as part of the workflow, not as an exception to it.⁴³ T1 Cursor’s 2026 enterprise pitch, used by a stated 64 percent of Fortune 500 companies, sells itself on autonomy gradients: tab-completion, command-K targeted edits, full-agent mode, with the developer choosing where on the slider to operate.⁴⁴ T2 T4, vendor self-reported figure on Fortune 500 share; not independently audited.

The negative case is McDonald’s drive-through. Where the system was deployed without a workable human-takeover path inside the customer experience, there is no graceful escalation in a four-second drive-through interaction, the architecture had no recovery mode, and viral failure followed.¹³ T1

The principle has a corollary on data: human-AI workflow design is also where the feedback loop lives. Jason Liu’s Systematically Improving RAG materials emphasize that small UI choices (changing a feedback prompt from “How did we do?” to “Did we answer your question?”) can produce a 5x change in feedback volume, which determines how fast the underlying system can improve.³⁹ T2 The human in the loop is not just a safety mechanism. It is the data-generating mechanism for the next iteration.

Section 6: A Taxonomy of Applied-Layer Maturity

Software engineering has long-standing maturity ladders, most prominently Carnegie Mellon’s CMMI, with its five staged levels (Initial, Managed, Defined, Quantitatively Managed, Optimizing).⁴⁵ T1 CMMI is a stable reference but is not directly transferable; the applied layer has its own components and failure modes. The five-level ladder below is an editorial framework constructed for this report. It is descriptive, not prescriptive. It is meant to give an outside observer a way to place an organization. T4

Figure 4, Applied-Layer Maturity Ladder

┌────────┬──────────────────┬────────────────────────┬───────────────────────┐
│ Level  │ Observable from  │ What is built          │ What is missing       │
│        │ outside          │                        │                       │
├────────┼──────────────────┼────────────────────────┼───────────────────────┤
│ 1      │ Pilots, demos,   │ Direct API calls;      │ No retrieval system;  │
│ Ad-hoc │ no production    │ ad-hoc prompts;        │ no evaluation; no     │
│        │ deployment       │ individual heroics     │ governance; no        │
│        │                  │                        │ ownership             │
├────────┼──────────────────┼────────────────────────┼───────────────────────┤
│ 2      │ One or two       │ Basic RAG; prompt      │ No systematic         │
│ Pro-   │ production       │ library; manual        │ evaluation; ad-hoc    │
│ duced  │ features serving │ release sign-off;      │ governance; weak      │
│        │ real users       │ Slack-based on-call    │ integration with      │
│        │                  │                        │ systems of record     │
├────────┼──────────────────┼────────────────────────┼───────────────────────┤
│ 3      │ Multiple         │ Eval suite with        │ Eval not yet wired    │
│ Mea-   │ products in      │ regression tests;      │ into CI/CD as gate;   │
│ sured  │ production with  │ tracing/observability; │ retrieval not yet a   │
│        │ dashboards and   │ ownership defined;     │ shared platform;      │
│        │ named owners     │ initial guardrails     │ governance partly     │
│        │                  │                        │ retroactive           │
├────────┼──────────────────┼────────────────────────┼───────────────────────┤
│ 4      │ Shared platform  │ Evals as CI gates;     │ Limited cross-org     │
│ Plat-  │ across multiple  │ versioned prompts;     │ pattern reuse; some   │
│ formed │ business units;  │ unified gateway;       │ teams still building  │
│        │ predictable      │ retrieval as a service;│ in silos; governance  │
│        │ release cadence  │ governance built into  │ council exists but    │
│        │                  │ delivery               │ enforcement uneven    │
├────────┼──────────────────┼────────────────────────┼───────────────────────┤
│ 5      │ AI is part of    │ Continuous eval        │ The frontier:         │
│ Compo- │ how the company  │ on production traffic; │ multi-agent           │
│ unding │ ships software,  │ data flywheel from     │ orchestration at      │
│        │ not a separate   │ user feedback;         │ scale; cross-system   │
│        │ program;         │ governance is          │ verification; novel   │
│        │ measurable EBIT  │ delivery practice;     │ failure modes still   │
│        │ contribution     │ workflows redesigned   │ being discovered      │
│        │                  │ around AI              │                       │
└────────┴──────────────────┴────────────────────────┴───────────────────────┘

Caption: Five-level applied-layer maturity ladder. Levels are
defined by what an outside observer would see, what is in
place, and what is still missing. Inspired by CMMI's staged
representation but adapted to the applied-layer components
defined in Section 2. Source: editorial framework, The Applied
Layer (2026), drawing on CMMI v3.0; Anthropic Economic
Index (2025-2026); McKinsey State of AI (2025); engineering
blog disclosures from Uber, Stripe, DoorDash, Bloomberg,
Morgan Stanley.

Several named cases anchor the ladder.

Level 1 (Ad-hoc). The 88 percent of POCs that IDC reports never reach production sit here.⁸ T1 So do many of the abandonment cases described in S&P Global’s 2025 survey.⁹ T1 The diagnostic indicator is that the work depends on individuals; if a single engineer or champion left, the system would not be maintainable.

Level 2 (Produced). A first feature is live, often as a chatbot or internal copilot. It works for the 80 percent case and fails for the long tail. It is governed by Slack-channel triage. Honeycomb’s own Query Assistant in mid-2023, by Phillip Carter’s account, started here: an initial release that solved roughly 80 percent of cases, with the remaining 20 percent representing important edge cases that paying customers cared about.³⁴ T2 The diagnostic indicator is that improvement is reactive, driven by user complaints and incidents, rather than systematic.

Level 3 (Measured). Eval has been built. A team has dashboards. Issues are detected before users complain. Klarna’s described 2025 architecture, “strict whitelisting protocols,” weekly sampling of 100 conversations for review, defined escalation cues, is a Level 3 description for a single product, even as the company’s broader operating-model decisions reset its trajectory.⁴⁰ ² T2 Most enterprises that report production AI systems sit at Level 2 or early Level 3, consistent with McKinsey’s finding that nearly two-thirds have not yet begun scaling AI across the enterprise.⁶ T2

Level 4 (Platformed). A shared applied-layer platform is in place. Evaluation, retrieval, governance, and tooling are reused across multiple business units. Uber’s Michelangelo platform, evolved through three distinct phases over eight years (predictive ML 2016-2019; deep learning 2019-2023; generative AI 2023-present) and now serving more than 60 LLM use cases through a unified GenAI Gateway, is at this level.³⁵ ²¹ T1 Stripe’s Railyard plus Shepherd platforms, which together standardize training and feature engineering across the company’s many ML use cases, are also Level 4 examples.²⁶ ²⁷ T1

Level 5 (Compounding). AI is a delivery competency, not a project portfolio. Evaluation runs continuously on production traffic. The data flywheel from user feedback into model and retrieval improvement is operational. Governance is built into delivery. Workflows have been redesigned. Bloomberg’s Terminal AI features, built atop both BloombergGPT and frontier models, integrated into the workflow that finance professionals already trust, represents an organization at this level for at least one workflow.¹² T2 Morgan Stanley’s expansion from AI Assistant to Debrief to AskResearchGPT, with 98 percent advisor adoption and a continuous evaluation discipline, is approaching it.³ ²⁵ T2

A note on the ladder’s limits. CMMI was built for stable processes; the applied layer is not stable. Frontier capability shifts quarterly; governance regulations are evolving (the EU AI Act, U.S. state laws, sectoral guidance); attack patterns including prompt injection and supply-chain risks are still being characterized. An organization at Level 4 in 2026 may be Level 3 in 2027 if it does not invest. The ladder describes a current snapshot, not a destination.

Report 8: Failure Modes will extend this ladder by mapping specific failure patterns, hallucination at scale, evaluation drift, retrieval staleness, integration failure, governance breach, workflow rejection, to the level at which each typically appears.

Section 7: What This Means for the Next Decade

What follows are scenarios and trends, not predictions. Each is grounded in directional evidence; each has uncertainty associated. The honest read is that the applied layer will become more important, not less, regardless of model trajectory. T4

Scenario A: Models continue to improve rapidly; the applied-layer gap widens. Stanford’s AI Index 2025 records that smaller models are now reaching performance thresholds that required 100x parameter counts two years earlier; inference cost has dropped roughly 280-fold over 18 months; agent benchmarks like RE-Bench show top systems matching or exceeding human experts on short-horizon tasks.⁵ T1 Anthropic’s Economic Index finds directive (automation-style) use rising and Claude Opus 4.5 reportedly approaching ASL‑4-related thresholds on benchmark tasks, though Anthropic notes that long-horizon, multi-system collaboration with humans is still a binding limitation.¹⁷ ³⁶ T1 In a world where capability gets cheaper and stronger, the applied layer becomes the entire competitive surface area, because that is where the cheapest, strongest model has to actually fit into the business.

Scenario B: Capability plateaus; applied-layer competence determines who captures value. If model improvement slows for any reason, data, compute, regulation, training-method exhaustion, the value of being good at the applied layer rises further. The McKinsey 2025 finding that high performers redesign workflows around AI rather than wrapping AI around legacy workflows points the same direction.⁶ T2 In either scenario, the applied layer is where the leverage lives.

Trend: Agents move from copilot to autonomous, slowly. The current pattern is augmentation-dominant on consumer products and increasingly automation-dominant on API traffic.¹⁷ T1 Stripe’s “Stripe Integration Benchmark” published in 2025 found that current frontier LLMs can solve a majority of scoped coding problems but that fully autonomous management of end-to-end software engineering projects remains an open question.⁴¹ T1 Cursor and Replit are testing the boundary in coding; Intercom Fin 3 is testing it in customer service through “Procedures.”¹⁵ ⁴⁶ T1 The applied-layer implication is that orchestration, evaluation of agent trajectories, and human-in-the-loop verification are about to become more important, not less. Anthropic’s January 2026 Economic Index report finds that augmentation has overtaken automation again on Claude.ai, 52 percent versus 45 percent, even as automation rises on API traffic.⁴⁷ T1 The ratio is not stable, and the applied-layer designs that work for augmentation will not transfer cleanly to autonomy.

Trend: Governance moves from policy artifact to enforcement architecture. The Moffatt v. Air Canada ruling is the first widely cited case in a category that will grow.¹⁴ T1 EU AI Act provisions on AI-generated content transparency are now law. State-level activity in the U.S. is accelerating; Stanford’s AI Index 2025 records that AI-related state legislation in the U.S. more than doubled in a single year, from 49 in 2023 to 131 in 2024.⁵ T1 The applied-layer implication is straightforward: more output classification, more audit logging, more provenance tracking, more human review at higher-risk surfaces. Governance is going to look more and more like classical change management, less and less like policy publishing.

Trend: A vocabulary will form, and reference architectures will harden. The current state, “AI engineering,” “LLMOps,” “applied LLMs,” “production AI” all describing roughly the same thing, is the early phase of a discipline. The pattern is consistent with how DevOps, MLOps, and SRE matured: practitioner essays first; conference vocabulary; books; reference architectures from neutral parties; eventually, formal certifications. The applied layer is in the practitioner-essay phase. The next decade will produce its O’Reilly canon, its conference circuit, and probably its Carnegie Mellon equivalent.

What is genuinely uncertain. Whether multi-agent architectures become the norm is uncertain, Cognition’s published view, conveyed through Walden Yan’s discussion with Jason Liu, is that single agents with proper context management often outperform multi-agent setups in coding contexts.³⁹ T2 Whether enterprise AI value scales linearly with applied-layer maturity, or whether there are step-function returns, is unknown. Whether the current wave of vibe-coding tools (Cursor, Replit, Lovable) drives a step change in applied-layer engineering productivity, or whether it primarily shifts where the engineering happens, is unknown.⁴⁶ T2 The Stanford 2024 incident database recording 233 AI-related incidents, a 56.4 percent year-over-year increase, points to failure modes still being catalogued.⁵ T1

The conservative claim is this: the applied layer will be a central management problem for any enterprise running AI in production for the rest of the decade. That claim is consistent with every data source surveyed in this report. The strong claim, that the applied layer will be the central problem, displacing model selection and infrastructure as boardroom topics, is supported by directional evidence and is the editorial position this publication will defend.

Reports 2 through 10 will go deeper on specific components. The applied layer is the discipline. This was its frame.

Methodology Note

This report was researched and written between April and May 2026 using publicly available sources. The research process used approximately 20 targeted web searches across the following categories: named enterprise AI deployments (Klarna, Morgan Stanley, JPMorgan, Bloomberg, McDonald’s, Air Canada, Intercom, GitHub Copilot, Cursor, Replit, DoorDash, Uber, Stripe); foundational and current peer-reviewed research (Lewis et al. 2020, Karpukhin et al. 2020, BloombergGPT, RAGAS); engineering blogs from named teams (Uber, Stripe, DoorDash, Honeycomb); practitioner essays (Hamel Husain, Eugene Yan, Chip Huyen, Simon Willison, Jason Liu, Phillip Carter, the Applied LLMs group); industry surveys (Stanford AI Index 2025, McKinsey State of AI 2024 and 2025, a16z 2024 and 2025 enterprise AI surveys, Anthropic Economic Index reports through early 2026); regulatory and case material (Moffatt v. Air Canada); and reputable secondary reporting (CNBC, Reuters, FT, The Information context, CIO Dive, Bloomberg, MIT Tech Review).

Sources were classified by tier: Tier A (primary documents, vendor docs, regulatory filings, peer-reviewed research) was preferred for all load-bearing claims. Tier B (named practitioner writing, named case studies) was used for industry-attested patterns. Tier C secondary reporting was used only for context. Tier D analyst material (McKinsey, IDC, S&P Global, a16z surveys) was clearly flagged where used and triangulated against at least one independent source where possible. Tier E sources (SEO content, marketing as fact) were excluded.

Known limitations.

First, Figure 2 (the performance-gap chart referenced in Section 1) is conceptually clear but quantitatively under-supported. No publicly available, methodologically rigorous study compares production AI performance across enterprises using comparable underlying models on a common metric; the closest available data are vendor-published case studies (Klarna, Morgan Stanley, Intercom Fin) and aggregate survey data (Stanford, McKinsey, RAND, S&P Global). The narrative argument in Section 1 stands on the convergence of those independent sources; precise quantification of the gap remains a research gap. Figure 2 has therefore been omitted in favor of a textual description of the gap, with that omission acknowledged here. Future reports in this series will attempt to construct a more rigorous benchmark.

Second, Figure 3 (cost decomposition) is constructed from limited public data. Few enterprises publish detailed project-level cost decompositions for their AI initiatives. The bands shown are directional, drawing on Klarna’s disclosed implementation cost (USD 2-3M), the a16z 2025 CIO survey (LLM and application spend trajectories), Morgan Stanley’s eval-program descriptions, and Stanford’s 2025 inference-cost data. A rigorous reader should treat the figure as illustrative of the qualitative point, that applied-layer work dominates spend over inference and infrastructure, rather than as a precise benchmark.

Third, the five-level maturity ladder (Figure 4) is an editorial framework. It is informed by CMMI’s staged representation, but no peer-reviewed study has validated the specific level definitions used here. The ladder is offered as a tool for placement, not a measurement instrument.

Fourth, archived URLs in the bibliography below have been constructed in the format https://web.archive.org/web/2026*/[original URL] where an exact archive snapshot at the date of access could not be programmatically confirmed; readers should treat these as pointers to the Wayback Machine’s most recent capture rather than as verified static snapshots. Any subsequent edition of this report will replace constructed archive links with verified snapshot URLs.

Fifth, this report relies disproportionately on practitioner blogs and engineering posts because that is where applied-layer practice is currently documented. Where claims rest on a single practitioner source, that source is named in-line and the tier is marked T2.

Sixth, no commercial relationships exist between the publication and any of the named vendors, models, platforms, or practitioners cited. See the Disclosure Statement below.

The applied layer is a moving target. Specific platform features, model versions, and benchmark results cited in this report will be outdated within months. The architectural and operational principles defended in Sections 2, 5, and 6 are intended to be more durable, but they too will be revisited in subsequent reports as evidence accumulates.

Bibliography

Citations are listed in alphabetical order by author/organization, in Chicago Notes-Bibliography style. All web sources include accessed date and an archived URL. Where an exact archive snapshot could not be programmatically verified, the archived URL points to the Wayback Machine’s nearest capture, formatted as https://web.archive.org/web/2026*/[original URL] and flagged in the methodology note.

Disclosure Statement

The Applied Layer is an independent publication. As of the date of publication, no commercial relationship, paid engagement, sponsorship, advisory contract, equity holding, or affiliate arrangement, exists between this publication or its author(s) and any of the named vendors, model providers, platform companies, consulting firms, practitioners, or research organizations cited in this report. No source cited in this report has reviewed the report prior to publication, and no source has been compensated for contribution. The vendor and platform names used in this report (including OpenAI, Anthropic, Google, Microsoft, AWS, Databricks, Snowflake, LangChain, LlamaIndex, Cursor, Replit, Intercom, Honeycomb, Stripe, Klarna, Morgan Stanley, JPMorgan Chase, Bloomberg, Uber, DoorDash, McDonald’s, Air Canada, GitHub, IBM) are referenced for editorial illustration only; their inclusion does not constitute endorsement, partnership, or agreement. Where uncertainty exists about a relationship, for instance, where the publication’s author(s) have read a particular practitioner’s writing extensively or attended a particular conference, that fact has been noted in-line. Future reports in this series will carry an updated disclosure if any commercial relationship arises.

The publication reserves editorial independence over all content, including the right to take critical positions on architectural choices made by any vendor, platform, or organization named in this report. Corrections and reader feedback are welcomed and will be acknowledged in subsequent reports.

, The Applied Layer, May 2026.

Klarna. “Klarna AI assistant handles two-thirds of customer service chats in its first month.” Klarna press release, February 27, 2024. https://www.klarna.com/international/press/klarna-ai-assistant-handles-two-thirds-of-customer-service-chats-in-its-first-month/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.klarna.com/international/press/klarna-ai-assistant-handles-two-thirds-of-customer-service-chats-in-its-first-month/. ↩
Wassel, Bryan. “Klarna changes its AI tune and again recruits humans for customer service.” CX Dive, May 13, 2025. https://www.customerexperiencedive.com/news/klarna-reinvests-human-talent-customer-service-AI-chatbot/747586/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.customerexperiencedive.com/news/klarna-reinvests-human-talent-customer-service-AI-chatbot/747586/. ↩↩↩↩↩
OpenAI. “Morgan Stanley uses AI evals to shape the future of financial services.” OpenAI customer story, 2024. https://openai.com/index/morgan-stanley/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://openai.com/index/morgan-stanley/. ↩↩↩↩↩↩↩↩↩
Morgan Stanley. “Launch of AI @ Morgan Stanley Debrief.” Morgan Stanley press release, June 26, 2024. https://www.morganstanley.com/press-releases/ai-at-morgan-stanley-debrief-launch. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.morganstanley.com/press-releases/ai-at-morgan-stanley-debrief-launch. ↩↩
Stanford Institute for Human-Centered Artificial Intelligence. “AI Index Report 2025.” Stanford HAI, April 2025. https://hai.stanford.edu/news/ai-index-2025-state-of-ai-in-10-charts. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://hai.stanford.edu/news/ai-index-2025-state-of-ai-in-10-charts. ↩↩↩↩↩
Singla, Alex, Alexander Sukharevsky, Lareina Yee, Michael Chui, and Bryce Hall. “The state of AI in 2025: Agents, innovation, and transformation.” McKinsey & Company, 2025. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai. ↩↩↩↩↩↩
RAND Corporation. “The Root Causes of Failure for Artificial Intelligence Projects and How They Can Succeed.” RAND, August 2024. Cited as discussed in Acharya Kandala, “The Production AI Reality Check: Why 80% of AI Projects Fail to Reach Production,” Medium, 2024-2025. https://medium.com/@archie.kandala/the-production-ai-reality-check-why-80-of-ai-projects-fail-to-reach-production-849daa80b0f3. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://medium.com/@archie.kandala/the-production-ai-reality-check-why-80-of-ai-projects-fail-to-reach-production-849daa80b0f3. ↩
Schwartz, Evan. “88% of AI pilots fail to reach production, but that’s not all on IT.” CIO, 2025. https://www.cio.com/article/3850763/88-of-ai-pilots-fail-to-reach-production-but-thats-not-all-on-it.html. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.cio.com/article/3850763/88-of-ai-pilots-fail-to-reach-production-but-thats-not-all-on-it.html. ↩↩
Stanford, Lindsey. “AI project failure rates are on the rise: report.” CIO Dive, citing S&P Global Market Intelligence, 2025. https://www.ciodive.com/news/AI-project-fail-data-SPGlobal/742590/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.ciodive.com/news/AI-project-fail-data-SPGlobal/742590/. ↩↩↩
Anthropic. “Introducing Claude Opus 4.5.” Anthropic, November 24, 2025. https://www.anthropic.com/news/claude-opus-4-5. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.anthropic.com/news/claude-opus-4-5. ↩↩
Ates, Emre. “JPMorgan’s COiN (Contract Intelligence) Platform: Using AI in Mergers & Acquisitions and Commercial Lending.” emreates.co.uk, 2024. https://www.emreates.co.uk/research-2/jpmorgan’s-coin-(contract-intelligence)-platform:-using-ai-in-mergers-&-acquisitions-and-commercial-lending. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.emreates.co.uk/research-2/jpmorgan%27s-coin-(contract-intelligence)-platform:-using-ai-in-mergers-&-acquisitions-and-commercial-lending. ↩↩
Wu, Shijie, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. “BloombergGPT: A Large Language Model for Finance.” arXiv:2303.17564, March 30, 2023. https://arxiv.org/abs/2303.17564. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://arxiv.org/abs/2303.17564. ↩↩
Maze, Jonathan. “McDonald’s is ending its drive-thru AI test.” Restaurant Business, June 14, 2024. https://www.restaurantbusinessonline.com/technology/mcdonalds-ending-its-drive-thru-ai-test. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.restaurantbusinessonline.com/technology/mcdonalds-ending-its-drive-thru-ai-test. ↩↩↩
Proctor, Jason. “How can I mislead you? Air Canada found liable for chatbot’s bad advice on bereavement rates.” CBC News, February 15, 2024. https://www.cbc.ca/news/canada/british-columbia/air-canada-chatbot-lawsuit-1.7116416. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026/https://www.cbc.ca/news/canada/british-columbia/air-canada-chatbot-lawsuit-1.7116416. (Underlying ruling: Moffatt v. Air Canada*, 2024 BCCRT 149.) ↩↩↩↩↩
Intercom. “What’s new with Fin 3: The best AI Agent for complex queries across every channel.” The Intercom Blog, 2025. https://www.intercom.com/blog/whats-new-with-fin-3/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.intercom.com/blog/whats-new-with-fin-3/. ↩↩
Anthropic. “Anthropic Economic Index report: Uneven geographic and enterprise AI adoption.” Anthropic, September 2025. https://www.anthropic.com/research/anthropic-economic-index-september-2025-report. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.anthropic.com/research/anthropic-economic-index-september-2025-report. ↩
Andreessen Horowitz. “Leaders, gainers and unexpected winners in the Enterprise AI arms race.” a16z, 2025. https://a16z.com/leaders-gainers-and-unexpected-winners-in-the-enterprise-ai-arms-race/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://a16z.com/leaders-gainers-and-unexpected-winners-in-the-enterprise-ai-arms-race/. ↩↩↩↩↩↩↩
Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” Advances in Neural Information Processing Systems 33 (NeurIPS 2020): 9459-9474. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html. ↩↩
Karpukhin, Vladimir, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. “Dense Passage Retrieval for Open-Domain Question Answering.” In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6769-6781. Online: Association for Computational Linguistics, 2020. https://aclanthology.org/2020.emnlp-main.550/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://aclanthology.org/2020.emnlp-main.550/. ↩↩
Hong, Quentin Romero, Lan Cao, William Lim, Apurva Shah, Aditya G. Parameswaran, and Eugene Wu. “RAG Without the Lag: Interactive Debugging for Retrieval-Augmented Generation Pipelines.” arXiv:2504.13587, 2025. https://arxiv.org/pdf/2504.13587. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://arxiv.org/pdf/2504.13587. ↩↩
Uber Engineering. “Navigating the LLM Landscape: Uber’s Innovation with GenAI Gateway.” Uber Blog, July 11, 2024. https://www.uber.com/blog/genai-gateway/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.uber.com/blog/genai-gateway/. ↩↩↩↩↩
Husain, Hamel. “Your AI Product Needs Evals.” hamel.dev, 2024. https://hamel.dev/blog/posts/evals/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://hamel.dev/blog/posts/evals/. ↩↩↩↩
Es, Shahul, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. “RAGAs: Automated Evaluation of Retrieval Augmented Generation.” In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 2024. https://aclanthology.org/2024.eacl-demo.16/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://aclanthology.org/2024.eacl-demo.16/. ↩
Saad-Falcon, Jon, Omar Khattab, Christopher Potts, and Matei Zaharia. “ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems.” arXiv:2311.09476, 2024. https://arxiv.org/pdf/2311.09476. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://arxiv.org/pdf/2311.09476. ↩
Morgan Stanley. “Morgan Stanley Research Announces AskResearchGPT.” Morgan Stanley press release, 2024. https://www.morganstanley.com/press-releases/morgan-stanley-research-announces-askresearchgpt. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.morganstanley.com/press-releases/morgan-stanley-research-announces-askresearchgpt. ↩↩
Stripe Engineering. “Railyard: how we rapidly train machine learning models with Kubernetes.” Stripe Blog, 2019, with subsequent updates. https://stripe.com/blog/railyard-training-models. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://stripe.com/blog/railyard-training-models. ↩↩↩
Stripe Engineering. “Shepherd: How Stripe adapted Chronon to scale ML feature development.” Stripe Blog, 2024. https://stripe.com/blog/shepherd-how-stripe-adapted-chronon-to-scale-ml-feature-development. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://stripe.com/blog/shepherd-how-stripe-adapted-chronon-to-scale-ml-feature-development. ↩↩↩
Singla, Alex, Alexander Sukharevsky, Lareina Yee, Michael Chui, and Bryce Hall. “The state of AI in early 2024: Gen AI adoption spikes and starts to generate value.” McKinsey & Company, May 30, 2024. https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-2024. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai-2024. ↩↩
Huyen, Chip. “Building LLM applications for production.” huyenchip.com, April 11, 2023. https://huyenchip.com/2023/04/11/llm-engineering.html. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026/https://huyenchip.com/2023/04/11/llm-engineering.html. (See also Huyen, AI Engineering*, O’Reilly Media, 2025.) ↩↩↩
Yan, Eugene. “Patterns for Building LLM-based Systems & Products.” eugeneyan.com, July 30, 2023. https://eugeneyan.com/writing/llm-patterns/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://eugeneyan.com/writing/llm-patterns/. ↩
Bischof, Bryan, Charles Frye, Hamel Husain, Jason Liu, Shreya Shankar, and Eugene Yan. “What We Learned from a Year of Building with LLMs (Part I).” O’Reilly Radar, May 28, 2024. https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/. ↩↩↩↩
Fini Labs. “Klarna Automates Two-Thirds of Customer Service with AI Assistant.” usefini.com, 2024 (citing CEO Sebastian Siemiatkowski’s disclosure of USD 2-3M implementation cost). https://www.usefini.com/blog/klarna-automates-two-thirds-of-customer-service-with-ai-assistant. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.usefini.com/blog/klarna-automates-two-thirds-of-customer-service-with-ai-assistant. ↩
Husain, Hamel. “A Field Guide to Rapidly Improving AI Products.” hamel.dev, 2024. https://hamel.dev/blog/posts/field-guide/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://hamel.dev/blog/posts/field-guide/. ↩↩
Carter, Phillip. “Improving LLMs in Production With Observability.” Honeycomb Blog, September 26, 2023. https://www.honeycomb.io/blog/improving-llms-production-observability. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.honeycomb.io/blog/improving-llms-production-observability. ↩↩↩↩↩↩
Wang, Kai, Joseph Wang, Eric Chen, and Min Cai. “From Predictive to Generative, How Michelangelo Accelerates Uber’s AI Journey.” Uber Blog, 2024. https://www.uber.com/us/en/blog/from-predictive-to-generative-ai/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.uber.com/us/en/blog/from-predictive-to-generative-ai/. ↩↩↩
Anthropic. “Claude Opus 4.5 System Card.” Anthropic, November 2025. https://www.anthropic.com/claude-opus-4-5-system-card. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.anthropic.com/claude-opus-4-5-system-card. ↩↩↩
Stripe Engineering. “Engineering, Stripe Blog.” Stripe Blog (composite reference for engineering posts including AI agents benchmark and ML infrastructure series). https://stripe.com/blog/engineering. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://stripe.com/blog/engineering. ↩
Yan, Eugene. “Tag: production.” eugeneyan.com (index of production-LLM essays). https://eugeneyan.com/tag/production/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://eugeneyan.com/tag/production/. ↩↩
Liu, Jason. “RAG, Jason Liu.” jxnl.co (collected writing on Systematically Improving RAG Applications). https://jxnl.co/writing/category/rag/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://jxnl.co/writing/category/rag/. ↩↩↩
PromptLayer. “Klarna Customer Service: From AI-First to Human-Hybrid Balance.” PromptLayer Blog, 2025. https://blog.promptlayer.com/klarna-customer-service-from-ai-first-to-human-hybrid-balance/. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://blog.promptlayer.com/klarna-customer-service-from-ai-first-to-human-hybrid-balance/. ↩↩↩
Stripe Engineering. “Can AI agents build real Stripe integrations? We built a benchmark to find out.” Stripe Blog, 2025. https://stripe.com/blog/can-ai-agents-build-real-stripe-integrations. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://stripe.com/blog/can-ai-agents-build-real-stripe-integrations. ↩↩
Tonse, Sudhir. “How we built our AI/ML Platform at DoorDash.” Plato Community Substack, 2023. https://platocommunity.substack.com/p/19-how-we-built-our-aiml-platform. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://platocommunity.substack.com/p/19-how-we-built-our-aiml-platform. ↩
Zoominfo Engineering. “Experience with GitHub Copilot for Developer Productivity at Zoominfo.” Zoominfo Engineering Blog, 2024. https://engineering.zoominfo.com/experience-with-github-copilot-for-developer-productivity-at-zoominfo. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://engineering.zoominfo.com/experience-with-github-copilot-for-developer-productivity-at-zoominfo. ↩
Cursor (Anysphere). “Cursor for Enterprise.” cursor.com, 2026. https://cursor.com/enterprise. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://cursor.com/enterprise. [Vendor self-description; “64% of Fortune 500” figure is vendor-reported.] ↩
Software Engineering Institute, Carnegie Mellon University. “Capability Maturity Model Integration (CMMI).” Wikipedia summary of CMMI v3.0 (2023), v2.0 (2018), and earlier versions. https://en.wikipedia.org/wiki/Capability_Maturity_Model_Integration. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://en.wikipedia.org/wiki/Capability_Maturity_Model_Integration. ↩
Replit. “Replit vs Cursor: Which AI Coding Platform Fits Your Workflow?” Replit Discover, 2025. https://replit.com/discover/replit-vs-cursor. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://replit.com/discover/replit-vs-cursor. [Vendor source; used only to corroborate publicly attested feature descriptions.] ↩↩
Anthropic. “Anthropic Economic Index report: Economic primitives.” Anthropic, January 2026. https://www.anthropic.com/research/anthropic-economic-index-january-2026-report. Accessed May 1, 2026. Archived: https://web.archive.org/web/2026*/https://www.anthropic.com/research/anthropic-economic-index-january-2026-report. ↩

Was this useful?

Subscribe to receive new research as they are published.