Overview

In 2023, NatWest wanted to introduce generative AI into Cora+, its customer-facing banking assistant serving over 11 million users annually. At the time, Cora struggled with simple knowledge-based queries. Questions like “What is an APR?” often resulted in dead ends or unnecessary escalation to human agents. The objective was to explore a retrieval-augmented generation (RAG) approach that could answer natural language financial questions directly in chat, while maintaining regulatory standards and customer trust.

My contribution wasn't the model pipeline. It was figuring out what "good" meant before we shipped anything to 11 million people.

Cora+ an upgrade to its digital assistant, Cora, developed in collaboration with IBM.

The Problem with Standard Evaluation

The temptation in a project like this is to measure correctness and move on. We didn't, because correctness alone was insufficient. A response could be factually accurate and still fail in a banking context. It might use language a customer with low financial literacy couldn't follow. It might be technically right but overconfident where qualification was needed. It might pass an automated score and still make a compliance team uncomfortable. We needed a way to evaluate quality, accessibility, and risk together, before any of this reached customers.

Framing the Problem & Scope

I led a three-day workshop with cross-functional stakeholders to define what good looked like in measurable terms, establish acceptable risk thresholds, and scope the MVP. We aligned on a six-week build focused on knowledge-based mortgage and FAQ queries.

Building the Evaluation Framework

The evaluation framework had three components.

Contextual evaluation on the outputs and scoring.

Quantitative metrics. We introduced a Flesch Reading Score to assess accessibility alongside accuracy, an answer relevancy score, and a custom response-quality classifier. This gave us a way to evaluate tone and clarity, not just correctness.
31 benchmark questions with assigned difficulty ratings. These covered product-specific queries (mortgage rates, interest calculations), common financial knowledge (APR, LTV), and general FAQs. Each question had a defined expected answer standard so scoring wasn't subjective.

Synthetic personas for testing and evaluation.

Synthetic persona testing. This was the most revealing part. I authored four personas — Bob (36, teacher in South London, limited financial literacy), Mari (37, Bristol, specific accessibility requirements), a Cambridge-based user with English as a second language, and Andrew (23, early savings stage) — each submitting the same benchmark question set.

The finding that mattered most: aggregate accuracy scores masked significant per-group variation. Responses that scored well overall were sometimes failing specific personas entirely. Prompt configuration had a material effect on performance across user profiles — something we'd never have caught without testing across them explicitly.