Role: Design Lead
Team: 1 Architect, 4 Data Scientists, 2 Technology Engineers
Scale: Deployed to 11M+ UK customers
Launched: London Tech Week, June 2024
In 2023, NatWest wanted to introduce generative AI into Cora+, its customer-facing banking assistant serving over 11 million users annually. At the time, Cora struggled with simple knowledge-based queries. Questions like “What is an APR?” often resulted in dead ends or unnecessary escalation to human agents. The objective was to explore a retrieval-augmented generation (RAG) approach that could answer natural language financial questions directly in chat, while maintaining regulatory standards and customer trust.
My contribution wasn't the model pipeline. It was figuring out what "good" meant before we shipped anything to 11 million people.

Cora+ an upgrade to its digital assistant, Cora, developed in collaboration with IBM.
The temptation in a project like this is to measure correctness and move on. We didn't, because correctness alone was insufficient. A response could be factually accurate and still fail in a banking context. It might use language a customer with low financial literacy couldn't follow. It might be technically right but overconfident where qualification was needed. It might pass an automated score and still make a compliance team uncomfortable. We needed a way to evaluate quality, accessibility, and risk together, before any of this reached customers.
I led a three-day workshop with cross-functional stakeholders to define what good looked like in measurable terms, establish acceptable risk thresholds, and scope the MVP. We aligned on a six-week build focused on knowledge-based mortgage and FAQ queries.
The evaluation framework had three components.

Contextual evaluation on the outputs and scoring.

Synthetic personas for testing and evaluation.
The finding that mattered most: aggregate accuracy scores masked significant per-group variation. Responses that scored well overall were sometimes failing specific personas entirely. Prompt configuration had a material effect on performance across user profiles — something we'd never have caught without testing across them explicitly.