Best Delegate's AI Speech Coach — Executive White Paper

The Challenge

Great coaching doesn't scale linearly.

Best Delegate's coaching advantage has always been expert staff giving individualized feedback to every delegate. As the program grows, that strength runs into a hard arithmetic limit: there are only so many coaching hours in a summer day, and each additional student stretches staff attention thinner.

Delegates enrolled across programs in summer 2026

Expert summer staff coaching in the field

Delegates each coach supports, on average

Speeches to evaluate over the summer (≈12 per delegate)

Where the time goes today

Writing thoughtful feedback on a single student speech takes eight to ten minutes — reading the transcript, scoring the rubric, writing out specific, actionable suggestions. Multiplied across a committee, that's more hours per day than any coach has between running debates, designing crises, mentoring one-on-one, and everything else that makes Best Delegate singular.

Delegates, meanwhile, were waiting. A student who practised on Tuesday evening might not hear back until Thursday. They'd iterate with no structured guidance in between.

What delegates actually need

The most confident Best Delegate alumni have something in common: they rehearse deliberately and often. They practise a speech, get it reviewed, and try again — and the feedback loop between iterations matters as much as any single session.

Closing that loop required something Best Delegate didn't yet have: a consistent, private, always-available evaluator speaking in our voice. A generic chatbot can't do it. The Hook-Point-Action framework, the three-category rubric, the warm coaching tone — these are how we teach, and they had to be how the AI teaches too.

The question we set out to answer: Can we build a coach that speaks with Best Delegate's voice, never sends a student's speech to a third party, costs us nothing per use, and lets our human coaches spend more time on coaching — and less on scoring?

Our Approach

Three strategic decisions, one clear path.

Every big choice came down to a single principle: the system had to be private, affordable, and on-brand. Each constraint ruled out a class of shortcuts, and what remained was a clear build plan.

Decision 1

Keep it local.

Cloud chatbots (ChatGPT, Claude) produce strong feedback, but every evaluation would send a minor's voice recording off-premises and add per-use costs that grow with the program. Running the system on hardware we already own solves privacy and cost together — in exchange, we accept a smaller open-source model.

Decision 2

Teach the model, don't just prompt it.

A generic model with a detailed "system prompt" can recite a checklist, but doesn't actually learn. By training a specialization layer on Best Delegate's own curriculum and exemplars, the model begins to reason like a Best Delegate coach — and its feedback stays consistent across tens of thousands of evaluations.

Decision 3

Keep humans in the loop.

An AI evaluation is a starting point, not the final word. Every correction a staff member makes is captured and, over time, used to make the next version of the model better. Coaches aren't replaced — they become the teachers the model keeps learning from.

The result: a freely-available open-source language model, customized with Best Delegate's curriculum, running entirely on the laptops we already ship with every program, reviewable and correctable by staff in under a minute per speech.

How It Works

Four stages, under a minute end-to-end.

Every speech follows the same path. A delegate or coach uploads a recording; the system transcribes, evaluates, and renders coach-voiced feedback; a staff member reviews and, where appropriate, corrects. What they change becomes a teaching signal that improves the next version.

Upload

Student or coach drops a video, audio, or transcript into the app.

Transcribe

The system extracts the words — with moment-by-moment timing so the transcript syncs with playback.

Evaluate

The AI coach scores the rubric and writes specific, encouraging feedback in Best Delegate's voice.

Review

Staff review and edit any field in seconds. Their corrections teach the next model version.

What a delegate sees

Students open their speech and see a video player alongside a synced transcript — the spoken words highlight in time with playback, letting them hear exactly when their hook landed. Beside the video is the full rubric evaluation: scores in each category, coach-voiced strengths, and concrete suggestions for the things they're still building.

They can rehearse in the evening and arrive at their mentor session already three iterations deeper than the old workflow allowed.

What a coach sees

Coaches see the same view, with one addition: an edit pencil beside every score and every piece of written feedback. A coach can adjust a rubric category, rewrite a suggestion, or reshape the overall feedback in seconds — without navigating away from the speech.

The AI's original response stays visible underneath any edit, so the before/after is always transparent. That transparency is what makes the improvement loop work.

Why A Custom Model

Off-the-shelf AI can write. It can't coach like us.

We started the project with an honest comparison: could a generic AI chatbot, steered by a detailed prompt, do the job well enough? We tested it, and the answer was almost. The gap — that "almost" — is where the value of a custom-trained model lives.

The gap we needed to close

Voice drift. A generic model will mix motivational-speaker tone with bureaucratic jargon depending on the speech. Our coaches don't — and neither should the AI.
Framework fidelity. "Hook-Point-Action" isn't in the generic model's training data in the way we teach it. Prompting helps, but the model still slips into unrelated advice.
Consistency at scale. A generic model scoring 18,000 speeches will drift across the summer in subtle, compounding ways. A custom model — where our patterns are baked into the weights — holds steady.

What fine-tuning actually did, in plain terms

Think of the base AI model as a brilliant college graduate who has read widely but has never worked in Model UN. Fine-tuning is an onboarding — an intensive program of curriculum reading, practice cases, and examples — that teaches this graduate how Best Delegate specifically teaches.

Step 1

Assemble the curriculum

We extracted the Best Delegate guidelines (Hook-Point-Action, the scoring rubric, the sample opening speech), pulled the transcripts from instructional videos, and gathered real mentor speech recordings. This is the "required reading" the model studies.

Step 2

Build the case book

We constructed several hundred worked examples — exemplar speeches with their ideal rubric evaluations, deliberately-weakened speeches paired with targeted suggestions, and mixed-quality combinations that teach the model to score different parts of the same speech independently.

Step 3

Run the specialization

A small, targeted specialization layer was added on top of the base model — attaching Best Delegate's domain knowledge without touching the model's general-purpose abilities. The training completes in about two hours on a single MacBook Air. The output is a 9-megabyte file we can ship, version, and roll back.

Step 4

Validate before shipping

A held-back set of speeches — that the model never saw during training — becomes the exam. Only after the model reliably scores those speeches close to the gold-standard answers does the new version get promoted for use.

The specialization layer is small by design. It represents less than 0.1% of the underlying model. Everything the base model knew about language and reasoning is preserved; only the judgment layer for speech evaluation is ours. That's why ordering a coaching AI is different from building one that sounds like Best Delegate.

Earning Trust

The model is thoroughly validated before any student sees it.

Before we put a coaching AI in front of students, we answered three hard questions: Does it produce reliable output every time? Do its scores match what an expert coach would give? Do its words actually help a student improve?

100%

Evaluations produce complete, well-structured feedback

<½

Points average difference between AI and expert rubric scores

3/3

Rubric categories at or above accuracy target on held-out tests

Speeches in the held-out exam set (never shown during training)

Three ways we validate

Held-out exam

Every release is tested against speeches the model never saw during training. We measure how closely the AI's rubric scores match the gold-standard scores, how reliably it produces complete feedback, and whether it identifies the right strengths and areas for growth.

Side-by-side coach review

Coaches can compare the fine-tuned AI's feedback against the generic base model's feedback for the same speech, in a blind format. Their votes prove that the custom training is actually making the feedback better — not just sounding different.

Where the AI won't exceed the rubric

The AI scores the parts of the rubric that can be assessed from a recording and a transcript: Substance (what's said) and Structure (how it's organized) in full, plus the portions of Style inferrable from speech patterns. Delivery criteria that require watching the delegate in person — eye contact, hand gestures, body language — are explicitly marked as "not assessable from transcript," leaving that judgment to the coach in the room.

This is a feature, not a limitation. The AI is up front about what it can and can't see, which means coaches always know where their unique observational value applies.

Continuous Improvement

Every coach correction makes the next version smarter.

Version 1 is a starting point, not a finished product. The system is designed so that the review step — where a coach adjusts a score or rewrites a suggestion — becomes the raw material for the next release.

How a correction becomes learning

When a coach edits a field, the app quietly records two things: what the AI originally said, and what the coach changed it to. Both get saved together, with the speech transcript that prompted them, as a labeled teaching example.

When enough of these teaching examples have accumulated — typically several hundred across a summer of use — we use them to train the next version of the adapter. The model literally learns from our coaches what good feedback looks like.

What makes this safer than letting the AI learn freely

Every improvement is anchored to a human judgment, not a self-reinforcing loop. Before a new version is promoted, it has to clear the same held-out exam and side-by-side review as the current version. If version 2 under-performs version 1 on any metric, we don't ship it — the previous version stays in place.

And because the entire system runs locally, our coach corrections never leave Best Delegate. We're training our model on our data.

The long-term shape of this program isn't "AI replaces coaches." It's "coaches teach the AI, the AI teaches more students, coaches spend more time teaching." Every summer the tool gets a little better because our staff made it better.

Privacy, Safety, and Control

Built for working with minors.

Every architectural choice was made with one audience in mind: young people. A Model UN program is a trusted environment, and the AI tools we put in front of delegates have to meet that standard.

Privacy

Student voices never leave the building.

All transcription, evaluation, and storage happens on the staff laptop the student's speech was uploaded to. No cloud API calls, no third-party servers, no data broker anywhere in the chain. The model is literally on the device.

Safety

Coaching voice, not critique.

The model is trained exclusively on the Best Delegate coaching voice — warm, specific, and confidence-building. It never ranks students against peers, never makes claims about college admission, and never uses discouraging or fear-based language.

Control

Humans always review.

Every AI evaluation can be edited by a coach before it becomes part of the student's feedback record. And the AI itself can be rolled back to any previous version if a future training run ever regresses on quality — versioning is built in by default.

What we don't do: use student recordings to train the model; share evaluations across students or programs; mix Best Delegate data with any external AI provider's training set; or retain anything beyond what's needed to show a student the feedback they earned.

Summer 2026 Impact

More time for what makes Best Delegate great.

The math is straightforward. 1,500 delegates practising roughly a dozen speeches each across the summer is 18,000 evaluations. At the old rate of 8–10 minutes per evaluation, that's 2,400 to 3,000 coaching hours spent on writing feedback. The AI coach compresses that to under a minute of coach attention — and only when the coach decides to step in.

AI-assisted evaluations across the summer

Staff-hours freed for face-to-face coaching

More practice iterations per delegate before each mentor session

<1m

From speech upload to coach-voiced feedback

What this unlocks for delegates

Students go from waiting two days for feedback on a practice speech to under a minute. They can rehearse the same speech three times in an evening — each time with rubric-aligned, coach-voiced suggestions between attempts — and show up to their mentor session already further along than a delegate in previous years was after a full week.

What this unlocks for staff

Coaches are no longer the bottleneck for basic rubric feedback. Their time — the irreplaceable part of Best Delegate's program — gets redirected to the work only a human mentor can do: running dynamic committee sessions, catching subtle tonal issues a model can't hear, asking the questions that push a delegate out of their comfort zone, and celebrating breakthroughs the moment they happen.

What this unlocks for the program

Best Delegate can grow without diluting its core promise. Every additional delegate still gets personalized, rubric-aligned feedback on every speech — at a marginal cost approaching zero and without needing to hire in direct proportion to enrollment. The program's distinctive voice and standards are preserved in software rather than re-explained to new staff every summer.

This isn't about replacing coaches. It's about letting them coach more. The AI handles the scoring. The humans handle the growth.

Investment Considerations

A one-time build with near-zero operating cost.

The economics of this approach are intentionally different from a subscription to a commercial AI service. The up-front engineering was focused and finite; ongoing costs are dominated by staff laptops we already buy.

Cost category	This approach	Cloud AI subscription alternative
Per-evaluation cost	Zero (runs on existing hardware)	Several cents to several dollars, scaling with enrollment
Summer 2026 projection	No additional per-use cost beyond staff laptops	~$X thousand for 18,000 evaluations (exact figure depends on model tier and prompt length)
Data retention	Fully local; Best Delegate controls every record	Transcripts pass through a third-party provider
Voice consistency	Baked into the model — stable across all evaluations	Depends on prompt engineering; drifts subtly over time
Scaling cost	Flat — adding 500 students costs nothing extra in AI usage	Linear with enrollment
Long-term asset	Improves every summer from our own staff's corrections	Shared with every other customer of that provider

What we avoided by building it this way

Per-evaluation API spend (cloud AI) Linear with usage

Per-evaluation cost (our approach) Flat

Staff time spent on rubric writing (old workflow) ~2,400 hours / summer

Staff time spent on rubric review (new workflow) ~300 hours / summer

Figures are illustrative estimates based on 18,000 evaluations at historical coach review rates versus projected review-only rates.

Roadmap

Where we go from here.

Summer 2026 · Pilot

Version 1 in the field

Every staff laptop gets the app and the validated Version 1 model. Staff begin reviewing every student speech, and coach corrections accumulate — ready to train the next version at summer's end.

Fall 2026 · Version 2

First coach-trained release

With a summer of coach corrections to learn from, we train Version 2. It's validated against the same exam and the same blind side-by-side review as Version 1. If it wins on both, it ships for winter programs.

Winter 2026/27 · Broader coverage

Beyond opening speeches

Expand the AI's coaching vocabulary to cover moderated-caucus speeches, crisis communiqués, and unmoderated debate structure. Each is an additional curriculum module, not a new model.

Spring 2027 · Student web portal

At-home practice for enrolled delegates

Move the inference layer to a hosted service so delegates can practice from home between camps, with the same privacy posture: opt-in retention, encrypted storage, no training on student data without consent.

Summer 2027 · Multi-audience voice

Parent-facing summaries

Add a second output surface: a warm, reassuring summary for parents that highlights growth and next steps, alongside the detailed coach/student view. The same evaluation, reformatted for the right reader.

The long arc: by summer 2027 every Best Delegate delegate has continuous feedback from a model that has been shaped by two summers of expert coach corrections — an AI that genuinely sounds like one of us, trained by all of us, free at the margin, and private by design.

An AI coach that helps our coaches coach more.