Best Delegate built a private, on-brand AI speech evaluator that will give 1,500 delegates instant, rubric-aligned feedback on every practice speech starting summer 2026 — so 50 expert staff can spend more time on the high-value coaching that only humans can do.
Best Delegate's coaching advantage has always been expert staff giving individualized feedback to every delegate. As the program grows, that strength runs into a hard arithmetic limit: there are only so many coaching hours in a summer day, and each additional student stretches staff attention thinner.
Writing thoughtful feedback on a single student speech takes eight to ten minutes — reading the transcript, scoring the rubric, writing out specific, actionable suggestions. Multiplied across a committee, that's more hours per day than any coach has between running debates, designing crises, mentoring one-on-one, and everything else that makes Best Delegate singular.
Delegates, meanwhile, were waiting. A student who practised on Tuesday evening might not hear back until Thursday. They'd iterate with no structured guidance in between.
The most confident Best Delegate alumni have something in common: they rehearse deliberately and often. They practise a speech, get it reviewed, and try again — and the feedback loop between iterations matters as much as any single session.
Closing that loop required something Best Delegate didn't yet have: a consistent, private, always-available evaluator speaking in our voice. A generic chatbot can't do it. The Hook-Point-Action framework, the three-category rubric, the warm coaching tone — these are how we teach, and they had to be how the AI teaches too.
Every big choice came down to a single principle: the system had to be private, affordable, and on-brand. Each constraint ruled out a class of shortcuts, and what remained was a clear build plan.
Cloud chatbots (ChatGPT, Claude) produce strong feedback, but every evaluation would send a minor's voice recording off-premises and add per-use costs that grow with the program. Running the system on hardware we already own solves privacy and cost together — in exchange, we accept a smaller open-source model.
A generic model with a detailed "system prompt" can recite a checklist, but doesn't actually learn. By training a specialization layer on Best Delegate's own curriculum and exemplars, the model begins to reason like a Best Delegate coach — and its feedback stays consistent across tens of thousands of evaluations.
An AI evaluation is a starting point, not the final word. Every correction a staff member makes is captured and, over time, used to make the next version of the model better. Coaches aren't replaced — they become the teachers the model keeps learning from.
Every speech follows the same path. A delegate or coach uploads a recording; the system transcribes, evaluates, and renders coach-voiced feedback; a staff member reviews and, where appropriate, corrects. What they change becomes a teaching signal that improves the next version.
Students open their speech and see a video player alongside a synced transcript — the spoken words highlight in time with playback, letting them hear exactly when their hook landed. Beside the video is the full rubric evaluation: scores in each category, coach-voiced strengths, and concrete suggestions for the things they're still building.
They can rehearse in the evening and arrive at their mentor session already three iterations deeper than the old workflow allowed.
Coaches see the same view, with one addition: an edit pencil beside every score and every piece of written feedback. A coach can adjust a rubric category, rewrite a suggestion, or reshape the overall feedback in seconds — without navigating away from the speech.
The AI's original response stays visible underneath any edit, so the before/after is always transparent. That transparency is what makes the improvement loop work.
We started the project with an honest comparison: could a generic AI chatbot, steered by a detailed prompt, do the job well enough? We tested it, and the answer was almost. The gap — that "almost" — is where the value of a custom-trained model lives.
Think of the base AI model as a brilliant college graduate who has read widely but has never worked in Model UN. Fine-tuning is an onboarding — an intensive program of curriculum reading, practice cases, and examples — that teaches this graduate how Best Delegate specifically teaches.
We extracted the Best Delegate guidelines (Hook-Point-Action, the scoring rubric, the sample opening speech), pulled the transcripts from instructional videos, and gathered real mentor speech recordings. This is the "required reading" the model studies.
We constructed several hundred worked examples — exemplar speeches with their ideal rubric evaluations, deliberately-weakened speeches paired with targeted suggestions, and mixed-quality combinations that teach the model to score different parts of the same speech independently.
A small, targeted specialization layer was added on top of the base model — attaching Best Delegate's domain knowledge without touching the model's general-purpose abilities. The training completes in about two hours on a single MacBook Air. The output is a 9-megabyte file we can ship, version, and roll back.
A held-back set of speeches — that the model never saw during training — becomes the exam. Only after the model reliably scores those speeches close to the gold-standard answers does the new version get promoted for use.
Before we put a coaching AI in front of students, we answered three hard questions: Does it produce reliable output every time? Do its scores match what an expert coach would give? Do its words actually help a student improve?
Every release is tested against speeches the model never saw during training. We measure how closely the AI's rubric scores match the gold-standard scores, how reliably it produces complete feedback, and whether it identifies the right strengths and areas for growth.
Coaches can compare the fine-tuned AI's feedback against the generic base model's feedback for the same speech, in a blind format. Their votes prove that the custom training is actually making the feedback better — not just sounding different.
The AI scores the parts of the rubric that can be assessed from a recording and a transcript: Substance (what's said) and Structure (how it's organized) in full, plus the portions of Style inferrable from speech patterns. Delivery criteria that require watching the delegate in person — eye contact, hand gestures, body language — are explicitly marked as "not assessable from transcript," leaving that judgment to the coach in the room.
This is a feature, not a limitation. The AI is up front about what it can and can't see, which means coaches always know where their unique observational value applies.
Version 1 is a starting point, not a finished product. The system is designed so that the review step — where a coach adjusts a score or rewrites a suggestion — becomes the raw material for the next release.
When a coach edits a field, the app quietly records two things: what the AI originally said, and what the coach changed it to. Both get saved together, with the speech transcript that prompted them, as a labeled teaching example.
When enough of these teaching examples have accumulated — typically several hundred across a summer of use — we use them to train the next version of the adapter. The model literally learns from our coaches what good feedback looks like.
Every improvement is anchored to a human judgment, not a self-reinforcing loop. Before a new version is promoted, it has to clear the same held-out exam and side-by-side review as the current version. If version 2 under-performs version 1 on any metric, we don't ship it — the previous version stays in place.
And because the entire system runs locally, our coach corrections never leave Best Delegate. We're training our model on our data.
Every architectural choice was made with one audience in mind: young people. A Model UN program is a trusted environment, and the AI tools we put in front of delegates have to meet that standard.
All transcription, evaluation, and storage happens on the staff laptop the student's speech was uploaded to. No cloud API calls, no third-party servers, no data broker anywhere in the chain. The model is literally on the device.
The model is trained exclusively on the Best Delegate coaching voice — warm, specific, and confidence-building. It never ranks students against peers, never makes claims about college admission, and never uses discouraging or fear-based language.
Every AI evaluation can be edited by a coach before it becomes part of the student's feedback record. And the AI itself can be rolled back to any previous version if a future training run ever regresses on quality — versioning is built in by default.
The math is straightforward. 1,500 delegates practising roughly a dozen speeches each across the summer is 18,000 evaluations. At the old rate of 8–10 minutes per evaluation, that's 2,400 to 3,000 coaching hours spent on writing feedback. The AI coach compresses that to under a minute of coach attention — and only when the coach decides to step in.
Students go from waiting two days for feedback on a practice speech to under a minute. They can rehearse the same speech three times in an evening — each time with rubric-aligned, coach-voiced suggestions between attempts — and show up to their mentor session already further along than a delegate in previous years was after a full week.
Coaches are no longer the bottleneck for basic rubric feedback. Their time — the irreplaceable part of Best Delegate's program — gets redirected to the work only a human mentor can do: running dynamic committee sessions, catching subtle tonal issues a model can't hear, asking the questions that push a delegate out of their comfort zone, and celebrating breakthroughs the moment they happen.
Best Delegate can grow without diluting its core promise. Every additional delegate still gets personalized, rubric-aligned feedback on every speech — at a marginal cost approaching zero and without needing to hire in direct proportion to enrollment. The program's distinctive voice and standards are preserved in software rather than re-explained to new staff every summer.
The economics of this approach are intentionally different from a subscription to a commercial AI service. The up-front engineering was focused and finite; ongoing costs are dominated by staff laptops we already buy.
| Cost category | This approach | Cloud AI subscription alternative |
|---|---|---|
| Per-evaluation cost | Zero (runs on existing hardware) | Several cents to several dollars, scaling with enrollment |
| Summer 2026 projection | No additional per-use cost beyond staff laptops | ~$X thousand for 18,000 evaluations (exact figure depends on model tier and prompt length) |
| Data retention | Fully local; Best Delegate controls every record | Transcripts pass through a third-party provider |
| Voice consistency | Baked into the model — stable across all evaluations | Depends on prompt engineering; drifts subtly over time |
| Scaling cost | Flat — adding 500 students costs nothing extra in AI usage | Linear with enrollment |
| Long-term asset | Improves every summer from our own staff's corrections | Shared with every other customer of that provider |
Figures are illustrative estimates based on 18,000 evaluations at historical coach review rates versus projected review-only rates.
Every staff laptop gets the app and the validated Version 1 model. Staff begin reviewing every student speech, and coach corrections accumulate — ready to train the next version at summer's end.
With a summer of coach corrections to learn from, we train Version 2. It's validated against the same exam and the same blind side-by-side review as Version 1. If it wins on both, it ships for winter programs.
Expand the AI's coaching vocabulary to cover moderated-caucus speeches, crisis communiqués, and unmoderated debate structure. Each is an additional curriculum module, not a new model.
Move the inference layer to a hosted service so delegates can practice from home between camps, with the same privacy posture: opt-in retention, encrypted storage, no training on student data without consent.
Add a second output surface: a warm, reassuring summary for parents that highlights growth and next steps, alongside the detailed coach/student view. The same evaluation, reformatted for the right reader.