Model Transformation — how fine-tuning shaped the AI coach
See exactly how the Phi-3.5 language model was adjusted to become a Best Delegate MUN speech coach. This page is designed to be readable without a machine learning background — every chart has a plain-English explanation you can expand.
Think of the base Phi-3.5 model as a brain that already knows how to read, write, and follow instructions in general. Fine-tuning teaches it a new specialty — in this case, evaluating Model UN speeches like a Best Delegate coach.
Instead of rewriting the whole brain, LoRA (the technique we used) leaves the original model untouched and attaches a small set of adjustments — like sticky notes added to a textbook. This page visualizes those adjustments:
Every fine-tuning project starts with a base model — a pre-trained language model that has already learned general language, reasoning, and instruction-following from the internet and other massive text sources. For Best Delegate's speech coach, we selected Phi-3.5-mini-instruct, a model released by Microsoft Research in August 2024. It has about 3.8 billion parameters, was trained on roughly 3.4 trillion tokens of high-quality filtered data, and supports a 128,000-token context window — far more than any student speech would require. Crucially, Microsoft released it under the MIT open-source license, meaning we can download the weights, fine-tune them, and use them commercially without any licensing fees or API calls.
Choosing a base model is a trade-off between capability, size, licensing, and hardware fit. Here's how Phi-3.5-mini-instruct compared to the main alternatives we considered:
| Model | Params | License | Runs on M5 32GB? | Best at |
|---|---|---|---|---|
| Phi-3.5-mini-instruct (chosen) | 3.8B | MIT (fully open) | Yes — comfortably | Reasoning, instruction-following, small footprint |
| Llama-3.1-8B-Instruct | 8B | Llama 3.1 Community (restrictions) | Yes but tight during training | General purpose, large ecosystem |
| Mistral-7B-Instruct-v0.3 | 7B | Apache 2.0 | Yes but tight during training | General purpose, European-friendly |
| Gemma-2-2B-Instruct | 2B | Gemma Terms of Use | Yes — very comfortably | Speed, lowest compute |
| Claude / GPT-4 (via API) | Hundreds of billions | Proprietary (API only) | No — cloud only | Highest quality, but cloud-dependent |
Why Phi-3.5 won for this project:
The trade-off we accepted: Phi-3.5-mini is not the highest-quality model available. Cloud-hosted GPT-4 or Claude would produce more polished feedback. But local deployment, zero per-request cost, privacy guarantees, and the ability to fine-tune for the Best Delegate voice outweighed that gap.
The high-level numbers that describe how much the model was adjusted.
How much did each layer change? Each pair of bars represents one layer of the model. Taller bars mean that layer absorbed more learning during fine-tuning.
Where inside each layer did the learning go? The attention mechanism has four components: Query, Key, Value, and Output. This chart breaks down each layer's change into those four parts, stacked.
Zoom into a single layer to see the structure of its changes. Pick a layer from the dropdown; the heatmap shows where inside that layer the adjustments happened, and the bar chart shows how concentrated vs. spread-out those adjustments are.
The heatmap: Red = increased connection, Blue = decreased, White = unchanged. Stripes or blocks are 'hotspots' the model focused on.
Singular values: How much of LoRA's rank-8 capacity was used. Similar bars = full capacity utilized. Stair-step = typical healthy pattern. One dominant bar = over-capacity.
Effective rank: how much of LoRA's r=8 capacity was used.
Are the changes small or dramatic? This chart combines every single adjustment made across the entire model into one histogram.
Healthy shape: tall narrow peak centered on zero with thin tails — most connections barely changed; real learning happened at the edges.
Warning signs: peak shifted far from zero (drift), or very wide with no clear peak (over-aggressive training).
Which specific 'thinking units' changed the most? These are the neurons that became specialists in MUN speech evaluation.
Layer / Component / Neuron Index: identify the specific neuron. Shift Magnitude: how much its role changed. Higher = more specialization.
| Layer | Component | Neuron Index | Shift Magnitude |
|---|---|---|---|
| 28 | o_proj | 2191 | 2.0030 |
| 26 | o_proj | 1189 | 1.8847 |
| 28 | o_proj | 481 | 1.8833 |
| 25 | o_proj | 337 | 1.8459 |
| 26 | o_proj | 2091 | 1.8394 |
| 28 | o_proj | 2671 | 1.8336 |
| 28 | o_proj | 2439 | 1.8110 |
| 27 | o_proj | 517 | 1.7964 |
| 26 | o_proj | 2603 | 1.7871 |
| 28 | o_proj | 2850 | 1.7660 |
| 26 | o_proj | 420 | 1.7632 |
| 27 | o_proj | 148 | 1.7564 |
| 29 | qkv_proj | 102 | 1.7265 |
| 27 | o_proj | 2398 | 1.7234 |
| 24 | o_proj | 2110 | 1.7226 |
| 26 | o_proj | 491 | 1.7163 |
| 27 | o_proj | 2867 | 1.6983 |
| 25 | o_proj | 1919 | 1.6956 |
| 25 | o_proj | 2079 | 1.6945 |
| 30 | o_proj | 1404 | 1.6921 |
| 25 | o_proj | 1857 | 1.6876 |
| 31 | o_proj | 1223 | 1.6770 |
| 27 | o_proj | 2444 | 1.6679 |
| 27 | qkv_proj | 3842 | 1.6383 |
| 30 | o_proj | 723 | 1.6341 |
| 29 | o_proj | 1858 | 1.6240 |
| 25 | o_proj | 605 | 1.5841 |
| 29 | o_proj | 2412 | 1.5680 |
| 30 | o_proj | 654 | 1.5525 |
| 22 | o_proj | 1767 | 1.5350 |
Quick reference for the technical terms used above.