Best Delegate Speech Evaluator

Model Transformation — how fine-tuning shaped the AI coach

Model Transformation Visualization

See exactly how the Phi-3.5 language model was adjusted to become a Best Delegate MUN speech coach. This page is designed to be readable without a machine learning background — every chart has a plain-English explanation you can expand.

What is this page showing me?

Think of the base Phi-3.5 model as a brain that already knows how to read, write, and follow instructions in general. Fine-tuning teaches it a new specialty — in this case, evaluating Model UN speeches like a Best Delegate coach.

Instead of rewriting the whole brain, LoRA (the technique we used) leaves the original model untouched and attaches a small set of adjustments — like sticky notes added to a textbook. This page visualizes those adjustments:

  • Which parts of the brain changed the most?
  • How big were those changes?
  • Did the fine-tuning use its full capacity, or just a little?

About the Base Model: Phi-3.5-mini-instruct

Every fine-tuning project starts with a base model — a pre-trained language model that has already learned general language, reasoning, and instruction-following from the internet and other massive text sources. For Best Delegate's speech coach, we selected Phi-3.5-mini-instruct, a model released by Microsoft Research in August 2024. It has about 3.8 billion parameters, was trained on roughly 3.4 trillion tokens of high-quality filtered data, and supports a 128,000-token context window — far more than any student speech would require. Crucially, Microsoft released it under the MIT open-source license, meaning we can download the weights, fine-tune them, and use them commercially without any licensing fees or API calls.

Why Phi-3.5-mini-instruct over other models?

Choosing a base model is a trade-off between capability, size, licensing, and hardware fit. Here's how Phi-3.5-mini-instruct compared to the main alternatives we considered:

ModelParamsLicenseRuns on M5 32GB?Best at
Phi-3.5-mini-instruct (chosen)3.8BMIT (fully open)Yes — comfortablyReasoning, instruction-following, small footprint
Llama-3.1-8B-Instruct8BLlama 3.1 Community (restrictions)Yes but tight during trainingGeneral purpose, large ecosystem
Mistral-7B-Instruct-v0.37BApache 2.0Yes but tight during trainingGeneral purpose, European-friendly
Gemma-2-2B-Instruct2BGemma Terms of UseYes — very comfortablySpeed, lowest compute
Claude / GPT-4 (via API)Hundreds of billionsProprietary (API only)No — cloud onlyHighest quality, but cloud-dependent

Why Phi-3.5 won for this project:

  1. Strong reasoning at a small size. Specifically designed for reasoning and instruction-following; benchmarks show it performing close to models 2–3× its size.
  2. Fits on a MacBook Air M5 with room to spare. The 4-bit quantized version is only ~2 GB on disk and uses 3–4 GB of RAM for inference, leaving headroom for LoRA training, Whisper, and the Streamlit app.
  3. MIT license — no strings attached. Unrestricted commercial use and redistribution.
  4. Privacy by design. Runs locally — student speech transcripts never leave the camp's laptop.
  5. Clean chat template that integrates smoothly with MLX-LM's LoRA trainer.

The trade-off we accepted: Phi-3.5-mini is not the highest-quality model available. Cloud-hosted GPT-4 or Claude would produce more polished feedback. But local deployment, zero per-request cost, privacy guarantees, and the ability to fine-tune for the Best Delegate voice outweighed that gap.

Transformation Summary

The high-level numbers that describe how much the model was adjusted.

2,359,296
LoRA Trainable Params
0.062% of base
16/32
Layers Modified
Layers 16–31
40.56
Mean ΔW Magnitude
max 53.69
rank 8
LoRA Config
α 16, scale 10.0
Overall training footprint
The fine-tuning was extremely efficient: it adjusted only 2,359,296 parameters — just 0.062% of the base model's roughly 3.8 billion. That tiny footprint is why training fit on a MacBook Air M5 and why the adapter file is just a few megabytes rather than gigabytes. It also means the model's original general knowledge (language fluency, world facts, reasoning) is entirely preserved — we only added a small specialization layer on top.

We trained 16 of 32 layers (layers 16–31), deliberately focusing on the 'specialist' second half of the model where task-specific learning lives. The average magnitude of change across these layers was 40.6 (with a maximum of 53.7), and the changes were fairly evenly distributed across the trained layers. This pattern suggests the fine-tune was well-behaved: substantial enough that real learning happened, but balanced enough that no single layer was forced to do all the work.

Layer-Level Transformation Magnitude

How much did each layer change? Each pair of bars represents one layer of the model. Taller bars mean that layer absorbed more learning during fine-tuning.

How to read this chart
  • X-axis (Layer): The layer number inside the model. We only trained layers 16–31 (the 'specialist' half).
  • Y-axis (Frobenius Norm): A single-number measure of 'how big the change was' — like the total distance the layer's weights moved.
  • Navy bars (qkv_proj): The part of each layer that decides what to pay attention to.
  • Gold bars (o_proj): The part that combines attention results into the output.
What this chart is telling us
The bars reveal where inside the model the 'learning' actually landed. The most-changed layer is layer 29 on the attention side (53.7), while the least-changed is layer 16 (39.1) — a 1.4× range. That's a reasonable spread: enough variation to show each layer found its own role, but not so extreme that one layer is doing all the work.

The qkv_proj bars (navy) are noticeably taller than the o_proj bars (gold) across nearly every layer (average 47.7 vs 33.4). This means the model put more effort into deciding what to pay attention to than into combining the attended information. For a task like MUN speech evaluation — where the model needs to zero in on hooks, policy statements, and calls to action scattered across a transcript — this pattern makes intuitive sense. Looking at the progression, later layers show larger changes than earlier ones — consistent with the deepest layers specializing most for the new task.

Learning Distribution — Q / K / V / O

Where inside each layer did the learning go? The attention mechanism has four components: Query, Key, Value, and Output. This chart breaks down each layer's change into those four parts, stacked.

What are Q, K, V, and O?
  • Q (Query)"What am I looking for?" The current word's question about its context.
  • K (Key)"What do I have to offer?" Each other word's advertisement.
  • V (Value)"What information do I contribute?" The actual content.
  • O (Output)"How do I combine everything?" Mixes the attended values.
How the model re-specialized its attention
Stacking the four attention components shows where inside each layer the learning went. On average across layers, Output (O) absorbed the most change (33.4, or about 29% of the per-layer total), followed by Query (Q) (28.2). Value (V) moved the least (26.6, ~23%).

This is a classic style/format learning pattern. When a fine-tune emphasizes Value and Output, it means the model is learning how to express information in a new way — in our case, how to produce structured rubric scores and warm, coach-like feedback in the Best Delegate house voice.

Per-Layer Detail

Zoom into a single layer to see the structure of its changes. Pick a layer from the dropdown; the heatmap shows where inside that layer the adjustments happened, and the bar chart shows how concentrated vs. spread-out those adjustments are.

How to read the heatmap and singular value chart

The heatmap: Red = increased connection, Blue = decreased, White = unchanged. Stripes or blocks are 'hotspots' the model focused on.

Singular values: How much of LoRA's rank-8 capacity was used. Similar bars = full capacity utilized. Stair-step = typical healthy pattern. One dominant bar = over-capacity.

ΔW Heatmap (downsampled)

Singular Value Spectrum

Effective rank: how much of LoRA's r=8 capacity was used.

What this layer's profile means

ΔW Value Distribution

Are the changes small or dramatic? This chart combines every single adjustment made across the entire model into one histogram.

How to read this histogram

Healthy shape: tall narrow peak centered on zero with thin tails — most connections barely changed; real learning happened at the edges.

Warning signs: peak shifted far from zero (drift), or very wide with no clear peak (over-aggressive training).

0.00001
Mean
0.01008
Std
-0.08332
Min
0.08156
Max
What the shape of this distribution tells us
The histogram pools every single adjustment made anywhere in the model. The shape is a tall, narrow peak centered on zero with two thin tails — exactly the signature of a well-behaved LoRA fine-tune. Roughly 74% of all adjustments fall within one standard deviation of zero, meaning the vast majority of connections barely moved; the real learning happened in the small number of values at the edges.

Specifically: the average adjustment is essentially zero (+0.00001), meaning increases and decreases balanced out. There's no overall drift — exactly what we want from a healthy fine-tune. And the most extreme decrease (-0.0833) and most extreme increase (+0.0816) are roughly symmetric. Taken together, this confirms LoRA's design intent: a surgical change to the model, not a sweeping rewrite.

Top Shifted Neurons

Which specific 'thinking units' changed the most? These are the neurons that became specialists in MUN speech evaluation.

How to read this table

Layer / Component / Neuron Index: identify the specific neuron. Shift Magnitude: how much its role changed. Higher = more specialization.

LayerComponentNeuron IndexShift Magnitude
28o_proj21912.0030
26o_proj11891.8847
28o_proj4811.8833
25o_proj3371.8459
26o_proj20911.8394
28o_proj26711.8336
28o_proj24391.8110
27o_proj5171.7964
26o_proj26031.7871
28o_proj28501.7660
26o_proj4201.7632
27o_proj1481.7564
29qkv_proj1021.7265
27o_proj23981.7234
24o_proj21101.7226
26o_proj4911.7163
27o_proj28671.6983
25o_proj19191.6956
25o_proj20791.6945
30o_proj14041.6921
25o_proj18571.6876
31o_proj12231.6770
27o_proj24441.6679
27qkv_proj38421.6383
30o_proj7231.6341
29o_proj18581.6240
25o_proj6051.5841
29o_proj24121.5680
30o_proj6541.5525
22o_proj17671.5350
Where specialization concentrated
Of the millions of connections in the model, only a small number moved substantially. The most transformed neuron moved with a magnitude of 2.00. Layer 28 contributes 5 of the top 10 — a strong focal point for MUN-specific specialization.

The top 10 are dominated by o_proj neurons (10 of 10), on the output-combination side. These neurons learned new ways to assemble the final response. If the same (layer, component, neuron) combinations reappear across retraining runs, the model is converging on a stable representation of the Best Delegate coaching task.

Glossary

Quick reference for the technical terms used above.

Adapter — A small add-on file that modifies how a large language model behaves, without changing the original model itself. Like a lens you attach to a camera.
Alpha (α) — A LoRA training setting that controls how strongly the adapter's adjustments are applied. Higher alpha = stronger effect.
Attention — The mechanism a language model uses to decide which parts of the input matter most when generating each word.
Base model — The original, untouched Phi-3.5-mini-instruct model — the starting point.
Delta W (ΔW) — 'Change in weights.' The difference between the fine-tuned model's adjustments and the base model. This is what LoRA actually learns.
Fine-tuning — Teaching an already-trained model a specialized skill using a smaller, task-specific dataset.
Frobenius norm — A single number that summarizes the total size of a matrix of adjustments.
Heatmap — A colored grid where each square's color represents a number. Used to show which specific connections changed the most.
Layer — One of 32 stacked processing stages inside the model. Each layer transforms the input a little more.
LoRA (Low-Rank Adaptation) — A fine-tuning technique that attaches small 'sticky note' adjustments instead of rewriting the whole model.
Neuron — A single computational unit inside a layer.
o_proj (Output projection) — The part of each layer that combines attention results into the final output.
Parameter — One adjustable number inside the model. Phi-3.5 has ~3.8 billion; our LoRA adjusts about 2.4 million.
qkv_proj (Query-Key-Value projection) — The part of each layer that prepares the three ingredients of attention.
Rank — How many independent 'directions' of change LoRA can learn per layer. We used rank 8.
Scale — A multiplier applied to LoRA's adjustments during training.
Singular value — A measure of how much a particular 'direction' of change contributed to the overall adjustment.
Weight — A number inside the model that gets multiplied with inputs to produce outputs.