97 hours on one RTX 4090: MoE with plug-in experts, self-distillation and why perplexing is a bad metric
It all started with a simple idea: what if we connect new “skills” to the language model like applications to a smartphone - without retraining, without degradation, in half an hour? I spent 22 rounds…