2026-03-15 12:55:57

97 hours on one RTX 4090: MoE with plug-in experts, self-distillation and why perplexing is a bad metric

It all started with a simple idea: what if we connect new “skills” to the language model like applications to a smartphone - without retraining, without degradation, in half an hour? I spent 22 rounds…

Read full article at source (habr.com)

Back to feed