The coverage principle: How pre-training enables post-training
New preprint where we look at the mechanisms through which next-token prediction produces models that succeed at downstream tasks.
The answer involves a metric we call the "coverage profile", not cross-entropy.