METR/Epoch's MirrorCode benchmark shows frontier models completing week-long coding tasks
METR and Epoch released MirrorCode on April 10, a long-horizon software engineering benchmark built from real-world tasks whose human completion times span hours to weeks. Their writeup reports that current frontier models already solve a non-trivial share of tasks in the multi-day to week-long regime, extending the task-horizon doubling trend METR has been tracking since 2024.
This is the cleanest methodological update to the horizon curve we've seen this quarter, and it directly anchors trajectory estimates toward a "superhuman coder" threshold. If the doubling time holds, it pulls that frontier somewhat earlier than the March 2027 scenario date; if it's decelerating inside MirrorCode, the opposite.