My beef is with the METR graph which isn't mentioned once in the paper nor are they, its a sensational Berkeley nonprofit writing disingenuous misleading tests to show the outcome they want, working backwards from ai in scifi scary so we should stop just like they work backward from cows fart too much so you shouldn't be allowed to have a hamburger
[You must be logged in to view images. Log in or Register.]
If you do click one of the dots it does list way past 16hours shown but the methodology of the test itself and the chart are both misleading to show scary exponential growth, nothing changed with how the models themselves are fundementally built, it's stuff around the model that is improving, the harness & MoE. You can stick a earlier model in a harness and let it run for days also but the chart doesn't show that they cap gpt5 at 6 hours and previous models are 12-30 minutes, if you run the same opus they have ranked as they do without a harness it scores will be completely different
Quote:
|
This doesn’t mean opus 4.6 can work for ~14 hrs, it means on tasks that would take a human expert ~14 hrs, the agent successfully finishes them 50% of the time. Probably completes them way faster actually
|
Their either turbo retarded or knowily commiting academic fraud for political/funding reasons, nobody there even works at a frontier lab I assume just nonprofit advocacy fart huffing from what I can tell
https://summify.io/discover/is-ai-ab...-s-not-5GezB1/ one click bait YouTuber to counter another
And the Google fanfic thought expirement about the timeline of one sci fi concept progressing into another sci fi theoretical concept itself I have no issue with