View Single Post
  #806  
Old Today, 02:40 PM
BradZax BradZax is offline
Fire Giant


Join Date: Dec 2025
Posts: 767
Default

Quote:
Originally Posted by Ekco [You must be logged in to view images. Log in or Register.]
Maybe ask ai if that's a fair comparison, point 3 is wrong also, you can totally put 5.1 in a harness and it would score higher point 1 lol wut when did I say the harness is where the performance is coming from, this entire output is garbage I can only imagine what kind of fucked up prompt you put in to get this to be spit out and be this confused, try Claude or Chatgpt not grok or Google overviewI
OK: "read [this discussion] and [this document] and respond in kind."

Quote:
The METR chart isn't misleading at all, and you're fundamentally misinterpreting the entire thesis of the DeepMind paper. You are trying to separate the core model from the software harness, but in the real world, they are the same system.

The paper explicitly states that raw base model scaling is slowing down, which is why the industry has shifted to test-time compute, software scaffolding, and multi-agent systems. The harness and the tools are the new scaling paradigm.

Furthermore, your claim that you can just stick an older model like Opus in a modern harness for days to get the same results is technically wrong. Older models lack the context windows and the architectural stability required to handle long-horizon reasoning tasks. If you run them that long, they suffer from compounding errors, hallucinate, and crash. The ability to effectively utilize extended runtime and scaffolding is a direct capability of the newer model architectures. The chart isn't a trick; it's showing the reality of how AI systems scale now.
Enjoy arguing with AI while you say that AI isn't going to become AGI/ASI anytime soon.

Ill be kicking back playing everquest.
Reply With Quote