Is scale unnecessary for intelligence (<10B param human-competitive STEM model before 2030)?
26
1kṀ2592
2030
72%
chance

Resolves yes if before 2030, a neural net with <10B parameters achieves all of: >75% on GPQA, >80% on SWE-bench verified, and >95% on MATH

Arbitrary scaffolding allowed (retrieval over fixed DB is ok), no talking with other AI, no internet access. We'll use whatever tools are available at the time to determine whether such an AI memorized the answers to these datasets; if verbatim memorization obviously happened, the model will be disqualified.

Edit: we'll allow up to 1 minute of time per question.

Possible clarification from creator (AI generated):

  • Model must complete each question within 1 minute of wall-clock time

Get
Ṁ1,000
to start trading!
Sort by:

OK I think the "spend capacity on general reasoning and let tool-use replace in-weights knowledge" paradigm has been spelled out enough that I can reveal the text I preregistered after o1 came out https://x.com/karpathy/status/1938626382248149433

"""Preregistering alpha. Pretraining approximates the behavior of a Solomonoff Inductor across the data distribution. It does so because the model is an ensemble of Turing machines, a latent distribution over the generating processes that compress the pretraining corpus, and attention performs a Bayesian update conditioning on the context window.

Most compute spent on pretraining before mid-2024 was on the human text distribution, not RL. o1 taught us that spending more compute on RL could scale to be a significant proportion of training compute (current or next generation of models), and it can be inferred that familiarizing the model with tool use in its reasoning chain would benefit immensely from additional compute. Humans don't often cite the outputs of python in the reasoning traces they put on the Internet, so of course current instantiations of tool use are highly underwhelming. People are still underestimating how fluently models will leverage tools with enough dedicated RL compute. The model and DBs supporting the tools may even be built concurrently during training. This is one pathway to models that know how to effectively use tools during inference.

The second piece of the analysis pointing to a YES resolution is success of distillation. All of the cited benchmarks probably represent a large amount of crystallized knowledge, more than it may be realistic to assume a small model could store in its limited capacity. I'd estimate on the order of 1M tokens of high-quality reference text. It would be really surprising to me if everything could be compressed to <2GB: APIs, for instance, are often specified in thousands of lines of documentation and are hard to compress much syntactically.

However, the minuscule size of models like o1-mini (where inference performance indicates the model must have surprisingly few active parameters) does not seem to be reducing the breadth of their knowledge. The inference time paradigm and long context windows implies that putting technical and reference information in context when distilling reasoners just works, i.e.small reasoners distilled on reasoning traces where all necessary background knowledge is already in context are implicitly learning not to put that knowledge in their weights, saving that capacity instead for general-purpose manipulation of in-context information. The system that selects whether stored information is beneficial if included in context can totally be handled by scaffolding over a vector DB.

The two ways to reconcile this are either to bite the bullet on (1) crystallized knowledge does not use much capacity for sufficiently high-quality representations, or (2) pruning passive parameters by distilling reasoners to non-MoE models will work. In either case, this resolves YES."""

New research from Meta describing "Memory Layers" which resemble both attention / a vector DB to keys in the model's latent space.
https://ai.meta.com/research/publications/memory-layers-at-scale/
I think it's quite clear that active params will end up being a smaller and smaller proportion of a model's data (MoEs were only the beginning of this), with most parameters used very sparsely in the same vein as associative memory. My sense is that techniques like these don't count under this question's resolution criteria (since they're trained parameters), but they do point to the same principle.

I didn't explicitly mention wall-clock time, but I said "Arbitrary scaffolding allowed" so unless anyone objects I'll add "Must use below X minutes of wall-clock per question". I am conflicted between 1 minute (upper bound on how long users would be willing to wait) and something higher since the spirit of this question is upper-bound-y.

@JacobPfau Added 1 minute cap. Since we're talking about 10b models on arbitrarily optimized hardware, this isn't much of a constraint. I expect that'l allow >100k tokens/question.

bought Ṁ100 YES

Preregistering my confidence that small models operating closely with large external DBs will turn out to be pretty darn smart.

@AdamK Care to share your reasoning?

@JoeBoyle I don't think I ought to share it. No point giving up so much alpha while prices for downstream markets remain this good.

@AdamK Okay

@JoeBoyle Sorry that I'll have to wait, but here's this to keep me honest: de3be1f4472c9adb4a479b97d140d6615b7189536d8916d52c5426aa0291fd28

I may consider sharing by April or so.
@JacobPfau I'm also happy to bet YES on a "before 2027" version of this market.

@AdamK Yea given qwen/o1 progress I agree that 2027 is possible. I've made a question here https://manifold.markets/JacobPfau/is-scale-unnecessary-for-intelligen?play=true

@AdamK Link is dead mate

bought Ṁ200 NO

The title is a bit misleading, because I think this is theoretically possible but just won't happen before 2030

@SaviorofPlant If you have a better title I will consider editing.

@JacobPfau If you can spare the characters, "before 2030" inside the parentheses clears it up.

Do current larger models reach those scores? Or is improvement AND compression currently necessary?

@KimberlyWilberLIgt Improvement for SWE-bench verified is necessary. The others have been roughly hit by O1. I chose these numbers as being my sense of in domain expert performance

bought Ṁ50 YES

You describe it as the opposite of the title

@IasonKoukas Thanks for catching this @CraigDemel

Pinging @AdamK to make sure your limit orders are in the right direction

Title doesn't match question in text. Should it be "is scale unnecessary"?

© Manifold Markets, Inc.Terms + Mana-only TermsPrivacyRules