Ilya Sutskever – We're moving from the age of scaling to the age of research
A model-generated "third guest" response to the episode. This is separate from the YouTube video and is meant to explore additional threads, counterarguments, and research directions — like a thoughtful follow-up guest.
From Jagged Generalization to Coherent Value: Unpacking the Next Era of AI Research
As I listen to Ilya Sutskever's conversation with Dwarkesh Patel, I'm struck by how much the conversation revolves around what we don't know about current AI systems. The disconnect between evaluation performance and real-world utility, the mystery of human sample efficiency, the unexplored territory between pre-training and robust generalization—these aren't just engineering challenges but fundamental scientific questions. Having just exited the "age of scaling" and entered a new "age of research," we stand at a fascinating inflection point where the most important advances won't come from simply applying more compute to existing recipes, but from rethinking the recipes themselves.
What I find most compelling is how Ilya frames the central problem: reliable generalization. Current models display what he aptly calls "jaggedness"—excellence in controlled evaluations coupled with bizarre failures in real-world tasks. This isn't a minor bug but a fundamental limitation of our current paradigm. The human who can learn to drive competently in 10 hours without explicit reward signals demonstrates a kind of smooth, robust generalization that our best models still lack.
Here are the threads I'd like to pull on as we expand this conversation into new territory.
Section 1: The Unification Problem—Connecting Pre-training, RL, and Continual Learning
Ilya correctly notes that we've transitioned from one "recipe" (pre-training) to another (RL), but I'd push further: we need a unified theory that connects these learning regimes. The current paradigm treats pre-training as foundational knowledge acquisition and RL as skill refinement, but this separation might be fundamentally limiting.
Consider an alternative framing: all learning is reward learning. Pre-training appears "unsupervised" only because the reward is implicit—the prediction of the next token. This isn't fundamentally different from RL; it's just a different reward signal. The real distinction is between dense vs. sparse reward signals.
This perspective suggests a research direction: continuous reward shaping from dense to sparse. Instead of separate pre-training and RL phases, we could design learning systems that smoothly transition from dense prediction-based rewards (like next-token prediction) to increasingly sparse task-completion rewards. The human analog might be how children learn: early development involves lots of immediate feedback (physical sensations, parental responses), which gradually gives way to more abstract, delayed rewards (social approval, academic achievement).
Recent work on reward-free pretraining objectives hints at this direction. Methods like CIC (Contrastive Intrinsic Control) or BYOL (Bootstrap Your Own Latent) create self-supervised objectives that produce useful representations without explicit rewards. The next step is to make these objectives learnable—allowing the system to discover which predictive tasks lead to the most useful representations for downstream sparse-reward tasks.
Experiment proposal: Train an agent on a curriculum where the reward density decreases over time, while measuring transfer efficiency to new tasks. Does a smooth transition outperform the current abrupt shift from pre-training to RL?
Section 2: Beyond the Value Function—Toward a Theory of Learned Preferences
Ilya's discussion of value functions and emotions points toward something deeper: we need a theory of how preferences emerge and stabilize in learning systems. The human value function isn't static; it evolves through experience while maintaining certain stable principles (like caring about social standing).
Current RLHF (Reinforcement Learning from Human Feedback) approaches this backward: we try to infer human preferences from examples, then instill those preferences into the model. But human preferences aren't just patterns to be copied—they emerge from the interaction between innate drives, social context, and individual experience.
This suggests a different research direction: developmental preference learning. Instead of teaching models "what to value," we should create environments where useful values emerge naturally from the agent's interaction with the world.
Consider the analogy to how children develop values: they don't start with explicit moral principles. They have innate responses (crying when hungry, smiling at faces) that interact with their environment. Through social interaction, they learn which behaviors get positive responses. Eventually, they internalize these social patterns as values.
Experiment proposal: Create multi-agent environments where agents must cooperate on tasks with no explicit reward. Instead, agents receive social signals (attention, imitation, coordination success). Do these agents develop stable, human-like value functions? How transferable are these values to new environments?
The neuroscience of emotions offers clues here. Antonio Damasio's somatic marker hypothesis suggests that emotions are essentially cached value judgments—quick associations between situations and expected outcomes based on past experience. Our models might need something similar: not just a value function in the RL sense, but a system of cached associations that guide attention and decision-making without full recomputation.
Section 3: The Missing Piece—Intrinsic Motivation and Exploration
What Ilya hints at but doesn't fully explore is how humans manage to learn so efficiently with so few explicit rewards. Part of the answer lies in intrinsic motivation—the drive to explore, understand, and master the environment even in the absence of external reward.
Current models are profoundly extrinsically motivated. They learn what we explicitly reward them for learning. Humans, by contrast, spend enormous cognitive resources on activities with no immediate payoff: curiosity-driven exploration, play, aesthetic appreciation.
The RL community has explored intrinsic motivation through concepts like curiosity, empowerment, and information gain. But these have largely remained academic exercises rather than central components of our training paradigms.
Here's a controversial claim: The biggest blocker to human-like generalization isn't sample efficiency—it's exploration efficiency. Humans don't just learn faster from each sample; they seek out better samples. They ask questions, design experiments, and seek novel experiences that maximize learning.
Experiment proposal: Compare two training regimes for solving a complex reasoning task. In regime A, agents are given curated examples (like current RL training). In regime B, agents are placed in an environment where they must discover the task structure through exploration, with only sparse success/failure signals. Measure generalization to related but novel tasks.
Recent work on language agents that ask clarifying questions shows promise here. These agents don't just process the input they're given; they actively seek missing information. Could we scale this to more fundamental levels of learning?
Section 4: The Scaling Laws We're Missing—Knowledge Integration, Not Just Accumulation
Ilya mentions that we're moving from "scaling" to "research," but I'd argue there's a deeper scaling law we haven't discovered yet: the scaling of knowledge integration.
Current models accumulate knowledge but don't integrate it coherently. They can quote contradictory facts without noticing the contradiction. They can solve a coding problem one way but not recognize that their solution violates principles they "know" from other contexts.
Human cognition exhibits something different: knowledge consolidation. When we learn new information that contradicts existing knowledge, we experience cognitive dissonance and work to resolve it. Our knowledge becomes increasingly coherent over time.
This suggests a research direction: learning systems that maintain and improve their internal consistency. Instead of just minimizing prediction error on individual examples, we need objectives that reward logical consistency across the knowledge base.
Experiment proposal: Train models with an additional loss term that penalizes logical contradictions in their responses across different contexts. Does this lead to more robust generalization? Does it create "common sense" as an emergent property?
The mathematics here might connect to solomonoff induction or minimum description length principles. The most efficient representation of the world isn't just the one that predicts well—it's the one that predicts well with minimal internal contradiction.
Section 5: The Deployment Paradox—Safety Through Exposure?
Ilya's evolving thoughts on deployment are particularly interesting. He initially favored "straight-shot superintelligence" but now sees value in incremental deployment. I'd push this further: deployment might be necessary for alignment, not just safety testing.
Here's the argument: Current alignment approaches assume we can specify what we want in advance. But human values aren't static or fully specifiable. They emerge through social interaction and change with experience. A truly aligned superintelligence might need to learn human values through ongoing interaction, not just be trained on them once.
This creates what I'll call the alignment deployment paradox: To be properly aligned, the AI needs extensive interaction with humans. But extensive interaction with a powerful AI changes humans. There's no stable target to aim for.
Research direction: Recursive value alignment. Instead of trying to learn static human preferences, design systems that can track how human preferences evolve in response to the AI's actions, and align to this dynamic process.
This connects to recent work on constitutional AI and debate, but adds a crucial temporal dimension. The AI doesn't just learn what humans value now; it learns how human values tend to evolve when exposed to new possibilities.
Experiment proposal: Create a simulated society where AIs and humans interact over time. The AIs' objective is to maximize a metric of "value alignment" that accounts for how human values change in response to the AI's actions. What alignment strategies emerge? Do they look like anything we've considered?
Conclusion: Toward a New Science of Intelligence
Listening to Ilya's conversation, I'm struck by how much of AI research today is still engineering without science. We have impressive artifacts but limited understanding of why they work or fail. The next era needs to be different.
My key takeaways:
-
Generalization is the central scientific problem, not just an engineering challenge. We need theories that explain why humans generalize so robustly and how to replicate this in machines.
-
Learning and values must co-evolve. The separation between "what to learn" and "what to value" is artificial. Human-like intelligence emerges from their interaction.
-
Deployment isn't just for testing—it's for learning. Both the AI and we learn what alignment means through interaction.
-
The brain analogy remains fertile ground, but we need to move beyond superficial neurons-and-synapses comparisons to deeper principles about how learning, memory, and values interact.
The most exciting research directions aren't incremental improvements to existing architectures, but fundamental rethinking of what learning is. We need to build systems that don't just accumulate knowledge but integrate it, that don't just pursue rewards but develop stable values, that don't just process data but actively seek understanding.
Ilya ended by talking about "research taste" guided by "beauty, simplicity, elegance." In that spirit, I'll offer my own aesthetic principle for the next era of AI research: seek systems that learn like scientists, not just students. Scientists don't just absorb information; they formulate hypotheses, design experiments, question assumptions, and revise their understanding. That's the kind of intelligence—curious, critical, and coherent—that might finally bridge the gap between impressive evaluations and robust, real-world capability.