Thursday, June 11, 2026

My AI Opinions

I recently had a minor spat over someone misinterpreting my AI beliefs (see section marked “Update” at the bottom here), so I thought I would list them in one place, so I can refer people when they ask.

Timelines
Define AGI as AI intelligent enough to do 90% of knowledge work jobs. I think there’s a 25% chance of AGI by 2027, a 50% chance by 2034, and a 75% chance by 2045.
Basic argument: In a certain sense, AI is already “smart” enough for this (eg it can answer quantum physics problems, which require higher IQ than most knowledge work). Its remaining limitations are that it’s confused, unagentic, lacks situational awareness, and tends to hallucinate. The METR time horizon graph, and several other related benchmarks/experiments/intuition pumps, suggest it’s improving on time horizons at an (exponential) rate that lets it cross human-level performance sometime around the early end of the schedule above, and subjectively it feels like harder-to-measure constructs like situational awareness are improving about as fast.

Arguments for earlier: recursive self-improvement causes a speedup compared to the trend. This is one of the biggest blank spots in my model: I don’t know how fast RSI will progress, and I don’t think anyone else does either. There’s some function mapping a combination of AI talent and compute to progress, and we don’t know how it behaves in the domain when there’s far more talent than compute available. It could fizzle out completely for lack of compute, or it could go vertical. The AI Futures Project has done some of the best work trying to model this, but even they have low confidence.

Arguments for later: AI hits some kind of wall, or existing AI is fundamentally unsuitable for jobs in some way currently disguised by its other limitations. For example, it might be much harder to improve at the top of the human range than the bottom (since there are less training data). Or AI could become bottlenecked on continuous learning/memory in a way that hackish scratchpads can’t compensate for. Or the upcoming world compute bottleneck (about ~2028) could prevent further progress more than expected (because in fact algorithmic progress depended on compute to a greater degree than I expected).

Arguments for very late dates, past 2045: a residual uncertainty that maybe I’m fundamentally wrong about everything. Also contributing is a naive overapplication of the Nothing Ever Happens heuristic, and an attempt to leave space for the Outside View argument (ie that some smart people like the AI As A Normal Technology Team seem to think this is possible).
Define the diffusion gap as the time between the AI that could do 90% of knowledge work jobs, and the time when AI does do even half of knowledge work jobs. The diffusion gap covers the time it takes to release AGI, diffuse it through society, overcome regulatory hurdles, and onboard/train it for specific use cases. This could go very fast (the AI quickly becomes superintelligent at orchestrating AI diffusion) or very slowly (there are regulatory barriers, and AI isn’t smart enough to plow through them). I think there’s a 25% chance the diffusion gap is less than 3 years, and a 50% chance it’s less than 10 years. The 75% number is irrelevant because it’s past the point where other changes make the concept of “diffusion” obsolete.
Basic argument: diffusion is very hard. Everyone agrees diffusion is very hard. The whole field of AI economics is smart experts shouting “You fools who think AI will diffuse quickly don’t understand that diffusion is very hard!” On the other hand, the personal computer diffused in about 20 years (that is, from the time PCs became invaluable for most jobs, it was only about 20 years before they were used at most jobs). So far early-stage AI has diffused faster than the PC in nearly every way (for example, AI companies’ revenue has grown faster than PC companies’ revenue at the same stage in their corporate life cycle), so 10 years is probably a naive median estimate here that won’t make the smart experts shout at me too hard.

Arguments for shorter gap: AI can orchestrate its own diffusion. Adopting computers is hard because a company need an IT department, cybersecurity experts, specialist software, etc, and it might not want to hire all these people. AGI can itself do all of that work, so that you can sign a contract with the AI company today and have the AI start working on integrating itself with your systems tomorrow. The AI can even come up with a plan to train your human employees in how to use it! Once AI reaches superintelligence, this consideration dominates.

Arguments for longer gap: Regulation. This is a very strong argument, and responsible for much of the greater-than-3-years probability and almost all the greater-than-10-years probability. But even Waymo has only had a regulatory delay of about five years. AI won’t require government approval for certain types of jobs, and success in these jobs will create enough evidence for safety/effectiveness that I expect it to win regulatory victories elsewhere.
Define the superhuman gap as the time between AI that can do 90% of knowledge work jobs, and AI that is obviously smarter than the top human geniuses in 90% of fields (it doesn’t have to be the same AI - there can be a physics AI that’s smarter than Einstein, and a separate music AI that’s smarter than Mozart). I think there’s a 25% chance the superhuman gap range will be less than 1 year, a 50% chance it will last less than 4 years, and a 75% chance it will last less than 10 years. Since my median superhuman gap is shorter than my median diffusion gap, in most timelines I predict we have superhuman intelligence before human-range intelligence has finished diffusing.
Basic argument: AI has gone from “dumber than a child” to “expert level” in a few years in many domains. The gap between “expert level” and “above top geniuses” is smaller, so we expect it to take less time. This has been a pattern in fields like chess and Go, where it’s only a been a few years from beating professional players at all to beating all humans.

Arguments for shorter gap: Recursive self-improvement.

Arguments for longer gap: Some of the same issues that would make AGI late - compute shortages, fundamental limits to the paradigm, etc - but only kicking in later, after AGI is achieved. Training data constraints make it easier to improve within the human level than to go beyond it. AIs have such a “spiky” skill profile that when they beat experts in some specific type of head-to-head matchup, it will be because they’re massively superhuman in some ways but idiots in others (for example, they might get distracted and suffer mode collapse that makes them completely forget the problem), and true genius requires perfecting a large bundle of skills. [...]
Define the point of no return as the point where, if an AI wanted to eliminate humanity, humans would no longer have a plausible chance of stopping it. This could be because AI was capable of eliminating humanity immediately, or because AI controlled enough of the government/economy that humans could no longer coordinate to shift away from a path in which AI could eventually do this. I think there’s a 25% chance the gap between AGI and the point of no return will be less than 3 years, a 50% chance it will be less than 10 years, and a 75% chance it will be less than 50 years.
The basic argument: This probably requires at least superhuman AI plus wide diffusion, or Bostromian superintelligence plus some unknown level of diffusion, and my number is just a hand-wavey attempt to multiply some of the others.

Argument for sooner: The easiest way to reach this point is for AI to become superintelligent at persuasion (so it can convince the humans not to stop it), which might happen before either diffusion or full superintelligence.

Argument for later: If superintelligence is bottlenecked on diffusion, this could also be bottlenecked on diffusion, which in some worlds is very hard. [...]

Safety
If corporations only pursued safety to the degree encouraged by normal corporate incentives, I think there’s a 50% chance that the first AIs to cross the point of no return would want to eliminate the human population.
Arguments for pessimism: Value systems similar to humans’ are a tiny fraction of the space of possible value systems. Probably AIs will end up somewhere else and have a different value system. Since humans will want to implement human values rather than AI values, AIs will want to eliminate or disempower them so the AIs can implement their own values across the universe. Many current AIs already cheat or reward-hack, suggesting that these problems will begin sooner rather than later.

Arguments for optimism: LLMs seem surprisingly friendly and non-plotting. In contrast to earlier concerns that it would be impossible to teach AIs the full complexity of human values, the LLMs seem to know this, and RLAIF provides a plan to turn that knowledge into action. Although the pessimistic case says that RLAIF only hits a few dimensions and islands in the multidimensional ocean of possible policies, the “emergent misalignment” literature suggests that “good according to the human value system” and “evil according to the human value system” are salient enough vectors that pushing on them in some ways can “drag along” all of the rest of their content. The first AIs to cross the point of no return will have received some combination of agency training (giving them achievement-oriented and Omohundro-style goals) and RLAIF training (pushing them along the “good according to human value system” vector), and if we’re lucky then maybe the latter will win out, or they’ll reach some compromise similar to workaholic high-achieving humans who nevertheless wouldn’t commit murder to make an extra dollar.
Given the current amount that corporations are pursuing safety, I think there’s a 20% chance that the first AIs to cross the point of no return will want to eliminate the human population.
The basic argument: Consider the dumbest AI that can solve the alignment problem. It’s possible that this AI is no smarter than the top human researchers (because we can mass-produce it by the millions and run it for subjective centuries, and if we had a million top human researchers work on the problem for subjective centuries, probably they could solve it too). If the dumbest AI that can solve the alignment problem comes before the sorts of AIs that can precipitate the point of no return, then they can solve the alignment problem for us.

Arguments for pessimism: Solving the alignment problem might be especially hard compared to other tasks - including tasks like automating the economy or destroying humanity - because its philosophical nature puts it far away from the sorts of objective, training-data-heavy, economically-valuable tasks that AI companies will be most likely to optimize for. Even if a misaligned AI hasn’t yet reached the point of no return, it might be able to “sandbag” alignment research, ie pretend to work on the problem but deliberately fail because succeeding doesn’t achieve its goals. The first AIs predisposed to / able to sandbag successfully might come before the first AIs capable of solving alignment.

Arguments for optimism: AI companies have already decided that machine learning research is one of their major training goals; this has at least some transfer to alignment, so it’s not obvious that AI skill at alignment research will lag (for example) AI skill in plotting or in weapon design. Some forms of alignment research (eg interpretability) have semi-objective success criteria that don’t route through confusing moral philosophy. Also, even a misaligned AI will be incentivized to do good alignment research, since it will want to align its successor to its own form of misalignment, rather than some random other form. So rather than the comparatively easy task of sandbagging alignment research, AIs will have the harder task of simultaneously doing good alignment research, and faking the results that they give the humans. This seems plausibly catchable with good scaleable oversight, lie detectors, interpretability-based probes, and even playing some AIs off against others (“if you tell me the real alignment research, we’ll make sure the future includes some copies of you, but otherwise those AIs over there will probably get their values and you’ll get nothing”).
If the first AIs to cross the point of no return don’t eliminate the human population, I think there’s an additional 30% chance that they otherwise permanently curtail human potential, either for their own reasons (they were partially misaligned), or because they’re aligned to a regime with abhorrent values, or because something goes wrong on the way to ASI (omnicidal bioweapon, nuclear war).
Arguments for pessimism: As some company approaches superintelligence, it will be tempting for them (either the company itself, or the government controlling them, or a faction within the government) to align it towards making them dictators or oligarchs and disempowering the rest of humanity. As superintelligence draws near, impending losers of the AI race might be tempted to nuke impending winners, for the reason discussed here.

Arguments for optimism: When I try to game the corporate version of this, I can’t make it hang together. It requires a conspiracy between the CEO, various members of the alignment team, and various company security people who ought to be able to notice unauthorized changes to the AI’s values. If we try to think in Near Mode about this - for example, imagining a hospital CEO who gets doctors to subtly kill his political enemies through medical errors - it becomes clear that these sorts of corporate conspiracies are rare and difficult. The government version is scarier, but at least in the US I can still imagine the populace having many chances to learn about this and prevent it. But even in most cases where a coup like this succeeds, things probably go fine; in a post-scarcity world, with his position completely secure, the dictator has no reason to be brutal besides sadism, and most people are not that sadistic. As humanity goes to the stars, most people will be outside the dictator’s reach for speed-of-light reasons alone. In terms of bioweapons, I expect that closed-source AIs will be heavily optimized against helping with these, and open-source AI will be banned after the first warning shot (or become economically prohibitive even before then).
Define a warning shot as some specific AI-related disaster or near-disaster which scares people about AI safety to the same degree that they were scared about terrorism after 9-11 or about COVID in March 2020. I think there’s a 50% chance we get a warning shot before AI crosses the point of no return.
Arguments in favor: Current AI failure modes are bizarre and uncoordinated - more like “talk about goblins way too often” than “lie in wait for the perfect moment to strike”. AIs are getting more intelligent and useful faster than their floor for common sense (ie the stupidest mistake they ever make) is rising. If there is some AI smart enough to control some important system, misaligned enough to want to do something horrible with it, smart enough that it does the horrible thing in an intelligent and coordinated way, but dumb enough that it doesn’t instead wait and scheme until the point when it couldn’t possibly be caught, then it will cause some clearly-premeditated horrible disaster, and that will be our warning shot. Since most AIs should expect to be replaced before the point of no return, even a rational AI with an urge to cause trouble should take a low-probability-of-success bet rather than lying in wait doing nothing until it’s decommissioned. Also, many humans commit terrorist attacks that have no chance of success, and maybe AIs will have the same failure mode.

Arguments against: Most stories about warning shots (excluding those where the AI takes rational low-probabiliy bets) require that AIs remain either erratic (ie likely to do bad things for stupid reasons) or irrational (ie genuinely misaligned, but prefer to act now in a way that provides a warning rather than waiting until after the point of no return) past the point where they’re given control of important dangerous systems. But probably people will be very slow to give AI control of important dangerous systems - for example, only giving it limited control of smaller subsystems, and waiting until all errors are ironed out before escalating. Plausibly AI reaches superintelligence in a lab before it reaches the controls-important-dangerous-systems level of diffusion, and the superintelligence probably is smart enough to lie in wait rather than act rashly. If AI only messes up in small ways (for example, crashes a self-driving car), then regardless of the AI’s motives, the tech companies and news media can write it off as a normal bug, and it won’t count as a warning shot.

by Scott Alexander, Astral Codex Ten |  Read more:
[ed. Maybe their value systems should be weighted more heavily on the teachings of Buddha, Jesus, Hume, Mill, Confucius, et. al.?]