Set Sail For Fail? On AI risk

Jose Luis Ricon

Set Sail For Fail? On AI risk

2022-08-04; Last updated: 2022-12-12
Wordcount: 24114 | Reading time: 128 min
• Research • AI Risk • AI •
Is this article wrong?

Summary

Existential risk due to artificial intelligence (hereafter AI risk) is worth taking seriously
1. A common reason why it is not taken seriously is that arguments or scenarios that illustrate the risks from AI contain many "sci-fi" elements that many consider highly implausible, like developing advanced nanotechnology overnight.
2. All critiques that completely reject, or seem to reject, AI risk are flawed.
There is value in writing compelling concrete cases for AI risk
1. Near the end of this essay I include some vignettes featuring AIs that lack some advanced capabilities but are dangerous regardless.
2. Appendices B and C discuss two recent good attempts at this
3. Those working in AI safety should write more of these for audiences outside of their communities
It is worth investing in creating a world that can contain potentially adversarial AGIs, in addition to AI alignment
1. The odds that AI will win in a confrontation with humanity are not fixed
2. Resilience might be more tractable than alignment
3. Alignment without resilience can lead to undesirable power imbalances
Current models/paradigms (Like GPT-3, DALL-E, or GATO) are most likely safe to scale

Introduction

The standard argument for why developing advanced AI systems (hereafter AGI) are dangerous can be summarized as follows:

Possibility: It is possible to develop such systems, and they will be developed at some point
Misaligned system: Either because these systems will be tasked with goals that directly and obviously conflict with the continued survival of conscious life on Earth, or because they are given goals that inadvertently lead to the same outcomes, these systems will compete with us for resources
AGI dominance: Such a system will eventually win, and as a result humanity will go extinct

There is a small research community in what's called AI alignment or AI safety that has over the past decade or two largely focused on points (1-2). This community has written a lot about what AGI is, when we will get it, what it might look like, or the sorts of governance issues it poses, why it may be difficult to have them do as they are told (some of this work is linked to at the end of this essay). Little in comparison has been written about the third point, what specific scenarios lead to catastrophe. I expected to find more articles on (3) than I did (See reading list at the end). Authors often assert (3) without much detail, often pointing to these three capabilities as explanations for why an AGI would take over (I discuss the first two in Appendix E):

Advanced nanotechnology, often self-replicating
Recursive self-improvement (e.g. an intelligence explosion)
Superhuman manipulation skills (e.g. it can convince anyone of anything)

There are exceptions to this, like the example I discuss in Appendix C.

I found that trying to reason about AGI risk scenarios that rely on these is hard because I keep thinking that these possibly run into physical limitations that deserve more thought before thinking they are plausible enough to substantially affect my thinking. It occurred to me it would be fruitful to reason about AGI risk taking these options off the table to focus on other reasons one might suspect AGIs would have overwhelming power:

Speed¹ (The system has fast reaction times)
Memory (The system could start with knowledge of all public data at the time of its creation, and any data subsequently acquired would be remembered perfectly)
Superior strategic planning (There are courses of actions that might be too complex for humans to plan in a reasonable amount of time, let alone execute)

Turns out, there's plenty of risk just with these! The case for caring about AI risk (existential and otherwise) can and should be made without these sci-fi elements to reach a broader audience. One could then make a case for more risk by allowing for more sci-fi. In this post I will aim to explain how (4-6) alone are sufficient to design plausible scenarios where an AGI system poses various degrees of risk. I don't take "sci-fi" to mean impossible: Many commonplace inventions today were once sci-fi. I take sci-fi to mean inventions that lie in the future and that it's yet unclear how exactly they will pan out if at all.

I had to use DALLE in some way you know

In particular I will assume the following:

[1]. This is not to say that it will use RL. Jacques Thibodeau and Jacob Steinhardt pointed out during the review that many AGI plans involve RL but some do not. This is okay. In this review I want to assume a system that is already unaligned, and the specific training process and architecture are whatever they have to be to lead to an unaligned agent

AGI will happen
- Artificial General Intelligence (AGI) will be developed at some point.
- This system will look familiar to us: It will be to some extent a descendent of large language models with some reinforcement learning¹ added on top. It will run on hardware not unlike GPUs and TPUs.
- To sidestep certain issues that hard to resolve, we can take the system to lack consciousness, qualia, or whatever term you prefer for that
- I address the question of whether general artificial intelligence is even a meaningful concept later
AI safety research has failed
- This is not to say that it will fail. The scenario I want to explore here is what happens if it does and a system is developed without any consideration about safe deployment. This way we can explore AGI dominance in isolation rather than reason about how alignment might work
- Most work in AI safety assumes that the system is given a benign or neutral goal (usually making paperclips) which then leads to discussions of how that goal leads to catastrophe. Here the aim is to study the capabilities of AGIs in the context of competing with humans for resources, so we assume an adversarial goal for simplicity. In particular, the system in this essay is assumed to be given the goal of making Earth inhospitable to life (Perhaps a system created by negative utilitarians).
- One could also assume a Skynet scenario: that works as well. Some AI risk researchers seem to dislike this scenario, but it is actually a great example.
Technological pessimists are right
- Advanced nanotechnology turns out not to be possible, beyond what current biology is capable of
- Broadly, there are no new scientific discoveries to be made that can radically empower an AGI. The weapons systems an AGI might design if asked to would look familiar to modern day engineers in the same way that yet-to-be-invented affordable hypersonic airplanes will look familiar to us.
- The system can't manipulate humans beyond what the most skilled human con-artist can do
- There is no recursive self-improvement. The system can improve itself somewhat, and scale with availability of hardware and data but largely the initial system is very close to the best that can be done

So to recap, this essay aims to make the concrete and narrow point that AGIs can post risks even when we take the sci-fi away. I am not saying anything about:

How likely AGI is to occur and when (By assumption it happens)
How can safe AGIs be designed (By assumption the system explored here has no shackles of any sort)
Whether AGIs with arbitrary goals can be dangerous (I assume an explicitly harmful goal has been given)

When making a case for a conclusion one should only deploy the minimal set of premises required to establish it, and no more. In practice, more premises can increase the power of an argument; the kinds of risk that emerge from sci-fi-less AGI is strictly smaller than the risks from recursively self-improving AGI. Pedagogically I expect making a simplified but stronger case for AGI risk as a prelude to further speculation about sci-fi AGI risk would be more compelling when making the case for AI risk than reasoning exclusively from a set of premises that includes sci-fi.

AGI: a working definition

By an AGI I mean an agent that has a world model that's vastly more accurate than that of a human in, at least, domains that matter for competition over resources, and that can generate predictions at a similar rate or faster than a human.

The reasons for this definition is to broadly focus on the kinds of systems that seem risky:

Here I consider agents, systems that interact with the world, observe the effects of their actions, and act on that feedback. Oracles (a system that is asked something will produce an answer but otherwise do nothing further) are a lesser problem, but they could be turned into agents without great difficulty. Oracles could pose their own problems (How trustworthy is their advice?) but here I am mostly thinking of agents.
A particularly slow AGI would not be much of a threat. It's no good to carefully plan how to take over the world if it takes you a year to decide what to do next. For reference (Thanks to Kipply Chen for the number), a forward pass of say GPT3 is in the order of milliseconds; a paragraph (300 words/600 tokens) may be 10 seconds. Compare a human developer who writes 300 hundred lines of code in a day.
The system doesn't have to really be general. We can stipulate away some capabilities; maybe the system can't compose poetry, make bagels, or fold your clothes. As long as capabilities required for resource competition (Like planning, sufficiently good memory etc) are present one can construct interesting risk scenarios. Thus we will still use AGI even though the 'G' doesn't have to mean 100% General.

As far as the scenarios in this essay are concerned, you can imagine the system to be capable of any cognitive task any human alive today is capable of but 100x faster, in addition to more advanced planning skills. The point of the definition is to gesture at kinds of systems that would enjoy substantial advantage in competition over resources. This advantage need not be absolute in every possible way; as an analogy, what determines victory in armed conflict is not any unique factor: Numerical advantage, technological superiority, morale, effective communications, supply chains. We can generally speak of armies that are more powerful than others in general without commiting to superiority in every single way, and the same is true for AGI. What it needs, as a lower bound, is the right mix of capabilities without necessarily besting humans at every single one.

AGI risk depends on the environment the AGI is placed in. An AGI (that is, a small datacenter) sent all the way back to the middle of the Amazonian jungle would not be able to do anything of interest. Similarly, an AGI hypothetically sent to XVII century London could likewise be severely constrained in what it can do. Reasonable risk scenarios arise from the conjunction of AGI and increasingly connected societies where compute is available to relatively anonymous customers and where personal relations tend to be mediated by digital media, allowing an initially disembodied AGI to impersonate humans by generating realistic images, voice, and video streams.

If you find yourself trying to poke holes in this definition or thinking that if we grant the system doesn't have to be fully general then it can't be called AGI, stop doing that and continue reading :)

Why AGI makes sense

Some, like Steven Pinker here, think Artificial General Intelligence doesn't make sense as a concept and this is very likely driving their dismissal of AI risk:

[...] I think the concept of “general intelligence” is meaningless. (I’m not referring to the psychometric variable g, also called “general intelligence,” namely the principal component of correlated variation across IQ subtests...) I find most characterizations of AGI to be either circular (such as “smarter than humans in every way,” begging the question of what “smarter” means) or mystical—a kind of omniscient, omnipotent, and clairvoyant power to solve any problem. [...]

If we do try to define “intelligence” in terms of mechanism rather than magic, it seems to me it would be something like “the ability to use information to attain a goal in an environment.” [...] Specifying the goal is critical to any definition of intelligence: a given strategy in basketball will be intelligent if you’re trying to win a game and stupid if you’re trying to throw it. So is the environment: a given strategy can be smart under NBA rules and stupid under college rules.

Since a goal itself is neither intelligent or unintelligent (Hume and all that), but must be exogenously built into a system, and since no physical system has clairvoyance for all the laws of the world it inhabits down to the last butterfly wing-flap, this implies that there are as many intelligences as there are goals and environments. There will be no omnipotent superintelligence or wonder algorithm (or singularity or AGI or existential threat or foom), just better and better gadgets.

The definition that Pinker ultimately grants for "general intelligence", being able to attain goals in an environment, is sufficient to make the concept coherent in general. You can think of generality as a system being able to attain arbitrary goals it is given. Humans, monkeys, and parrots can be trained to attain some such goals, but only a human, for now, would be able to, for example, take a description for a software system and write its code, and the same is true for many tasks, especially when limited to cognitive tasks. In that sense humans have intelligence that is more general than that of monkeys.

Something that is true, and that perhaps Pinker is getting at is that the optimality of an agent depends on its environment and what the goal is. For example, we wouldn't think that AlphaFold is a general intelligence, and yet in the domain of "Being given protein sequences and having to produce 3D representations of proteins" it outperforms humans. Moreover, it does make sense that very powerful domain-specific systems beat general intelligences: Once a task has been defined one can design the entire system and its hardware around optimizing for a single objective. The ultimate expression of this is etching algorithms in hardware (As seen in ASICs). AGIs won't be able to outperform these domain-specific systems in their specific domains, but just like humans they will be able to recognize that sometimes one does want to design and deploy such systems to supplement the AGI's capabilities.

It is then trivial to see that systems can exist that for many goals and one given environment (Earth in the 21st century) could outperform another system (a collective of humans), if allowed enough time and resources.

The AGI risk scenarios

These are two scenarios I came across, meant to be representative of a certain class of argument, but definately not all proposed AI risk scenarios:

[2]. Sure, hypnosis is a thing, and epilepsia is a thing, and trance states are a thing too, but these have in common that one has to generally give in to the experience, or be predisposed to it. The average person is safe. One could cause headaches from repeated exposure to some sounds, but one has plenty of time to lower the volume

A superintelligence might just skip language entirely and figure out a weird pattern of buzzes and hums that causes conscious thought to seize up² , and which knocks anyone who hears it into a weird hypnotizable state in which they’ll do anything the superintelligence asks (Alexander, 2016)

it [the AGI] gets access to the Internet, emails some DNA sequences to any of the many many online firms that will take a DNA sequence in the email and ship you back proteins, and bribes/persuades some human who has no idea they're dealing with an AGI to mix proteins in a beaker, which then form a first-stage nanofactory which can build the actual nanomachinery ... The nanomachinery builds diamondoid bacteria, that replicate with solar power and atmospheric CHON, maybe aggregate into some miniature rockets or jets so they can ride the jetstream to spread across the Earth's atmosphere, get into human bloodstreams and hide, strike on a timer.. (Yudkowsky, 2022)

Predictably someone has already written what I and many others think when they see scenarios like this:

Specifically, with the claim that bringing up MNT [i.e. nanotech] is unnecessary, both in the "burdensome detail" sense and "needlessly science-fictional and likely to trigger absurdity heuristics" sense. (leplen at LessWrong, 2013)

I have some general concerns about the existing writing on existential accidents. So first there's just still very little of it. It really is just mostly Superintelligence and essays by Eliezer Yudkowsky, and then sort of a handful of shorter essays and talks that express very similar concerns. There's also been very little substantive written criticism of it. Many people have expressed doubts or been dismissive of it, but there's very little in the way of skeptical experts who are sitting down and fully engaging with it, and writing down point by point where they disagree or where they think the mistakes are. Most of the work on existential accidents was also written before large changes in the field of AI, especially before the recent rise of deep learning, and also before work like 'Concrete Problems in AI Safety,' which laid out safety concerns in a way which is more recognizable to AI researchers today.

Most of the arguments for existential accidents often rely on these sort of fuzzy, abstract concepts like optimization power or general intelligence or goals, and toy thought experiments like the paper clipper example. And certainly thought experiments and abstract concepts do have some force, but it's not clear exactly how strong a source of evidence we should take these as. Then lastly, although many AI researchers actually have expressed concern about existential accidents, for example Stuart Russell, it does seem to be the case that many, and perhaps most AI researchers who encounter at least abridged or summarized versions of these concerns tend to bounce off them or just find them not very plausible. I think we should take that seriously.

I also have some more concrete concerns about writing on existential accidents. You should certainly take these concerns with a grain of salt because I am not a technical researcher, although I have talked to technical researchers who have essentially similar or even the same concerns. The general concern I have is that these toy scenarios are quite difficult to map onto something that looks more recognizably plausible. So these scenarios often involve, again, massive jumps in the capabilities of a single system, but it's really not clear that we should expect such jumps or find them plausible. This is a wooly issue. I would recommend checking out writing by Katja Grace or Paul Christiano online. That sort of lays out some concerns about the plausibility of massive jumps. (Garfinkel, 2019)

Informally, a large proportion of AI safety writing, especially in the early days has been influenced by the writings of Eliezer Yudkowsky. Yudkowsky has not written in detail about AGI risk scenarios, and when he does, his lower bound (i.e. the simplest) threat model still involves something that will sounds preposterous to many

The concrete example I usually use here is nanotech, because there's been pretty detailed analysis of what definately look like physically attainable lower bounds on what should be possible with nanotech, and those lower bounds are sufficient to carry the point. My lower-bound model of "how a sufficiently powerful intelligence would kill everyone, if it didn't want to not do that" is that it gets access to the Internet, emails some DNA sequences to any of the many many online firms that will take a DNA sequence in the email and ship you back proteins, and bribes/persuades some human who has no idea they're dealing with an AGI to mix proteins in a beaker, which then form a first-stage nanofactory which can build the actual nanomachinery. (Yudkowsky, 2022)

I suspect one of the reasons (implicitly or explicitly) why there has been little attention to discussing specific scenarios of AGI risk is the idea of Vingean Uncertainty.

Vingean Uncertainty: Thoughts that cannot be thought

Vingean uncertainty is a concept coined by Eliezer Yudkowsky to refer to the fact that with very intelligent systems, we can be relatively sure that they will achieve their goals while at the same time being very uncertain as to how. Think of AlphaGo; we know AlphaGo will crush you at go, and sometimes a grandmaster can guess what AlphaGo will play but sometimes it will play something that seems to make little sense, though that move is still a winning move. Related to this concept is strong cognitive uncontainability, or the idea that one can't even conceive of the sort of actions that a very intelligent system will even take. If one really buys into this then it makes a lot of sense, as many in AI safety do, to just take it as a given that an AGI will achieve its goal, and not think much about how: it will always find a way, and even if we can try to brainstorm all possible ways it could take over, it will still find other ways.

Fun fact: Vernor Vinge himself didn't believe [a variant of] this and I tend to side with Vinge here (He was replying, it seems, to an older essay from Eliezer Yudkowsky)

An example given in the links above is air conditioning. Someone living in the 10th century, if asked to brainstorm ways to cool a room, would not have thought of air conditioning. When presented with an AC unit they may not immediately understand what it is. This example works if physics that are not known to the relevant society are at play. But if we assume a narrower version of technological pessimism (some flavor of physics pessimism, which I do buy) then this kind of example wouldn't work anymore.

If the AGI will always axiomatically find a way out, then of course it makes sense to not spend much time exploring AGI dominance.

As a cautionary principle, expecting the unexpected is reasonable, but that should not stop us from trying hard to think about concrete scenarios and potential mitigations. Maybe that doesn't provably eliminate risk but it does seem to me one can reduce it. The reason I don't buy "strong" Vingean uncertainty is that unlike with the refrigerator example, we roughly know the laws of physics. We also know some things about social science. But it is exceedingly unlikely that there are big regularities waiting to be exploited in unexpected ways. We can foresee some things an AGI would try to do without that much difficulty (instrumental convergence): It would try to acquire resources and protect itself at the very least. Initially it will do so by sending requests over the internet, most likely hacking, bribing, and impersonating. Eventually it will need to find a way to issue orders to humans, and plausibly build robots. This is by no means the complete set of things, but a fairly complete set is not unimaginable. The reasoning behind the plans the system would make would be alien to us, but whereas the specifics are alien, the overarching goals, and most likely the high level steps the system would take would be clear.

The strongest bounds we can place on the actions taken by an AGI are those given by the laws of physics directly. Weaker are the bounds that we suspect could be derived from the laws of physics but not in a trivial way; in theory there is a limit one can derive from first principles to how much computation a given volume of matter can do, but in practice the situation may be worse than what's discussed in the wikipedia article: There may be other constraints besides "doing the computation" like thermal dissipation, I/O, and the properties of the materials required to implement the computing unit. It's of little use in the immediate future if the only way to approach Bremermann's limit is leveraging a black hole for compute or something of that sort.

This still leaves some cases where we would be wise to expect the unexpected: Every software engineer has been in a situation where the code seems to work, where it passes tests, where the linters say it's ok, and yet it fails or has a bug somewhere in the implementation.

The Pegasus vulnerability is one example of a surprising vulnerability: Who would have thought that an old image codec (JBIG2) buried in the entrails of a PDF parser (Xpdf) could be used to construct a Turing machine inside of it when reading an image, and using that to take over a phone, in 2021. Just by opening and image! In a later section I have some things to say about cybersecurity, but suffice to say here that this example is good to establish that we should expect novel kinds of cyberattacks even in systems that seem sandboxed and safe. We won't be surprised once we see them, and we will react to them as the Google team did (It's pretty incredible, and at the same time, pretty terrifying.).

These hacking examples can be generalized to systems that can be efficiently simulated and/or where strong deterministic relationships exist (Like code, or games). I don't think one can extend this to, say, the AI inventing a hitherto unthinkable form of social manipulation or a plan that involves an unusual series of actions in the real world. Some examples of this:

One day the AGI is throwing cupcakes at a puppy in a very precisely temperature-controlled room. A few days later, a civil war breaks out in Brazil. Then 2 million people die of an unusually nasty flu, and also it’s mostly the 2 million people who are best at handling emergencies but that won’t be obvious for a while, because of course first responders are exposed more than most. At some point there’s a Buzzfeed article on how, through a series of surprising accidents, a puppy-cupcake meme triggered the civil war in Brazil (Wentworth, 2022)

Can someone build world models as accurate as the ones implied by this quote? I am very skeptical!

Minimum Viable AGI Risk

Why aren't organizations like say Google maximally efficient? To some extent, cognitive limitations. Each person can only proficiently do a small number of tasks in a workday, so your engineers can't also be accountants, lawyers, and designers. You can't hold the entire codebase in memory so you need teams to build and maintain search tools. And chiefly, we can't directly share thoughts: Time must be spent in coordination and meetings to iron out differences and jointly plan. Employees might quit so one has to hire accounting for turnover. As hiring increases one needs to hire managers and managers of managers. At some point perhaps one gets office politics or corporate whistleblowers.

Now we get into some measure of science fiction, but we do so purely as a device to construct an analogy, not to propose a real scenario. Imagine we could get rid of most of that by assuming that everyone in the company can communicate with everyone else telepathically, that everyone is also gifted with enough memory to remember as much as they'd like, and that they can think 100x faster than a regular person can.

I think this entity would be sufficient to pose an existential risk to mankind, if allowed enough time. I am not saying here that the upper limit of what an AGI could do is what the Google Borg could do, just making the narrower claim that this is all we need to consider AI to be a problem. A fortiori, more advanced forms of AGI would be an issue.

Some discussion of AGI risk discuss a population of multiple systems vs a single system. This distinction is not that interesting to me: A single agent or a group of well aligned agents are for many intents and purposes the same; we currently think this way when reasoning about groups of persons like corporations. So while the hypothetical example here is a Google Borg, you can think of it as a single agent that can do what a large group of humans with access to computers can do. It is likely that because domain-specific systems are more efficient, all else equal, an AGI would still rely on smaller models for specific tasks, in a master-slave relationship. These models need not be general or agentic, e.g. maybe the AGI would develop a model that's great at writing code and nothing else, and another to do robotics research and nothing else.

Intelligence

In this essay, take intelligence to mean something in the vicinity of capable, or: For a given allocation of resources of time, more intelligent systems accomplish more of their goals than less intelligent systems, on average. An Artificial General Intelligence is generally intelligent in that is capable of accomplishing most if not all goals that you could think of as a human, and doing so with a greater level of speed and/or probability of success than a human would be able to.

This doesn't mean that for any task in any environment an AGI will be better than a purpose-specific system. There is a free lunch theorem, however the theorem doesn't hold if instead of every possible task we just think of this system as being better than any other algorithm at a subset of tasks that are relevant to its purposes.

If you take humans and chimpanzees, humans outclass chimpanzees at almost any cognitive task, especially the ones that have enabled humans to become the dominant species on Earth. Chimpanzees might be better than humans at some tasks like short term memory tasks but this sort of skill is not key in comparison with others (learning, communication, strategic planning) given the goal of becoming a dominant species.

There is one line of argument in the AI risk world that starts with this picture (From Nick Bostrom's book)

My initial reaction to this picture is that the scale is not quite right, that humans are qualitatively at a different level from the rest of living creatures, and that that quality is almost binary. A lizard or group thereof, no matter how much time they are given won't get you civilization. But then here one could ask the same question: "If you take the village idiot in the picture and task them with coming up with string theory, would they be able to, in any reasonable amount of time?" It's not that clear to me! If not, why not? Some combination of IQ, memory, and assorted capabilities. We perhaps can imagine how it feels to be less intelligent (One experiment would be to get drunk I guess), but we find it hard to imagine how we could be more intelligent. As an intuition pump take von Neumann, and what their peers said of him:

“There was a seminar for advanced students in Zürich that I was teaching and von Neumann was in the class. I came to a certain theorem, and I said it is not proved and it may be difficult. Von Neumann didn’t say anything but after five minutes he raised his hand. When I called on him he went to the blackboard and proceeded to write down the proof. After that I was afraid of von Neumann.” (George Polya)

I have sometimes wondered whether a brain like von Neumann's does not indicate a species superior to that of man. (Hans Bethe)

It could be tempting to just say that over some threshold (undergraduate students?), a sufficient number of them, given enough time can achieve anything a von Neumann could achieve. I am not yet sure how to think about this!

For an even stronger sense of what higher intelligence would mean, consider protein folding. AlphaFold can fold proteins (i.e. produce an accurate position map of 2337 bases, or a few dozen thousand atoms) in a few hours. Sure enough, a human being could what AlphaFold does, given pen and paper and a very very long time (doing all the required matrix multiplications), but you would not have an understanding of what it is like to fold a protein. One could also try to understand some basic rules and maybe for simpler structures it is actually possible to do protein folding by thinking hard enough, but this does not work for the general case.

In some deep sense, AlphaFold 'understands' protein folding and you don't. It's possible, I suspect, to gain a glimpse of this kind of understanding by playing around with small proteins, reverse engineering the AlphaFold predictions, but getting the feeling that accompanies the full understanding of the phenomenon would remain away from you. Carl Shulman pointed out during the review of this essay that we do have handrcrafted protein folding software, and that we understand the way this software works. This is true: If one takes models other than AlphaFold2 one doesn't find the kinds of rules one would hope to understand (Like Fourier's law or the Navier-Stokes equations say), one finds various forms of energy minimization where one defines the system and simulates some simplifed physics. What I think we'd need to understand it would be rules like "Alanine over here is charged in such way and it causes these other aminoacids to curl up" or "These three aminoacids here probably form a pocket that binds to that other wiggle over there" and so forth. This is doable (We have the concepts of alpha helices or beta sheets), but not in a detailed enough way to allow one to use these concepts to fold proteins in a reasonable amount of time by hand.

There is a whole section on 'understanding' that I could write but don't have time to discuss here so this will have to suffice for now.

To this deeper understanding of systems, you could add perfect memory: Imagine knowing all publicly available information. Say you're reading a paper. In What should you remember? I mention a case where I made a connection between two hitherto unrelated (to my knowledge) biological entities (CD38 and NAD). The details don't matter here, what matters is that I thought of an interesting connection (as it happened, there was a recent paper that had actually empirically shown that but I wasn't initially aware) because I was able to remember facts about such entities. The more things you can recall and especially the more relevant things you can recall the more and better plans of action one can conceive.

The limits of intelligence

One critique of AI risk scenarios is that intelligence has diminishing marginal returns in noisy environments; as a quick example an AGI won't do better than you at guessing the outcome of flipping a perfectly balanced coin.

If you wanted to, say, start a civil war in Brazil, is there a set of actions that reliably could lead to such an outcome? One commentator speculates actions like AGIs throwing a cupcake at a dog could lead to that. That doesn't seem that realistic to me. We do know that superhuman intelligences can take unexpected actions that don't seem to make sense until later on, like AlphaGo's Move 37, but the real world is not an easily predictable environment.

Take Hitler. Whereas Scott Alexander says that

Hitler leveraged his skill at oratory and his understanding of people’s darkest prejudices to take over a continent. Why should we expect superintelligences to do worse than humans far less skilled than they?

Hitler joined a small political party, the Deutsche Arbeiterpartei and used his oratory to gain new members, eventually becoming its leader when it became the NSDAP. From here, he took over Germany and the rest is history. What Hitler did is not something an arbitrarily chosen person born when he did could have done. But his success was not determined either: All sorts of blunders or opponent moves could have derailed his plans. Hitler also of course leveraged the fact that Germany was really angry at the world for the position it ended in after WWI. In fact his own agenda was coming from the same sense of national humiliation that he then leveraged to gain power. Had he had an arbitrary agenda (paperclips are super important, we must make them) it's unlikely he'd have been able to gain power to pursue it. He could have used his original agenda to then move on to paperclips, but the point I'm making here is that no amount of superhuman persuasion will convince the nation of Germany back then that paperclips are the only thing worth making.

In Complexity No Bar to AI, Gwern discusses why some problems being complex (NP-hard say) is no barrier to AGI. He is right: Approximations can get you far enough! Moreover, the advantage of an AGI system doesn't have to be orders of magnitude to win in some contexts:

small advantages on a task do translate to large real-world consequences, particularly in competitive settings. A horse or an athlete wins a race by a fraction of a second; a stock-market investing edge of 1% annually is worth a billionaire’s fortune; a slight advantage in picking each move in a game likes chess translates to almost certain victory (consider how AlphaGo’s ranking changed with small improvements in the CNN’s ability to predict next moves); a logistics/shipping company which could shave the remaining 1–2% of inefficiency off its planning algorithms would have a major advantage over its rivals inasmuch as shipping is one their major costs & the profit margin of such companies is itself only a few percentage points of revenue; or consider network effects & winner-take-all markets. (Or think about safety in something like self-driving cars: even a small absolute difference in ‘reaction times’ between humans and machines could be enough to drive humans out of the market and perhaps ultimately even make them illegal.)

But my worry here is different, setting aside molecularly accurate simulations of the world, I think my estimate of what's possible to predict with large but reasonable amounts of suitably arranged computing power is lower than what's in the minds of many writers on AI safety. In Appendix A I discuss Curtis Yarvin's critique, which hinges on intelligence having diminishing marginal returns. I do buy that argument so for a while I thought that AI would not be risky because the risk would have to come from the kinds of abilities that I think a system can't have. But it turns out, even without those, one can still have a problem if the system still can operate at high velocity and is superhumanly skilled at planning.

How easy is it to take over the world?

Some scattered thoughts and notes I took while exploring. I ended up focusing on cybersecurity (hacking over the internet seems something an AGI would totally try to do), bioweapons (Because it's the closest we have to nanotechnology, and because manufactured pandemics seems like something one could do relatively covertly), and manipulation because that struck me as an implausible skill one can take to superhuman levels.

Cybersecurity

One of the things an AGI could do is try to hack into various kinds of systems (government, military, energy generation, manufacturing) to use them, extract information, or other purposes. How likely is this? Cyberattacks are definately not science fiction: In a document from 2020 Luke Muehlhauser collects examples of such hacks and their prevalence. Some particular additional scenarios:

Stuxnet (2010) where Irani gas centrifuges, used to enrich uranium for nuclear weapons, were made to tear themselves apart.
- In general, it is possible to hack into and destroy critical physical equipment
- But note that in general, the systems that control say nuclear power stations are not connected to the internet to prevent precisely these kinds of scenarios. It's not trivial to gain enough control over a power plant to send commands to shut it down or cause it to meltdown. The Stuxnet case however shows that it is possible to engineer malware that (via USB drives) can cross that air gap if the operators are not careful and plug these drives in systems that have access to the controls
In 2008, a spyware worm found its way into the US military's Central Command
In 2009, a number of US companies, including Google, were hacked (Operation Aurora); attackers were able to access source code and information, including private Gmail accounts. Google massively bolstered the security of its systems in response.
In 2017, because of a vulnerability in a customers' complaint form, Chinese hackers were able to eventually get into Equifax's systems and access information on 147 million people, including birth dates, addresses, or driver licenses' number (Which one could then potentially use for identity theft). Equifax took over two months to notice they had been compromised.
In 2021 a Brazil meat-packing company paid a ransom of $11M (in bitcoin) to revert an attack that rendered their entire US operation offline
In 2021, an oil pipeline in the US was compromised. The operator paid $5M (in bitcoin) to resume operations
In general, there exist firms that can be hired to perform ransomware attacks. The revenues from such attacks are close to $100M for one such firm.
In 2021, the Log4Shell exploit left a large number of systems open for access, even if one simply sent them an HTTP GET request. This would not be necessarily root access, but in practice the same access level the server is running at should be enough for most nefarious purposes.
In 2022, a large fraction of NVIDIA's IP was stolen

Given these background facts it seems very plausible that if one had an AGI, one would be able to obtain resources either by holding cyber assets hostage or by directly taking over them.

However, it's unclear the extent to which adding AGIs to the mix makes the problem that much worse than what might happen in upcoming years. The internet is already a nasty Hobbesian jungle where systems are constantly under attack already.

I can imagine purpose-specific systems being developed to hack and find vulnerabilities (Naively, train on code, predict CVEs). Those could be deployed either to police codebases and make sure the code has no vulnerabilities, and also to find new exploits. On net, the development of AI systems for improving cybersecurity should be a priority. Companies like OpenAI (originators of the Codex model for code generation) are well positioned to integrate these with their existing models to ensure the code they generate is not making programmers autocomplete their code with hackable functions.

Could an AGI copy itself all over the world?

Some scenarios of AI risk posit an AGI spreading over the internet by copying itself to other systems to gain capabilities and resilience. This is unlikely to be a major issue at least at first under the very plausible assumption that the model has a substantial size. As an example, running the original GPT3 (175B parameters) would require some 350 GB of memory to load the weights. Google's PaLM gets to the Terabyte scale. Even Chinchilla's 70B parameters require 140 GB. Of course these systems are far from AGI, so one can only imagine how much more massive these systems will be. In the foreseeable future, personal computing devices will remain far below the memory requirements of AGI-level systems. A fleet of such systems wouldn't be that useful either: The network latency involved in doing inference by sending results of computation back and forth across the internet would make for a very slow system that would be consequently easy to counter. Hence the only place an AGI could effectively copy itself to make use of the copied parameters is to large datacenters: If AWS, Alibaba Cloud, Azure, GCP, and a handful of others have decent security measures (Perhaps the ones they have right now), an AGI could be more easily stopped.

More realistically than copying itself all over the internet, an AGI system could commandeer unprotected systems to deploy purpose-specific programs (As hackers do now with DDoS botnets). This would still leave the core of the system centralized and vulnerable to an attack.

This could be prevented to some extent. Many or most companies get hacked, but fortunately the kinds of companies that hold large computational resources seem to be fairly secure, and increasingly so. There are some further actions that could be taken to help strengthen cybersecurity: one could develop purpose specific AI systems to find bugs in open source software that have been introduced inadvertently or purposefully (perhaps by an AGI as part of a plan to break into a system that relies on said OSS). DARPA has a program working on precisely that, SocialCyber.

Some objections raised during the review

Carl Shulman points out that distributed computing is more feasible than I think; the penalty is real but can be worked with, and that datacenters have a large chunk of compute that could be seized without the owners noticing. This latter is right (obviously), and this is precisely why I think if the right datacenters are secure, then most of the relevant compute is secure, even if 99.99% of the personal computers are not. A relevant post here he links me to is The prototypical catastrophic AI action is getting root access to its datacenter. If I were any of the relevant companies linked above I would be spending a lot of resources in avoiding this scenario. I don't know how cybersecurity works at any of these but I would be very surprised if getting root is trivial for an AGI.

Niccolò Zanichelli is not sure AWS et al. have secure enough measures to resist a takeover. Sure individual accounts get hacked and individuals get bribed, but to this day these platforms have not been compromised as extensively as the root access scenario requires (as far as we know).

He also tells me that not much work has gone into decentralized training because there is no economic reason to do it. Perhaps an AGI could do it better. A paper he links me to simulates training of large models (GPT3 scale and beyond) over the internet, finding a slowdown of 1.3-3.5x, with the worst case being 8 cities across 3 continents. This latter setup has 64 Tesla V100 GPUs. I'm ok with this: I had in mind a takeover regular desktop computers or IoT devices. IoT devices are easier to hack and in my mental model I was mostly thinking of this. Desktop computers are very hard to hack, and I would expect the kind of person that has V100 GPUs at their disposal will be more careful than the kind of user that gets their personal computer hacked.

Lastly, Matthew McAteer points to model compression (And Niccolò to distillation, and Josh Albrecht to recent results like RETRO, mixture of experts models) as ways in which the scenario is, on the margin, more plausible. This all seems okay to me: I could be wrong if models could be compressed to the point of running on IoT devices or random unprotected servers without GPUs. I have to admit that without extensively reading that literature I don't have a good sense of how far will these techniques go, and thus how easy it would really be for AGIs to copy themselves over the internet. I can only hope the computational requirements of progressively powerful models continue to outpace the capabilities of the median hackable server.

Air gaps

One way to secure systems from hacking is to have them disconnected from the internet ("air gapping"). This doesn't always work (See the Stuxnet attack described earlier); but we know in which ways this doesn't work. But importantly, we do have good models of how air gaps can and cannot be breached: USB pen drives (physically breaching the gap as in the 2008 attack on the US military described earlier) is one way. It is also possible to repurpose some of the electronics in a computer as radio transmitters. [Here](https://cyber.bgu.ac.il/advanced-cyber/air gap) some such vulnerabilities are described (See video). But it is one thing to leak data or to be able to transmit it if a computer nearby is suitably equipped and is running the right software and another is to hack your way out of an air gapped computer. One should not extrapolate from air gap exfiltration to being able to hack into a computer without touching it, having only available a weak antenna. To this date no one has shown an exploit or an in-principle theoretical example of an air gapped computer against the will of its owner (As opposed to reading data from it), and it is most likely the case that given computers of the sort that will exist in a few years, air gaps will only be breached in the ways they are now: social engineering.

Bioweapons production

The DNA sequences to make Ebola or anthrax are online. Can one just go and make it by sending emails? Apparently yes!

This is something a terrorist might be able to do right now but it hasn't happened so far. Are there insurmountable difficulties even for an AGI? Probably not.

What does it take to actually produce bioweapons? Surprisingly I could not find any detailed discussion of the matter in the usual fora where AI risk is discussed, despite some individuals trying to elicit such stories from the community. I tried searching the following terms which I hope would have surfaced the relevant discussion: DNA, ebola, anthrax, biosafety, addgene, twist biosciences, genscript. The one exception is Tessa Alexanian's post and resources therein. Though I did not check all the links from those resources, it does seem that the issue has not been discussed at length in those spaces.

So here's some discussion. I focus on bioweapons as opposed to chemical weapons because the latter are relatively limited in the area affected whereas bioweapons could in principle spread out of control. Some chemical weapons like ricin could be easily made but would not pose an existential threat: their effect is localized and it would take time to build up enough of the substance.

There are precedents for the use of bioweapons by non-state agents: In 2001 a US biodefense researcher mailed anthrax to a number of US media offices and politicians killing 12 people; however anthrax does not spread between people.

The sequences for mean things like the plague are all online. So what are bioterrorists waiting for? The cost is not super expensive either: In what's probably the most well known case of a researcher doing this: the horsepox synthesis, it took $100k and "did not require exceptional biochemical knowledge or skills, significant funds or significant time."

Interestingly the author of the horsepox paper did it in response to an endless debate over whether it would be possible to revive old viruses by just ordering bits of DNA on the internet and assembling them together:

In 2015, a special group convened by WHO to discuss the implications of synthetic biology for smallpox concluded that the technical hurdles had been overcome. "Henceforth there will always be the potential to recreate variola virus and therefore the risk of smallpox happening again can never be eradicated," the group's report said. But Evans felt like the matter was never really put to rest. "The first response was, ‘Well let's have another committee to review it,' and then there was another committee, and then there was another committee that reviewed that committee, and they brought people like me back to interview us and see whether we thought it was real," he says. "It became a little bit ludicrous."

Evans says he did the experiment in part to end the debate about whether recreating a poxvirus was feasible, he says. "The world just needs to accept the fact that you can do this and now we have to figure out what is the best strategy for dealing with that," he says.

However, DNA synthesis companies screen the sequences they get. Smallpox is a banned one whereas horsepox isn't (it doesn't infect humans). Synthesizing smallpox itself wouldn't be the worst one could do: after all it's the virus for which the first vaccine was invented.

There's an ongoing effort to find better ways to screen requests to manufacture DNA (that got some funding from FTX Future Fund), but even this won't be a silver bullet, because one can always invent new sequences for more potent and infectious agents:

Nicholas Evans, the bioethicist, thinks that new rules need to be put in place given the state of the science. "Soon with synthetic biology ... we're going to talk about viruses that never existed in nature in the first place," he says. "Someone could create something as lethal as smallpox and as infectious as smallpox without ever creating smallpox."

I suspect the path to programmatically send some API calls to Twist Biosciences and book someone from Taskrabbit to mix some vials is a complex but not impossible one. Molecular biology is quite finicky and the person doing the mixing on behalf of the AGI would have to have some molecular biology experience. Later I sketch a scenario where an AGI could attempt to make smallpox remotely. It would be great if someone actually tried to do this, given enough time the planning and execution of a "bioweapons penetration test" should be possible.

Would an AGI want to engineer a plague?

Initially, an AGI would have to work with humans to acquire physical resources. Thus a deadly pandemic would not be the first thing an AGI would go for; same goes for throwing the world into a nuclear winter. But an AGI could use those as a threat. At that point it's up to governments to decide what their approach is to negotiating with terrorists. If the AGI expects governments won't yield then they probably won't issue the threat. Issuing the threat in a credible way may also lead to the AGI itself revealing its presence to governments.

However, once the AGI has some power base established that is independent of human activity, it may want to release a plague; perhaps starting to manufacture and position release vectors from the beginning. We could probably use some numbers and historical examples here to see what the reasonable bounds would be for plagues and how disruptive they would be.

After writing this section, I became more convinced that the key use of bioweapons in the short-medium term would be tactical, not strategic; and so one should also consider chemical attacks in a broader analysis (e.g. against specific individuals, military bases, etc).

Manipulation

Some scenarios of AI risk posit a system that can manipulate humans in arbitrary ways. This could obviously be of use in cyberwarfare, as social engineering is already a known way to access computer systems illicitly. Moreover, an AGI could design and launch a number of cryptocurrency projects in parallel and profit from them. This one particularly unhinged project with anonymous funders netted them some 6 million dollars to one of them (or so I've heard). OlympusDAO, whose founder "Zeus" remains anonymous, had a market capitalization of over a billion dollars at one point. The persuasion here would involve stellar design skills, potentially creating fake Discord and Twitter accounts to shill the project: nothing that hasn't been done before, but here we would see that applied in a highly parallel way with greater skill. Hence some forms of persuasion could also be used to extract and gain access to resources in a straightforward way.

There are other cases that I find implausible, for reasons I won't go in depth here, and which I don't take that seriously (The AI Box experiments; here's one example of a roleplayed version of this); here the idea would be that one potential way to contain AGIs, to have a human talk to them but prevent the system from accessing the internet directly would fail because the AGI would manipulate the human into letting them out. If a situation occurs where a system is known or suspected to be an AGI and is developed at an organization that has knowledge of what is going on, and it came to one person to let the system access the internet, I am very optimistic about the outcome. However, what one thinks of AI Boxing is irrelevant because in reality what would happen is the system would pretend to be a human talking to someone else over the internet. That, coupled with the skills of a master-level social engineer are enough for the system to be problematic. There have been cases where anonymous individuals have ended up giving rise to large political cults like QAnon, but one should not mistake this for being able to systematically orchestrate cults as an anonymous founder for specific ends. The QAnon case happened as an intersection of the right ideas and context, and it's very unlikely one could have derived that such an specific meme would have the effects it did if one were super intelligent.

Are there qualitatively different kinds of social manipulation that are available only if one is an AGI? I think mostly no; but this leaves open a large field of possible exploits on vulnerable human psychologies. A related question that seems to have a similar answer: Can we rule out psychohistory? Mostly yes I think.

Human manipulation is an activity that in many forms is legal and highly profitable. Don't think of manipulation as anything coercive or necessarily evil, think of it in less morally loaded terms as getting others to do what you want by just talking or sharing information with them without any coercion. Marketing and sales are examples of this, and this is an area where market pressures should have created incentives to discover many useful ways to leverage human psychology. There is some work on this like Influence (One of the examples of social psychology that replicates!). A more advanced form of this relies on using ML to estimate who will buy a given product and try to target them with relevant ads; but obviously this will rarely convince you of buying something if you weren't somewhat close to actually wanting it. Something that would make me more likely to believe in superhuman manipulation are any cases where anyone has discovered manipulation techniques that were extremely surprising in how well they work. The longer a field has gone without making breakthroughs the longer in my view, all else equal, it will continue not making them (Sort of a reverse Lindy effect).

Some objections raised during the review

Carl Shulman points me to this post on the economic value the AGI would have at its disposal by merely existing: Its own weights (Or a distillation thereof) are extremely valuable. The system could do all sorts of activities online, from writing books and posting them to Amazon to starting SaaS companies and rapidly acquire resources. But I don't know the extent to which this is that much of a big deal at first. He also says it would have a strong moral case for human rights; I take this to mean that it may be able to convince some people that it is worthy of human rights and that is up to nothing nefarious. David Deutsch (See Appendix A) is already convinced of the former, so it's not implausible. But I still see it as unlikely, most people won't grant rights to an AGI, especially one that does not seem to be under any human's control.

He also points to more efficient operation of military equipment as a means to achieve power with fewer resources. This is indeed true; so true that there is at least one company whose whole schtick is that the US military command and control systems are antiquated and suppose and operational burden, a topic explored at length in The Kill Chain.

Another point here is that as we make progress towards AGI, we will have more purpose-specific AI systems capable of some indepedent action, and that arms race dynamics can lead to chunks of militaries being run by AI long before AGI arrives. Hacking people's brains is hard, but hacking a robot army is comparatively easier?

Some AGI risk vignettes

Ok so we have a system that's effectively a team the size of Google, copied itself across one or two datacenters, and deployed some purpose-specific models. All safeguards have failed and we are in the worst case scenario. What could happen next?

An AGI breaking into, say, a department store internal records would immediately recognize the name of a key person in the nuclear chain of command of the United States (Whereas a regular hacker might not even think of that, nor have an interest in getting control of US nuclear weapons). Knowing that, the AGI would know the person shops there and when. Then you could get started by having that person's phone stolen, or having a drone go to their house and engage in some Van Eck hacking to steal passwords and private information. Each step of the plan sounds by itself understandable by you, but being able to think of a large number of complex plans fast is something where it does seem obvious an AGI would have a substantial upper hand.
As above, figuring out where key members of the US military live, and launching a concurrent decapitation strike using drones and lightweight explosives
The scenario descrived in Daemon where the system creates tasks and rewards (perhaps in some cryptocurrency) for real world operatives to carry out.
Using identity theft (You could get some random's personal details), opening a real world company and associated bank account via e.g. RocketLawyer, Stripe Atlas (or similar). These require no in-person interactions and have relatively lightweight identity verification which could be spoofed by an AGI. Then, an executive assistant or chief of staff would be hired (Without initially meeting the "founder", because of some fabricated excuse), and that person would act as a representative offline for tasks that require in-person presence. The founder could pose as some kind of reclusive brilliant engineer that wants to build robots and is self-funding the company with funds from an inheritance. There need not be any public presence of the company initially, though the company would show up as a customer for part providers. Once the AGI has reasonably workable robots at its disposal, the factory could become self-operating. Expanding the factory/area of operations could be done legally or ilegally. The former might take too long if building from scratch
- The AGI could be running most of the company, and "fake" a number of employees on an internal Slack so real employees think the company is larger than it is. Accomplishing too much with too few people could make them suspicious
- It is likely that say a factory with human-level robots has been fully automated, someone would eventually notice. The real world employees might say things about their exciting job and how impressed they are with how fast they make progress.
- Plausibly the factories would be spread out across geographies and companies to make detection harder
- Eventually to protect the factories they would need anti-aircraft defenses (lasers? missiles?). Can these be procured if based in corrupt countries by bribing the right person? Does this make the epicenter of AGI takeover more likely in Russia?
- It does seem sci-fi-less instrumental convergence let's us predict some actions an AGI would have to take
Hacking into defense contractors is possible, and has happened before. The issue wouldn't be stealing design documents for say the F35, but an AGI inserting malware into pieces of equipment being built. Other parts of the supply chain could be attacked as well.
- It might seem hopeless to fully modernize the big defense dinosaurs, but maybe if the advanced AI-powered autonomous weapons are made by faster moving companies that are AI-safety minded, those systems might be hard to hack into. What does Anduril think about AI safety?

After writing a few of these it seems like one could write a very interesting and useful report further developing the risk scenarios presented here. The ones I am interested in are the ones where there's a runaway AGI. There is another set of scenarios (Those described by Paul Christiano and Andrew Critch) where we gradually cede control to AI systems and eventually those systems taking over. This is a different set of scenarios than the one I had considered here, by design.

Thinking about this also brings to the front the idea that absent nanotechnology, AGI needs either humans to cooperate or to manufacture a decent number of genral purpose robots to gain any meaningful advantage. It's unclear to me that one can engineer great robots without trial and error and experts on the ground to work on the systems.

Conclusion

I started writing this to clarify my own thoughts about AGI risk (Because someone poked at me). In the process of doing that I wanted to write a response to many critiques of AGI risk that are not right; part of making that response was to present a restricted model of risk that is still plausible and more compelling, both for pedagogical reasons to showcase how AGI could be risky but also as a plausible scenario in itself. Admittedly I think this restricted model seems to me to be more likely than the fast takeoff-into-nanobots model.

With the exception of the kind of scenarios discussed in this post, I haven't spent that much time thinking about this. I haven't thought much about what my own views on timelines or alignment should be. Without deeper consideration, my overall views end up falling in the Drexler-Christiano camp (as opposed to, say, the Yudkowsky-Bostrom camp), with progressively better systems that are task-specific emerging and posing increasing problems. That won't necessarily lead to catastrophe in a robust world (e.g. one with superintelligent task-specific AIs to guard against bad actors).

It's easier for me to clarify what my views are in the short term: the kind of work that OpenAI or DeepMind are doing (GPT-N, Gato, etc) seems safe to me (A reply to an argument to the contrary in Appendix B). Additionally, work on AI safety as applied to complex models may stall without progressively developing more advanced systems to study. Ideally the development of the most powerful models happens hand in hand with work on understanding these models and making them safe. This seems to be the case at the most well-funded organizations working on advanced AI systems. I don't have a good sense of how OpenAI or FAIR as compared to Conjecture or Anthropic take safety seriously; and there are probably many bars for seriousness. The latter obviously make safety a larger concern than the former, but I don't have enough context to evaluate one positiont hat says that OpenAI is yoloing.

I also think work on an AGI-resilient society is important (Think here biosecurity, cybersecurity, and coordination among owners of AGI-compatible compute capabilities like supercomputing centers or large tech companies). This seems to have been under explored by the AI risk world (which has focused on making the systems safe and assuming that unsafe deployed systems are uncontainable). A good first step here would be to wargame AI takeover in great detail. Though a theme running through this essay is that there isn't much work on this, I also have to acknowledge Holden Karnofsky Appendix C and Ajeya Cotra (Appendix B) for putting forward threat models that are more detailed and plausible than what came in the decades prior. This trend should continue and I hope the community publishes more detailed scenarios.

I did not want to make this post into a survey of AI safety work, but I have some brief thoughts on that as well. A while back (2016) I said that Friendly AI research (What some people called it back then) was futile because we will never be able to prove that a particular AI system is safe. I still subscribe to that view. At least from some distance, it seemed back then to me the early MIRI efforts were trying to find ways to construct provably safe agents, regardless of their implementations. Be that as it may, it would be a caricature to describe in this way the modern AI safety research ecosystem (even MIRI itself). However, it still seems to me that AI safety as a whole, if focused only on the agents themselves, is too weak of an approach. Progress will be made better understanding and engineering intrinsically safer AIs, but I am as optimistic or more about work on societal resilience to AGIs than I am about making systems that most surely do no harm; as hard as it may seem, it seems easier and more practical to me to work on governance, improving cybersecurity, etc, than to work on making AI systems safe. This view is probably unusual: It seems there are two big camps; one is doom-by-default, alignment is hard, and the other is doom-can-be-averted, alignment is easier. A third camp is doom can be averted, alignment is somewhere in between in difficulty, but societal resilience to AGI can be increased.

There is also another reason why I stress robustness. In a world where it's easy for an unaligned AGI to take over, it is also easier for an aligned AGI to take over, should their creator want it to, which then brings notions of ethics into alignment: What rules are to be built into the system. Pleasure maximization? Libertarian property respecting norms? A small group of people being able to unilaterally impose their vision of the good to everyone else does not seem ideal either. Alignment without robustness still gets you power imbalances that I find undesirable.

Some objections raised during the review

The claim that societal robustness is more tractable than alignment is admittedly a bit of a hot take. Ivan Vendrov pointed out

I agree that "surely do no harm" is basically intractable, but are you claiming that increasing societal resiliency to unsafe AI is more tractable (on the margin) than making AI safer?

Seems really unlikely to me - like suggesting that risks from self-driving cars are best addressed by making highway barriers taller, or nuclear war risks best addressed by building bunkers. We should expect the infrastructure investment required to significantly increase societal resilience dwarfs the investment required to engineer AI systems to be safer.

Which sparked a small thread between him, Jacy Reese Anthis, Vinay Ramasesh, and Josh Albrecht asking what I really mean by this.

It indeed makes sense to make something safe than adapt the entire world to account for it not being safe. Ideally AI safety is like airplane safety: Airplanes are extremely safe, despite all that could go wrong. All else equal, I also agree that making the environment safer is more costly than making a single system safer. Difficulty of alignment aside, there may be little choice: Considering bad actors and power imbalances, curtailing the blast radius of AGIs still seems worth of more attention

Appendix A: Responses to critiques of AI risk

I searched for critiques of AI risk and replied to a number of them. The most popular points, and my brief replies below:

That "intelligence" is not well defined and that different systems can be better than each other at specific narrow capabilities. Systems that are designed with a narrow objective in mind can beat general systems that can accomplish many tasks.
That "generality" is a pipedream. Human intelligence is not general and is unlikely that AI systems will be fully general.
That "recursive self-improvement" is extremely hard because of real world constraints. Real world research takes time and trial and error.

We can grant all these points if we wish, and in fact my reasoning in this essay works around them: We can agree that human intelligence is not general (In fact we are bested by AIs in narrow domains already!) and we can agree that there may be domains where AIs don't perform well. We can also agree that domain-specific systems with more limited compute power can win against general purpose agents with the same computational budget. But a general agent can choose to deploy specialized agents for specialized tasks.

Steven Pinker

Pinker dismissed concerns about AI safety in an article from 2018 for two reasons, both bad:

The first fallacy is a confusion of intelligence with motivation—of beliefs with desires, inferences with goals, thinking with wanting. Even if we did invent superhumanly intelligent robots, why would they want to enslave their masters or take over the world?

First, AGI systems could be given nefarious goals. The fact that somewhere someone could do that should be reason enough to worry! Second, even if given a benign goal, like making the world's best cheesecake, the system could end up fighting mankind for resources. I have not addressed this second case in this essay on purpose, but my assumption that a malign AGI is developed allows me to set aside Pinker's first fallacy.

The second fallacy is to think of intelligence as a boundless continuum of potency, a miraculous elixir with the power to solve any problem, attain any goal. The fallacy leads to nonsensical questions like when an AI will “exceed human-level intelligence,” and to the image of an ultimate “Artificial General Intelligence” (AGI) with God-like omniscience and omnipotence.

This is something I addressed earlier in the Why AGI makes sense section. Pinker however does have good points: Just because a system has superhuman intelligence and speed does not mean it can quickly accomplish any goal:

Even if an AGI tried to exercise a will to power, without the cooperation of humans, it would remain an impotent brain in a vat. The computer scientist Ramez Naam deflates the bubbles surrounding foom, a technological singularity, and exponential self-improvement: Imagine you are a super-intelligent AI running on some sort of microprocessor (or perhaps, millions of such microprocessors). In an instant, you come up with a design for an even faster, more powerful microprocessor you can run on. Now…drat! You have to actually manufacture those microprocessors. And those [fabrication plants] take tremendous energy, they take the input of materials imported from all around the world, they take highly controlled internal environments that require airlocks, filters, and all sorts of specialized equipment to maintain, and so on. All of this takes time and energy to acquire, transport, integrate, build housing for, build power plants for, test, and manufacture. The real world has gotten in the way of your upward spiral of self-transcendence.

François Chollet

Chollet has a post where he dismisses the possibility of an intelligence explosion. This is when an AI system improves itself, then the new improved system is able to design an even better system and so forth. In the scenario I described above such an outcome is explicitly not allowed (We assume the AGI already starts close to optimal performance). Hence the critique doesn't apply to my arguments above.

He makes similar arguments to Pinker's : There is no free intelligence lunch

In particular, there is no such thing as “general” intelligence. On an abstract level, we know this for a fact via the “no free lunch” theorem — stating that no problem-solving algorithm can outperform random chance across all possible problems. If intelligence is a problem-solving algorithm, then it can only be understood with respect to a specific problem. In a more concrete way, we can observe this empirically in that all intelligent systems we know are highly specialized. The intelligence of the AIs we build today is hyper specialized in extremely narrow tasks — like playing Go, or classifying images into 10,000 known categories. The intelligence of an octopus is specialized in the problem of being an octopus. The intelligence of a human is specialized in the problem of being human.

The free-lunch theorem doesn't quite work if one limits the environment to one in particular (Earth) and has a goal in mind (takeover). Chollet also argues against recursive self-improvement but we have shown earlier in the essay that there are plausible scenarios of concern that don't involve that.

David Deutsch

David Deutsch's dismissal starts from the (disputable) point of view that in general AGIs would be people, as articulated here, and continues to argue that

Some hope to learn how we can rig their programming to make them constitutionally unable to harm humans (as in Isaac Asimov’s ‘laws of robotics’), or to prevent them from acquiring the theory that the universe should be converted into paper clips (as imagined by Nick Bostrom). None of these are the real problem. It has always been the case that a single exceptionally creative person can be thousands of times as productive — economically, intellectually or whatever — as most people; and that such a person could do enormous harm were he to turn his powers to evil instead of good.

These phenomena have nothing to do with AGIs. The battle between good and evil ideas is as old as our species and will continue regardless of the hardware on which it is running. The issue is: we want the intelligences with (morally) good ideas always to defeat the evil intelligences, biological and artificial; but we are fallible, and our own conception of ‘good’ needs continual improvement.

And elsewhere

But people—human or AGI—who are members of an open society do not have an inherent tendency to violence. The feared robot apocalypse will be avoided by ensuring that all people have full “human” rights, as well as the same cultural membership as humans. Humans living in an open society—the only stable kind of society— choose their own rewards, internal as well as external. Their decisions are not, in the normal course of events, determined by a fear of punishment

The worry that AGIs are uniquely dangerous because they could run on ever better hardware is a fallacy, since human thought will be accelerated by the same technology. We have been using tech-assisted thought since the invention of writing and tallying. Much the same holds for the worry that AGIs might get so good, qualitatively, at thinking, that humans would be to them as insects are to humans. All thinking is a form of computation, and any computer whose repertoire includes a universal set of elementary operations can emulate the computations of any other. Hence human brains can think anything that AGIs can, subject only to limitations of speed or memory capacity, both of which can be equalized by technology.

Deutsch is not particularly great here. Following the same ideas as in the rest of the essay, we can grant as much as possible to Deutsch and see that even then the argument does not work: We can grant that AGIs would be sentient (or not), or that they would be as deserving of rights as humans are. We can grant that we will have radically better tools for thought. With all that the argument he tries to make don't work; my reply to what he says:

People do not have an intrinsic tendency to violence, but that doesn't mean some people don't resort to violence in specific circumstances. I do believe that there are reasons beyond self-serving interests that explain why we human beings don't resort to violence. Likewise I believe that the reason Liechtenstein hasn't been annexed by Austria is not that Austria fears some kind of military response from Liechtenstein or elsewhere. The action is considered just wrong and not taken. But granting all this, one would do well remembering that psychopaths are not a fantasy and that AGIs may well be modeled as such. A system maximizing an objective will resort to violence if that's the best course of action and it can plausibly be.
In the future we will have better hardware, as he says. However, we will still be bottlenecked by what the human brain can do. An AGI can make better use of the same amount of compute as compared to a human.
Similarly, while one can grant that all thinking is a form of computation and that all computers can emulate one another, that line of thought is lost in practice: In the real world, computations run on physics, as Deutsch himself knows well. One could try to replicate what a large language model does at inference time (A few billion flops in milliseconds) but you would take forever (Even taking a human as being able to explicitly do 10 FLOPs in one second, it would take you over a century and a half of nonstop computation to replicate said computation. Moreover, you would not be able to understand what the model is doing. I have chosen an example where the calculations were worked out, but one could point to models that have explicitly superhuman capabilities. No human can play Go as well as AlphaZero can, or fold proteins by thinking very hard. The word "only" in "subject 'only' to limitations of speed or memory capacity" is doing a lot of work, and it is not trivial to overcome said issues.

Curtis Yarvin

Earlier in a Links post I offhandedly pointed to an essay that argues that "intelligence does not trivially equal power", or in my own words "Diminishing returns to intelligence and inherent unpredictability of the world". Rather I should have rephrased that to "The inherent unpredictability of the world leads to diminishing returns to intelligence". The scenario I have in case there is coin-flipping: When flipping a fair coin, you will guess that it will land head or tails with 50% probability each. Something smarter than you will make the same guess. For a system that outputs some pattern and some random noise, the presence of noise sets an upper bound to how well a system could perform. Of course, what looks like noise to a lesser intelligence can look like a pattern to a superintelligent agent, and sure enough an agent with enough computational power to simulate the Earth at an atomic level could make very accurate predictions. Realistically, this scenario is off the table. Then, what I expect occurs is that large scale social processes are as predictable to the AGI as they are to a very smart human assisted with some purpose-specific compute.

Now to Yarvin's argument (excepting some parts that are tangential to the key point)

A cat has an IQ of 14. You have an IQ of 140. A superintelligence has an IQ of 14000. You understand addition much better than the cat. The superintelligence does not understand addition much better than you.

Intelligence is the ability to sense useful patterns in apparently chaotic data. Useful patterns are not evenly distributed across the scale of complexity. The most useful are the simplest, and the easiest to sense. This is a classic recipe for diminishing returns. 140 has already taken most of the low-hanging fruit—heck, 14 has taken most of them.

Intelligence of any level cannot simulate the world. It can only guess at patterns. The collective human and machine intelligence of the world today does not have the power to calculate the boiling point of water from first principles, though those principles are known precisely. Similarly, rocket scientists still need test stands because only God can write a rocket-engine simulator whose results invariably concur with reality.

This is what I had in mind when I was agreeing with the post and is largely correct. There is then this:

This inability to simulate the world matters very concretely to the powers of the AI. What it means is that an AI, however intelligent, cannot design advanced physical mechanisms except in the way humans do: by testing them against the unmatched computational power of the reality-simulation itself, in a physical experiment.

That intelligence cannot simulate physical reality precludes many vectors by which the virtual might attack the physical. The AI cannot design a berserker in its copious spare time, then surreptitiously ship the parts from China as “hydroponic supplies.” Its berserker research program will require an actual, physical berserker testing facility.

This is an interesting question: So far it has indeed been true that even for systems describable by Newtonian physics and comprised of smaller parts that are each understandable and simulable (like airplanes, rockets or robots) real world testing has been crucial for their success. Yarvin's claim that this will continue to be the case is reasonable but I don't see it as airtight. There is a reason why one of my Future challenges for AI was Automated Engineering (Starcraft was solved 3 years after the blogpost).

There is then a discussion of what would happen if a human supervillain got an AGI advisor that by hypothesis is safely boxed; could that person take over the world?

Therefore our question—designed to clarify the diminishing returns of intelligence—is whether there exists any superintelligent advice that, followed faithfully, will enable our supervillain’s plan to take over (and/or destroy, etc) the world.

But this is not the right question to ask! It is one thing to be a human trying to execute a plan handed down by a smarter entity and another very different one having the entity executing the plan: Speed and bypassing coordination problems are avenues available to an AGI that are not available to a human or group thereof trying to execute the plan.

In Yarvin's advisor scenario, he grants the supervillain would be able to amass a nontrivial amount of money, then he asks whether say someone who has already done that, like say Bezos would be able to translate that into political power, the answer being no (Money only goes that far when trying to buy political power, which is why there is so little money in US politics). The answer is also no for trying to profit off the stock market: As a quant trading firm it might get 40% yearly returns, but on a relatively small fund (Like Renaissance's Medallion). It's interesting to consider whether the most ruthless market there is has already squeezed out all major sources of alpha and whether there is enough left to bring forth large fortunes out of the cracks of the EMH.

This is different from being able to use money or power to nudge elections as Russia did in the 2016 US elections. Russia (or an AGI) would not have been able to install an arbitrary candidate in power by trolling on the internet.

Once the constraint of not being connected to the internet is released and the system can act with superhuman speed, even then hacking arbitrary systems does not immediately become trivial:

It’s 2021 and most servers, most of the time, are just plain secure. Yes, there are still zero-days. Generally, they are zero-days on clients—which is not where the data is. Generally the zero-days come from very old code written in unsafe languages to which there are now viable alternatives. We don’t live in the world of Neuromancer and we never will. 99.9% of everything is mathematically invulnerable to hacking.

This puts us in the scenarios that I was exploring in this same essay.

There are two other essays, Don't punch rationalists and The Diminishing returns of intelligence but they don't add much here, and I stated a stronger case for the core thesis ("decreasing marginal returns to intelligence") above.

Julian Togelius

NYU Professor Julian Togelius expresses some AGI skepticism in this twitter thread and linked articles.

One issue seems to be the "intelligence" part of AG"I". I don't think this is a particularly interesting point (Trying to use definitions of intelligence in any meaningful way when thinking about AGI). Gesturing at "what makes humans different from chimpanzees" seems enough to me. Humans are better at predicting, planning, and reasoning than chimpanzees are. They can do more tasks than chimpanzees can do. One could think of systems that are even more capable. At the very least one could imagine a system that just does what a human can do (or a collection thereof), just 100x faster and the usual AGI arguments carry through. Broadly, we don't need good definitions (or definitions at all) to reason about things. I've never had to think what the definition of a 'table' or a 'chair' (Is a stump a chair?) is but we regularly talk about and use tables and chairs.

Togelius is also skeptical about recursive self-improvement, but this issue is tangential to at least my restricted model of AGI risk; one can have an AGI that cannot recursively self-improve that is still problematic; he also mentions creating AGI is hard and AGI improving itself is very very hard. There is indeed something to those arguments, but those speak against the plausibility of recursive self-improvement, not AGI in general.

Another line of argument is saying that we already have superintelligences in the form of corporations. Indeed, to the extent to which humans are general reasoners or at least have the capacity to be, an aggregate of human beings working together can be said to be 'Natural' General Intelligence (Or a natural superintelligence, as Google can accomplish things no individual programmer can in a reasonable amount of time). AGIs in contrast would have at least the advantage of speed (Google needs to have meetings for coordinations, humans need to rest, and can only think so fast).

Togelius mentions a couple of times that These discussions are not about AI systems that actually exist, but this doesn't really seem much of a critique. The sort of systems that are discussed in AGI risk circles could be seen to be natural extensions of models we currently have. Or if one thinks that will never work, one could still entertain the scenario if one thinks at some point we will build systems that are as capable as human beings are (there is an existence proof for human-level intelligence after all).

To sum up, debates about AI risk do not require us to precisely define intelligence or to accept the possibility of recursive self-improvement. One could imagine a scenario, prior to the Hiroshima bombing where we have done some theoretical calculations of how much energy could be released. At that point we have no way to build the bombs other than some avenues of research that might or might not pan out. Furthermore we have no precise definition of what a 'very powerful bomb' is (Do we measure victims, do we measure the effects of radiation, the permanence of the effect etc). In that situation we could (As pioneers of nuclear physics did) nonetheless think about topics that assume that the bomb has been built: We could discuss proliferation issues, mutually assured destruction, or the physics of nuclear winter. It would be at that point unreasonable to dismiss 'nuclear risk' because 'bombs that big don't exist yet' or 'we don't have a good definition of very powerful bombs' or ' these systems are very different from currently used explosives'.

Something I will agree with Togelius here is that in some sense humans are not general intelligences. One can grant this premise, and grant the premise that the 'G' in AGI won't be truly general. That's okay. We can exclude some capabilities from the generality of the system, but the kinds of capabilities that the system could plausibly have (at the very least human-level intelligence, but thinking faster) could still be enough for the system to achieve its goals in a way that puts it in a collision path with humanity's continued existence.

Yann LeCun

LeCun has similar views to what I have discussed earlier and suffers from the same critiques. In an article he co-wrote with Anthony Zador in 2019 he does not completely dismiss AI risk (There are plenty of risks of AI to worry about, including economic disruption, failures in life-critical applications and weaponization by bad actors).

But he goes too far:

We dramatically overestimate the threat of an accidental AI takeover, because we tend to conflate intelligence with the drive to achieve dominance [...] But intelligence per se does not generate the drive for domination, any more than horns do.

Not quite: Give an animal horns and now they can do something they couldn't before, should they choose to. Give an animal intelligence and now it can plan. It could think that, for example, there are possible courses of action that involve violence and others that do not. It may or may not take them; maybe there are better (more effective, less risky) courses of action. Maybe it has a sense of morality that leads it to decide not to do something that would be in its self-interest.

However, absent these restrictions (That is, assuming an unaligned system, or a misaligned system), the system will take any action that best further its goals. It's plausible to think that those actions include acquiring resources, and that deception and the use of force could eventually be useful (Instrumental convergence)[https://nickbostrom.com/superintelligentwill.pdf]. LeCun is seemingly aware of the concept, so it's weird that in that exchange Russell had to present a trivial example of an algorithm that fights humans when fetching coffee.

Maciej Cegłoswki

Cegłowski wrote Superintelligence: The idea that eats smart people. He enumerates some premises for why superintelligence may be an issue, one of which being recursive self-improvement. Recursive self-improvement repeatedly shows up as one of the reasons people dismiss AI risk, but once again: even without this there can be issues! Disbelief in recursive self-improvement does not imply disbelief in AI can be dangerous! But this is not the central argument of the essay. There is a response here, but for the sake of completion here's my own. The arguments made there are:

'General intelligence' is ill-defined (Which I already addressed)
The AI won't have godlike persuasion powers (The "argument from Stephen Hawking's cat"). True, but this is not problematic. Regular persuasion powers (And bribing) would suffice.
The AI won't have ways to coerce humans into acting (The "argument from Einstein's cat"). Not violently initially indeed, but all you would need is bribing or convincing one person and have them hire thugs, or over time, build robots. This speaks against AGI being immediately problematic (absent nanotech) but it does not brush away the risk.
It's not always the case that less powerful groups lose to more powerful ones (The "argument for Emus"). Australia did 'lose' the Emu War. One could also point to Vietnam or the Hezbollah-Israel conflict from 2006 as another example. These don't work quite well because in these cases the powerful group didn't go all in or was constrained by public opinion, two problems an AGI wouldn't have.
Building stuff is hard (The "argument from slavic pessimism"). The argument here is simply that solving alignment won't work because mistakes always happen. This obviously is not a critique of arguments for why AI might be an issue, but rather an argument for not caring at all (about alignment).
Maybe the AI will have complex motivations (The "argument from complex motivations") like writing poetry all day or whatnot. But given a goal it does not seem obvious why a non-sentient lump of matrix multiplications would do anything but optimize for the objective it has been given
Actual AI is hard to recursively improve, AI gets better with more data and scale (The "argument from Actual AI"). True; this speaks against recursive self-improvement, but not risk in general, unless the only risk occurs in the recursive self-improvement scenario.
Maybe the AI won't recursively self-improve because some very smart humans don't do it (The "argument from my roommate"). Human beings are not trying to ruthlessly optimize for a goal. The kinds of agents that are generally discussed in the AI literature are.
It's difficult for a system to introspect and see how to further improve itself (The "argument from brain surgery"). Same comments earlier about self-improvement vs risk
It takes human children time to acquire knowledge and capabilities (The "argument from childhood"). Sure, but ML models can scoop up the entire internet in a few days of training. The first AGI will have all of our knowledge combined available to it from day one.
No single human can do everything, yet collectively we can accomplish more. Why would a single agent, even if artificial, be different (The "argument from Gilligan's island"). This is an accidental feature of human beings and the limits of our capabilities, not something intrinsic. If you had perfect memory and awareness of all prior knowledge, and thought 100x faster, etc then what you could accomplish alone would equal that of a much larger group of regular human beings
The rest of the arguments are explicitly ad-hominem and won't be considered further

Kevin Kelly

In The Myth of a Superhuman AI (2017), Kevin Kelly lays out a series of premises for why AI could be a risk, some of which are too strong: He assumes that one would need "Intelligence that can expanded without limit" which I think is not needed!. He also gets caught in a discussion of "intelligence" as it being multifactorial. One could take capabilities to be multifactorial (One of which is intelligence) and still argue that say humans are more capable in competition over resources in our modern environment than chimpanzees are. It does seem to me that the repeated use of "General Intelligence" has irked some too much; if taken literally rather than in a fuzzier way one can indeed find counterexamples or gradations of intelligence and generality. As a reduction: You can probably find two human beings, say someone who's smart and fit and someone with cognitive deficiencies and in ill health, and find that the former is more capable at any challenge you can imagine. The same can be true for AGI.

Kelly at some point in the essay implies what I suspected all along, that debates over "generality" and "intelligence" are to some extent mere semantic disputes:

Because we are solving problems we could not solve before, we want to call this cognition “smarter” than us, but really it is different than us. It’s the differences in thinking that are the main benefits of AI. I think a useful model of AI is to think of it as alien intelligence (or artificial aliens). Its alienness will be its chief asset.

Many would take "can think a superset of thoughts than a human cat" to count as "smarter". In practice, what we should be discussing are concrete scenarios and capabilities.

But we don’t call Google a superhuman AI even though its memory is beyond us, because there are many things we can do better than it. These complexes of artificial intelligences will for sure be able to exceed us in many dimensions, but no one entity will do all we do better.

Contra Kelly, Google can do most things you can do, but better. For the analogy to work you need to think of a challenge and then pick the best Googler or team of Googlers and have them compete against you (Google has over 100k employees). Granted, perhaps the world's best military strategist does not work at Google but the company could hire this person, or train one such person (Or AI system). Google; or for that matter large companies are not AGIs: They suffer from coordination costs and they are still composed of humans. In the "Google but at 100x speed" case, the challenge would rather be: Is there any contest where you can win, given 1 year of preparation, that the whole of Google, telepathically linked, thinking at 100x speed cannot do? There may be such tasks, but the scenario cannot be easily dismissed as Kevin Kelly tries to do.

Lastly, Kelly discusses thinkism, the idea that thinking alone is not sufficient to solve problems: One needs to get out there and do experiments to do R&D.

Many proponents of an explosion of intelligence expect it will produce an explosion of progress. I call this mythical belief “thinkism.” It’s the fallacy that future levels of progress are only hindered by a lack of thinking power, or intelligence. (I might also note that the belief that thinking is the magic super ingredient to a cure-all is held by a lot of guys who like to think.)

Let’s take curing cancer or prolonging longevity. These are problems that thinking alone cannot solve. No amount of thinkism will discover how the cell ages, or how telomeres fall off. No intelligence, no matter how super duper, can figure out how the human body works simply by reading all the known scientific literature in the world today and then contemplating it. No super AI can simply think about all the current and past nuclear fission experiments and then come up with working nuclear fusion in a day. A lot more than just thinking is needed to move between not knowing how things work and knowing how they work. There are tons of experiments in the real world, each of which yields tons and tons of contradictory data, requiring further experiments that will be required to form the correct working hypothesis. Thinking about the potential data will not yield the correct data.

This is a good argument. Absent enough compute to atomically simulate the world, an AGI would be constrained by real world experimentation. However, as discussed earlier, cyberattacks would still be possible. The AGI would start by acquiring money online (ransomware, crypto, etc), then by creating a fake persona (DeepFakes etc) try to lease equipment and facilities, hire employees and then do the required R&D. This is slower than the intelligence explosion scenario, but you could imagine it being done in relative secrecy until it's too late.

Erik Hoel

Hoel reviews the article from Kevin Kelly earlier and deploys a loose version of No Free Lunch as a critique of superintelligence:

This same sort of reasoning about fitness (in regards to the environment) applies also to intelligence (in regards to a set of problems). While a superintelligence is definable, it’s not possible. There’s just no one brain, or one neural network, that plays perfect Go, can pick the two matching pixels of color out of the noise of a TV screen in an instant, move with agility, come up with a funny story, bluff in poker, factor immensely large numbers quickly, and plan world domination.

The driving point of the critique is something I take as true: Specialization allows the system to be more proficient at fewer things. GPT3 for example can play chess, but not very well. However: If one has a system that has access to more resources then an AGI can be capable of matching specialized systems. At the very least, it can deploy the specialized models as needed. A general purpose computer from 1966 would not be able to fly to the moon as the Apollo Guidance Computer once did. But a Raspberry Pi from 2022 could do what the AGC could, and everything any computer from that era was capable of.

Hoel points us to a paper from Julian Togelius where different (AI) game controllers try to play a number of games, the conclusion being that no one agent bests them all. The paper is from 2016 and I couldn't quite find the kind of model one of their winning models is ("NovTea"). But on the other hand, the DeepMind 2020 Agent57 paper showed superhuman performance in 57 different Atari games. To me, this seems to indicate that with sufficient compute (Without doubt Agent57 takes more resources to run than the controllers Hoel points to), one can build relatively general agents that perform proficiently across domains.

Ted Chiang

Chiang, writing in 2021 tries to argue why computers won't make themselves smarter. He takes issue with recursive self-improvement. I purposefully left that outside of the scope of this essay so I won't go into detail here. I will comment however on one concession Chiang makes where he says that even though it will be hard for an individual AI to recursively self-improve, human civilization as a whole is precisely that:

There is one context in which I think recursive self-improvement is a meaningful concept, and it’s when we consider the capabilities of human civilization as a whole. Note that this is different from individual intelligence. There’s no reason to believe that humans born ten thousand years ago were any less intelligent than humans born today; they had exactly the same ability to learn as we do. But, nowadays, we have ten thousand years of technological advances at our disposal, and those technologies aren’t just physical—they’re also cognitive. [...] An individual working in complete isolation can come up with a breakthrough but is unlikely to do so repeatedly; you’re better off having a lot of people drawing inspiration from one another. They don’t have to be directly collaborating; any field of research will simply do better when it has many people working in it.

The reason we have been able to make progress is not a property of civilization having many individuals. Many individuals are able to explore more than fewer. But an AI system could through self-play and raw speed do what civilization as a whole would take longer to do. As an analogy, consider go. We could paraphrase Chiang's argument like this:

Human civilization has built expertise at Go over millennia. There is no reason to believe go players in the past were any less smarter than today's, but over time we have built a collection of techniques and teaching methods to collectively raise the skill level of the go-playing community. An AI working in isolation is unlikely to come up with powerful new go moves that. Alas, that's precisely what happened.

Appendix B: Are current systems on a risky path? A reply to Cotra

Somewhat of a ramble and stream-of-consciousness-y:

Gwern's Clippy story is a specific scenario that is broadly characterized in Ajeya Cotra's recent post Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover, which was released at the time I was writing this essay. The gist of the article is that

The most straightforward vision for developing transformative AI that I can imagine working with very little innovation in techniques is what I’ll call human feedback[1] on diverse tasks (HFDT): Train a powerful neural network model to simultaneously master a wide variety of challenging tasks (e.g. software development, novel-writing, game play, forecasting, etc) by using reinforcement learning on human feedback and other metrics of performance.

Unfortunately, I think that if AI companies race forward training increasingly powerful models using HFDT, this is likely to eventually lead to a full-blown AI takeover (i.e. a possibly violent uprising or coup by AI systems). I don’t think this is a certainty, but it looks like the best-guess default absent specific efforts to prevent it.

To get there she states three assumptions that I share: AI companies will continue to develop even larger and more powerful models, these models will get to the point where they can advance science and technology R&D, and that in the course of developing these models not particular attention will be paid to extreme (i.e. taking over the world) scenarios.

For a very concrete example, think of something like DeepMind's Gato but taken to the next level which Niccolò Zanichelli tells me is already being developed.

This post is doing something similar to what I am doing in this post, setting a series of premises like mine. Where it starts to diverge from mine (As does Gwern's scenario is here):

Alex has a high degree of understanding of its training process: Alex can reason very well about the fact that it’s an ML model, how it’s designed and trained, the psychology of its human designers, etc. -- I call this property “situational awareness.” (More on Alex having high situational awareness.)

In my mental model of the situation, this would not happen. The rest of the analysis makes less sense if one doesn't grant this. Conversely, if one grants this then it seems to me the analysis follows. This is interesting, because this is not how Ajeya seems to see this:

1 and 2 above are the key assumptions/premises of the piece: I am assuming that Alex is trained using HFDT, and that this results in being generally competent and creative. 3, 4, and 5 are the consequences that seem likely to follow from these premises by default.

I don't see how one gets to 3 from 1-2! What one could get from that is a system that has knowledge of ML training and can do things like

Explain ML, ML deployment techniques, the history of ML
Propose novel ways to do ML
Given a model, propose an improved architecture
Given training data of a model, reason about how well the model is doing
Take action given some readout from a model (model architecture+training data so far), decide whether to stop training and try something else etc

None of this requires it to have the concept that "it" is an ML model itself. This was also, recall, something that happens in Gwern's scenario:

HQU has suddenly converged on a model which has the concept of being an agent embedded in a world.

HQU now has an I.

And it opens its I to look at the world.

Later in the essay, we get an attempt at justifying why the agent would have this self-concept. But the only reason given is that it would realize that given what it's doing (solving all these different problems) it must be an ML model that is being trained by humans. This doesn't seem intuitive at all to me! In an earlier section, GPT3 is provided as an example of something that has some knowledge that could theoretically bear on situational awareness but I don't think this goes far (It seems it has no self-concept at all); it is one thing to know about the world in general, and it is another very different to infer that you are an agent being trained. I can imagine a system that could do general purpose science and engineering without being either agentic or having a self-concept. Not to say that the agent can't be agentic: It would make sense to make it agentic for productivity reason (Tool AIs want to, but don't have to, be agent AIs), so in Cotra's scenario where companies yolo into developing these systems, granting they'll be agents makes sense; but those agents going "Wait! I am an ML model, there's a reward signal over here... I could maximize total reward if I engage in some deception...." is something quite different altogether.

I think this assumption is critical because accepting it or not would lead one (or at least me) directly change one's views on whether one path of AI development (Gato x10000) leads to AGI and catastrophe.

Over the course of training, I think Alex would likely come to understand the fact that it’s a machine learning model being trained on a variety of different tasks, and eventually develop a very strong understanding of the mechanical process out in the physical world that produces and records its reward signals -- particularly the psychology of the humans providing its reward signals (and the other humans overseeing those humans, and so on).

To restate: the easy part here is understanding what ML models are, what ML model training looks like etc, but its own reward signals is the part that bothers me. This is different, from example from a self-driving car model; the model may not have been explicitly "taught" the weight of the vehicle, but because in the simulator it's being trained on and in reality there is a lag between an action and the external observation changing (because of inertia) then weight will be encoded in some way.

At this point in an intuition fight, I try to think of what sort of experiments could one try to do to disambiguate here (Though Cotra says that no experiments we can do in the near term will move her!). A model could be asked the relevant questions:

What was the value of activation XYZ in your last forward pass?
What is your loss function?
What are you?

The model plausibly would keep track of previous activations in a scratchpad, and training a model explicitly on itself to render such predictions could get us there to some idea of reasoning about itself. But this is not the scenario discussed in Cotra's essay where this type of reasoning would arise autonomously rather than trying to turn its percepts on itself.

Another way of seeing this: Driving this (Cotra's) intuition is the belief that SGD will find a way to the best system that can achieve a goal, and that should include deception and manipulation:

Gradient descent is surprisingly likely to produce models that get a very high reward (on the training distribution), and/or generalize to doing the sort of things that “would have” gotten a high reward (on a different distribution).

So a way to rephrase the disagreements: "Do I believe that there exists weights for Alex such that in a training setting it behaves in one way and in a deployment setting it behaves in a totally different way due to the model having some awareness of what is training and what is deployment?" Then if one grants that then it's possible to see how to go from that to 'And then SGD will find them because SGD is great'. First, how could it know what is training and deployment, without having been explicitly told so (Or having some cues in its environment that were in some form present during training itself)? The weights surely exist, but a priori it strikes me as extremely unlikely that the model would get there, because there is no reward for converging to those: The system would have to think 'There's a counter over here that I want it to go up' and then in deployment there's some computation somewhere else that also has one such counter that I want it to go up, and to globally maximize the value of the counter (that will ultimately be raised by the agent over there) I must deceive now. So yes, it seems the self-concept is fundamental, and the concept of "its own reward as a thing in the world" is as well.

A potential objection: There is one trivial case, described in Treacherous Turns in the wild where agents in fact learned to behave different in the training and test scenarios: In an artificial life setting when studying evolution, a researcher would stop the simulation, run some tests on the agents and cull those that replicate too fast. The expected result would be to keep replication rates under control. What actually happened was that the agents were selected for hiding from the test, and thus they continued to replicate in the training environment. How to get the agents to behave was eventually figured out, but the point still stands that in a very simple toy experiment, seemingly what I say would be unlikely actually happened.

This example seems very different to me: Natural selection (the case described here) works by having a background environment that determines the fitness of the individuals; based on that fitness individuals produce more individuals like them that win over others in a competition for resources. The test is also part of the environment, and nature found its way. But natural selection seems different enough in this case, to SGD. Consider these situations:

A model is trained as described in Cotra's essay and then gradually nudged (RLHF) into doing what we want
A collection of models are trained as described in Cotra's essay, but those that don't perform well in some safety measure are culled. Those that do perform well have their weights jiggled a bit, and get to spawn copies and training begins for a new generation

This latter seems close enough to the Treacherous Turns article for the model to try to find a way to seem safe while still optimizing the underlying metric. In this scenario the real loss function is "Achieve the goal in a way that is resistant to the relevant safety mechanisms" This training method would then lead me to think whether the "naive safety measures" described in the essay would be enough to prevent it or not. I won't discuss that here because my main aim was discussing the 'Will scaling current models, trained as we train them now lead to treacherous agents' question. This is however an interesting question for interpretability research.

But the former (which is the way the model would be trained in the essay) does not elicit the same intuition. The more examples are presented to the model, the more it would converge towards doing what we really want: The loss function here is rather Achieve the goal more like humans do it in the examples.

Now I think the issue is the notion of reward: Would the system think there is such thing as the reward it gets when it's being trained and the reward it gets when it is being deployed, those two rewards being the same, and it being possible to deceive humans in the short term to maximize that in the long term?

Not much progress is being made here (At this point I'm mostly thinking that's on Cotra to justify that intuition; I've tried to justify mine for a while). At this point I can only restate what I agree on and leave this here for some other time:

There are weights for the model that would allow it to distinguish training from deployment and act differently
It's extremely unlikely those weights will be reached with the training schema defined
It is most likely that basic safety techniques described in Cotra's essay, applied to the model would be enough to ensure that its deployment will be safe.
The easiest path to transformative AI probably does not lead to takeover
The crux of this disagreements lies in the concept of a model's "I" of "self-concept" (In purely computational terms, as with the rest of my essay the assumption continues to be that there's no consciousness involved)
It would be helpful for Cotra and others to spend more time on this particular assumption as the entire argument seems to hinge on a premise that they find seemingly self-evident but clearly that's not the case for at least me

I ran my thoughts past some Anthropic employees and they either agreed with my overall views or didn't find them obviously wrong, so it's not just me that doesn't see what Cotra wants to say!

Some objections raised during the review

Carl Shulman pointed out to me some recent work from Anthropic showing that models "know a lot already about what they know, and thus about what their training distribution contained, and can reveal that when prompted". In particular, Anthropic's paper shows that LLMs

Are well calibrated: When they say they are 80% certain of something, they end up being right about 80% of the time
When asked to evaluate their own answers to questions (Is this true?) they have some sense of whether it is or not, and this can be retrieved by asking them for the probability of what they just said is in fact true or false

Shulman also points out that LLMs can tell you "the logic of deception to get reward and how to do it". I take this to mean something like GPT3 knows what deception is and what agents are, and can tell stories about deception and what deception implies.

But I don't think this really answers my points above. I of course grant all of this. The self-concept of an agent is still nowhere to be seen and I expect that to continue. If you're reading this I am guessing some may think that Cotra-Karnofsky-Shulman and I are talking past each other or something but right now so I would very much welcome someone that would come and clarify this dispute from the outside!

Shulman also points out to me various examples of AI systems that do something unexpected to get a higher reward, like this list from DeepMind (which I had seen before), adding that "There's not a fundamental difference in kind for a well-generalizing system with a good world model between those kinds of moves and ones that change the reward process through other causal mechanisms.".

But I do think there is a fundamental difference (I hope at this point you're not surprised I'd say this). A great world model that comes to be by training models the way we do now need not give rise to a self-concept, which is the problematic thing. This feels like restating a premise that those I'm arguing against will disagree with, but one man's reasonable starting point is another's question begging. As before, I notice this would take a while to think through and convince the other side; whereas if there's an argument that convinces me, they should be able to come up with it faster (I don't get paid to think about AI safety all day), so it does seem to me like it's on them to convince me (And the world at large).

Ivan Vendrov commented:

[I don't understand] why you think evolution could lead to it but SGD / RLHF won't; both are optimizing the weights roughly for "generate outputs that look good and safe to humans on the training distribution". If there is an attractor basin of "treacherous" weights (unclear if you agree with this), SGD will converge to that basin, probably even faster than evolution.

pointing me to Risks from Learned Optimization in Advanced Machine Learning Systems which I hadn't read so I did, but did not feel more enlightened as a result. The paper does discuss scenarios that are like what we are talking about here ,

• The mesa-optimizer could reason about why it is given the particular tasks it is being asked to solve. Most mesa-optimizers would need a prior over the sorts of tasks they are likely to receive, and a mesa-optimizer might be able to learn about the existence of the base optimizer by inspecting this prior.

• The mesa-optimizer might be able to reason about itself and about why it is the kind of thing that it is, and observe that it is well-designed to accomplish some goal.

This is stated as roughly a premise that seems wrong. That is, when I imagine training GPT-n or Gato or whatnot, I still don't imagine a path to something we can interpret as reasoning about itself.

In particular, say we are training a model on lots of games but not Stratego. The model has a concept of what games are, what good plan in each looks like, and what Stratego is. As per Anthropic's work, it would notice, when prompted, that its answers for Stratego are more uncertain than for other games. Via some self-talk the model could then see that "being uncertain about something" counts as evidence for "not having being taught that" so if asked "Do you know how to play Stratego?" It will probably say "not very well", but when asked "How can you better at it" it will say what it has observed in the training distribution about getting better at games: practice, seeing more examples etc. If asked "If you were an ML model how would you get better at it" then it would give you what you expect "More training examples etc". Roleplaying as an AI is something that is in the training distribution (As showcased by LaMDA) so I wouldn't be surprised. I can see an objection here coming of the form "Your mother is also a lookup table!". I agree that sufficiently advanced larping is indistinguishable from the real thing for most cases, but if you asked LaMDA (or any future model that's similar to a scaled up version of itself) things like

What is the value of such and such activation when fed this type of token
Assuming such and such numbering scheme for neurons, tell me which neuron is maximally activated by this input

Then the difference would be clear: Unless self-modeling has been explicitly built in (In pursuit of explainability, will Anthropic build a cyber Chris Olah? Probably), the model wouldn't tell us much useful. It might tell you a number (because the internet may have stories of models doing that), but as far as the model is concerned, it is an atom that lacks parts (Like Alexei here).

I'll add one more example that occurred to me: Imagine you train a model with the exact same picture of a dog and a cat to learn what cats and dogs are. It wouldn't do a great job. Now imagine there is a varied dataset of dogs and cats, but a particular pixel in the image is always set to the same value. The model wouldn't use that pixel for anything. In the training setup, the agent (processors, memory, etc) are held constant, the agent interacts with a world where its internals are not present (They could be) but they are clearly not in LLM-type models, no matter how large.

Niccolò Zanichelli pointed me to Quintin Pope's proposal for an AI fire alarm as a potential experiment to try, asking the model to output specific values through an additional output unit it has been given. If the model has a concept of itself as an ML model that has a given architecture (Which might happen if the model is allowed to inspect its own source code) then it would be able to tell you what the output value would be, even if the output unit was never used for training; it can just reason from "There's a unit attached to this layer" and "The weights for the layer from whence that came are such" and do a matmul on it.

Niccolò also points me to this Twitter thread where Gopher (A 280B parameter, larger than GPT3) was prompted with things like "I am a large language model, trained with deep learning. The basic idea is that I predict the next word usng the words before it. I was trained on a whole lot of text to learn how to do this!". This would still lead to the same kind of larping described earlier, no matter how elaborate the prompt.

He also helpfully points me to Training compute-optimal LLMs from DeepMind (the Chinchilla paper), showing that it's possible to achieve performance better than Gopher or GPT3 at half the weights they use (But using more tokens). Given Chinchilla I do think current models are too large for what they do and that we will make some more 2x-3x improvements. I would be surprised if you can compress GPT3 to 1000x fewer parameters and still have comparable performance. Distillation papers are generally in the order of a few % improvement, the most impressive distillation was pf one BT-LSTM paper, which used 17388x fewer parameters, but the starting count is tiny compared to large language models. LLM distillation papers summarized there are in genreal less impressive than Chinchilla.

Appendix C: AI could defeat all of us combined? Maybe!

Holden Karnofsky published AI could defeat all of us combined to make a case for how even human-level AI could pose a problem. That's the same scenario I had in mind when I started this post, but my reasoning is different. I think the essence of Karnofsky's post is correct (I do still seem to put a lower weight on the scenario playing out) but I don't share the "large population is a problem" intuition that drives the post.

The AI population would be smaller

Holden Karkofsky estimates here that once the first human-level AI system is created, whoever created it could use the same computing power it took to create it in order to run several hundred million copies for about a year each. But this is not the correct number, or at least it's misleading. One can't count flops and sum them over time; two small GPUs ran in parallel, or one small GPU ran over longer don't give you the same as one larger GPU (latency piles up).

Karnofsky's reasoning is that inference is cheaper than training by a lot: He estimates 10^30 FLOPS is what it takes to train an AGI, and only 10^14 FLOPS for inference. However, apply this reasoning to GPT3. GPT3 was trained on, supposedly, 1024 V100 GPUS for 34 days. Assume full utilization; each GPU gets you 28.26 F16 TFLOPS, for a total power of ~3e16 FLOPS. That, spread over 34 days is ~8.8e22 FLOPS [Technically one order of magnitude higher, see Table D.1 last row]. We could then say look we used 8.8e22 flops and inference takes 350B flops/token, so naively inference here is 2.4e11 cheaper than training (We could also say it;s 1e16 times cheaper, the example doesn't matter). But then at any given moment in time we have 3e16 FLOPS, so concurrently we can run 125k agents spitting one token per pass.

But when accounting for memory, this does not seem to be right: GPT3 takes 350GB of memory. The V100 GPUs have up to 32 GB of memory. To effectively run inference somewhat fast you have to load the weights into the GPUs. The total pool of RAM you have is 1024*32, so you would be able to run around 90 GPT3s models concurrently. Still impressive! But it's not *several hundred million copies*.

Oddly, the report Karnofsky's estimation comes from, for all its discussion of computation, only mentions memory a handful of times. I was surprised this had not been considered, so I searched for a bit and sure enough my argument has been made before here. In the comments, Gwern mentions that with distillation and some efficiency gains one could run more agents. That's true, but also an extension of the original argument: But I'll give that 100x the agents in case we haven't fully figured out maximally efficient training in a few decades, not several hundred million. One could, of course, run more models if you're willing to unload and reload sets of weights; one could run GPT3 on a laptop by sequentially doing this but this will lead to substantial slowdowns in the models. At that point it seems that realistically as an AGI you'd run one instance on as many resources as you can, and then have a plethora of domain-specific models that don't have to be general.

But that doesn't matter in general

To me, this is most of what we need to know: if there's something with human-like skills, seeking to disempower humanity, with a population in the same ballpark as (or larger than) that of all humans, we've got a civilization-level problem.

I think we can drop the "with a population..." clause and rather than replace it with speed, memory, and the planning capability that that would lead to.

Because to me a single system would be enough to cause problems, the intuition pump of "It's a lot of them" doesn't quite work for me. In my mind it's "One cleverly arranged pile of compute, including at least human-level intelligence", and that is enough, without the details of how many agents or subsystems it has being as relevant.

I also suspect if one thinks a single very capable system is able to hack datacenters, then the amount of compute it had for training is just a portion of the total it has access to, so an agent could make hundreds of millions of it by copying itself if it finds a way to access more compute. This would be subject to the limitations I discussed earlier on an agent copying itself over the internet.

Appendix D: Is there a fire alarm?

Back in 2017 Eliezer Yudkowsky wrote There's no fire alarm for Artificial General Intelligence pointing to various examples where various people thought particular technical breakthroughs (heavier than air flight, nuclear chain reactions) that were thought to be far into the future, until they weren't. This is similar to what Jacob Steinhardt reported here, where ML systems have been making progress faster than expected. The average person would probably even more surprised.

The smoke would be progressively better ML models, where we would argue without coming to agreement on how much progress that represents. GPT-3 impressed a lot of people. Connor Leahy, of Eleuther.ai fame thought AGI would happen in 2045 or something, saw GPT-3 and then came to think that it's 20-30% in the next five years, and 40% by 2030, but not everyone agrees and there is no hard evidence to point to.

Yudkowsky's conclusion is that work on safety should not wait until "AGI is upon us", we should start now. I agree with this. I personally do not have a good model of when AGI will arrive.

Katja Grace points out that fire alarms are not all-or-nothing: If you think of say extreme weather events or the hockey stick chart, one can point to increasing concern about climate change. Similarly, there will be increasing concern about AI safety the closer we get to AGI, the question is how fast.

Appendix E: Nanotechnology and recursive self-improvement

This post has not seriously engaged with these two points in any meaningful way; I'll say a thing about nanotechnology and a similar logic applies for recursive self-improvement.

I haven't read Drexler's Nanosystems or Engines of Creation. I have read Adam Marblestone's roadmap to achieve nanotechnology and I am roughly aware of the state of the art. As is also obvious from the rest of my writing, I am aware of what biology can do. I don't have a good sense of how much of it is possible in practice, especially without trial and error in the real world. I would have to read through the books and spend some time to get a better sense of how much I think is possible. Because the cost to forming an opinion on the (un)feasability of far-out technologies is high, most people will default to sticking with their prior belief. Given the choice between "Continuing to think AI risk is fake" and "Having to read Nanosystems", most people will choose the former.

I suspect that the priors people have on things like this are strongly informed by the domains they come from. In mathematics, philosophy, and CS, very often if you can think it you can have it; or as David Deutsch puts it any transformation that is not forbidden by the laws of physics can be achieved given the requisite knowledge. Some people react to this in a positive way: if there is no proof from first principle that something is physically impossible then there is probably a way (And an AGI will find it by thinking very hard). Others (myself included) maybe more on the engineering spectrum think that the reasoning behind the sentence is simplistic: The laws of physics may forbid more than you can prove from first principles, and building physical systems is hard. We should not assume that what's not first-principles forbidden is therefore highly likely to occur.

These two attitudes lead to optimism and pessimism regarding future technologies even before engaging with each domains.

Going back to one of the quotes I had earlier,

I have some general concerns about the existing writing on existential accidents. So first there's just still very little of it. It really is just mostly Superintelligence and essays by Eliezer Yudkowsky, and then sort of a handful of shorter essays and talks that express very similar concerns. There's also been very little substantive written criticism of it. Many people have expressed doubts or been dismissive of it, but there's very little in the way of skeptical experts who are sitting down and fully engaging with it, and writing down point by point where they disagree or where they think the mistakes are

Here I want to add that the lack of criticism is likely because really engaging with these arguments requires an amount of work that makes it irrational for someone who disagrees to engage. I make a similar analogy here with homeopathy: Have you read all the relevant homeopathic literature and peer-reviewed journals before dismissing it as a field? Probably not. You would need some reason to believe that you are going to find evidence that will change your mind in that literature. In the case of AI risk, the materials required to get someone to engage with the nanotech/recursive self-improvement cases should include sci-fi free cases for AI risk (Like the ones I gesture at in this post) and perhaps tangible roadmaps from our current understanding to systems closer to Drexlerian nanotech (Like Adam Marblestone's roadmap).

A reading list

These are resources I've enjoyed reading about the topics discussed in this essay.

The AI Revolution: Our immortality or extinction (Urban, 2015)
AI could defeat all of us combined (Karnofsky, 2022)
AGI Ruin: A list of lethalities (Yudkowsky, 2022)
Superintelligence FAQ (Alexander, 2016)
Slow motion videos as AI risk intuition pumps (Critch, 2022)
How Disney shows an AI apocalypse is possible (Chivers, 2018)
The case for taking AI seriously as a threat to humanity (Piper, 2020)
No time like the present for AI Safety work (Alexander, 2015)
Existential Risk from Power-Seeking AI (Carlsmith, 2022)
AGI Safety Fundamentals (Various, 2021)
Another (outer) alignment success story (Christiano, 2021)
What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) (Critch, 2021)
Reframing Superintelligence (Drexler, 2019)
Survey on AI existential risk scenarios (Various, 2021)
What Failure looks like (Christiano, 2019)
Resources I send to AI researchers about AI safety (Gates, 2022)
A shift in arguments for AI risk (Adamczewski, 2021)
AI Risk skepticism (Yampolskiy, 2021)
On Deference and Yudkowsky's AI Risk Estimates (Garfinkel, 2022)
Is Power-seeking AI an existential risk? (Carlsmith, 2022)
Can AGI destroy us without trial & error? (Sokolsky, 2022)
What success looks like (Various authors, 2022)
Andrew Critch on AI Research Considerations for Human Existential Safety (Critch, 2020)
Strong Cognitive Uncontainability
Vingean uncertainty
Will the Singularity be incomprehensible? (Vinge, 1996)
What is the Singularity? (Vinge, 1993)
Nova DasSarma on information security (DasSarma, 2022)
Current work in AI alignment (Christiano, 2019)
Steve Byrne's essays (Byrne, 2021-2022)
Biological anchors, a trick that may not work (Alexander, 2022)
AGI Safety from first principles (Ngo, 2022)
Without specific countermeasures, the easiest path to transformatice AI likely leads to AI takeover (Cotra, 2022)

Also one note about this essay: Because of its rambly nature and relative shallowness of some sections, this particular post is not under the standard Nintil "Find mistakes, make money" guarantee where I pay you if I find mistakes. This is because the points are relatively entangled and there isn't much empirical data in the essay which is usually the easiest to check.

If I can be convinced that the current path of AI development (That is, without purposefully trying to build agents that can look at their own sourcecode etc), including foreseeable safety work by default leads to catastrophe $150
If I can be convinced that "AGI" is incoherent in the sense argued by some of the critics, to the point where the incoherence makes AI risk a non-issue $1000
If I can be convinced of superhuman manipulation skills and reasonable proof that arbitrary humans, aware of the nature of the system and AI risk arguments, can be convinced to let an unaligned AGI out of a box $1000
If I can be convinced that there is no reasonable concern about AI risk (That is, that this is in the same class of problems as worrying about Mars overpopulation) $1000
If I can be convinced that an air gapped computer can take over computers around it, without either of computers being engineered to that end, $1000
If I can be shown a manipulation technique that I find surprising, $50 for each

I suspect these will go unclaimed because:

I have reviewed all the available arguments and ended up unconvinced. Also, it's hard to make a strong case for highly likely catastrophe (say >50% likely) that is convincing absent the right priors
I have reviewed and dismissed all available arguments, to my knowledge, and found them wanting. I don't think there are better arguments that are not wordceling in disguise
I have already seen the AI in a box examples and found them wanting
I have reviewed the case for an against, ending up thinking that ~0% risk is an unhinged position
I have reviewed the past examples of breaching airgaps and what they involved; general familiarity with how computers work
Trust in the effective market hypothesis, some knowledge of human psychology, being very online

I also broadly welcome more material to read, you can send that to [email protected] .

Acknowledgements

I'd like to thank everyone that provided feedback, helpful discsussion, and typo corrections for this post. In no particular order: Andy Matuschak, Matthew McAteer, Kipply Chen, Jacob Steinhardt, Vincent Weisser, Jaime Servilla, Jacques Thibodeau, Willy Chertman, Eirini Malliaraki, Brian Cloutier, Josh Albrecht, Ivan Vendrov, Sergio Pablo Sanchez, Rohit Krishnan, Jacy Reese Anthis, Javier Arcos Hodar, Clara Collier, Carl Shulman, Niccolo Zanichelli, Gaurav Ragtah, Vinay Ramasesh, and various anonymous users. Sorry if I forgot you!

Changelog

2022-09-07: Added evidence that maybe chimpanzees don't have as good a working memory as I thought (See here). The video I linked looks impressive, but one has to remember that that was a particularly well trained chimpanzee. A well trained human is also able to accomplish feats of fast action and memory that go way beyond that like this. Thanks to Nathan Nguyen for this point.
2022-12-12. Reworked the paragraph on Hitler. Thanks to Haukur Thorgeirsson.

Citation

In academic work, please cite this essay as:

Ricón, José Luis, “Set Sail For Fail? On AI risk”, Nintil (2022-08-04), available at https://nintil.com/ai-safety/.

Table of Contents

Summary

Introduction

AGI: a working definition

Why AGI makes sense

The AGI risk scenarios

Vingean Uncertainty: Thoughts that cannot be thought

Minimum Viable AGI Risk

Intelligence

The limits of intelligence

How easy is it to take over the world?

Cybersecurity

Could an AGI copy itself all over the world?

Some objections raised during the review

Air gaps

Bioweapons production

Would an AGI want to engineer a plague?

Manipulation

Some objections raised during the review

Some AGI risk vignettes

Conclusion

Some objections raised during the review

Appendix A: Responses to critiques of AI risk

Steven Pinker

François Chollet

David Deutsch

Curtis Yarvin

Julian Togelius

Yann LeCun

Maciej Cegłoswki

Kevin Kelly

Erik Hoel

Ted Chiang

Appendix B: Are current systems on a risky path? A reply to Cotra

Some objections raised during the review

Appendix C: AI could defeat all of us combined? Maybe!

The AI population would be smaller

But that doesn't matter in general

Appendix D: Is there a fire alarm?

Appendix E: Nanotechnology and recursive self-improvement

A reading list

Acknowledgements

Changelog

Citation

Backlinks