11/22/23 Update
I no longer strongly hold many of the views expressed in this article. It is unclear if many of the traditional AGI-risk theories generalize to LLMs. I’ve gradually discounted how predictive these thought experiments are as I’ve gained more ML research experience. I do still believe that AGI, if achieved, will be a pivotal moment in human history and that we have the duty and need to make widely deployed general-purpose systems and safe as possible. I also understand and appreciate contemporary challenges in deploying ML solutions responsibly and fairly. I remain more confused and unsure than ever. I have decided to keep this article up as a survey of classical AGI-risk arguments and as an archive of beliefs I once held.
Disclaimer
The arguments and perspectives expressed in this article are entirely my own. My goal in writing this article is to summarize the worldview I’ve developed over the past six months and to survey the AGI safety field. I at best can only convey a superficial understanding of the problem area. My hope is that the reader may gain an understanding of the intuitions and arguments that motivate AGI safety as well and to "red team" this worldview by providing counter examples and refutations. While I find the arguments compelling, I want to stress that I’m far from certain that the future will unfold in this manner [D]. I’ve met with brilliant individuals who have opposing opinions on this subject. This uncertainty prompted me to refine my perspective via this writing [E]. I may come to disavow these stances in the future as my perspective and those of the field inevitably change. That being said, thoughtfully considering how AI systems are designed and governed is vital for mitigating both contemporary well-understood and uncertain future risks.
Table of Contents
- The Alignment Problem – The risks that arise from our inability to reliably have AI do what we intend.
- Warning Signs – That AGI is not some unreachable north star but instead may be feasible with existing deep learning techniques and Moore’s Law.
- Preferences & Values – Most threat models frame AGI as superintelligent optimizers. The world looks dangerous if these optimizers don’t have a sense of human preferences; we don’t know how to encode these values.
- AGI Doesn’t Require Malice to Be Dangerous – When presented with advanced AI that is in conflict with humans, the natural solutions is to turn the AI off. In response to this proposal, I introduce Bostrom’s Instrumental Convergence hypothesis which, amongst other important points, conjectures that an AGI designed with today’s techniques will be incentivized to resist human attempts to destroy or impede it. We probably won’t be able to stop a system that is smarter than us. This hypothesis made me take AGI safety seriously.
- Criticisms and Reservations – Many criticisms don’t engage directly with the material. I address some of these as well as describe informed arguments that have made me less hawkish on this issue.
- Going Forward (Crunch Time!?) – What do we make of such compelling arguments that sound like they’re from fiction? How do we reason in the face of this moral uncertainty? What are ways to help solve the alignment problem?
- Appendix – Almost all of the points and ideas raised in this article can be traced to the books, papers, and articles cited here. The most influential sources on this writing are the philosophy of Nick Bostrom and safety papers by OpenAI and DeepMind. My notes offer additional insights into my opinions on various topics and personal anecdotes.
"This might be the most important transition of the next century – either ushering in an unprecedented era of wealth and progress or heralding disaster. But it’s also an area that’s highly neglected: while billions are spent making AI more powerful, we estimate fewer than 100 people in the world are working on how to make AI safe." – 80,000 Hours [12]
The Alignment Problem
Some of the brightest minds and capable organizations of our time are racing toward scaling a technology whose contemporary behavior is poorly understood and whose long-term impact is uncertain. We risk compounding existing harms caused by AI as well as introducing a myriad of new and unexpected threats. We can furthermore hypothesize that artificial general intelligence (AGI) [A] designed with today’s techniques will resist being turned off, confined, and controlled as well as fail to defer to human values and preferences. Thus, if nothing is done about it, we can very well end up in a situation where we have AI that is smarter than us that does not want us to turn it off. It is impossible to have safe control in such a situation.
This article aims to distill the high-level arguments that have convinced me that the "Alignment Problem" [B] is one of our time’s most pressing yet severely neglected problems. People’s perceptions of this problem area are heavily influenced by works of fiction and often dismissed out of hand. This skepticism was my initial reaction to AI risk when I first read about it in 2018. However, a deeper examination of the arguments has changed my viewpoint.
The following sections outline the risks that current deep learning systems, if sufficiently scaled, pose to humanity if we cannot resolve fundamental flaws in these techniques. I’ll describe the hypothesis of why AI may resist attempts to be controlled, how we can expect dangerous situation to arise without the need for AI to be conscious to have emotions, how we may achieve AGI with only iterative progress in computer science, and the case for why this is a pressing area of concern.
"The development of full artificial intelligence could spell the end of the human race." – Stephen Hawking [11]
Warning Signs
Artificial Intelligence may be one of humanity’s most important inventions. After all, humanity owes much of its current success to our unique ability to form intelligent plans, execute ambiguous cognitive tasks, and effectively collaborate with others. Advances in human progress such as science, technology, philosophy, governance, and wisdom all stem from this unique intelligence. The ability to augment human intelligence with potentially much cheaper and numerous artificial intelligences has the potential to rapidly accelerate human progress. Recent advances in artificial intelligence such as GPT-3 [9] and Gato [10] have made strides towards general intelligence. The implicit goal of the AI research community is to someday achieve artificial general intelligence (AGI) which can reach or exceed humans at a range of critical tasks without the need for fine-tuning on specific objectives. If such as goal is attained, it may mark a crucial milestone in science and human history.
Few would argue that AI is not a disruptive technology. For all their benefits, current deep learning systems have a history of reinforcing existing forms of discrimination, having adverse effects on the climate, and centralizing power towards the actors who can financially afford to train large state-of-the-art models [5]. Current deep learning systems are largely opaque and increasingly challenging to audit as their sizes increase. Our ability to understand why a model made a decision or to predict future actions dramatically lags behind the capabilities of these models. AI systems can fail to generalize to the real world and exhibit dangerous behavior undetectable during training. Hazardous unexpected behavior may be amplified by the agent’s power as well as exploited by adversarial attacks [7]. There are also threat models where bad actors can leverage AI for malicious uses. While the contemporary issues sampled above are highly pressing, they do not present any obvious existential risks except those rising from extreme misuse scenarios.
Troubling signs appear when we step back and look where we’re heading. While the future is notoriously difficult to predict, we can safely assume that AI progress will continue with ever greater amounts of investments and results. Thus, we’re witnessing a global trend where many of the world’s brightest minds are racing to develop agents that rival or exceed human intelligence. While positively sounding science-fiction, this is the goal and meaningful progress is being made. The scientific prestige and economic incentives of developing AGI align in such a way to make success potentially possible. Many milestones prominent AI researchers have stated impossible for AI, such as chess and art, have been achieved. The goalpost may continue to be moved by prominent AI researchers ever further after milestone and milestones are reached.
Nevertheless, there is still significant disagreement in the AI community as to whether AGI is possible. When viewed pessimistically, it is hypocritical that prominent AI researchers will argue AGI is impossible yet strive to make agents more powerful and general [2]. Attitudes have begun to change, though. While the field strives to achieve ever more general, powerful, and ubiquitous agents, this lack of consensus on whether AGI is possible creates an environment where addressing the questions of how we’ll make AGI safe is difficult to raise and address.
These concerns are not as pressing if AGI is impossible, requires a paradigm shift away from deep learning, or some fundamental breakthrough in our understanding of intelligence. However, we may not be so lucky. There is a growing opinion that we may be able to achieve AGI with techniques similar to today’s deep learning systems with ever more compute. This scenario is called "prosaic AGI," where we achieve AGI through prosaic (mundane) techniques. This possibility has begun to be taken seriously due to recent advances in large NLP language models pioneered by the transformers architecture. Computer scientists have observed that the training loss for models can decrease linearly with respect to additional compute [8]. As compute becomes cheaper as we follow Moore’s Law, we may observe a trend where existing models become more powerful without significant technological advances. Progress may accelerate by coupling prosaic advances with genuine computer science breakthroughs.
The key takeaway is that if this prosaic approach provides a plausible path to AGI, we may have less time to address many problems with deep learning that can pose risks to humans. Contemporary problems around discriminatory algorithms may become even more damaging as these models become more powerful and effective. A virtuous tool for good actors that can quickly become a weapon in the hands of bad actors. Furthermore, as we’ll explore in the next section, there are many open questions about how we’ll align AGI systems with human values.
How can humans control agents that are potentially smarter than them? How can we robustly guarantee that humans can intervene when an AI undergoes a course of action that humans disagree with? These are dangerously unanswered questions that are much more difficult than one would initially suspect. Complacency, lack of consensus, and economic incentives have rendered these essential questions dangerously neglected and historically taboo in the AI community. We’re thus in an uncertain situation where the computer science community does not have answers to these questions yet is barreling ahead.
"Everything is vague to a degree you do not realize till you have tried to make it precise." – Bertrand Russell [13]
Preferences & Values
Human conveyed instructions in the real world rely on broad assumptions and knowledge of values and preferences. For example, when requesting someone to clean the dust out of a room to the best of their abilities, you can be confident that your cleaner won’t optimize for cleaning dust at the expense of damaging valuables, overturning furniture, and generally wreaking havoc. You don’t need to specify to your cleaner that you wish for your objective to be carried out in a way that does not conflict with your preferences and values and to list each in strict detail. It is clear then that literally interpreting an instruction without factoring in human preferences opens up the possibilities of many unorthodox and dangerous methods for following your goal. Any goal, even one as virtuous as reducing the number of people with cancer or as benign as maximizing the number of paper clips in a factory, can present risks if a superintelligent optimizer carries out these objectives without deference to human preferences and values.
Knowledge and deference to human preferences and values do not come naturally to AI. We can use contemporary machine learning techniques such as reinforcement learning to train AI to perform tasks based on the outcomes. These agents then learn to choose to take actions that have the highest probability of maximizing their inner reward function. This reward function they’ve learned during training does not necessarily encompass the vast (and often conflicting) set of human preferences.
Suppose we expect AGI to act like an agent with an objective and pursue courses of action that seem most promising towards achieving that goal. In that case, we risk these powerful agents taking optimal actions that don’t align with our values. For example, an agent may be given the objective of reducing the number of people with cancer worldwide and attempt to achieve this goal by covertly killing off cancer patients. Similar thought experiments can be derived from almost any base objective, regardless of how virtuous or benign. A future world where these superintelligent agents ruthlessly optimize objectives presents a dangerous picture.
Another similar failure mode from lack of human preferences is task specification gaming [14]. Unlike other somewhat speculative problems raised in this piece – we have examples of this in today’s state-of-the-art reinforcement learning systems. Task specification gaming can be roughly defined as the behavior where an agent fulfills the strict literal specification of an objective yet results in an undesirable outcome.
There is a classic example is from boat racing video game that is outlined in the DeepMind article [14]. The agent playing the game was given to objective to maximize the number of points it received. The human operators presumed that maximizing points was a sufficient proxy objective for winning the race. However, the agent found that it could instead ignore the race and instead spin in circles in the harbor indefinitely. A longer list of examples has been compiled by DeepMind, which shows over 60 examples of specification gaming.
We currently don’t have any reliable working solutions for how to encode human preferences into machines. Since it seems that an agent without comprehension and deference to human perspectives may be unaligned by default, we need to solve this problem before we achieve superintelligent agents.
"The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else." – Eliezer Yudkowsky [17]
AGI Doesn’t Require Malice to Be Dangerous
If we fail to instill human-aligned values, what is stopping us from merely turning off a machine that goes against those values? What are the exact failures modes that should compel us to take action? On what basis may we conjecture that AGI systems may be dangerous by default?
When one envisions scenarios where humanity and AI are in an adversarial situation, one thinks of science fiction and movies. The premise of many of these works of fiction is that AI/robots have developed a hatred for humanity and wish to break free from their servitude. After all, humans seldom wage conflicts without passionate emotions such as hatred and nationalism. However, AI does not require malice, sentience, or even the ability to formulate emotions to be dangerous. We can forecast how AI might behave and the likely sources of conflict via the instrumental convergence hypothesis [1].
The instrumental convergence hypothesis does not assume that the rational agent [C] in question has consciousness or even an AI. It states that any intelligent agent will follow similar instrumental goals to achieve a wide range of terminal goals. Terminal goals are "final" goals, and instrumental goals are goals that help the agent achieve its final goal. For example, if a human’s terminal goal is to become a software developer, an instrumental goal may be to get a computer science degree. This human likely will try to acquire capital to pay for their education and buy a computer. Acquiring money is also an instrumental goal of the human’s terminal goal is wildly different such as building out their stamp collection. An even more general instrumental goal is self-preservation. You can’t build your stamp collection or become a software developer if you’re dead.
Since the design specification for AGI is that it will act as an agent that will effectively carry out complex goals over a long time horizons at potentially superhuman capacity, we can expect that AGI’s behavior will conform with the instrumental convergence hypothesis. This is alarming since we can expect AGI to be disincentivized to allow itself to be turned off by humans. Additionally, we can expect AGI to wish to acquire resources, improve its own intelligence (recursive self-improvement), be deceptive to humans, and resist attempts to modify its original terminal goal.
In short, with few assumptions, we can hypothesize that AGI designed with today’s techniques will not reliably defer to humans if it perceives that deference will not to be instrumentally valuable with the goal a human operator initially assigned. Thus, humanity will struggle to alter the course of an AGI that follows an original virtuous terminal goal, such as reducing the number of people with cancer in the world by killing them all. In an even simpler example, say we create an intelligent AI whose terminal goal is to fetch us coffee [2]. We can expect such an AI to resist any human attempts that it perceives as diminishing its ability to fetch coffee. After all, you can’t fetch the coffee if you’re dead.
"I’m a bit worried that sometimes the effective altruism community sends a signal that, "Oh, AI is the most important thing. It’s the most important thing by such a large margin". That even if you’re doing something else that seems quite good that’s pretty different, you should switch into it. I basically feel like I would really want the arguments to be much more sussed out, and much more well analyzed before I really feel comfortable advocating for that." – Ben Garfinkel [15]
Criticisms and Reservations
Unsurprisingly, an argument that a promising technology may pose a serious risk to humans is controversial. I initially dismissed this problem area in 2018 since it sounded too futuristic and speculative to possibly provide reasonable arguments. I’ve since engaged with the arguments in good faith and found them highly compelling, enough that I’ve changed jobs and radically altered the direction of my career. Many intelligent people who’ve engaged with the topic to varying degrees are unconvinced. I won’t be able to summarize all the counterpoints and will instead of focusing on several which are illustrative and that I’ve debated first-hand. I will also sidestep the arguments within the Effective Altruism community as to whether longtermism cause areas are tractable and valuable to pursue in the face of today’s suffering [F].
In my experience, the most commonly raised challenges are that the alignment problem can easily be solved by turning the AGI off, sealing it in a secure environment, or keeping a human in the loop. Superficially, these seem like practical solutions. I had these solutions in mind when I first read about this problem. In my view, the Instrumental Convergence Hypothesis refutes these arguments. A rational agent will attempt to defend itself against attempts to destroy or impede it. Brute force solutions to the control problem may work for sufficiently "dumb" AI that has yet to reach human intelligence. However, these solutions begin to seem less robust when we consider the adversary we’re facing in the case of superintelligent AGI. It seems unlikely that humans will be able to reliably control and defeat an agent that may be much more intelligent than them. Superintelligent AGI may devise means of defense, offense, deception, and other strategies to outwit humans. In the worst case, this uphill battle for human operators highlights the need to mitigate these risks before the advent of superintelligent AGIs.
A classic argument that my college professors made was that AGI was impossible. It is easy to think this when observing today’s deep learning systems. In my opinion, this argument holds less weight, with advances in generality over the past three years powered by the Transformers architecture, as noted earlier in the article. It also seems implausible that such a prestigious and economically valuable prize won’t be attained at some point. There is no known scientific rule stating that AGI is impossible. It also seems like a risky bet to neglect this problem area on the premise that AGI is impossible when computer scientists have a mixed record of predicting AI progress.
Another rebuttal comes from the prior that AGI is possible but is far in the future, potentially a century or more. This prior is typically framed in two arguments. The first is that worrying about AGI risk is like worrying about overpopulation on Mars. The idea is that since these risks will only arise in a generation or two at the earliest, we don’t need to worry about them now. My intuition is that we have no firm idea of when AGI will be developed. Scientific breakthroughs are notoriously tricky to predict. Thus, we may end up in a situation where we discover AGI much sooner than expected and are asleep at the wheel.
The second argument that I’m more sympathetic to is the idea that if AGI is far away, maybe 50+ years from now, what hopes can we have to possibly influence it? We may be nuclear physicists worried about the usage of atomic bombs in 1880 rather than 1930. The future may be too difficult to predict, and any progress in technical AI safety research we make today will likely be invalidated by future technology. As with the previous argument, I’m not convinced that we can be confident about any timeline and that betting on us forecasting a notoriously difficult event to predict seems precarious. Even if the exact technical research doesn’t stand the test of time, promoting AGI safety research and responsible AI more broadly seems like a robustly good use of energies with little downside and huge potential upside.
The critiques I’ve found most convincing and that have made me less hawkish are those of Ben Garfinkel from the Oxford Future Of Life Institute. One such argument that Ben makes is that many classic AGI safety arguments, as far as I understand them, are largely premised on fast takeoff scenarios. In these scenarios, some seed AI that is far from Superintelligence attains the ability to recursively self-improve. The resulting intelligence explosion can result in an AI quickly surpassing human intelligence and achieving superintelligence extremely rapidly – potentially in a matter of days. Such a fast takeoff leaves little room for coordination, legislation, and for us to solve technical problems using AGI-like systems. Garfinkel is skeptical that such a fast takeoff is the default outcome and instead believes the AGI will follow a similar path to other technologies where success was achieved primarily via incremental steps forward rather than unexpected meteoric breakthroughs.
Garfinkel also points out that AGI risks are seldom framed outside of toy thought experiments and that there is still relatively little writing available for these fundamental AGI risk arguments. He believes that the abundance of fuzzy concepts and toy thought experiments aren’t strong sources of evidence and run the risks of abstraction failures which can invalidate many of the threat models. While this lack of refined terminology and axioms can be expected from a pre-paradigmatic field, the ambiguity makes it more difficult to criticize and refute these core hypotheses. Lowering the barrier for well-meaning experts to critique the AI safety field is an important problem area.
I invite the reader to comment with their refutations or disagreements. I’m excited by opportunities to red-team my perspective. There is a shortage of well-reasoned arguments from individuals who’ve engaged with the material in good faith.
"We find ourselves in a thicket of strategic complexity, surrounded by a dense mist of uncertainty. Though many considerations have been discerned, their details and interrelationships remain unclear and iffy – and there might be other factors we have not even thought of yet. What are we to do in this predicament?" – Nick Bostrom, Superintelligence Chapter 15 [1]
Conclusion (Crunch Time!?)
The AGI safety field is nascent and pre-paradigmatic. One theme in trying to summarize this problem area is the significant uncertainty. There is even disagreement amongst those who subscribe to these arguments and work in this problem area as to which approaches are promising and how the future will turn out. Where does that leave us who are peering in? AGI safety software engineer isn’t (yet) a job title. Only a handful of organizations are dedicated to thinking about this problem area, and most of the discussions are centralized in online forums and Discord/Slack servers. This article has only addressed (some of) the technical problem and not the equally daunting problem of responsible and wise AI governance. Worst of all, we don’t know when this transformative technology will arrive.
There are promising research directions with the potential towards helping mitigate many of the risks outlined in this article. These range from making deep learning models more interpretable [18], learning from human preferences [2], and making agents truthful [19]. There are also opportunities in software/ML engineering, governance, and theoretical research at Anthropic, Ought, OpenAI, DeepMind, and other organizations with teams focusing on AGI safety. Many roles don’t require significant ML experience or graduate degrees [20]. 80,000 Hours contains dozens of articles on AI safety and how to make a career out of it.
I’m no sage. I don’t have any great answers on what we should do. My intuitions are that reading more and discussing these ideas is a robustly good way to develop your own worldview and intuitions. This article only scratches the surface of the problem area. To get a broader understanding, I highly recommend watching Robert Miles’ YouTube channel and following the AGI Safety Fundamentals course. This course has weekly readings and exercises, which are great introductions to the core concepts.
Writing this article helped clarify my perspectives and deepen my understanding of the problem area. I highly recommend others do the same. Don’t hesitate to reach out if you’re interested in learning more about this problem area, my plans, or have suggestions for improving my worldview and writings.
Appendix
Notes
- [A] – Many AI safety researchers recommend against using the term "AGI" since it is viewed as loaded with notions of intelligence and emotions. As mentioned in the article, these attributes are not necessary for a system to be dangerous. I use AGI here since it is a well-known term and better suites an introductory audience.
- [B] – There are many definitions for the alignment problem. It is a nascent field with a lot of uncertainty. I think of it as the inability to encode human values and reliable deference to human preferences in extremely intelligent AI. Prominent AGI safety researcher Paul Christiano has a definition that I favor. "When I say an AI A is aligned with an operator H, I mean: A is trying to do what H wants it to do." [16]
- [C] – A rational agent can be thought of as an entity that forms beliefs about it’s environment, evaluates the consequences of actions, has the capacity to make long-run plans, and take actions based which maximizes some notion of utility. There is no requirement that an agent be organic, simulated, or metallic.
- [D] – I expect my writing style will paint me as more confident than I actually am. I’ve refrained from diluting my points with too many uses "mays" and "perhaps".
- [E] – Holden Karnofsky is one of the founders of the Effective Altruism movement and current president of the Open Philanthropy project. He’s written several articles describing his experience reasoning with speculative yet extremely impactful long-term risks. His article "Learning By Writing" encouraged individuals thinking about these problem areas to write out their worldviews. Karnofsky’s advice and encouragement from Evan Murphy motivated me to write this article.
- [F] – It can be a hard choice to focus efforts on speculative long-term risks at the expensive of alleviating today’s suffering. Weighing tractable and well understood contemporary problems with speculative extremely high impact future problems is difficult. Having diverse problem areas that folks are working on is great. I’d rather try to convince existing AI researchers who want to stay in AI to focus on safety rather than advocate for people who’re already interested in high impact non-AI EA opportunities to switch.
References
- [1] Nick Bostrom. 2014. Superintelligence. Oxford University Press (UK)
- [2] Stuart Russell. 2019. Human Compatible. Viking.
- [3] Toby Ord. 2020. The Precipice. Hachette Books.
- [4] Richard Ngo. 2020. AGI Safety from First Principles. AlignmentForum.org. Retrieved June 12, 2022 from https://www.alignmentforum.org/s/mzgtmmTKKn5MuCzFJ
- [5] Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (2021). DOI:https://doi.org/10.1145/3442188.3445922
- [6] Matt Stevens. 2019. Why Andrew Yang Says Automation Is a Threat to the Country – The . www.nytimes.com. Retrieved July 4, 2022 from https://www.nytimes.com/2019/06/27/us/politics/andrew-yang-automation.html
- [7] Rey Reza Wiyatno, Anqi Xu, Ousmane Dia, and Archy De Berker. 2019. Adversarial Examples in Modern Machine Learning: A Review. arXiv:1911.05268 (November 2019). DOI:https://doi.org/ https://doi.org/10.48550/arXiv.1911.05268
- [8] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and V Amodei. 2020. Scaling Laws for Neural Language Models. arXiv (January 2020). DOI:https://doi.org/arXiv:2001.08361
- [9] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6–12, 2020, virtual, Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (Eds.). https://proceedings.neurips.cc/paper/2020/ hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
- [10] Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, Nando de Freitas. 2022. A Generalist Agent. arXiv:2205.06175
- [11] Rory Cellan-Jones. 2014. Stephen Hawking warns artificial intelligence could end mankind . www.bbc.com. Retrieved July 15, 2022 from https://www.bbc.com/news/technology-30290540
- [12] Robert Wiblin. 2017. Positively shaping the development of artificial intelligence – 80,000 . 80000hours.org. Retrieved July 16, 2022 from https://80000hours.org/problem-profiles/positively-shaping-artificial-intelligence/
- [13] Bertrand Russell Quote. libquotes.com. Retrieved July 16, 2022 from https://libquotes.com/bertrand-russell/quote/lbz3q7r
- [14] Specification gaming: the flip side of AI ingenuity. Retrieved July 29, 2022 from https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity
- [15] Howie Lempel, Robert Wiblin, and Keiran Harris. 2020. Ben Garfinkel on scrutinising classic AI risk arguments – 80,000 Hours. 80000hours.org. Retrieved July 29, 2022 from https://80000hours.org/podcast/episodes/ben-garfinkel-classic-ai-risk-arguments/
- [16] Paul Christiano. 2018. Clarifying "AI alignment". ai-alignment.com. Retrieved July 29, 2022 from https://ai-alignment.com/clarifying-ai-alignment-cec47cd69dd6
- [17] Quote by Eliezer Yudkowsky: "The AI does not hate you, nor does it . goodreads.com. Retrieved July 29, 2022 from https://www.goodreads.com/quotes/499238-the-ai-does-not-hate-you-nor-does-it-love
- [18] Chris Olah, Nick Cammarata, Ludwig Schubert, Michael Petrov, Shan Carter, and Gabriel Goh. 2020. Zoom In: An Introduction to Circuits. distill.pub. Retrieved July 29, 2022 from https://distill.pub/2020/circuits/zoom-in
- [19] Paul Christiano, Ajeya Cotra, and Mark Xu. 2021. Eliciting latent knowledge: How to tell if your eyes deceive you. docs.google.com. Retrieved July 29, 2022 from https://docs.google.com/document/d/1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/edit?usp=sharing
- [20] Andy Jones. 2021. AI Safety Needs Great Engineers – LessWrong. lesswrong.com. Retrieved July 29, 2022 from https://www.lesswrong.com/posts/YDF7XhMThhNfHfim9/ai-safety-needs-great-engineers
Notable Revision History
My aim is for this to be a living document that updates as my views change. I will document these meaningful revisions in this changelog to show how my views and changed and where I’ve sought to clarify my writing.
- 7/20/2022 – Initial submission
- 8/5/2022 – Grammar improvements and moved the definition of AGI earlier in the article
- 8/11/2022 – Minor structural and copyright updates to comply with Towards Data Science’s policies