AI safety reading group

Weekly discussions of readings on technical and philosophical topics in AI safety.

AI Safety is the field trying to figure out how to stop AI systems from breaking the world, and in particular, trying to do so before they break the world. Readings will span from potential issues arising from future advanced AI systems, to technical topics in AI control, to present-day issues.

Seminar information:

Organisers: Matthew Farrugia-Roberts and Dan Murfet.
Time: The seminar is on indefinite hiatus.
Venue: The Rising Sea.

Directions for joining discussions:

New to metauni? Follow these instructions (part 2) to join the metauni Discord server, and introduce yourself in the channel #ai-safety.
Metauni talks take place in Roblox using in-game voice chat. Follow these instructions (part 1) to create a Roblox account, complete “age verification” (unfortunately, this involves sharing ID with Roblox), and then enable Roblox “voice chat”.
At the scheduled discussion time, launch the Roblox experience The Rising Sea and then walk over to the discussion area as depicted in this picture (or, just follow the people).
If lost, ask for help in the Discord server, #ai-safety channel.

Readings

Completing weekly readings is recommended. We sometimes briefly summarise the paper. Usually we dive in to discussing particular credits, concerns, or confusions.

Upcoming readings and discussions:

The seminar is on indefinite hiatus.

Past readings and discussions (most recent first):

2024.02.15: Evan Hubinger et al., 2024, “Sleeper agents: Training deceptive LLMs that persist through safety training”. arXiv.
2024.02.08: John Wentworth, 2021, three posts on selection theorems:
- “Selection theorems: A program for understanding agents”, LessWrong.
- “Some existing selection theorems”, LessWrong.
- “What selection theorems do we expect/want”, LessWrong.

2024.01.25: Peter Vamplew, Richard Dazeley, et al., 2018, “Human-aligned artificial intelligence is a multiobjective problem”. Springer Link.
2024.01.18: Vanessa Kosoy, 2023, “Critical review of Christiano’s disagreements with Yudkowsky”. LessWrong post.

2024.01.04: Vanessa Kosoy, 2023, “AI alignment metastrategy”. LessWrong post.

2023.12.21: Max Tegmark and Steve Omohundro, 2023, “Provably safe systems: the only path to controllable AGI”. arXiv.

2023.12.07: Yoshua Bengio, 2023, “Towards AI safety that improves with more compute”. YouTube.
2023.11.30: Open AI edition. Readings:
- Zvi Mowshowitz’s summaries facts from a weekend and battle of the board.
- Andrew Imbrie, Owen Daniels, and Helen Toner, 2023, “Decoding intentions”. Center for Security and Emerging Technology technical report. (Especially “Private Sector Signaling” section on pages 27–30).

2023.10.19: Yonadav Shavit, 2023, “What does it take to catch a Chinchilla? Verifying rules on large-scale neural network training via compute monitoring” arXiv. Discussion led by Rohan [notes].
2023.10.12: K. Eric Drexler, 2023, “‘Reframing Superintelligence’ + LLMs + 4 years”. LessWrong post.
- Background: Drexler, 2019, Reframing Superintelligence. FHI technical report.
- See also: Rohin Shah’s summary in the LessWrong linkpost.
Discussion led by Dan.

2023.09.07: Hadassah Harland, Richard Dazeley, et al., 2023, “AI apology: interactive multi-objective reinforcement learning for human-aligned AI”. Springer Link
2023.08.31: Blake Richards et al., 2023, “The illusion of AI’s existential risk”. Essay in Noema Magazine.
2023.08.24: Roger Grosse et al., 2023, “Studying large language model generalization with influence functions”. arXiv. Discussion led by Dan.
2023.08.17: Alex Turner, 2022, “Reward is not the optimization target”, LessWrong post.
2023.08.10: Paul Christiano, Eric Neyman, and Mark Xu, 2022, “Formalising the presumption of independence”. arXiv. Including discussion with guest speaker Eric Neyman. Location: Discord.
2023.08.03: Vladimir Mikulik, 2019, “Utility ≠ Reward”, LessWrong post. Discussion led by Dalcy.
2023.07.20: Australian Government, 2023, “Safe and Responsible AI in Australia”. AU Government discussion paper (1.9MB pdf).
2023.07.13: Daniel S. Brown et al., 2021, “Value alignment verification”. ICML 2021 / arXiv. (Discussion part 2: results).
2023.06.08: Daniel S. Brown et al., 2021, “Value alignment verification”. ICML 2021 / arXiv. (Discussion part 1: definitions).
2023.05.25: Wu et al. 2023, “Interpretability at scale: Identifying causal mechanisms in Alpaca”. arXiv.
2023.05.18: Bills et al. of OpenAI, 2023, “Language models can explain neurons in language models”. blog post, full paper.
2023.05.11: Alan Chan et al., 2023, “Harms from increasingly agentic algorithmic systems”. arXiv.
2023.05.04: Joar Skalse et al., 2022, “Defining and characterizing reward hacking”. NeurIPS / arXiv.
2023.04.27: Erik Jenner, 2023, “Research agenda: Formalizing abstractions of computations”. lesswrong.
2023.04.20: Bai et al., 2022, “Consitutional AI: Harmlessness from AI feedback”. arXiv; Discussion led by Dan.
2023.04.06: Collin Burns et al., 2022, “Discovering latent knowledge in language models without supervision”, arXiv. Discussion led by Dan.
2023.03.30: Karl J. Friston et al., 2022, “Designing ecosystems of intelligence from first principles”, arXiv.
2023.03.23: Thomas Parr, Giovanni Pezzulo, Karl J. Friston, 2022, Active Inference: The Free Energy Principle in Mind, Brain, and Behavior, chapters 1–3. MIT Press.
2023.03.16: Ethan Perez et al., 2022, “Discovering language model behaviors with model-written evaluations”. arXiv. Discussion led by Dan.
2023.03.02: Lauro Langosco et al., 2022, “Goal misgeneralization in deep reinforcement learning”. ICML / arXiv.
2023.02.23: Guest speaker: Tom Everitt (Location: Discord).
2023.02.16: Tom Everitt et al., 2021, “Agent incentives: A causal perspective”. AAAI / arXiv.
2023.02.09: Paul Christiano, Jan Leike, et al., 2017, “Deep reinforcement learning from human preferences”. NeurIPS.
2023.02.02: Jan Leike, David Krueger, Tom Everitt, et al., 2018, “Scalable agent alignment via reward modeling: a research direction”. arXiv.
2023.01.26: Andrew Critch and David Krueger, 2020, “AI research considerations for human existential safety (ARCHES)”. arXiv.
2022.12.15: Scott Aaronson, 2022, “Reform AI Alignment”, Shtetl-Optimized blog. Also, Boaz Barak and Ben Edelman, 2022, “AI will change the world, but won’t take it over by playing ‘3-dimensional chess’”, Windows On Theory blog. Discussion led by Dan.
2022.12.01: Eliezer Yudkowsky, 2013, “Intelligence explosion microeconomics”, MIRI technical report.
2022.11.24: Guest speaker: Elliot Catt. An overview of mathematical approaches to AGI safety.
2022.09.15: Nate Soares, 2022, “On how various plans miss the hard bits of the alignment challenge”, LessWrong post.
2022.09.08: discussion of reading group direction.
2022.08.25: an original presentation by Matt about compression and learning in models of computation embedded in the real world.
2022.08.18: Evan Hubinger et al., 2019, “Risks from learned optimization in advanced machine learning systems” arXiv / LessWrong sequence.
2022.08.11: Robin Hanson, 2022, “Why not wait on AI risk?”, overcoming bias blog and “Foom update”, overcoming bias blog.
2022.08.04: Scott Garrabrant et al., 2017, “Logical induction”, arXiv. Discussion led by Dan. Note: there is an updated 2020 version on arXiv.
2022.07.28: Abram Demski and Scott Garrabrant, 2019, “Embedded agency”, arXiv / LessWrong sequence.
2022.07.21: Tobias Wängberg et al., 2017, “A game-theoretic analysis of the of the off-switch game”, AGI 2017.
2022.06.30: Rachel Thomas and Louisa Bartolo, 2022, “AI harms are societal, not just individual”, fast.ai blog. Discussion led by Dan.
2022.06.23: Nick Bostrom, 2012, “The superintelligent will: Motivation and instrumental rationality in advanced artificial agents”, Minds and Machines.
2022.06.16: Stephen M. Omohundro, 2008, “The basic AI drives”, Proceedings of the 2008 conference on Artificial General Intelligence.
2022.06.09: Norbert Wiener, 1960, “Some moral and technical consequences of automation”, Science.

Topics brainstorm

AI safety is political philosophy complete:

Exernalities correspond to market alignment failures. How does society such alignment failures? (How) does society overcome them? Does society face risk from them? Would such risks be exacerbated by AI progress?
How can we live in the midst of complex systems we don’t understand, and can’t fully control, like civilisation, capitalism, etc.?
What other literatures could help us here?

Alex Turner’s work on power seeking AI

“Parametrically Retargetable Decision-Makers Tend To Seek Power” arXiv
“Optimal Policies Tend to Seek Power” arXiv
note to self see also Turner’s PhD thesis arXiv

On modelling tasks with reward functions:

“Settling the reward hypothesis” arXiv
“Reward is Enough” DeepMind
There are many replies to that controversial paper but I’m interested in looking at a local one: “Scalar reward is not enough” arXiv

On impact regularisation:

Also some other work from ARAAC / Richard Dazeley
Find other papers here