#1: Agency preserving RL & game theory AGI gyms

Under this topic we explore several variants that focus on algorithmically describing how "agency" and "agency preservation" might be conceptualized or learned, e.g. by reinforcement learning (RL) agents. We can begin simply by viewing agency as a "capacity" to affect the environment (or external world) and limit ourselves to few-agent environments. We can ask easy questions such as quantifying agency and tracking it in simulated games or environments, and we can ask hard questions such as solving previously unsolved problems of organizing or maximizing resources use fairly, or solving inequality among agents.

Easy questions: Is agency quantifiable, e.g. number of states or changes an agent can make to the world? Do RL agents tend to seek "power-for-themselves" and "disempowerment-for-others" as instrumental goals?

Hard questions: How can we guarantee that (benevolent) superhuman intelligent AI systems – that are essentially alien lifeforms – understand and protect human wellbeing and the human control (i.e. agency) over the world? How would such AI systems balance between competing values and goals to arrive at human-acceptable solutions and human futures – and avoid silly things like paperclip maximization?

To get us started on the "easy" questions there are several related works that might guide us:

So we are broadly seeking to conceptualize RL AGIs that might learn to directly preserve agency: i.e. having many options/locations, and many choices (and possibly improve these) – rather than focusing on human intent or truth, accurate value representation, interpretability. But the challenge is on how to do this without negatively affecting the long-term future and outcomes as "agency depletion". Our sketch for agency depletion is that even well-meaning, (i.e. "intent aligned") truthful AIs can gradually remove all but the safest options from an environment – because of risk-reducing and reward-maximizing objectives. (If the AI agent is misaligned/untruthful, then option depletion happens even faster).

To get started on the "hard" questions, we have less guidance. But perhaps we can start by rethinking the "paper clip" maximization paradigm not as a "specification gaming" failure – e.g. the human failure to outline all the rules (i.e. failure on training distribution), nor as a "goal misgeneralization" failure – e.g. the failure of the AI/humans to provide all of the required training data or recognize out of distribution scenarios (i.e. failure on out of distribution). What if paper clip maximization occurs because humanity has not solved the problem of distributing and equally sharing large amounts of knowledge and power – let alone figured out how to write algorithms to this end? That is, what if alignment is not an algorithmic failure – but a (very) difficult conceptualization problem, such as how powerful agents can live and interact with much less powerful ones?

Essentially, we are interested in agency preservation as a pre-learning, pre-misalignment target for safe AIs, perhaps in game or economic theory paradigms. For example, in a paradigm containing a "human" agent and an AGI agent that has already learned the human's reward function and has nearly omnipotent control over the environment, what does the AGI optimize in assisting the human? That is, how does the AGI balance between all the (true) needs, goals and desires of the human at every time step to make an action or recommendation? And how does the AGI do this when you add many other humans to the environment?

This question goes beyond merely the problem of AGIs recommending "suboptimal" solutions and towards central – but vastly understudied – questions in alignment relating to the long-term effects of AI actions on the "operator" and other agents. Whereas "intent" alignment and "preference" satisfaction focus more on human evaluation of the immediate outcome of AI/AGI actions (and is the basis of RLHF in LLM tuning) – here we ask questions about whether AI alignment is a problem of algorithmically defining long-term and distributed effects rather than immediately observable effects:

* We note that agency-evaluation can suffer from similar, but arguably less harmful, failures as "intent"-fulfilment.

The sketch of this looks like the problem of trying to get a completely benevolent organism with high intelligence (an AGI) to figure out what a lower intelligent organism without harming it or doing long-term damage. For example, trying to use symbols to get a cow to respond correctly to a complex decision problem. How would we truly evaluate how the cow feels about the solution or problem formulation? Truth doesn't help guide us (the AI is not deceitful and knows what the cow values perfectly); interpretability is not even relevant; ontological identification – who knows how to explain any concept to a cow? And how do we ensure that what the cow wants won't destroy, enslave etc. other organisms?

As more practical projects, could we design AGI gyms (or simple paradigms) where super agents (i.e. having many capacities not available to others) interact with ordinary agents for long periods of time without harming them? What might agency preservation and equilibria look like in these paradigms? Are the only solutions here (equilibrium etc.) that: (i) AGI/AIs must necessarily become "part" of the organism they interact with (what does this even mean)? or (ii) that AGI/AIs must never make agency-related decisions (how would we ever stop this)?