Join us September 8 to 24th for an "agency foundations" research challenge hosted by Alignment Jams and Apart Research. We will explore - and award prizes - in several areas of research all focusing on human agency preservation in A(G)I-human interactions. Below we provide brief primers on each of these areas - but hope that more creative and novel ideas can be developed beyond these starting ideas.
Can we conceive of algorithms where AI systems evaluate and optimize for the preservation or increase of human agency rather than - or in addition to - truth, interpretability, corrigibility, sandboxing, value learning (ML/formal) etc? Are safe-AI paradigms even possible in game theory when one agent is omnipotent relative to the others? See more on RL
How do LLMs or DeepRL NNs represent other agents or themselves? Can we carry out mechanistic interpretability focusing both on behavior and network weights to characterize - and (re)direct - the emergence of agent and agency representation? See more on mechint
What are the limitations of extracting the true reward function or policy given non-infinite empirical behavior data? Does having access to the internal states (e.g. neural states) of an agent - in addition to the observable behavior - improve accuracy and could such processes undermine the agency of the observed agent? See more on IRL
In the presence of superhuman intelligent AI systems - and in light of the complexity of human knowledge and value creation - is it sufficient to seek AI systems aligned to truth, interpretability and optimal value representation? Can we red-team existing technical approaches from the agency-perspective? See more on governance