Under this topic we explore studying agency at: (i) psychological level; and (ii) representation levels (aka known as mechanistic interpretability). We want to ask psychological level questions such as: how does an LLM or DeepRL NN understand the capacities of agents as tokens, or the concept of agency? But also questions such as where and how are these concepts stored in the weights and layers of the NN? We note that DeepRL NNs are likely easier to work with conceptually than LLMs on this concept of agents - and for those with more knowledge of training such NNs it may be an easier research path (please contact us for DeepRL NN suggestions/discussions).
To get us started on the psychological state approach in LLMs here are some related ideas and studies:
Here we may wish to explore how LLMs classify tokens by their inherent capacities. Do LLMs seem to increasingly categorize tokens by their "agency"? That is, is there a natural separation that occurs in embedding space for specific classes of sentence subjects?
We may visualize this in the embedding space of tokens prior to prediction (i.e. non-position embedding) and during sentence processing.
Digging in deeper, can we identify the clustering or specific category representation in different layers of transformer-based LLMs during training and on fully trained models?