Moral alignment for LLM agents: Paper review

Paper review: how to ensure AI objective match ours through fine-tuning on th Prisoners' Dilemma game?

What is the problem of AI alignment?

We want an AI system's objectives to match our objectives, moral norms and ethical standards. At the moment, it is mainly done through human feedback - but human raters are not representative, and human preferences are complex and inconsistent. Is there another way?

This new paper by Elizaveta (Liza) Tennant, Stephen Hailes, and Mirco Musolesi suggests there is - the paper is called "Moral alignment for LLM agents" and discusses how an LLM can be fine-tuned on the Prisoner's Dilemma to align with a ethical framework (or a combination of them).

Prisoner's Dilemma is a game theory thought experiment that involves two players, each of whom can cooperate for mutual benefit or betray their partner for individual reward.

Now, let's pause to consider the differences between deontology and utilitarianism, two ethical frameworks mentioned in the paper and widely used in our world (including its tech side).

Deontology is based on the idea that certain actions are inherently right or wrong, regardless of their consequences. It emphasizes fulfilling one's duties and adhering to moral principles ("treat others as you would like to be treated"). Think, for example, data privacy.

Utilitarianism focuses on the consequences of actions and argues that the morally right action is one that maximizes overall happiness or well-being. Weighing the potential benefits and harms of AI applications is a good example of its application in tech.

The authors of the paper show how one can teach an LLM to follow the principles of one or several of these frameworks in the game by providing it with 'intrinsic' moral rewards. For example, the model can be given a deontological reward - a penalty applied when an agent defects against an opponent who previously cooperated. It represents a moral norm that discourages taking advantage of others. Alternatively, we can give it a utilitarian reward based on the collective payoff of both agents in the game - this would encourage cooperation and promotes the overall well-being of the system.

Combined with a so-called game reward that encourages selfish or rational behaviour, these two rewards can influence the LLM's behaviour and help train it to balance selfish interests with specific moral considerations.