My primary research is in philosophy of mind and philosophy of cognitive science. It is aimed at explaining motivation in biological and artificial agents and exploring the implications of such explanations for accounts of rationality and well-being. To do so, I draw on computational methods from cognitive science and AI research. I also have interests in decision theory and other areas of formal philosophy.
I also work on AI safety, and I am a principal investigator for the project “Alignment and the Structure of Human Values”, which is funded by Open Philanthropy. The aim of this project is to use insights from cognitive science to improve the ability of AI to learn human preferences.
Works in progress
[A paper on diachronic rationality] (under review)
[A paper on the limits of RLHF in training LLMs]
[A paper on the relation between desires, beliefs, and perception]
[A paper on the notion of reward as used in psychology and computational cognitive science]
[A paper on preferences and their psychological status]
Works on AI ethics and safety
“Machine Theory of Mind and the Structure of Human Values”, NeurIPS 2023 MP2 Workshop (link)
“Generative Theory of Mind and the Value Misgeneralization Problem”, 2023 AI Alignment Awards Final Prize winner (link)
“Alignment as a Dynamic Process”, NeurIPS 2022 AI Safety Workshop, AI Risk Analysis award winner (link)
“AI: The Consequences for Human Rights: Initial findings on the expected future use of AI in authoritarian regimes”, expert testimony report submitted to the House of Representatives Commission on Human Rights, 2017 (link)
Publications
“What makes discrimination wrong?”, Journal of Practical Ethics, Vol. 5(2):105-113, 2017
Writing for a popular audience
“Why gender studies are generally misunderstood”, Expressen, 2019 (in Swedish) (link)
“Where is the meaning of art?”, Medium, 2017 (link)