Mission
Frontier AI models in the form of Large Language Models (LLMs) are becoming more powerful by the day. Personal as well as professional users are constantly looking for the best coding tool, the best LLM-research assistant or just a model that provides digestible answers to complicated questions. Private as well as public institutions make investments on an unprecedented scale into AI systems and their advancement.
Of everything spanned by “Artificial Intelligence”, LLMs have the most direct end user exposure. A great effort has been made to rank models, but existing human-preference based benchmarks remain largely intransparent: they contain only filtered pairwise comparison data and do not publish user interaction traces, decision times, and other behavioral signals that are essential for rigorous scientific analysis.
comparity.ai addresses this gap. It is a research platform designed to provide substantive, reproducible insights into how users interact with and evaluate LLMs. The platform is built for transparency, open science, and the advancement of reliable, safe, and responsible AI evaluation to untangle signal and noise in human LLM text comparison.
Principles
The main purpose of comparity.ai is to function as a research tool. We aim to advance the understanding of human-preference based benchmarks and text evaluation. For this, we follow the principles of good scientific practice, which are reflected in our three core values: transparency, scientific rigor, and open science.
Transparency
We collect data on user interaction traces, decision metadata (decision times, multi turn conversation, behavioral patterns) and user decisions. For detailed information, see our Privacy Policy and Terms of Use.
Scientific rigor
Human text evaluation is cognitively very demanding. This can lead to an overreliance on heuristics. The platform is designed as a research instrument, built on theory from a variety of active research fields. It captures the data necessary to study how humans evaluate AI-generated text.
Open science
We publish interaction data including decision times, multi-turn conversations, and behavioral patterns, to enable deeper analysis of evaluation quality, see data availability for more information.
We invite other researchers to utilize this dataset to test and advance their own theories. In case of questions, answers and ideas, do not hesitate to reach out.
Who we are
comparity.ai is developed at the Max Planck Institute for Intelligent Systems (MPI-IS), within the Social Foundations of Computation (SF) department. MPI-IS’s mission is to answer fundamental questions about intelligence and how to create intelligent behavior in machines and pioneer new solutions by combining research in hardware, software, and theory.
At the SF department, we build the scientific foundations for machine learning and artificial intelligence in the social world. One of those foundations is benchmarks — understanding when and why benchmarks provide reliable and valid model rankings. comparity.ai is part of this research agenda, built on interdisciplinary theory across multiple domains, ranging from computer- and cognitive science to psychometrics and behavioral sciences.
Team
Data Availability
comparity.ai is an ongoing research project. The full dataset will be publicly released alongside the first publication from this project.
Questions or interested in collaborating?
