Side-by-Side. A style-controlled Bradley–Terry rating from every head-to-head A/B vote. We fit one logistic model where each model gets a log-strength γ, holding response formatting (headers, bold/italic, lists, rules, emoji) and length constant — so a model can't climb just by over-formatting. The rating is 1500 + (400/ln10)·γ, with the most-frequent model pinned at 1500; both-good / both-bad votes anchor the absolute level. Higher = preferred more, once style is stripped out.
Cascading. The same style-controlled rating, fit on Cascading mode instead of votes. Users swipe through anonymous responses one at a time; within each turn we form all response pairs and ask which got more pre-reveal dwell, controlling for slot position (early slots get more attention) plus the same formatting and length features. The per-model log-strength γ is mapped to 1500 + (400/ln10)·γ: higher = users linger longer on it than competing responses, after position and style are accounted for.