MWAHAHA Competition

Results

Task:

Rank	Team Name	Username	Rating	95% CI	Votes

Info

This web page shows the final results for the evaluation of systems submitted by participants to the 2025-2026 MWAHAHA competition on Humor Generation. In this competition, participants submitted computer program systems that are capable of generating humorous outputs given some context (e.g., a news headline).

Frequently Asked Questions (FAQ)

1. What are these systems?
These are the names of the participant systems submitted to this competition. These systems generate jokes given a prompt (e.g., a headline).

2. What's baseline?
It's the name of a system provided by the competition organizers as a baseline.

3. How are the systems evaluated?
We used an annotation web page to let anyone on the Internet help us decide what's the funnier system on 1-on-1 arena-style battles, partially inspired by LMArena. We also employed paid annotators from Prolific. With all the annotations, we computed an Elo-like rating score to rate the systems. A higher rating indicates a system that is more likely to generate outputs perceived as humorous. This is a system used by LMArena and also by games such as chess. More specifically, we employed a Bradley-Terry model to compute stable ratings and applied bootstrapping to compute 95% confidence intervals. Note that, in some border cases, there could be differences between the confidence intervals and the final rating values. See this blog post from LMSYS Org for more info.

4. Why do some systems have the same rank?
Some systems have the same rank because we can't differentiate them in a statistically significant way, even when their ratings are different. Note that ties aren't transitive. For example, we may not be able to tell which of A and B and which of B and C are better, but we may be able to significantly tell that A is better than C. That's why a system may have a lower rank than another one without a statistically significant difference (because there are others systems with the same rank as the latter that can be differentiated from the former).

5. Why do some systems have fewer votes than others?
Annotation is based on random samples, so it doesn't guarantee an equal number of votes. Also, annotators can skip votes, which we don't count here. Finally, another factor is that some systems were removed and/or merged early during the annotation.