Frequently Asked Questions
(FAQ)
1. What are these systems?
These are the names of the participant systems submitted to this competition.
These systems generate jokes given a prompt (e.g., a headline).
2. What's baseline?
It's the name of a system provided by the competition organizers as a baseline.
3. How are the systems evaluated?
We used an annotation web page to let anyone on the Internet help us
decide what's the funnier system on 1-on-1 arena-style battles,
partially inspired by
LMArena.
We also employed paid annotators from
Prolific.
With all the annotations, we computed an Elo-like rating score to rate the systems.
A higher rating indicates a system that is more likely to generate outputs perceived as
humorous.
This is a system used by LMArena and also by games such as chess.
More specifically, we employed a Bradley-Terry model to compute stable ratings
and applied bootstrapping to compute 95% confidence intervals.
Note that, in some border cases, there could be differences between the confidence
intervals and the final rating values.
See
this blog post from LMSYS Org
for more info.
4. Why do some systems have the same rank?
Some systems have the same rank because we can't differentiate them in a statistically
significant way, even when their ratings are different.
Note that ties aren't transitive.
For example, we may not be able to tell which of A and B and which of B and C are
better,
but we may be able to significantly tell that A is better than C.
That's why a system may have a lower rank than another one without a statistically
significant difference
(because there are others systems with the same rank as the latter that can
be differentiated from the former).
5. Why do some systems have fewer votes than others?
Annotation is based on random samples, so it doesn't guarantee an equal number of votes.
Also, annotators can skip votes, which we don't count here.
Finally, another factor is that some systems were removed and/or merged early during the
annotation.