聊天机器人竞技场含英文原文：使用Elo 评级对LLM进行基准测试总篇（巨型机器人竞技场）

内容使用“gpt-3.5-turbo” 分小节进行摘要说明：

1：微软2023年Build大会来自OpenAI的研究员和创始成员Andrej Karpathy的一个主题为State of GPT的演讲中，介绍大语言模型的能力排名时用了本文项目中的Leaderboard，有关注价值。后续我们将每周更新追踪。
2: 文章使用''gpt-3.5-turbo''进行摘要测试。
3: by: Lianmin Zheng*, Ying Sheng*, Wei-Lin Chiang, Hao Zhang, Joseph E. Gonzalez, Ion Stoica, May 03, 2023

We present Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomized battles in a crowdsourced manner. In this blog post, we are releasing our initial results and a leaderboard based on the Elo rating system, which is a widely-used rating system in chess and other competitive games. We invite the entire community to join this effort by contributing new models and evaluating them by asking questions and voting for your favorite answer.

Table 1 displays the Elo ratings of nine popular models, which are based on the 4.7K voting data and calculations shared in this notebook. You can also try the voting demo and see more about the leaderboard.

摘要 Summarize：
本文介绍了Chatbot Arena，这是一个基于大型语言模型（LLMs）的基准测试平台，采用众包方式进行匿名、随机对战。我们发布了初始结果和Elo评分系统排行榜，并邀请整个社区通过贡献新模型并提问和投票来评估它们。表1显示了9种流行模型的Elo评级。

1: 介绍 Introduction：

Following the great success of ChatGPT, there has been a proliferation of open-source large language models that are finetuned to follow instructions. These models are capable of providing valuable assistance in response to users’ questions/prompts. Notable examples include Alpaca and Vicuna, based on LLaMA, and OpenAssistant and Dolly, based on Pythia.

Despite the constant release of new models every week, the community faces a challenge in benchmarking these models effectively. Benchmarking LLM assistants is extremely challenging because the problems can be open-ended, and it is very difficult to write a program to automatically evaluate the response quality. In this case, we typically have to resort to human evaluation based on pairwise comparison.

There are some desired properties for a good benchmark system based on pairwise comparison.

- Scalability. The system should scale to a large number of models when it is not feasible to collect sufficient data for all possible model pairs.

- Incrementality. The system should be able to evaluate a new model using a relatively small number of trials.

- Unique order. The system should provide a unique order for all models. Given any two models, we should be able to tell which ranks higher or whether they are tied.

Existing LLM benchmark systems rarely satisfy all of these properties. Classical LLM benchmark frameworks, such as HELM and lm-evaluation-harness, provide multi-metric measurements for tasks commonly used in academic research. However, they are not based on pairwise comparison and are not effective at evaluating open-ended questions. OpenAI also launched the evals project to collect better questions, but this project does not provide ranking mechanisms for all participating models. When we launched our Vicuna model, we utilized a GPT-4-based evaluation pipeline, but it does not provide a solution for scalable and incremental ratings.

In this blog post, we introduce Chatbot Arena, an LLM benchmark platform featuring anonymous randomized battles in a crowdsourced manner. Chatbot Arena adopts the Elo rating system, which is a widely-used rating system in chess and other competitive games. The Elo rating system is promising to provide the desired property mentioned above. We noticed that the Anthropic LLM paper also adopted the Elo rating system.

To collect data, we launched the arena with several popular open-source LLMs one week ago. In the arena, a user can chat with two anonymous models side-by-side and vote for which one is better. This crowdsourcing way of data collection represents some use cases of LLMs in the wild. A comparison between several evaluation methods is shown in Table 2.

摘要 Summarize：
本文介绍了Chatbot Arena，这是一个基于Elo评分系统的LLM基准平台。该平台采用匿名随机对战方式进行数据收集，并具有可扩展性、增量性和唯一排序等优点。传统的LLM基准框架无法有效评估开放式问题，而OpenAI的evals项目也没有提供所有参与模型的排名机制。

2: 数据采集 Data Collection：

We hosted the arena at https://arena.lmsys.org with our multi-model serving system, FastChat. When a user enters the arena, they can chat with two anonymous models side-by-side, as shown in Figure 1. After getting responses from the two models, users can continue chatting or vote for the model they think is better. Once a vote is submitted, the model names will be revealed. Users can continue chatting or restart a new battle with two new randomly chosen anonymous models. The platform logs all user interactions. In our analysis, we only use the votes when the model names are hidden.

The arena was launched about one week ago and we have collected 4.7k valid anonymous votes since then. We share some exploratory analysis in this notebook and present a short summary here.

Figure 2 shows the battles count of each combination of models. When we initially launched the tournament, we had prior information on the likely ranking based on our benchmarks and chose to pair models according to this ranking. We gave preference to what we believed would be strong pairings based on this ranking. However, we later switched to uniform sampling to get better overall coverage of the rankings. Towards the end of the tournament, we also introduced a new model fastchat-t5-3b. All of these result in non-uniform model frequency.

Figure 3 plots the language distribution and shows most user prompts are in English.

摘要 Summarize：
该文介绍了一个名为FastChat的多模型服务系统，用户可以在其中与两个匿名模型进行聊天，并投票选择更好的那个。平台记录了所有用户交互并收集了4.7k有效匿名投票。作者还分享了一些探索性分析结果，包括不同模型组合的战斗次数和用户提示语言分布情况等。

3: Elo评级系统 Elo Rating System：

The Elo rating system is a method for calculating the relative skill levels of players, which has been widely adopted in competitive games and sports. The difference in the ratings between two players serves as a predictor of the outcome of a match. The Elo rating system works well for our case because we have multiple models and we run pairwise battles between them.

If player A has a rating of Ra and player B a rating of Rb, the exact formula (using the logistic curve with base 10) for the probability of player A winning is

The ratings of players can be linearly updated after each battle. Suppose player A (with Rating Ra) was expected to score Ea points but actucally scored Sa points. The formula for updating that player's rating is

Using the collected data, we compute the Elo ratings of the models in this notebook and put the main results in Table 1. You are welcome to try the notebook and play with the voting data by yourself. The data only contains voting results without conversation histories because releasing the conversation history will raise concerns such as privacy and toxicity.

摘要 Summarize：
Elo评分系统是一种计算玩家相对技能水平的方法，已被广泛应用于竞技游戏和体育比赛中。该系统适用于多个模型之间进行配对比较。通过使用收集到的数据，我们在本笔记本中计算了模型的Elo评分，并将主要结果放在表1中。

4: 双赢率 Pairwise Win Rates：

As a basis for calibration, we also present here the pairwise win rates for each model in the tournament (Figure 4) as well as the predicted pairwise win rate estimated using Elo ratings (Figure 5). By comparing the figures, we find the elo ratings can predict win rates relatively well.

摘要 Summarize：
该文介绍了比赛中每个模型的两两胜率，同时也展示了使用Elo评分预测的胜率。通过对比这些数据，发现Elo评分可以相对准确地预测胜率。

5: 未来计划 Future Plans：

We plan to work on the following items:

-- Add more closed-source models (ChatGPT-3.5, ChatGPT-4, and Claude-v1 are avaiable now in the anonymous Arena)

-- Add more open-source models

-- Release periodically updated leaderboards (e.g., monthly)

-- Implement better sampling algorithms, tournament mechanisms, and serving systems to support a much larger number of models

-- Provide fine-grained rankings on different task types.

We appreciate any feedback from you to make the arena better.

摘要 Summarize：
未来计划：添加更多闭源模型（ChatGPT-3.5、ChatGPT-4和Claude-v1现在在匿名竞技场中可用），添加更多开源模型，定期发布更新的排行榜（例如每月一次），实施更好的采样算法、锦标赛机制和服务系统以支持更大数量的模型，在不同任务类型上提供细粒度排名。我们感谢您提供任何反馈，以使竞技场变得更好。

推荐阅读

欢迎关注我的各平台同名账号，您的点赞、收藏是对我最大的支持！