Metrics for evaluating LLMs

ELO Rating, BLEU, Perplexity and Cross Entropy

8 min readJust now

LLMs can generate human-like text [1], translate languages [2], write code and other creative content[3], and answer questions in a chatbot [4]. But how do we gauge the effectiveness of these powerful tools?

This article explores three key metrics used for evaluating LLM pre-training and fine-tuning: ELO rating, BLEU, and cross entropy.

ELO Rating

While originally designed for chess, the ELO rating system has found a new application in LLMs. In this context, ELO rating serves as a method for comparing the performance of different language models based on their ability to generate high-quality responses. The system works by assigning a numerical score to each model, which changes based on the outcomes of pairwise comparisons.

Here’s a breakdown of the process:

Pairwise Battles: Two LLMs are presented with the same prompt or question.
Human Evaluation: Human evaluators assess the quality of the responses generated by each model and select a “winner” based on criteria such as coherence, relevance, and factual accuracy.
ELO Score Update:

The ELO scores of the models are updated based on the outcome of the comparison. The winning model’s score increases, while the losing model’s score decreases. The magnitude of the score change is influenced by the initial ELO rating difference between the models.

R_updated = R_initial + K * (Sa — Ea), where

R_updated is the updated Elo rating of model A
R_initial is the initial Elo rating
K is a constant factor controlling the magnitude of rating change. (e.g K=32)
Sa is the actual outcome of the comparison (win = 1, loss = 0)
Ea is the expected outcome based on the Elo difference between the two models being compared. This is calculated using a logistic function based on the ELO difference between the two models, meaning a larger ELO gap results in a higher expected probability of the higher-rated model winning.

ELO rating offers a dynamic and relative measure of LLM performance, allowing for continuous evaluation and ranking as models undergo training and fine-tuning. Chatbot arenas employ ELO rating to facilitate LLM battles and maintain a leaderboard of the most effective models.

Example:

Let’s say we have two LLMs, Model A and Model B, both starting with an ELO rating of 1000 and 2000. They are given the prompt “Write a short story about a cat who goes on an adventure.”

The expected outcome here is that Human evaluators judge Model A’s story to be more creative and engaging, declaring it the winner. Consequently, Model A’s ELO rating increases, while Model B’s decreases. This process can be repeated with various prompts and evaluators to obtain a more robust and comprehensive assessment of each model’s capabilities.

Limitations of ELO Rating:

ELO rating for LLMs relies heavily on human evaluation, which can introduce subjectivity and potential biases. Evaluators may have differing opinions on what constitutes a “better” response, and their judgments can be influenced by personal preferences or interpretations1.

Therefore, it’s crucial to incorporate diverse perspectives and ratings and establish clear evaluation criteria to mitigate these limitations.

BLEU

BLEU (Bilingual Evaluation Understudy) is a metric originally developed for evaluating machine translation systems. It quantifies the quality of machine-generated text by comparing it to one or more human-written reference translations. BLEU achieves this by calculating the overlap of n-grams (contiguous sequences of n items from a given sample of text) between the generated text and the references. The resulting score ranges from 0 to 1, with a higher score indicating a greater degree of similarity with the reference translations.

While primarily used for translation, BLEU can also be applied to other LLM tasks, such as text summarization and paraphrasing, where the objective is to generate text that closely matches a reference [6]. It is particularly useful for diagnostic purposes, helping developers identify areas where the model needs improvement.

Example:

Consider the following reference translation and machine-generated translation:

Reference: The cat sat on the mat.
Machine: The cat is on the mat.

BLEU would analyze the n-gram overlap between these two sentences. For instance, at the unigram (single word) level, all four words in the machine translation are present in the reference. At the bigram (two-word sequence) level, three out of three bigrams (“the cat,” “cat is,” “is on”) match. This overlap contributes to the final BLEU score, providing a quantitative measure of the translation’s accuracy.

Limitations of BLEU:

Despite its widespread use, BLEU has certain limitations. It primarily focuses on n-gram precision and doesn’t fully capture the sentence structure or meaning [2]. Consequently, a high BLEU score doesn’t always guarantee a high-quality translation, as it may not accurately reflect the fluency or coherence of the generated text.

Perplexity

Perplexity measures how well a LLM will predict a sequence of words. It quantifies the uncertainty of the model when predicting the next word in a sequence.

A score of 1 is ideal which means model is confident of predicting the sequence, lower perplexity score (> 1) indicates better performance, higher perplexity means low performance.

For example, imagine a language model trained on a small corpus of text. This model has a limited vocabulary of only six words: “a”, “the”, “red”, “fox”, “dog”, and “.”. We want to calculate the perplexity of this model on the sentence “a red fox.” Let’s consider:

P(“a”) = 0.2
P(“red” | “a”) = 0.6 (probability of “red” given “a”)
P(“fox” | “a red”) = 0.8 (probability of “fox” given “a red”)
P(“.” | “a red fox”) = 0.9 (probability of “.” given “a red fox”)

P(“a red fox.”) = 0.2 * 0.6 * 0.8 * 0.9 = 0.0864

Perplexity(“a red fox.”) = 1 / (0.0864)^(1/4) = 1.85

Note, perplexity is always greater or equal to 1 where 1 is the perfect score. Lower scores are preferred over larger scores.

Cross Entropy

Cross entropy is a loss function widely employed in machine learning, including the training of LLMs. It measures the difference between two probability distributions:

True Distribution: The actual distribution of words or tokens in the training data.
Predicted Distribution: The probability distribution predicted by the LLM.

During training, LLMs strive to minimize cross entropy. A lower cross entropy value signifies that the model’s predicted distribution closely aligns with the true distribution, indicating better performance [7]. In essence, cross entropy guides the model to learn the underlying patterns and relationships within the training data, enabling it to generate more accurate and human-like text.

Example:

Imagine an LLM being trained to predict the next word in a sentence. Given the input “The cat sat on the,” the true distribution might assign a high probability to the word “mat.” If the LLM predicts a different word with high probability, the cross entropy loss will be high. Conversely, if the LLM correctly assigns a high probability to “mat,” the loss will be low. By minimizing this loss, the model learns to make more accurate predictions and generate more coherent text.

Cross Entropy and LLM Confidence:

Cross entropy can also be interpreted as a measure of the LLM’s confidence in its predictions. A lower cross entropy suggests that the model is more certain about its output, while a higher cross entropy indicates greater uncertainty. This information can be valuable for developers in understanding the model’s behavior and identifying areas where it might be prone to errors.

Using Cross Entropy for Fine-tuning:

Analyzing the cross entropy loss for different types of errors, such as factual errors versus grammatical errors, can provide valuable insights for fine-tuning LLMs. By focusing on the errors that contribute most significantly to the loss, developers can tailor their fine-tuning efforts to address specific weaknesses and improve the model’s overall performance.

Combining Evaluation Metrics

While each of the metrics discussed above provides valuable information about LLM performance, it’s essential to recognize that no single metric is perfect. Each has its own strengths and limitations, and relying solely on one metric may not provide a complete picture of the model’s capabilities. Therefore, it’s crucial to combine multiple evaluation metrics to obtain a more comprehensive and nuanced assessment.

For example, while ELO rating captures the relative performance of LLMs based on human judgment, it may not be sensitive to subtle differences in text quality that BLEU can detect.

ELO’s strengths:

Evaluating the overall quality of the LLM’s output, such as coherence, relevance, and factual accuracy, which are harder to capture with metrics like BLEU.
Need for continuous evaluation against current and upcoming models.
Access to human evaluators.

Similarly, while BLEU measures the accuracy of generated text compared to references, it doesn’t provide insights into the model’s confidence levels, which cross entropy can reveal.

Bleu is needed when:

You have a clear ground truth text to compare the LLM’s output to, such as in machine translation, text summarization, or paraphrasing tasks.
You need an automatic and objective metric that can be quickly calculated.
You want to identify specific areas where the model needs improvement, such as n-gram precision.

Perplexity’s strengths:

Does not need a golden or reference dataset.
Goes beyond ngrams. The score provides insights into the model’s predictive power, fluency, and understanding of language structure.

Cross-entropy’s strengths:

Evaluating the effectiveness of pre-training: Cross-entropy can be used to assess how well an LLM has learned the underlying patterns and relationships in the training data. A lower cross-entropy (lower the better) on a held-out dataset indicates better generalization and pre-training effectiveness.
Cross-entropy can be used to compare the performance of different LLMs on the same task without involving human rating. The model with the lower cross-entropy is generally considered to be better at capturing the true distribution of the data.

By combining these metrics, developers can gain a more holistic understanding of LLM performance and identify areas for improvement.

Conclusion

Evaluating LLMs is a critical aspect of their development and deployment. The metrics discussed in this article — ELO rating, BLEU, and cross entropy — offer valuable tools for assessing different aspects of LLM performance. ELO rating provides a dynamic and relative measure based on human evaluation, BLEU quantifies the accuracy of generated text compared to references, and cross entropy guides the training process and provides insights into model confidence.

However, it’s important to remember that each metric has limitations, and a comprehensive evaluation often requires combining multiple metrics.

Reference:

1. Llm Elo Rating Insights | Restackio, accessed December 22, 2024, https://www.restack.io/p/llm-evaluation-answer-elo-rating-cat-ai

2. Bleu Evaluation for LLM Evaluation — Restack, accessed December 22, 2024, https://www.restack.io/p/llm-evaluation-answer-bleu-evaluation-cat-ai

3. Elo as a tool for ranking LLMs — Medium, accessed December 22, 2024, https://medium.com/tr-labs-ml-engineering-blog/elo-as-a-tool-for-ranking-llms-dab056dc9713

4. Chatbot Arena: Benchmarking LLMs in the Wild with Elo Ratings | LMSYS Org, accessed December 22, 2024, https://lmsys.org/blog/2023-05-03-arena/

5. Llm Elo Ranking Systems Explained | Restackio, accessed December 22, 2024, https://www.restack.io/p/llm-evaluation-answer-elo-ranking-systems-cat-ai

6. Comprehensive 10+ LLM Evaluation: From BLEU, ROUGE, and METEOR to Scenario-Based Metrics like… — Rupak (Bob) Roy, accessed December 22, 2024, https://bobrupakroy.medium.com/comprehensive-10-llm-evaluation-from-bleu-rouge-and-meteor-to-scenario-based-metrics-like-9f6602c92c17

7. Cross Entropy in Large Language Models (LLMs) | by Charles Chi | AI — Medium, accessed December 22, 2024, https://medium.com/ai-assimilating-intelligence/cross-entropy-in-large-language-models-llms-4f1c842b5fca

Metrics for evaluating LLMs

ELO Rating, BLEU, Perplexity and Cross Entropy

ELO Rating

BLEU

Perplexity

Cross Entropy

Combining Evaluation Metrics

Conclusion

Reference:

Written by Jaideep Ray

No responses yet