#AI #LLM #GenAI #AgenticAI

LLM Evaluation Techinques

Mukesh BarnwalMarch 16, 20263 min read

#AI #LLM #GenAI #AgenticAI

Evaluation of LLM is very important to gain insights into how they are actually performing. Is it solving your problem, your business use case? Does it hallucinate, is it complete, accurate, matches with the ground-truth output that you might have?

LLM evaluation is challenging because how do you know whether the generated output is accurate? It is like distinguishing between beginner and expert response. How do you know whether LLM output aligns closer to the expert.

Since LLMs output type can vary according to the specific problem you are solving, criteria for evaluation also differs.

Let's walk through different types of evaluation:

BERTScore: Lets say you have fine-tuned LLM to generate summary reports about specific topics. Assume, you have the ground-truth output for some topics like sports, finance. You allow the LLM to generate output for these topics. Now you have ground truth and predicted output. You use contextual embeddings of both and calculate semantic similarity between the two . Higher similarity means LLM model output is good and matches your ground truth data.
BLEU score(Bilingual Evaluation Understudy): This is often used in calculating how your sequence-to-sequence models like translators are working. For example: English to French translation. Let's say you have a translation of English to French predicted by LLM. And you have the ground truth of the actual French translation. In this method, you take all combinations of words from 1 to N in the model translated text. You calculate how much overlap is present in the actual translation. The more the overlap, the better the model translation(output) is.
Perplexity: Perplexity as it means is It says how much the model is confused while predicting the next token. Higher the perplexity score, more is the confusion which means the model is not sure of the next word. This means the model is not performing well. Lower the perplexity score, less the confusion, the model is confident about the words it predicts. This means it performs well.
GLUE benchmark: This encompasses all criteria of languages. It takes into consideration language parameters such as semanticity, structure, and qualitative criterias such as faithfulness, toxicity.
Classification scores(such as Precision, Accuracy, Recall, F1 Scores): When your LLM model solves a classification problem, you can use these metrics to check against the ground truth and calculate F1 scores. Higher the F1 score, better the model performance is.

Choosing from above metrics depends on the use case, LLM output types, ground truth data availability. Sometimes, a combination of the above can provide a bigger picture of how LLM performs.

Twitter LinkedIn

← Back to Blog

Related articles