搜索引擎RAG召回效果评测MTEB介绍与使用入门

RAG 评测数据集建设尚处于初期阶段，缺乏针对特定领域和场景的专业数据集。市面上常见的 MS-Marco 和 BEIR 数据集覆盖范围有限，且在实际使用场景中效果可能与评测表现不符。目前最权威的检索榜单是 HuggingFace MTEB，今天我们来学习使用MTEB，并来评测自研模型recall效果。

MTEB 是一个包含广泛文本嵌入（Text Embedding）的基准测试，它提供了多种语言的数十个数据集，用于各种 NLP 任务，例如文本分类、聚类、检索和文本相似性。MTEB 提供了一个公共排行榜，允许研究人员提交他们的结果并跟踪他们的进展。MTEB 还提供了一个简单的 API，允许研究人员轻松地将他们的模型与基准测试进行比较。

安装使用

pip install mteb

使用入门

最简单的用法就是，直接编写python代码来测试 (see scripts/run_mteb_english.py and mteb/mtebscripts for more):

from mteb import MTEB

from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name

model_name = "average_word_embeddings_komninos"

model = SentenceTransformer(model_name)

evaluation = MTEB(tasks=["Banking77Classification"])

results = evaluation.run(model, output_folder=f"results/{model_name}")

也可以使用官方提供的 CLI

mteb --available_tasks

mteb -m average_word_embeddings_komninos \

    -t Banking77Classification  \

    --output_folder results/average_word_embeddings_komninos \

    --verbosity 3

高级用法

测试数据集选择

MTEB支持指定数据集，可以通过下面的形式

按task_type任务类型（例如“聚类”或“分类”)

evaluation = MTEB(task_types=['Clustering', 'Retrieval']) # Only select clustering and retrieval tasks

按类别划分, 例如“句子到句子 "S2S" (sentence to sentence) "P2P" (paragraph to paragraph)

evaluation = MTEB(task_categories=['S2S']) # Only select sentence2sentence datasets

按照文本语言

evaluation = MTEB(task_langs=["en", "de"]) # Only select datasets which are "en", "de" or "en-de"

还可以针对数据集选择语言：

from mteb.tasks import AmazonReviewsClassification, BUCCBitextMining

evaluation = MTEB(tasks=[

        AmazonReviewsClassification(langs=["en", "fr"]) # Only load "en" and "fr" subsets of Amazon Reviews

        BUCCBitextMining(langs=["de-en"]), # Only load "de-en" subset of BUCC

])

可为某些任务集合提供预设

from mteb import MTEB_MAIN_EN

evaluation = MTEB(tasks=MTEB_MAIN_EN, task_langs=["en"])

自定义评测 split

有的数据集有多个split，评测会比较耗时，可以指定splits，来减少评测时间，比如下面的就指定了只用test split。

evaluation.run(model, eval_splits=["test"])

自定义评测模型

如果想自定义评测模型，可以自定义一个类，只要实现一个encode函数，输入是一个句子列表，返回的是一个嵌入向量列表（嵌入可以是np.array、torch.tensor等）。可以参考 mteb/mtebscripts repo 仓库。

class MyModel():

    def encode(self, sentences, batch_size=32, **kwargs):

        """

        Returns a list of embeddings for the given sentences.

        Args:

            sentences (`List[str]`): List of sentences to encode

            batch_size (`int`): Batch size for the encoding

        Returns:

            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences

        """

        pass

model = MyModel()

evaluation = MTEB(tasks=["Banking77Classification"])

evaluation.run(model)

如果针对query和corpus需要使用不同的encode方法，可以独立提供encode_queries and encode_corpus两个方法。

class MyModel():

    def encode_queries(self, queries, batch_size=32, **kwargs):

        """

        Returns a list of embeddings for the given sentences.

        Args:

            queries (`List[str]`): List of sentences to encode

            batch_size (`int`): Batch size for the encoding

        Returns:

            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences

        """

        pass

    def encode_corpus(self, corpus, batch_size=32, **kwargs):

        """

        Returns a list of embeddings for the given sentences.

        Args:

            corpus (`List[str]` or `List[Dict[str, str]]`): List of sentences to encode

                or list of dictionaries with keys "title" and "text"

            batch_size (`int`): Batch size for the encoding

        Returns:

            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences

        """

        pass

自定义评测Task（数据集）

要添加一个新任务，你需要实现一个从与任务类型相关的AbsTask继承的新类（例如，对于重排任务是AbsTaskReranking）。你可以在这里找到支持的任务类型。

比如下面的自定义重排任务：

from mteb import MTEB

from mteb.abstasks.AbsTaskReranking import AbsTaskReranking

from sentence_transformers import SentenceTransformer

class MindSmallReranking(AbsTaskReranking):

    @property

    def description(self):

        return {

            "name": "MindSmallReranking",

            "hf_hub_name": "mteb/mind_small",

            "description": "Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research",

            "reference": "https://www.microsoft.com/en-us/research/uploads/prod/2019/03/nl4se18LinkSO.pdf",

            "type": "Reranking",

            "category": "s2s",

            "eval_splits": ["validation"],

            "eval_langs": ["en"],

            "main_score": "map",

        }

model = SentenceTransformer("average_word_embeddings_komninos")

evaluation = MTEB(tasks=[MindSmallReranking()])

evaluation.run(model)

源码分析

Retrieval召回评测

召回评测是通过RetrievalEvaluator类实现的。

def __init__(

        self,

        queries: Dict[str, str],  # qid => query

        corpus: Dict[str, str],  # cid => doc

        relevant_docs: Dict[str, Set[str]],  # qid => Set[cid]

        corpus_chunk_size: int = 50000,

        mrr_at_k: List[int] = [10],

        ndcg_at_k: List[int] = [10],

        accuracy_at_k: List[int] = [1, 3, 5, 10],

        precision_recall_at_k: List[int] = [1, 3, 5, 10],

        map_at_k: List[int] = [100],

        show_progress_bar: bool = False,

        batch_size: int = 32,

        name: str = "",

        score_functions: List[Callable[[torch.Tensor, torch.Tensor], torch.Tensor]] = {

            "cos_sim": cos_sim,

            "dot": dot_score,

        },  # Score function, higher=more similar

        main_score_function: str = None,

        limit: int = None,

        **kwargs

    ):

        super().__init__(**kwargs)

        self.queries_ids = []

        for qid in queries:

            if qid in relevant_docs and len(relevant_docs[qid]) > 0:

                self.queries_ids.append(qid)

                if limit and len(self.queries_ids) >= limit:

                    break

        self.queries = [queries[qid] for qid in self.queries_ids]

        self.corpus_ids = list(corpus.keys())

        self.corpus = [corpus[cid] for cid in self.corpus_ids]

        self.relevant_docs = relevant_docs

        self.corpus_chunk_size = corpus_chunk_size

        self.mrr_at_k = mrr_at_k

        self.ndcg_at_k = ndcg_at_k

        self.accuracy_at_k = accuracy_at_k

        self.precision_recall_at_k = precision_recall_at_k

        self.map_at_k = map_at_k

        self.show_progress_bar = show_progress_bar

        self.batch_size = batch_size

        self.name = name

        self.score_functions = score_functions

        self.score_function_names = sorted(list(self.score_functions.keys()))

        self.main_score_function = main_score_function

构造函数几个重要的参数：

- queries: Dict[str, str], # qid => query qid到query的dict

- corpus: Dict[str, str], # cid => doc docid到doc的dict

- relevant_docs: Dict[str, Set[str]], # qid => Set[cid] qid到相关docid的dict

因此，要自定义评测任务，需要提供这些数据。

具体的评测函数在compute_metrics里：

def compute_metrics(self, model, corpus_model=None, corpus_embeddings: torch.Tensor = None) -> Dict[str, float]:

        if corpus_model is None:

            corpus_model = model

        max_k = max(

            max(self.mrr_at_k),

            max(self.ndcg_at_k),

            max(self.accuracy_at_k),

            max(self.precision_recall_at_k),

            max(self.map_at_k),

        )

        # Compute embedding for the queries

        logger.info("Encoding the queries...")

        # We don't know if encode has the kwargs show_progress_bar

        kwargs = {

            "show_progress_bar": self.show_progress_bar

        } if "show_progress_bar" in inspect.signature(model.encode).parameters else {}

        query_embeddings = np.asarray(model.encode(self.queries, batch_size=self.batch_size, **kwargs))

        queries_result_list = {}

        for name in self.score_functions:

            queries_result_list[name] = [[] for _ in range(len(query_embeddings))]

        # Iterate over chunks of the corpus

        logger.info("Encoding chunks of corpus, and computing similarity scores with queries...")

        for corpus_start_idx in trange(

            0,

            len(self.corpus),

            self.corpus_chunk_size,

            desc="Corpus Chunks",

            disable=not self.show_progress_bar,

        ):

            # Encode chunk of corpus

            if corpus_embeddings is None:

                corpus_end_idx = min(corpus_start_idx + self.corpus_chunk_size, len(self.corpus))

                sub_corpus_embeddings = np.asarray(corpus_model.encode(

                    self.corpus[corpus_start_idx:corpus_end_idx],

                    batch_size=self.batch_size,

                ))

            else:

                corpus_end_idx = min(corpus_start_idx + self.corpus_chunk_size, len(corpus_embeddings))

                sub_corpus_embeddings = corpus_embeddings[corpus_start_idx:corpus_end_idx]

            # Compute cosine similarites

            for name, score_function in self.score_functions.items():

                pair_scores = score_function(query_embeddings, sub_corpus_embeddings)

                # Get top-k values

                pair_scores_top_k_values, pair_scores_top_k_idx = torch.topk(

                    pair_scores,

                    min(max_k, len(pair_scores[0])),

                    dim=1,

                    largest=True,

                    sorted=False,

                )

                pair_scores_top_k_values = pair_scores_top_k_values.cpu().tolist()

                pair_scores_top_k_idx = pair_scores_top_k_idx.cpu().tolist()

                for query_itr in range(len(query_embeddings)):

                    for sub_corpus_id, score in zip(

                        pair_scores_top_k_idx[query_itr],

                        pair_scores_top_k_values[query_itr],

                    ):

                        corpus_id = self.corpus_ids[corpus_start_idx + sub_corpus_id]

                        queries_result_list[name][query_itr].append({"corpus_id": corpus_id, "score": score})

        # Compute scores

        logger.info("Computing metrics...")

        scores = {name: self._compute_metrics(queries_result_list[name]) for name in self.score_functions}

        return scores

model（embedding模型），corpus_model（如果doc用单独的embedding模型，需要传入这个参数，否则默认使用和query一样的model）
首先会计算query_embedding query_embeddings = np.asarray(model.encode(self.queries, batch_size=self.batch_size, **kwargs))
然后计算corpus_embeddings
通过score_function，计算tok_k，结果放到queries_result_list
根据召回结果计算指标_compute_metrics, 会计算"mrr@k", "ndcg@k", "accuracy@k", "precision_recall@k", "map@k"等指标

Reranking 精排

精排是通过RerankingEvaluator来实现的。

class RerankingEvaluator(Evaluator):

    """

    This class evaluates a SentenceTransformer model for the task of re-ranking.

    Given a query and a list of documents, it computes the score [query, doc_i] for all possible

    documents and sorts them in decreasing order. Then, MRR@10 and MAP is compute to measure the quality of the ranking.

    :param samples: Must be a list and each element is of the form:

        - {'query': '', 'positive': [], 'negative': []}. Query is the search query, positive is a list of positive

        (relevant) documents, negative is a list of negative (irrelevant) documents.

        - {'query': [], 'positive': [], 'negative': []}. Where query is a list of strings, which embeddings we average

        to get the query embedding.

    """

    def __init__(

        self,

        samples,

        mrr_at_k: int = 10,

        name: str = "",

        similarity_fct=cos_sim,

        batch_size: int = 512,

        use_batched_encoding: bool = True,

        limit: int = None,

        **kwargs,

    ):

给定一个query和一组文档，模型计算文档得分，并按降序排列，最后计算MRR@10和MAP指标来衡量排名的质量。

__init__方法接收以下参数：

samples：必须是一个列表，每个元素的形式为：
- {'query': '', 'positive': [], 'negative': []}。查询是搜索查询，正文档是相关（正面）文档的列表，负文档是无关（负面）文档的列表。
- {'query': [], 'positive': [], 'negative': []}。其中查询是一个字符串列表，我们将这些字符串的平均嵌入作为查询嵌入。
mrr_at_k：默认值为10，表示计算MRR时考虑的前k个结果。
name：默认值为空字符串，表示评估器的名称。
similarity_fct：默认值为cos_sim，表示用于计算相似度的函数。

在compute_metrics_batched 计算得分，还是计算的cos得分，这里相当于直接计算的embedding的排序能力，如果要计算cross模型的排序能力，默认的代码不适用，需要重新定制。

评测实践

说了这么多，现在切入正题：

评测自研模型的召回能力 —— 自定义模型
自定义评测集，对比开源模型和自研模型的效果 —— 自定义评测任务

自研模型召回效果评测

我们先评估模型召回效果，训练好的模型导出为onnx，因此我们通过onnxrutime来进行推理，先自定义模型：

from mteb import MTEB

import onnxruntime as ort

from paddlenlp.transformers import AutoTokenizer

import math

from tqdm import tqdm

# 模型路径

model_path = "onnx/fp16_model.onnx"

tokenizer_path = "model_520000"

class MyModel():

    def __init__(self, use_gpu=True):

        providers = ['CUDAExecutionProvider'] if use_gpu else ['CPUExecutionProvider']

        sess_options = ort.SessionOptions()

        self.predictor = ort.InferenceSession(

            model_path, sess_options=sess_options, providers=providers)

        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)

    def encode(self, sentences, batch_size=64, **kwargs):

        all_embeddings = []

        # 向上取整

        batch_count = math.ceil(len(sentences) / batch_size)

        for i in tqdm(range(batch_count)):

            # 按batch

            sub_sentences = sentences[i * batch_size : min(len(sentences), (i + 1) * batch_size)]

            features = self.tokenizer(sub_sentences, max_seq_len=128,

                                    pad_to_max_seq_len=True, truncation_strategy="longest_first")

            vecs = self.predictor.run(None, features.data)

            all_embeddings.extend(vecs[0])

        return all_embeddings

由于传进来的sentences是所有的数据，我们需要按照batch_size，分批进行embedding计算，计算好的放入all_embeddings，最后返回即可。

自定义召回评测任务

上面分析源代码时提到了，自定义时需要提供qurey，doc，以及query的相关doc

假设我们的自定义测试为jsonline格式，每行包含query，以及相关的doc，json格式如下：

{

    "query": "《1984》是什么",

    "data": [

        {

            "title": "《1984》介绍-知乎",

            "summary": "《1984》是伪装成小说的政治思想...",

            "url": "",

            "id": 5031622209044687985,

            "answer": "完全相关",

            "accuracy": "无错",

            "result": "good"

        }

    ]

}

那么我们可以编写自定义召回评测任务：

class SSRetrieval(AbsTaskRetrieval):

    @property

    def description(self):

        return {

            'name': 'SSRetrieval',

            'description': 'SSRetrieval是S研发部测试团队准备的召回测试集',

            'type': 'Retrieval',

            'category': 's2p',

            'json_path': '/data/xapian-core-1.4.24/demo/result.json',

            'eval_splits': ['dev'],

            'eval_langs': ['zh'],

            'main_score': 'recall_at_10',

        }

    def load_data(self, **kwargs):

        if self.data_loaded:

            return

        self.corpus = {} # doc_id => doc

        self.queries = {}  # qid => query

        self.relevant_docs = {} # qid => Set[doc_id]

        query_index = 1

        with open(self.description['json_path'], 'r', encoding='utf-8') as f:

            for line in f:

                if "完全相关" not in line:

                    continue

                line =  json.loads(line)

                query =  line['query']

                query_id = str(query_index)

                self.queries[query_id] = query

                query_index = query_index + 1

                query_relevant_docs = []

                for doc in line['data']:

                    doc_id = str(doc['id'])

                    self.corpus[doc_id] = {"title": doc["title"], "text": doc["summary"]}

                    if doc['answer'] == "完全相关":

                        if query_id not in self.relevant_docs:

                            self.relevant_docs[query_id] = {}

                        self.relevant_docs[query_id][doc_id] = 1 

                # debug使用

                # if query_index == 100:

                #     break

        self.queries = DatasetDict({"dev": self.queries})

        self.corpus = DatasetDict({"dev": self.corpus})

        self.relevant_docs = DatasetDict({"dev": self.relevant_docs})

        self.data_loaded = True

用自定义模型，评测自定义任务

if __name__ == '__main__':

    model = MyModel()

    # task_names = [t.description["name"] for t in MTEB(task_types='Retrieval',

    #                                                   task_langs=['zh', 'zh-CN']).tasks]

    task_names = ["SSRetrieval"]

    for task in task_names:

        model.query_instruction_for_retrieval = None

        evaluation = MTEB(tasks=[task], task_langs=['zh', 'zh-CN'])

        evaluation.run(model, output_folder=f"zh_results/256_model", batch_size=64)

总结

mteb 最为embedding召回效果测试，是一个权威的榜单，本身提供的工具框架也具备较好的扩展性，方便开发者自定义模型和自定义评测任务。