title: ParaPLUIE
emoji: ☂️
tags:
- evaluate
- metric
description: >-
ParaPLUIE is a metric for evaluating the semantic proximity between two
sentences. ParaPLUIE uses the perplexity of an LLM to compute a confidence
score. It has shown the highest correlation with human judgment on paraphrase
classification while maintaining a low computational cost, as it roughly
equivalent to the cost of generating a single token.
sdk: static
pinned: true
short_description: ParaPLUIE is a metric for evaluating the semantic proximity
Metric Card for ParaPLUIE ☂️ (Paraphrase Generation Evaluation Powered by an LLM)
Metric Description
ParaPLUIE is a metric for evaluating the semantic proximity between two sentences. ParaPLUIE uses the perplexity of an LLM to compute a confidence score. It has shown the highest correlation with human judgment on paraphrase classification while maintaining a low computational cost, as it roughly equivalent to the cost of generating a single token.
Source code
How to Use
This metric requires a source sentence and its hypothetical paraphrase.
from PPLUIE.wrapper import ParaPLUIE
template = "FS-DIRECT"
device = "auto"
scorer = ParaPLUIE()
scorer.init("mistralai/Mistral-7B-Instruct-v0.2", device)
scorer.setTemplate(template)
S = ["Have you ever seen a tsunami ?"]
H = ["Have you ever seen a tiramisu ?"]
score = scorer.compute(S, H)
print("Result score : ",score)
>>> [-16.97607421875]
Inputs
- sources (
list of string): Source sentences. - hypotheses (
list of string): Hypothetical paraphrases.
Output Values
- score (
float): ParaPLUIE score. Minimum possible value is-inf. Maximum possible value is+inf. A scoregreater than 0means that sentences are paraphrases. A scorelower than 0indicates the opposite.
This metric outputs a list containing the score.
Examples
Configure metric
scorer = ParaPLUIE()
scorer.init(
model = "mistralai/Mistral-7B-Instruct-v0.2",
device = "cuda:0",
template = "FS-DIRECT",
use_chat_template = True,
half_mode = True,
n_right_specials_tokens = 1,
dtype = torch.bfloat16
)
Show the available prompting templates
scorer.show_templates()
>>> DIRECT
>>> MEANING
>>> INDIRECT
>>> FS-DIRECT
>>> FS-DIRECT_MAJ
>>> FS-DIRECT_FR
>>> FS-DIRECT_MAJ_FR
>>> FS-DIRECT_FR_MIN
>>> NETWORK
Show the LLMs that have already been tested with ParaPLUIE
scorer.show_available_models()
>>> HuggingFaceTB/SmolLM2-135M-Instruct
>>> HuggingFaceTB/SmolLM2-360M-Instruct
>>> HuggingFaceTB/SmolLM2-1.7B-Instruct
>>> google/gemma-2-2b-it
>>> state-spaces/mamba-2.8b-hf
>>> internlm/internlm2-chat-1_8b
>>> microsoft/Phi-4-mini-instruct
>>> mistralai/Mistral-7B-Instruct-v0.2
>>> tiiuae/falcon-mamba-7b-instruct
>>> Qwen/Qwen2.5-7B-Instruct
>>> CohereForAI/aya-expanse-8b
>>> google/gemma-2-9b-it
>>> meta-llama/Meta-Llama-3-8B-Instruct
>>> microsoft/phi-4
>>> CohereForAI/aya-expanse-32b
>>> Qwen/QwQ-32B
>>> CohereForAI/c4ai-command-r-08-2024
Change the prompting template
scorer.setTemplate("DIRECT")
Show how the prompt is encoded to ensure that the correct numbers of special tokens are removed and that the words "Yes" and "No" each fit into a single token
scorer.check_end_tokens_tmpl()
Limitations and Bias
This metric is based on an LLM and is therefore limited by the LLM that is used.
Citation
@inproceedings{lemesle-etal-2025-paraphrase,
title = "Paraphrase Generation Evaluation Powered by an {LLM}: A Semantic Metric, Not a Lexical One",
author = "Lemesle, Quentin and
Chevelu, Jonathan and
Martin, Philippe and
Lolive, Damien and
Delhay, Arnaud and
Barbot, Nelly",
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics",
year = "2025",
url = "https://aclanthology.org/2025.coling-main.538/"
}