v5 Transformers

#57
by AntonV HF Staff - opened
AntonV changed pull request status to open

It also requires the removal of these 3 lines: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5/blob/main/sentence_bert_config.json#L4-L6
Once removed, this PR, together with the GitHub PR that you mentioned, will result in identical performance in my evaluations on Nano-MSMARCO:

from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import NanoBEIREvaluator

# model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)
model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", revision="refs/pr/57")
model.prompts = {
    "document": "search_document: ",
    "query": "search_query: ",
}

evaluator = NanoBEIREvaluator(["msmarco"])
results = evaluator(model)
print(results[evaluator.primary_metric])
# Baseline (transformers ~v5.4, trust_remote_code)
# 0.5925795457057265

# New, local, transformers:
# 0.5925795457057265

Of course, the core issue that remains is that merging this + the GitHub PR would break the model for all current users. Could we perhaps keep the auto_map? Then, current users can continue to use the model without issues, and we can update the README to suggest that new users can avoid trust_remote_code once their transformers version is high enough?

  • Tom Aarsen
Nomic AI org

Just tested this again, and I'm getting identical performance on NanoMSMARCO using Sentence Transformers using:

  • trust_remote_code=True on main
  • trust_remote_code=True on this PR
  • No trust_remote_code on this PR, i.e. with native transformers

Nice work.

  • Tom Aarsen
hnomic changed pull request status to merged

Sign up or log in to comment