YAML Metadata Warning: The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other

Vehicle Title to Year/Make/Model (T5-small)

A fine-tuned t5-small sequence-to-sequence model that extracts structured vehicle information (year, make, model) from free-form vehicle listing titles.

Model Overview

Property	Value
Base Model	`google-t5/t5-small`
Parameters	~60 million
Model Size	~231 MB (safetensors)
Task	Text-to-text generation
Input	Raw vehicle title with instruction prefix
Output	Structured string: `Year: XXXX \| Make: XXXX \| Model: XXXX`

Input/Output Format

Input:

extract vehicle info: New 2023 Yamaha YZ450F Monster Energy Edition

Output:

Year: 2023 | Make: Yamaha | Model: YZ450F

Intended Use

Normalizing marketplace or dealership listing titles into structured fields
Preprocessing step for downstream analytics or search systems expecting (year, make, model) tuples
Batch processing of vehicle inventory data

This model targets vehicle title parsing; it is not a general-purpose language model.

Training Details

Data

Property	Value
Source	Private dealership inventory export
Total Samples	~200,000
Train/Val Split	90% / 10%
Status	Proprietary; not released

Fields used from source data:

onlineTitle (free-text input)
catalogYear, catalogMake, catalogModel (target labels)

Training Configuration

Hyperparameter	Value
Epochs	3
Batch Size	32
Learning Rate	2e-4
Weight Decay	0.01
Max Input Length	256 tokens
Max Target Length	128 tokens
Optimizer	AdamW (via Seq2SeqTrainer)
Gradient Checkpointing	Enabled

Hardware

Component	Specification
Device	Apple MacBook Pro
Chip	Apple M1 Max
Backend	PyTorch MPS
Training Time	~5 hours

Software Environment

Python 3.14
PyTorch 2.9.1
Transformers < 4.57
scikit-learn

Evaluation Results

Evaluated on 19,793 held-out validation samples from the private corpus.

Metric	Score
Exact String Match	55.53%
Year Accuracy	98.06%
Make Accuracy	98.78%
Model Accuracy	56.49%

Field-wise accuracies are computed by parsing the generated string and comparing each field independently against ground truth.

How to Use

from transformers import T5Tokenizer, T5ForConditionalGeneration

MODEL_ID = "umut-celik/vehicle-title-t5-small"

tokenizer = T5Tokenizer.from_pretrained(MODEL_ID, legacy=False)
model = T5ForConditionalGeneration.from_pretrained(MODEL_ID)
model.eval()

def predict_vehicle(title: str) -> str:
    input_text = "extract vehicle info: " + title
    inputs = tokenizer(
        input_text,
        return_tensors="pt",
        truncation=True,
        max_length=256,
    )
    outputs = model.generate(**inputs, max_length=128)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example
title = "New 2023 Yamaha YZ450F Monster Energy Edition"
print(predict_vehicle(title))
# Output: Year: 2023 | Make: Yamaha | Model: YZ450F

Parsing the Output

def parse_prediction(text: str) -> dict:
    parts = {}
    for segment in text.split("|"):
        segment = segment.strip()
        if ":" in segment:
            key, value = segment.split(":", 1)
            parts[key.strip().lower()] = value.strip()
    return parts

result = parse_prediction("Year: 2023 | Make: Yamaha | Model: YZ450F")
# {'year': '2023', 'make': 'Yamaha', 'model': 'YZ450F'}

Limitations

Trained on a single private dealership corpus; may overfit to specific brands, naming conventions, and title styles present in that data
Titles deviating from training patterns (non-English text, very short titles, missing year information) may produce incorrect or incomplete outputs
Model accuracy for the model field is lower than year and make; complex or uncommon model names are harder to extract
Not robust against adversarial inputs; treat outputs as probabilistic predictions rather than validated facts

License

This model is released under the CC-BY-NC-4.0 license. Commercial use is not permitted without explicit authorization.

Citation

@misc{celik2024vehicletitle,
  author       = {Celik, Umut},
  title        = {Vehicle Title to Year/Make/Model Extraction with T5-small},
  year         = {2024},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/umut-celik/vehicle-title-t5-small}}
}

Contact

Author: Umut Celik
Email: umut.celik@cix.csi.cuny.edu
GitHub: github.com/umutc
Affiliation: CUNY College of Staten Island

Downloads last month: 1

Safetensors

Model size

60.5M params

Tensor type

F32

Evaluation results

Exact Match
self-reported

55.530
Year Accuracy
self-reported

98.060
Make Accuracy
self-reported

98.780
Model Accuracy
self-reported

56.490