Longformer Fiction Genre Classifier

Model Description

This model classifies narrative semantic genres in long-form fiction. Rather than predicting marketing categories or analyzing book descriptions, it identifies narrative modes by analyzing story structure, tone, diction, and thematic elements in actual fiction text.

Approach

The model is trained on story text (not blurbs or metadata) and uses a Longformer architecture to handle long contexts (up to 4096 tokens). It was trained with curriculum learning, progressively moving from short scenes to full chapters. For inference on complete books, sliding windows can be used to produce genre distributions across the text.

Key differences from typical genre classifiers:

  • Trained on narrative text rather than short descriptions
  • 4096 token context window (full chapters)
  • Curriculum learning approach (short to long)
  • Tested on commercial novels and diverse short stories
  • Supports windowed inference for book-length texts

Model Architecture

  • Base Model: allenai/longformer-base-4096
  • Architecture: Longformer with efficient self-attention for long documents
  • Max Sequence Length: 4096 tokens
  • Parameters: ~149M (backbone) + classification head
  • Training Strategy: Curriculum learning (500-token scenes to 4000-token chapters)
  • Genres: 13 semantic categories

Genre Labels

The model predicts 13 semantic narrative genres representing literary modes rather than bookstore categories:

adventure, contemporary, crime, fantasy, historical, horror, literary, mystery, romance, science_fiction, thriller, war, western

Training Data

  • Training corpus: Fiction excerpts and scenes spanning multiple genres (expanded compendium dataset)
  • Curriculum strategy: Progressive training from 500-token scenes to 4000-token chapters
  • Validation set: 130 original short stories (10 per genre, multiple writing styles)
  • Focus areas: Narrative structure, pacing, diction, tone, thematic elements, character dynamics

Performance

Evaluation Results

Overall Accuracy: 66.92% (87/130 stories correct)

Evaluated on 130 stories (10 per genre x 13 genres) spanning literary, indie, and blockbuster writing styles.

Per-Genre Performance

Genre Accuracy
western 100% (10/10)
historical 90% (9/10)
mystery 90% (9/10)
horror 80% (8/10)
literary 80% (8/10)
science_fiction 80% (8/10)
war 80% (8/10)
romance 70% (7/10)
adventure 50% (5/10)
fantasy 50% (5/10)
contemporary 40% (4/10)
crime 30% (3/10)
thriller 30% (3/10)

Known Limitations

  1. Crime classification: Lower accuracy (30%), frequently confused with mystery and science_fiction
  2. Thriller classification: Lower accuracy (30%), confused with crime, horror, and science_fiction
  3. Contemporary fiction: Moderate accuracy (40%), often misclassified as literary or romance
  4. Crime vs. Mystery confusion: Some overlap between criminal and investigative perspectives

Strengths

  • Perfect western recognition (distinctive setting/tone markers)
  • Strong historical and mystery performance (90%)
  • Solid performance across horror, literary, science_fiction, and war (80%)
  • Reduced literary over-prediction compared to earlier versions
  • Handles literary fiction with complex themes

Usage

Basic Classification

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "Mitchins/longformer-fiction-genre-13g"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
outputs = model(**inputs)

probs = torch.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(probs, dim=-1).item()
predicted_genre = model.config.id2label[predicted_class]
confidence = probs[0][predicted_class].item()
print(f"Genre: {predicted_genre} ({confidence:.2%} confidence)")

Windowed Classification for Full Books

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from collections import Counter

def classify_book_windowed(text, window_size=3500, stride=1750):
    tokenizer = AutoTokenizer.from_pretrained("Mitchins/longformer-fiction-genre-13g")
    model = AutoModelForSequenceClassification.from_pretrained("Mitchins/longformer-fiction-genre-13g")
    tokens = tokenizer.encode(text, add_special_tokens=False)
    genre_votes = []
    for i in range(0, len(tokens), stride):
        window = tokens[i:i + window_size]
        if len(window) < 100:
            continue
        inputs = {'input_ids': torch.tensor([window])}
        outputs = model(**inputs)
        pred = torch.argmax(outputs.logits, dim=-1).item()
        genre_votes.append(model.config.id2label[pred])
        if i + window_size >= len(tokens):
            break
    return Counter(genre_votes)

Extracting Narrative Embeddings

model = AutoModelForSequenceClassification.from_pretrained(
    "Mitchins/longformer-fiction-genre-13g",
    output_hidden_states=True
)
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=4096)
outputs = model(**inputs)
hidden_states = outputs.hidden_states[-1]
embedding = hidden_states.mean(dim=1)

Use Cases

  • Fiction retrieval systems (RAG) that cluster by narrative style
  • Book recommendation based on narrative characteristics
  • Writing analysis tools for genre consistency
  • Dataset curation and filtering by semantic genre
  • Building specialized subgenre classifiers on top of this base model
  • Narrative similarity search
  • Genre arc analysis across chapters

Future Directions

Potential extensions include training specialized subgenre classifier heads or adapting to more recent architectures like DeBERTa.

Citation

@model{longformer_fiction_genre,
  title={Longformer Fiction Genre Classifier},
  author={Mitchell Currie},
  year={2024},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/Mitchins/longformer-fiction-genre-13g}}
}

Related Resources

  • Validation Dataset - 52 original stories used for evaluation
  • Detailed evaluation results available in model repository

License

MIT License

Downloads last month
13
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Mitchins/longformer-fiction-genre-13g

Evaluation results

  • Accuracy on Fiction Genre Validation Set (130 Stories)
    self-reported
    66.920