Dataset Details

Dataset Description

This dataset contains short narrative passages (original_text) with associated metadata and labels. The primary target is themes, a multi-label list of theme tags used to train a theme classification model.

  1. Startup Success
  2. Mentorship
  3. Entrepreneurship

A secondary label tone may be used to train a tone classifier.

Direct Use

Train a multi-label text classification model that predicts themes from original_text Train a single-label text classifier that predicts tone from original_text Evaluate/benchmark tagging pipelines for structured Knowledge Sample JSON submissions

Out-of-Scope Use

High-stakes decision-making (medical, legal, employment, housing, finance) Inferring sensitive personal attributes or identity traits Treating predictions as ground-truth without human review Broad “general web” theme classification (dataset is project-scoped and may not generalize)

Dataset Structure

Each data point is a JSON object with these fields: knowledge_submission_id (string): unique record id original_text (string): model input text summary (string): short summary of the passage category (string): high-level category label (e.g., “Business & Culture”) themes (list[string]): multi-label theme tags (primary training target) tone (string): single-label tone (optional training target) knowledge_type (string): type label such as “story” Recommended splits (if/when added): train, validation, test.

Dataset Creation

Created to support automated theme tagging for Knowledge Sample JSON submissions, enabling consistent multi-label theme assignment and optional tone labeling for downstream models in the Kuumba Agent / Cultural Remix Engine workflow.

Data Collection and Processing

Examples are curated/authored knowledge submissions intended for training and evaluation Stored as normalized JSON with consistent keys across records Theme tags are assigned as a list to support multi-label learning Optional tone labels are assigned as a single categorical value

Who are the source data producers?

The source texts are authored/curated examples produced for this project and are not collected from a public platform or scraped source.

Annotation process

themes are assigned per record as multi-label tags based on the main idea(s) of the passage tone is assigned per record as a single-label descriptor of the writing style or emotional/communicative intent Labels are curated to be consistent across the dataset, and may evolve as the taxonomy expands

BibTeX:

@dataset{henson_kuumba_theme_dataset_2025, author = {Henson, James}, title = {Kuumba Knowledge Theme Training Data}, year = {2025}, publisher = {Hugging Face}, }

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for 4nkh/theme_model

Finetuned
(6257)
this model

Dataset used to train 4nkh/theme_model