--- license: cc-by-nc-4.0 language: - en library_name: transformers pipeline_tag: text-classification tags: - modernbert - oncology - clinical-trials - eligibility - binary-classification - text-matching --- # BoilerplateChecker-1225 **BoilerplateChecker-1225** is a binary text classifier that estimates whether a given **patient summary** indicates that the patient has a history of conditions that may exclude them from a given clinical trial based on that trial's **boilerplate exclusion** criteria. "Boilerplate exclusions" are intended to represent exclusion criteria that are not central to defining the target population for a specifictrial, but that instead tend to exclude patients from many clinical trials in general. Examples of "boilerplate exclusions" include concepts like "uncontrolled brain metastases" or "history of pneumonitis." This model is fine-tuned from **[`answerdotai/ModernBERT-large`]** for sequence classification on pairs of *(trial_boilerplate_text, patient_boilerplate_text)*. "Patient boilerplate text" represents a subsection of an overall patient summary that describes any history of such conditions. This model is not intended to capture whether a patient is excluded from a clinical trial based on trial criteria central to defining the trial's target population, which include age, sex, cancer type, histology, cancer burden requirements, biomarker requirements, and treatment history requirements. These concepts are covered by the separate TrialChecker classification model. > **Important:** This is a research prototype for model development, **not** a medical device or approved clinical decision support tool. It is **not** intended for clinical decision-making. --- --- ## Training summary The classifier was trained with a script that: 1. Loads three sources of annotated patient–trial pairs: - Pairs originating from space-specific eligibility checks - “Patient→top-cohorts” checks (rounds 1–3) - “Trial-space→top patients” checks (rounds 1–3) 2. Deduplicates by `['patient_boilerplate_text', 'trial_boilerplate_text']` 3. Builds the final text input as: ``` text = "Patient history: " + patient_boilerplate_text + "\nTrial exclusions:" + trial_boilerplate_text ```` 4. Uses `exclusion_result` as the **binary label** (0/1) 5. Model is **ModernBERT-large** (sequence classification, 2 labels) at max_length **3192** ### Key hyperparameters from training (on H100 x 8) - Base model: `answerdotai/ModernBERT-large` - Max length: **3192** - Optimizer settings: `learning_rate=2e-5`, `weight_decay=0.01` - Batch size: `per_device_train_batch_size=8` - Epochs: `2` - Save strategy: `epoch` - Tokenizer: `AutoTokenizer.from_pretrained("answerdotai/ModernBERT-large")` - Data collator: `DataCollatorWithPadding` --- ## Intended use - **Input:** a string describing the patient's history of common "boilerplate exclusion conditions", if any, and a clinical trial's "boilerplate exclusion criteria," if any. - **Output:** probability that the patient is **excluded** from the trial based on the trial's "boilerplate exclusion criteria". - Use cases: - Deeper pre-screening of candidate patients for specific trials Out of scope: - Confirming formal eligibility or safety - Formal (autonomous) medical record review, diagnosis, or treatment decision-making --- ## Inference (Transformers) ### Quick start (single example) ```python import torch from transformers import AutoTokenizer, AutoModelForSequenceClassification device = "cuda" if torch.cuda.is_available() else "cpu" MODEL_REPO = "ksg-dfci/BoilerplateChecker-1225" tok = AutoTokenizer.from_pretrained(MODEL_REPO) model = AutoModelForSequenceClassification.from_pretrained(MODEL_REPO).to(device) model.eval() trial_boilerplate_text = ( "Patients with uncontrolled brain metastases are excluded." ) patient_boilerplate_text = ( "New brain metastases identified 01/02/23, not yet treated." ) text = "Patient history: " + patient_boilerplate_text + "\nTrial exclusions:" + trial_boilerplate_text # Raw Transformers model enc = tok(text, return_tensors="pt", truncation=True, max_length=4096).to(device) with torch.no_grad(): logits = model(**enc).logits probs = logits.softmax(-1).squeeze(0) # Label mapping was set in training: {0: "NEGATIVE", 1: "POSITIVE"} p_positive = float(probs[1]) print(f"Exclusion probability: {p_positive:.3f}") # Or pipeline API to get similar outputs from trasnformers import pipeline pipe = pipeline('text-classification', 'ksg-dfci/BoilerplateChecker-1225') pipe([text]) ```` ### Batched scoring ```python from typing import List import torch def score_pairs(spaces: List[str], summaries: List[str], tokenizer, model, max_length=4096, batch_size=8): assert len(spaces) == len(summaries) device = next(model.parameters()).device scores = [] for i in range(0, len(spaces), batch_size): batch_spaces = spaces[i:i+batch_size] batch_summaries = summaries[i:i+batch_size] texts = [s + "\nNow here is the patient summary:" + p for s, p in zip(batch_spaces, batch_summaries)] enc = tokenizer(texts, return_tensors="pt", padding=True, truncation=True, max_length=max_length).to(device) with torch.no_grad(): logits = model(**enc).logits probs = logits.softmax(-1)[:, 1] # POSITIVE scores.extend(probs.detach().cpu().tolist()) return scores # Example trial_exclusions = [trial_boilerplate_text] * 3 paitne_boilerplate_texts = [patient_boilerplate_text, "Different patient comorbidities 1...", "Different patient comorbidities 2..."] scores = score_pairs(spaces, summaries, tok, model) print(scores) ``` ### Thresholding & calibration * Default decision: **0.5** on the POSITIVE probability. * For better calibration/operating points, tune the threshold on a validation set (e.g., maximize F1, optimize Youden’s J, or set to a desired precision). --- ## How to prepare inputs **Trial boilerplate text**: as per example above, a compact list of exclusion criteria for a trial that are not central to the target population for the trial **Patient boilerplate text**: as per example above, a concise summary of any medical conditions that may meet common boilerplat exclusion criteria You can generate these inputs with your upstream LLM pipeline (e.g., gpt-oss-120b or our `OncoReasoning-3B-1225` model for summarization and trial information extraction), but the classifier accepts any plain strings in the format shown above. --- ## Reproducibility (high-level) Below is the minimal structure used by the training script to build the dataset before tokenization: ```python # 1) Load and merge three labeled sources # - space_specific_eligibility_checks.parquet # - top_ten_cohorts_checked_round{1,2,3}.csv # - top_twenty_patients_checked_round{1,2,3}.csv # 2) Deduplicate by ['patient_boilerplate_text','trial_boilerplate_text'] and keep: # - split, patient_boilerplate_text, trial_boilerplate_text, exclusion_result # 3) Compose input text and label: text = this_space + "\nNow here is the patient summary:" + patient_summary label = int(eligibility_result) # 0 or 1 # 4) Tokenize with ModernBERT tokenizer (max_length=3192, truncation=True) # 5) Train AutoModelForSequenceClassification, which then produces probabilities for the "POSITIVE" class (patient may be excluded) and for the "NEGATIVE" class (patient not predicted to be excluded) ``` To reproduce exactly, consult and run the original training scripts at https://github.com/kenlkehl/matchminer-ai-training. --- ## Limitations & ethical considerations * Outputs reflect training data and may contain biases or errors. * The model estimates *probability of exclusion based on common boilerplate criteria*, not formal eligibility screening. * Not validated for safety-critical use; do not use for diagnosis or treatment decisions. --- ## Citation If you use this model or parts of the pipeline, please cite this model card and arxiv preprint (https://arxiv.org/abs/2412.17228) or corresponding journal publication (pending). ```