--- license: other tags: - rna - gquad - g-quadruplex - transformer - genomics - rna-biology library_name: transformers extra_gated_fields: I agree to use this model for non-commercial use ONLY: checkbox --- # mRNAbert **mRNAbert** is a transformer-based RNA language model trained on millions of transcriptomic sequences from the human genome. It is used as the foundation model for downstream fine-tuning tasks in the [G4mer](https://huggingface.co/Biociphers/g4mer) project, including rG4 structure prediction and variant effect analysis. ## Model Details - Architecture: BERT-base - Tokenization: Overlapping 6-mers - Pretraining data: Human transcriptome (GENCODE v40, hg38) - Task: Masked language modeling (MLM) - Input: RNA sequences (ACGT) - Max length: 512nt ## Disclaimer This is the official implementation of the **G4mer** model as described in the manuscript: > Zhuang, Farica, et al. _G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data._ bioRxiv (2024). See our [Bitbucket repo](https://bitbucket.org/biociphers/g4mer) for code, data, and tutorials. ## Model Details G4mer transformer-based model trained on transcriptome-wide RNA sequences to predict: - **Binary classification**: Whether a 70-nt seqeunce region forms an rG4 structure All models use overlapping 6-mer tokenization and are trained from scratch on the human transcriptome. ### Variants | Model | Task | Size | |--------------------------------------|-------------------|--------| | `Biociphers/g4mer` | rG4 binary class | ~46M | | `Biociphers/g4mer-subtype` | rG4 subtype class | ~46M | | `Biociphers/g4mer-regression` | rG4 strength | ~46M | ## Usage ### Fine-tune ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch from torch.utils.data import DataLoader, Dataset from torch.optim import AdamW import torch.nn.functional as F # Example dataset sequences = ["GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA", # rG4 "TCTGGGAAAAGCTACTGTAAGTAGGAGCAGATTCTGGGTTTAATCGGAGG"] # non-rG4 labels = [1, 0] # Tokenization with 6-mers def to_kmers(seq, k=6): return ' '.join([seq[i:i+k] for i in range(len(seq)-k+1)]) tokenizer = AutoTokenizer.from_pretrained("Biociphers/mRNAbert") tokenized = [tokenizer(to_kmers(seq), return_tensors='pt', padding='max_length', truncation=True, max_length=512) for seq in sequences] # Dataset class class rG4Dataset(Dataset): def __init__(self, tokenized_inputs, labels): self.inputs = tokenized_inputs self.labels = labels def __len__(self): return len(self.labels) def __getitem__(self, idx): item = {key: val.squeeze(0) for key, val in self.inputs[idx].items()} item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long) return item dataset = rG4Dataset(tokenized, labels) loader = DataLoader(dataset, batch_size=2, shuffle=True) # Load base model for classification model = AutoModelForSequenceClassification.from_pretrained("Biociphers/mRNAbert", num_labels=2) device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model.to(device) # Optimizer optimizer = AdamW(model.parameters(), lr=2e-5) # Training loop (1 epoch for demo) model.train() for batch in loader: batch = {k: v.to(device) for k, v in batch.items()} outputs = model(**batch) loss = outputs.loss loss.backward() optimizer.step() optimizer.zero_grad() print("Loss:", loss.item()) ``` ## Web Tool The `mRNAbert` model was fine-tuned to create **[G4mer](https://huggingface.co/Biociphers/g4mer)**, a state-of-the-art model for predicting **RNA G-quadruplexes** and their subtypes. You can explore G4mer predictions interactively through our web tool: **[G4mer Web Tool](https://tools.biociphers.org/g4mer)** Features include: - **RNA sequence prediction** (binary rG4-forming vs. non-forming) - **Transcriptome-wide prediction** of rG4s and subtypes - **Variant effect annotation** using gnomAD SNVs - **Search and filter** by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context No installation needed — just visit and start exploring. ## Citation - MLA ``` Zhuang, Farica, et al. "G4mer: An RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." Nature Communications 16.1 (2025): 10221. ``` ## Contact For questions, feedback, or discussions about G4mer, please post on the [Biociphers Google Group](https://groups.google.com/g/majiq_voila).