---
license: other
tags:
  - rna
  - gquad
  - g-quadruplex
  - transformer
  - genomics
  - rna-biology
library_name: transformers
extra_gated_fields:
  I agree to use this model for non-commercial use ONLY: checkbox
---

# mRNAbert

**mRNAbert** is a transformer-based RNA language model trained on millions of transcriptomic sequences from the human genome. It is used as the foundation model for downstream fine-tuning tasks in the [G4mer](https://huggingface.co/Biociphers/g4mer) project, including rG4 structure prediction and variant effect analysis.

## Model Details

- Architecture: BERT-base
- Tokenization: Overlapping 6-mers
- Pretraining data: Human transcriptome (GENCODE v40, hg38)
- Task: Masked language modeling (MLM)
- Input: RNA sequences (ACGT)
- Max length: 512nt
  
## Disclaimer

This is the official implementation of the **G4mer** model as described in the manuscript:

> Zhuang, Farica, et al. _G4mer: an RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data._ bioRxiv (2024).

See our [Bitbucket repo](https://bitbucket.org/biociphers/g4mer) for code, data, and tutorials.

## Model Details

G4mer transformer-based model trained on transcriptome-wide RNA sequences to predict:

- **Binary classification**: Whether a 70-nt seqeunce region forms an rG4 structure

All models use overlapping 6-mer tokenization and are trained from scratch on the human transcriptome.

### Variants

| Model                                | Task              | Size   |
|--------------------------------------|-------------------|--------|
| `Biociphers/g4mer`           | rG4 binary class  | ~46M    |
| `Biociphers/g4mer-subtype`          | rG4 subtype class | ~46M    |
| `Biociphers/g4mer-regression`       | rG4 strength      | ~46M    |

## Usage

### Fine-tune

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
from torch.utils.data import DataLoader, Dataset
from torch.optim import AdamW
import torch.nn.functional as F

# Example dataset
sequences = ["GGGAGGGCGCGTGTGGTGAGAGGAGGGAGGGAAGGAAGGCGGAGGAAGGA",  # rG4
             "TCTGGGAAAAGCTACTGTAAGTAGGAGCAGATTCTGGGTTTAATCGGAGG"]  # non-rG4
labels = [1, 0]

# Tokenization with 6-mers
def to_kmers(seq, k=6):
    return ' '.join([seq[i:i+k] for i in range(len(seq)-k+1)])

tokenizer = AutoTokenizer.from_pretrained("Biociphers/mRNAbert")
tokenized = [tokenizer(to_kmers(seq), return_tensors='pt', padding='max_length', truncation=True, max_length=512) for seq in sequences]

# Dataset class
class rG4Dataset(Dataset):
    def __init__(self, tokenized_inputs, labels):
        self.inputs = tokenized_inputs
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val.squeeze(0) for key, val in self.inputs[idx].items()}
        item["labels"] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

dataset = rG4Dataset(tokenized, labels)
loader = DataLoader(dataset, batch_size=2, shuffle=True)

# Load base model for classification
model = AutoModelForSequenceClassification.from_pretrained("Biociphers/mRNAbert", num_labels=2)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)

# Training loop (1 epoch for demo)
model.train()
for batch in loader:
    batch = {k: v.to(device) for k, v in batch.items()}
    outputs = model(**batch)
    loss = outputs.loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
    print("Loss:", loss.item())
```

## Web Tool

The `mRNAbert` model was fine-tuned to create **[G4mer](https://huggingface.co/Biociphers/g4mer)**, a state-of-the-art model for predicting **RNA G-quadruplexes** and their subtypes.

You can explore G4mer predictions interactively through our web tool:

**[G4mer Web Tool](https://tools.biociphers.org/g4mer)**

Features include:
- **RNA sequence prediction** (binary rG4-forming vs. non-forming)
- **Transcriptome-wide prediction** of rG4s and subtypes
- **Variant effect annotation** using gnomAD SNVs
- **Search and filter** by gene, transcript, region (5′UTR, CDS, 3′UTR), and sequence context

No installation needed — just visit and start exploring.

## Citation - MLA

```
Zhuang, Farica, et al. "G4mer: An RNA language model for transcriptome-wide identification of G-quadruplexes and disease variants from population-scale genetic data." Nature Communications 16.1 (2025): 10221.
```

## Contact

For questions, feedback, or discussions about G4mer, please post on the [Biociphers Google Group](https://groups.google.com/g/majiq_voila).