--- license: cc-by-nc-4.0 language: - en tags: - RNA - genomics - biology - embeddings - pretrained - therapeutics - foundation-model - non-coding-RNA - structure-prediction - drug-design library_name: transformers pipeline_tag: feature-extraction widget: - text: "GCCGGGCAUGGUGGCGCAUGCCUGUAGUCCCAGCUACCCGGGGAGGCUGAGGCAGAAGGAUCACUCGAGCCCAGGAGUUUGAGGUUGCUGUGAGCUAGGCUGACGCCACGGCACUCAGUCUAGCCUGGGCAACAAAGCGAGACUCUGUCUCCA" --- # RNAGenesis: A Generalist Foundation Model for Functional RNA Therapeutics ## Model Description RNAGenesis is a generalist RNA foundation model that integrates sequence representation, structural prediction, and de novo functional design within a single generative framework. Trained on diverse clustered non-coding RNAs, RNAGenesis leverages a BERT-style encoder, query-based latent compression, and a diffusion-guided decoder enhanced by inference-time alignment with gradient guidance and beam search strategies. This model achieves state-of-the-art performance on: - 11 of 13 tasks in the BEACON benchmark - Inverse folding and 3D structure prediction - De novo structure design - RNA therapeutics prediction (ASOs, siRNAs, shRNAs, circRNAs, UTR variants) - Functional RNA design including aptamers and CRISPR sgRNA scaffolds ## Model Details - **Model Type**: Generalist RNA Foundation Model - **Architecture**: BERT-style encoder with query-based latent compression and diffusion-guided decoder - **Input**: RNA sequences (AUGC notation) - **Output**: Sequence embeddings, structure predictions, functional designs - **Training Data**: Diverse clustered non-coding RNAs - **Key Features**: - Sequence representation learning - Structural prediction capabilities - De novo functional design - Inference-time alignment with gradient guidance - Beam search optimization strategies ## Usage ### Installation ```bash pip install transformers torch ``` ### Basic Usage ```python from transformers import AutoModel, AutoTokenizer import torch # Load model and tokenizer tokenizer = AutoTokenizer.from_pretrained("your-username/RNAGenesis", trust_remote_code=True) model = AutoModel.from_pretrained("your-username/RNAGenesis", trust_remote_code=True, torch_dtype=torch.bfloat16) # Prepare your RNA sequence rna_sequence = "GCCGGGCAUGGUGGCGCAUGCCUGUAGUCCCAGCUACCCGGGGAGGCUGAGGCAGAAGGAUCACUCGAGCCCAGGAGUUUGAGGUUGCUGUGAGCUAGGCUGACGCCACGGCACUCAGUCUAGCCUGGGCAACAAAGCGAGACUCUGUCUCCA" # Tokenize and get embeddings input_ids = torch.tensor(tokenizer.convert_tokens_to_ids(rna_sequence)).unsqueeze(0) with torch.no_grad(): outputs = model(input_ids) embeddings = outputs.last_hidden_state.mean(dim=1) # Average pooling print(f"Embedding shape: {embeddings.shape}") ``` ### Advanced Usage - Batch Processing ```python sequences = [ "AUGCGAUCGAUCGAUCG", "GCGCGCAUAUAUAUAUA", "UUUUAAAACCCCGGGGA" ] # Process multiple sequences embeddings = [] for seq in sequences: input_ids = torch.tensor(tokenizer.convert_tokens_to_ids(seq)).unsqueeze(0) with torch.no_grad(): outputs = model(input_ids) seq_embedding = outputs.last_hidden_state.mean(dim=1) embeddings.append(seq_embedding) # Stack embeddings all_embeddings = torch.cat(embeddings, dim=0) ``` ## Performance Highlights ### BEACON Benchmark - State-of-the-art performance on 11 of 13 tasks - Superior performance in structure-aware modeling tasks ### RNATx-Bench (RNA Therapeutics Benchmark) - Evaluated on >100,000 experimentally validated sequences - Strong predictive performance across: - Antisense oligonucleotides (ASOs) - Small interfering RNAs (siRNAs) - Short hairpin RNAs (shRNAs) - Circular RNAs (circRNAs) - Untranslated region (UTR) variants ### Experimental Validation - **Aptamer Design**: IGFBP3-targeting aptamers with KD values as low as 4.02 nM - **CRISPR Enhancement**: Up to 2.5-fold improvement in editing efficiency across: - CRISPR-Cas9 systems - Base editing systems - Prime editing systems ## Limitations - Maximum sequence length: Depends on model configuration - Input must be valid RNA sequences using standard AUGC notation - Model performance may vary on sequences significantly different from training data - This is a preprint model - results have not been peer-reviewed ## Citation If you use this model in your research, please cite: ```bibtex @article{zhang2024rnagenesis, title={RNAGenesis: A Generalist Foundation Model for Functional RNA Therapeutics}, author={Zhang, Zaixi and Jin, Ruofan and Chao, Linlin and Xu, Guangxue and Zhang, Yikun and Zhou, Guowei and Yin, Di and Guo, Yingqing and Fu, Yaqi and Yang, Yukang and Huang, Kaixuan and Wang, Xiaotong and Zhang, Junze and Yang, Yujie and Yang, Qirong and Xu, Ziyao and Weinan, E and Zhou, Ruhong and Zhang, Xiaoming and Wang, Mengdi and Cong, Le}, journal={bioRxiv}, year={2024}, doi={10.1101/2024.12.30.630826}, note={Preprint} } ``` **Paper**: [https://doi.org/10.1101/2024.12.30.630826](https://doi.org/10.1101/2024.12.30.630826) ## License This model is released under the Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0). ## Access This model requires approval for access. Please fill out the access request form with: - Your intended use case - Your affiliation - Whether the use is for commercial or research purposes ## Authors Zaixi Zhang, Ruofan Jin, Linlin Chao, Guangxue Xu, Yikun Zhang, Guowei Zhou, Di Yin, Yingqing Guo, Yaqi Fu, Yukang Yang, Kaixuan Huang, Xiaotong Wang, Junze Zhang, Yujie Yang, Qirong Yang, Ziyao Xu, E Weinan, Ruhong Zhou, Xiaoming Zhang, Mengdi Wang, Le Cong ## Contact For questions or issues, please open an issue on the model repository.