| --- |
| license: mit |
| datasets: |
| - ZINC-22 |
| language: |
| - en |
| tags: |
| - molecular-generation |
| - drug-discovery |
| - llama |
| - flash-attention |
| pipeline_tag: text-generation |
| --- |
| |
| # NovoMolGen |
|
|
| NovoMolGen is a family of molecular foundation models trained on |
| 1.5 billion ZINC-22 molecules with Llama architectures and FlashAttention. |
| It achieves state-of-the-art performance on both unconstrained and |
| goal-directed molecule generation tasks. |
|
|
| <img src="assets/NovoMolGen.png" width="900"/> |
|
|
| ## How to load |
|
|
| ```python |
| >>> from transformers import AutoTokenizer, AutoModelForCausalLM |
| >>> tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_32M_SMILES_AtomWise", trust_remote_code=True) |
| >>> model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_32M_SMILES_AtomWise", trust_remote_code=True) |
| ``` |
|
|
| ## Quick-start (FlashAttention + bf16) |
|
|
| ```python |
| >>> from accelerate import Accelerator |
| |
| >>> acc = Accelerator(mixed_precision='bf16') |
| >>> model = acc.prepare(model) |
| |
| >>> outputs = model.sample(tokenizer=tokenizer, batch_size=4) |
| >>> print(outputs['SMILES']) |
| ``` |
|
|
| ## Transformers-native HF checkpoint (`revision="hf-checkpoint"`) |
|
|
| We also publish a Transformers-native checkpoint on the `hf-checkpoint` revision. This version loads directly with `AutoModelForCausalLM` and works out-of-the-box with `.generate(...)`. |
|
|
| ```python |
| >>> import torch |
| >>> from transformers import AutoTokenizer, AutoModelForCausalLM |
| |
| >>> model = AutoModelForCausalLM.from_pretrained("chandar-lab/NovoMolGen_32M_SMILES_AtomWise", revision='hf-checkpoint', device_map='auto') |
| >>> tokenizer = AutoTokenizer.from_pretrained("chandar-lab/NovoMolGen_32M_SMILES_AtomWise", revision='hf-checkpoint') |
| |
| >>> input_ids = torch.tensor([[tokenizer.bos_token_id]]).expand(4, -1).contiguous().to(model.device) |
| >>> outs = model.generate(input_ids=input_ids, temperature=1.0, max_length=64, do_sample=True, pad_token_id=tokenizer.eos_token_id) |
| |
| >>> molecules = [t.replace(" ", "") for t in tokenizer.batch_decode(outs, skip_special_tokens=True)] |
| ['CCO[C@H](CNC(=O)N(CC(=O)OC(C)(C)C)c1cccc(Br)n1)C(F)(F)F', |
| 'CCn1nnnc1CNc1ncnc(N[C@H]2CCO[C@@H](C)C2)c1C', |
| 'CC(C)(O)CNC(=O)CC[C@H]1C[C@@H](NC(=O)COCC(F)F)C1', |
| 'Cc1ncc(C(=O)N2C[C@H]3[C@H](CNC(=O)c4cnn[nH]4)CCC[C@H]3C2)n1C'] |
| |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{chitsaz2025novomolgenrethinkingmolecularlanguage, |
| title={NovoMolGen: Rethinking Molecular Language Model Pretraining}, |
| author={Kamran Chitsaz and Roshan Balaji and Quentin Fournier and Nirav Pravinbhai Bhatt and Sarath Chandar}, |
| year={2025}, |
| eprint={2508.13408}, |
| archivePrefix={arXiv}, |
| primaryClass={cs.LG}, |
| url={https://arxiv.org/abs/2508.13408}, |
| } |
| ``` |