Text Generation
Transformers
Safetensors
mamba
text-generation-inference
4-bit precision
bitsandbytes
Instructions to use RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits") model = AutoModelForCausalLM.from_pretrained("RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits
- SGLang
How to use RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits with Docker Model Runner:
docker model run hf.co/RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits
Quantization made by Richard Erkhov.
mamba_790_hf_qa - bnb 4bits
- Model creator: https://huggingface.co/DeepMount00/
- Original model: https://huggingface.co/DeepMount00/mamba_790_hf_qa/
Original model description:
language: - it license: apache-2.0 datasets: - DeepMount00/gquad_it pipeline_tag: question-answering
SQuAD-it Evaluation
The Stanford Question Answering Dataset (SQuAD) in Italian (SQuAD-it) is used to evaluate the model's reading comprehension and question-answering capabilities. The following table presents the F1 score and Exact Match (EM) metrics:
| Model | F1 Score | Exact Match (EM) |
|---|---|---|
| DeepMount00/Gemma_QA_ITA_v3 | 77.24% | 64.60% |
| DeepMount00/Gemma_QA_ITA_v2 | 77.17% | 63.82% |
| DeepMount00/mamba_790_hf_qa | 75.89% | 66.71% |
| DeepMount00/Gemma_QA_ITA | 59.59% | 40.68% |
How to Use
How to use mamba q&a
from transformers import MambaConfig, MambaForCausalLM, AutoTokenizer
import torch
model_name = "DeepMount00/mamba_790_hf_qa"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = MambaForCausalLM.from_pretrained(model_name, device_map={"": 0}).eval()
def predict(contesto, domanda):
device = "cuda:0"
prefix_text = 'Di seguito ti verr脿 fornito un contesto e poi una domanda. il tuo compito 猫 quello di rispondere alla domanda basandoti esclusivamente sul contesto\n\n'
prompt = f"""{prefix_text}##CONTESTO: {contesto}\n##DOMANDA: {domanda}\n"""
input_ids = tokenizer([prompt], return_tensors="pt").to(device)
generate_ids = model.generate(**input_ids, max_new_tokens=150, eos_token_id=8112)
answer = tokenizer.batch_decode(generate_ids)
try:
final_answer = answer[0].split("##RISPOSTA: ")[1].split("##END")[0].strip("\n")
except IndexError:
final_answer = ""
return final_answer
contesto = """La torre degli Asinelli 猫 una delle cosiddette due torri di Bologna, simbolo della citt脿, situate in piazza di porta Ravegnana, all'incrocio tra le antiche strade San Donato (ora via Zamboni), San Vitale, Maggiore e Castiglione. Eretta, secondo la tradizione, fra il 1109 e il 1119 dal nobile Gherardo Asinelli, la torre 猫 alta 97,20 metri, pende verso ovest per 2,23 metri e presenta all'interno una scalinata composta da 498 gradini. Ancora non si pu貌 dire con certezza quando e da chi fu costruita la torre degli Asinelli. Si presume che la torre debba il proprio nome a Gherardo Asinelli, il nobile cavaliere di fazione ghibellina al quale se ne attribuisce la costruzione, iniziata secondo una consolidata tradizione l'11 ottobre 1109 e terminata dieci anni dopo, nel 1119."""
domanda = "Dove si trova precisamente la torre degli Asinelli?"
print(predict(contesto, domanda))
- Downloads last month
- 6
Model tree for RichardErkhov/DeepMount00_-_mamba_790_hf_qa-4bits
Base model
DeepMount00/mamba_790_hf_qa