Dhee-NxtGen-Qwen3-Indic (4B)

Model Description

Dhee-NxtGen-Qwen3-Indic is a single, unified 4B-parameter multilingual large language model developed by DheeYantra in collaboration with NxtGen Cloud Technologies Pvt. Ltd.

Built on the Qwen3-4B architecture, the model is created to support assistant-style conversations, reasoning, and function-calling–compatible workflows across 14 Indian (Indic) languages within one shared model.

The model is optimized for native-script generation, consistent multilingual behavior, and cross-lingual generalization.

Supported Languages

This single model supports the following Indic languages:

  • Hindi (hi)
  • Bengali (bn)
  • Tamil (ta)
  • Telugu (te)
  • Malayalam (ml)
  • Gujarati (gu)
  • Kannada (kn)
  • Marathi (mr)
  • Odia (or)
  • Punjabi (pa)
  • Assamese (as)
  • Maithili (mai)
  • Sanskrit (sa)
  • Sindhi (sd)

Best results are achieved when prompts are written entirely in the target language.


Key Features

  • Single multilingual 4B model (no per-language checkpoints)
  • Fluent, native-script text generation across 14 Indic languages
  • Optimized for assistant-style and reasoning-based dialogue
  • Supports summarization, Q&A, and long-form generation
  • Compatible with function-calling style prompting
  • Fully compatible with Hugging Face Transformers
  • Ready for high-throughput inference using vLLM

Example Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# 1. Configuration
model_name = "dheeyantra/dhee-nxtgen-qwen3-indic"
device = "cuda" if torch.cuda.is_available() else "cpu"

# 2. Load Model and Tokenizer
print(f"Loading model: {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

# 3. Define the Prompt
# Using the ChatML format expected by Qwen-based architectures
prompt = """<|im_start|>system
You are a helpful multilingual assistant.<|im_end|>
<|im_start|>user
क्या आप मेरे लिए एक अपॉइंटमेंट बुक कर सकते हैं? 
अगर हाँ, तो कृपया मुझसे ज़रूरी जानकारी जैसे तारीख, समय और उद्देश्य पूछिए।<|im_end|>
<|im_start|>assistant
"""

# 4. Process and Generate
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

print("Generating response...")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

# 5. Decode and Print
# We only want to print the newly generated text (the assistant's reply)
full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
response = full_output.split("assistant")[-1].strip()

print("-" * 30)
print(f"Assistant: {response}")
print("-" * 30)

Function/Tool Calling Example Usage

import torch
import json
import re
from transformers import AutoTokenizer, AutoModelForCausalLM

# --- 1. MODEL SETUP ---
model_name = "dheeyantra/dhee-nxtgen-qwen3-indic"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    trust_remote_code=True,
    device_map="auto",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

# --- 2. TOOLS & SYSTEM PROMPT ---
tools = [{
    "name": "book_appointment",
    "description": "Book an appointment for the user.",
    "parameters": {
        "type": "object",
        "properties": {
            "date": {"type": "string", "description": "Date in YYYY-MM-DD format"},
            "time": {"type": "string", "description": "Time in HH:MM (24h) format"},
            "purpose": {"type": "string", "description": "The medical reason or department"}
        },
        "required": ["date", "time", "purpose"]
    }
}]

# We provide the current date so the model can resolve "tomorrow" or "next Monday"
SYSTEM_PROMPT = f"""You are a helpful AI assistant.
Today's Date: 2026-01-08 (Thursday).

Available Tools:
{json.dumps(tools, indent=2)}

Rules:
1. If details (date, time, purpose) are missing, ask the user in Hindi.
2. If all details are present, output ONLY a <tool_call> JSON.
3. After a tool result is provided, confirm the booking to the user in Hindi."""

# --- 3. BACKEND FUNCTION ---
def execute_booking(date, time, purpose):
    # Simulated backend logic
    if "10:00" in time: # Simulate a busy slot
        return {"status": "error", "message": "यह समय पहले से बुक है। कृपया कोई और समय चुनें।"}
    return {"status": "success", "id": "APP-9921", "doctor": "Dr. Verma"}

# --- 4. THE INTERACTION ENGINE ---
def run_conversation(user_input, history):
    # Add user input to history
    history.append({"role": "user", "content": user_input})
    
    # Construct ChatML prompt
    prompt = f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
    for msg in history:
        prompt += f"<|im_start|>{msg['role']}\n{msg['content']}<|im_end|>\n"
    prompt += "<|im_start|>assistant\n"

    # Generate
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.1)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True).split("assistant")[-1].strip()

    # Check if the model wants to call a tool
    tool_match = re.search(r"<tool_call>(.*?)</tool_call>", response, re.DOTALL)
    
    if tool_match:
        print(f"\n[MODEL REQUESTED TOOL]: {response}")
        call_data = json.loads(tool_match.group(1).strip())
        
        # Execute the function
        result = execute_booking(**call_data['arguments'])
        print(f"[TOOL RESULT]: {result}")
        
        # Feed result back to model for final response
        history.append({"role": "assistant", "content": response})
        history.append({"role": "system", "content": f"Tool Result: {json.dumps(result)}"})
        
        # Final confirmation generation
        return run_conversation("Please confirm the result to the user.", history)
    
    return response

# --- 5. TEST SCENARIO ---
chat_history = []

print("--- Chatbot Started (Today is 2026-01-08) ---")

# Step 1: User provides partial info
query_1 = "मेरे लिए डेंटिस्ट का अपॉइंटमेंट बुक करें।"
print(f"\nUser: {query_1}")
res_1 = run_conversation(query_1, chat_history)
chat_history.append({"role": "assistant", "content": res_1})
print(f"Assistant: {res_1}")

# Step 2: User provides the rest
query_2 = "कल दोपहर 2 बजे।"
print(f"\nUser: {query_2}")
res_2 = run_conversation(query_2, chat_history)
print(f"Assistant: {res_2}")

Prompting Guidelines

  • Use pure native-language prompts for best fluency
  • Avoid heavy code-mixing (e.g., Hinglish-heavy inputs)
  • Include a system prompt to stabilize multilingual behavior
  • Ask explicitly for step-by-step reasoning when required

Intended Uses & Limitations

Intended Uses

  • Multilingual Indic chatbots and AI assistants
  • Education, governance, and public-sector AI applications
  • Content generation and summarization in Indian languages
  • Cross-lingual conversational and reasoning systems

Limitations

  • May occasionally hallucinate or produce inaccurate facts
  • Performance may vary slightly across languages
  • Not intended for medical, legal, or safety-critical use cases
  • Code-mixed inputs may reduce output quality

vLLM / High-Performance Serving

Requirements

  • NVIDIA GPU with compute capability ≥ 8.0 (A100 / H100 recommended)
  • PyTorch 2.1+ with CUDA installed
  • V100 (sm70) GPUs are not supported for vLLM GPU inference

Installation

pip install torch transformers vllm sentencepiece

Run vLLM Server

vllm serve   --model dheeyantra/dhee-nxtgen-qwen3-indic   --host 0.0.0.0   --port 8000

License

Released under the Apache 2.0 License.


Developed by DheeYantra in collaboration with NxtGen Cloud Technologies Pvt. Ltd.

Downloads last month
112
Safetensors
Model size
4B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including dheeyantra/dhee-nxtgen-qwen3-indic