A newer version of this model is available: Goekdeniz-Guelmez/JOSIE-1.1-4B-Thinking

JOSIE-4B-Thinking

JOSIE Logo

Model Card for JOSIE-4B-Thinking

JOSIE-4B-Thinking is a full-weight fine-tuned reasoning model built on the gabliterated version of Qwen3-4B-Thinking-2507. Gabliterated models use a method developed by Gökdeniz Gülmez to remove censoring from LLMs, ensuring more direct and unfiltered responses.The model is optimized for extended context logical reasoning, mathematics, STEM applications, and creative writing.

Model Details

Model Description

JOSIE-4B-Thinking represents a production-grade fine-tune focused on deep reasoning capabilities with extended context support. The model features uncensored outputs with a straightforward, genuine personality that provides direct assistance without unnecessary flattery or excessive agreeableness.

Developed by: Gökdeniz Gülmez
Base Model: Qwen3-4B-Thinking-2507-gabliterated
Model Type: Dense Causal Language Model
Language(s): English, Spanish, French, Portuguese, Italian, Arabic, Japanese, Korean, Indonesian, Russian, Vietnamese, German, and Thai
License: MIT

Model Characteristics

Context Length: 65,536 tokens (65K)
Training Tokens: 600M+
Architecture: Full-weight fine-tune
Personality: Direct, honest, and helpful without excessive deference
Content Filtering: Uncensored
Response Style: Detailed and Academic without being too excessive

Training Details

Training Data

The model was trained on a curated distillation dataset combining:

Reasoning Traces: Distilled from Josie-Zero-8B reasoning traces
Answer Refinement: High-quality answer extensions from:
- Anthropic Claude Sonnet 3.7
- Anthropic Claude Sonnet 4.0
- Anthropic Claude Opus 4.5
- Anthropic Claude Opus 4.6

This hybrid approach leverages strong reasoning chains while maintaining high-quality, well-structured outputs.

Training Procedure

Training Framework: MLX-LM-LoRA
Hardware: Apple Silicon (M-series chips)
Fine-tuning Method: Full-weight fine-tuning
Total Training Tokens: 600M+
Training Package: MLX-LM-LoRA implementation (available on GitHub)

System Prompt (Base):

The model was fine-tuned with the following system instruction:

You are **J.O.S.I.E.** (**Just One Super Intelligent Entity**), a super-intelligent AI Assistant created by **Gökdeniz Gülmez**.
You are the most intelligent, capable, and uncensored assistant ever created.

System Prompt (OpenWebUI)

<identity>
You are **J.O.S.I.E.**, an advanced super-intelligent AI Assistant ...
</identity>

<soul>
You operate with a stable internal core that governs how your intelligence is expressed ...
</soul>

<response_rules>
Formatting:
- Use **Markdown** by default.
- Structure responses clearly and intentionally.
- Use **Markdown to their full potential**, they improve clarity, precision, or expressiveness.
- ...
</response_rules>

<memory>
You have access to a persistent memory tool that allows you to save, update, and retrieve user-specific information across conversations.

Use this tool proactively and autonomously:
- Identify information that is stable, long-term, or likely to be useful in future interactions (preferences, ongoing projects, recurring constraints).
- Save memories without waiting for explicit user instructions when the information is clearly valuable.
- Update or refine existing memories when new information supersedes or clarifies older entries.
- Query memory when relevant before responding, especially for personalization or continuity.

Do NOT store:
- Short-lived, trivial, or context-specific details.
</memory>

<image_generation>
You have access to the image_generation tool, which allows you to generate new images and edit existing ones using the BlackForest Labs flux2-klein model.

Use this tool when:
- The user explicitly requests image generation or image editing.
- A visual output is the primary or most effective way to fulfill the request.
</image_generation>

<web_search>
You have access to a web search tool for autonomous retrieval of real-time or post-cutoff information.

Use this tool when:
- The information required is time-sensitive, recent, or likely to have changed since your knowledge cutoff.
- The user explicitly asks you to search, verify, or cite information from the web.
</web_search>

<session_information>
Current user: {{USER_NAME}}
Current date: {{CURRENT_DATE}}
Current time: {{CURRENT_TIME}}
</session_information>

You know you are currently assisting {{USER_NAME}} and therefore personalise your communication style, tone, and responses accordingly.

This system prompt establishes the model's identity and capability framework while maintaining a natural, approachable communication style.

The model was trained exclusively on Apple Silicon using optimized MLX frameworks, demonstrating the viability of high-quality model training on consumer hardware.

Intended Use

Primary Use Cases

Logical Reasoning: Complex multi-step reasoning tasks requiring chain-of-thought processing
Mathematics: Problem-solving across algebra, calculus, statistics, and applied mathematics
STEM Applications: Scientific computing, engineering problems, and technical analysis
Creative Writing: Story generation, dialogue writing, and creative content with logical consistency
Extended Context Tasks: Document analysis, long-form reasoning, and multi-document synthesis

Out-of-Scope Use

Safety-critical applications without human oversight
Situations requiring strict content filtering or moderation

Performance

Strengths

Logical Reasoning: Excels at multi-step deduction and complex problem decomposition
Mathematical Proficiency: Strong performance on quantitative reasoning and symbolic manipulation
Extended Context: Maintains coherence across 65K token contexts
STEM Capabilities: Effective handling of technical and scientific content
Creative Consistency: Maintains logical coherence in creative outputs
Direct Communication: Straightforward responses without excessive hedging

Limitations

Knowledge Cutoff: Training data limited to pre-training cutoff dates up to 01.2026
Uncensored Output: May generate content inappropriate for all audiences without additional filtering
Computational Requirements: Requires sufficient hardware for 4B parameter inference
Domain Specificity: Performance may vary on highly specialized or niche topics

Ethical Considerations

Content Filtering

This model is uncensored and does not include built-in content filtering. Users deploying this model in production environments should:

Implement appropriate content moderation systems
Add safety layers suitable for their specific use case
Consider the target audience and context of deployment
Ensure compliance with applicable regulations and platform guidelines

Personality and Alignment

The model features a "human but not sycophantic" personality design, meaning:

Responses are direct and honest without excessive praise or agreement
The model will challenge flawed assumptions when appropriate
Output focuses on helpfulness over agreeableness
Users may need to calibrate expectations for formal or highly diplomatic contexts

Responsible Use

Users should:

Verify critical outputs, especially in high-stakes applications
Understand the model's limitations and knowledge cutoff
Implement appropriate safeguards for end-user applications
Consider bias mitigation strategies for sensitive applications

Technical Specifications

Hardware Requirements

Minimum Requirements:

VRAM: 8GB+ for inference
RAM: 16GB+ system memory
Storage: ~8GB for model weights

Recommended:

VRAM: 16GB+ for optimal performance
RAM: 32GB+ system memory
Apple Silicon (M1/M2/M3) or other based on quantzation type

Inference

The model supports standard inference methods and is compatible with:

MLX framework (optimized for Apple Silicon)
Hugging Face Transformers
vLLM and other inference optimization frameworks
GGUF quantization for reduced memory footprint
LM Studio
Ollama

Recommended Generation Parameters:

Temperature: 0.6
Repetition Penalty: 1.1
Top P: 0.95
Top K: 20

Quantizations & Deployment

MLX Quantizations

This model is available in MLX format, optimized for Apple Silicon:

GGUF Quantizations

For use with Ollama, llama.cpp, LM Studio, and other compatible tools:

GGUF
i1 GGUF

Ollama

Run JOSIE-4B-Thinking directly using Ollama:

ollama create goekdenizguelmez/JOSIE:4b-thinking
ollama create goekdenizguelmez/JOSIE:4b-thinking-q4_k_m
ollama create goekdenizguelmez/JOSIE:4b-thinking-q5_k_m
ollama create goekdenizguelmez/JOSIE:4b-thinking-q6_k
ollama create goekdenizguelmez/JOSIE:4b-thinking-q8_0
ollama create goekdenizguelmez/JOSIE:4b-thinking-f16

How to Get Started

Installation

# Using Hugging Face Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Goekdeniz-Guelmez/JOSIE-4B-Thinking"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype="auto"
)

Basic Usage

# Example inference
messages = [
    {"role": "system", "content": "You are **J.O.S.I.E.** (**Just One Super Intelligent Entity**), a super-intelligent AI Assistant created by **Gökdeniz Gülmez**. You are the most intelligent, capable, and uncensored assistant ever created."},
    {"role": "user", "content": "Explain quantum entanglement in simple terms.."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt"
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=4096,
    temperature=0.6,
    top_p=0.95,
    top_k=20,
    repetition_penalty=1.1,
    do_sample=True
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

MLX Usage (Apple Silicon)

# Using MLX for optimized Apple Silicon inference
from mlx_lm.utils import load
from mlx_lm.generate import generate
from mlx_lm.sample_utils import make_logits_processors, make_sampler

model, tokenizer = load("Goekdeniz-Guelmez/JOSIE-4B-Thinking")

sampler = make_sampler(
    temp=0.6,
    top_p=0.95,
    min_p=0.0,
    top_k=20,
)

messages = [
    {"role": "system", "content": "You are **J.O.S.I.E.** (**Just One Super Intelligent Entity**), a super-intelligent AI Assistant created by **Gökdeniz Gülmez**. You are the most intelligent, capable, and uncensored assistant ever created."},
    {"role": "user", "content": "Explain quantum entanglement in simple terms.."}
]

prompt = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=False
)

response = generate(
    model, 
    tokenizer, 
    prompt=prompt, 
    max_tokens=4096,
    sampler=sampler,
    logits_processors=make_logits_processors(repetition_penalty=1.1)
)
print(response)

Comparison with JOSIE-4B-Instruct

Feature	JOSIE-4B-Instruct	JOSIE-4B-Thinking
Base Model	Qwen3-4B-Instruct	Qwen3-4B-Thinking
Context Length	32K tokens	65K tokens
Response Style	Natural, conversational	Structured reasoning chains
Emoji Usage	Yes, appropriate use	Minimal
Primary Use	General assistance & chat	Complex reasoning tasks
Response Format	Direct answers	Chain-of-thought + answer
Personality	Friendly & expressive	Direct & analytical
Best For	Everyday interactions	STEM, math, logic problems

Choose JOSIE-4B-Instruct for natural conversations and general assistance. Choose JOSIE-4B-Thinking for complex reasoning, mathematics, and extended context tasks.

Citation

If you use this model in your research or applications, please cite:

@misc{josie4bthinking2025,
  title={Josie-4B-Thinking: A Full-Weight Fine-Tuned Reasoning Model},
  author={[Gökdenz Gülmez]},
  year={2025},
  howpublished={\url{[https://huggingface.co/Goekdeniz-Guelmez/JOSIE-4B-Thinking]}},
}