Instructions to use byroneverson/glm-4-9b-chat-abliterated with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use byroneverson/glm-4-9b-chat-abliterated with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="byroneverson/glm-4-9b-chat-abliterated", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("byroneverson/glm-4-9b-chat-abliterated", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use byroneverson/glm-4-9b-chat-abliterated with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "byroneverson/glm-4-9b-chat-abliterated"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "byroneverson/glm-4-9b-chat-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/byroneverson/glm-4-9b-chat-abliterated

SGLang

How to use byroneverson/glm-4-9b-chat-abliterated with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "byroneverson/glm-4-9b-chat-abliterated" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "byroneverson/glm-4-9b-chat-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "byroneverson/glm-4-9b-chat-abliterated" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "byroneverson/glm-4-9b-chat-abliterated",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use byroneverson/glm-4-9b-chat-abliterated with Docker Model Runner:
```
docker model run hf.co/byroneverson/glm-4-9b-chat-abliterated
```

GLM

by MrDragonFox - opened Dec 11, 2024

Discussion

MrDragonFox

Dec 11, 2024

i tried applying your method to glm4-voice .. but i get nan's left right and center .. have you played with that ?

not a total idiot .. but they did something with the arch that is just weird ..

byroneverson

Owner Dec 17, 2024

"they did something with the arch that is just weird" is pretty spot on.

I am still working on nailing down my methods: (excuse)
A far as LLMs go, I have been pre-occupied slowly working on converting my abliteration process into an automated python library. Once that is complete then there might be some hope to adapt it for a hybrid voice model like glm4-voice but its not very high on my list at the moment considering that most people can just pipe any LLM text into a voice model and call it a day.
I have been very distracted: (another excuse)
Most of my ML free-time has been allocated towards getting open-source text-to-video models to work on Apple Metal backend rather than the usual CUDA backend. They all seem work with the custom batched-matrix-multiply metal shader I wrote but speed is not too great compared to reported CUDA speeds. In any case, my shaders seem like a decent solution to the Apple MPS problem where output tensors are restricted to a size of 2^32 bytes max most likely due to them indexing in 32-bit instead of 64-bit to save some milliseconds. But when a tensor needs to be large enough to perform attention on many frames of latent video a limit of 2^32 bytes (~4 GiB) is just not going to cut it.
Back on the topic of glm4-voice though:
If you spend enough time modifying my jupyter notebook to adapt it for glm4-voice in theory it should be possible. My newer method for abliteration (used for Llama 3.3 70B) would be a better approach. Llama 3.3 70B has 80 layers but I only needed to patch the input embedding and first 4 layers before it started answering pretty much whatever, which is absolutely insane considering all my previous abliterations required some serious carving.

So maybe glm4-voice is less work than we think, but like you said, its a weird architecture. Not to mention the fact that you would need like a hundred to a thousand of each harmful and harmless voice recordings as input for the model to probe it unlike a normal LLM where text samples are widely available. I guess you could convert the harmful and harmless text datasets into various voices with some text-to-voice models, and then use those audio files as input for glm4-voice. For a model that probably sacrifices raw intelligence for vocal generation, just seems like a lot of work at the end of the day.

Maybe when glm5-voice drops at some point I will have the courage to go for it...

MrDragonFox

Dec 17, 2024

•

edited Dec 17, 2024

glm4voice works on apple just fine - its really just putting out interweaved audiotokens from the llm - rest is very much a post process with cosy / machatts

you on discord ? id love talk more about that - you find me as mrdragonfox in mistral/cohere channels
i have like 1000 "bad" voice recordings with tokens - coaked the llm to eventually not refuse with regens

rollercoasterX

Dec 18, 2024

•

edited Dec 18, 2024

I don't know if this helps in this context but they recently released the same model (the chat model) as a pure transformers model, so without all the python hacks.

Here it is https://huggingface.co/THUDM/glm-4-9b-chat-hf

Really love your abliterated models by the way

MrDragonFox

Dec 18, 2024

•

edited Dec 18, 2024

I don't know if this helps in this context but they recently released the same model (the chat model) as a pure transformers model, so without all the python hacks.

Here it is https://huggingface.co/THUDM/glm-4-9b-chat-hf

Really love your abliterated models by the way

sadly it wont really help here - as the voice token are new embeddings .. they trained the heads and the embeddings based on that - if we get the dataset somehow we would reconstruct the method - otherwise we could distill but that is still a painfull job - layer 40 just gives nan tokens - they have custom logit error handling - something is very funky with how they build the model them self - too bad GLM isnt really responsive on gh or on hf

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment