--- library_name: transformers pipeline_tag: text-generation license: mit base_model: - ByteDance-Seed/Seed-Coder-8B-Base --- # Seed-Coder-8B-Reasoning GGUF Models ## Model Generation Details This model was generated using [llama.cpp](https://github.com/ggerganov/llama.cpp) at commit [`5787b5da`](https://github.com/ggerganov/llama.cpp/commit/5787b5da57e54dba760c2deeac1edf892e8fc450). ## Quantization beyond the IMatrix Testing a new quantization method using rules to bump important layers above what the standard imatrix would use. I have found that the standard IMatrix does not perform very well at low bit quantiztion and for MOE models. So I am using llama.cpp --tensor-type to bump up selected layers. See [Layer bumping with llama.cpp](https://github.com/Mungert69/GGUFModelBuilder/blob/main/model-converter/tensor_list_builder.py) This does create larger model files but increases precision for a given model size. ### **Please provide feedback on how you find this method performs** ## **Choosing the Right Model Format** Selecting the correct model format depends on your **hardware capabilities** and **memory constraints**. ### **BF16 (Brain Float 16) β Use if BF16 acceleration is available** - A 16-bit floating-point format designed for **faster computation** while retaining good precision. - Provides **similar dynamic range** as FP32 but with **lower memory usage**. - Recommended if your hardware supports **BF16 acceleration** (check your device's specs). - Ideal for **high-performance inference** with **reduced memory footprint** compared to FP32. π **Use BF16 if:** β Your hardware has native **BF16 support** (e.g., newer GPUs, TPUs). β You want **higher precision** while saving memory. β You plan to **requantize** the model into another format. π **Avoid BF16 if:** β Your hardware does **not** support BF16 (it may fall back to FP32 and run slower). β You need compatibility with older devices that lack BF16 optimization. --- ### **F16 (Float 16) β More widely supported than BF16** - A 16-bit floating-point **high precision** but with less of range of values than BF16. - Works on most devices with **FP16 acceleration support** (including many GPUs and some CPUs). - Slightly lower numerical precision than BF16 but generally sufficient for inference. π **Use F16 if:** β Your hardware supports **FP16** but **not BF16**. β You need a **balance between speed, memory usage, and accuracy**. β You are running on a **GPU** or another device optimized for FP16 computations. π **Avoid F16 if:** β Your device lacks **native FP16 support** (it may run slower than expected). β You have memory limitations. --- ### **Hybrid Precision Models (e.g., `bf16_q8_0`, `f16_q4_K`) β Best of Both Worlds** These formats selectively **quantize non-essential layers** while keeping **key layers in full precision** (e.g., attention and output layers). - Named like `bf16_q8_0` (meaning **full-precision BF16 core layers + quantized Q8_0 other layers**). - Strike a **balance between memory efficiency and accuracy**, improving over fully quantized models without requiring the full memory of BF16/F16. π **Use Hybrid Models if:** β You need **better accuracy than quant-only models** but canβt afford full BF16/F16 everywhere. β Your device supports **mixed-precision inference**. β You want to **optimize trade-offs** for production-grade models on constrained hardware. π **Avoid Hybrid Models if:** β Your target device doesnβt support **mixed or full-precision acceleration**. β You are operating under **ultra-strict memory limits** (in which case use fully quantized formats). --- ### **Quantized Models (Q4_K, Q6_K, Q8, etc.) β For CPU & Low-VRAM Inference** Quantization reduces model size and memory usage while maintaining as much accuracy as possible. - **Lower-bit models (Q4_K)** β **Best for minimal memory usage**, may have lower precision. - **Higher-bit models (Q6_K, Q8_0)** β **Better accuracy**, requires more memory. π **Use Quantized Models if:** β You are running inference on a **CPU** and need an optimized model. β Your device has **low VRAM** and cannot load full-precision models. β You want to reduce **memory footprint** while keeping reasonable accuracy. π **Avoid Quantized Models if:** β You need **maximum accuracy** (full-precision models are better for this). β Your hardware has enough VRAM for higher-precision formats (BF16/F16). --- ### **Very Low-Bit Quantization (IQ3_XS, IQ3_S, IQ3_M, Q4_K, Q4_0)** These models are optimized for **very high memory efficiency**, making them ideal for **low-power devices** or **large-scale deployments** where memory is a critical constraint. - **IQ3_XS**: Ultra-low-bit quantization (3-bit) with **very high memory efficiency**. - **Use case**: Best for **ultra-low-memory devices** where even Q4_K is too large. - **Trade-off**: Lower accuracy compared to higher-bit quantizations. - **IQ3_S**: Small block size for **maximum memory efficiency**. - **Use case**: Best for **low-memory devices** where **IQ3_XS** is too aggressive. - **IQ3_M**: Medium block size for better accuracy than **IQ3_S**. - **Use case**: Suitable for **low-memory devices** where **IQ3_S** is too limiting. - **Q4_K**: 4-bit quantization with **block-wise optimization** for better accuracy. - **Use case**: Best for **low-memory devices** where **Q6_K** is too large. - **Q4_0**: Pure 4-bit quantization, optimized for **ARM devices**. - **Use case**: Best for **ARM-based devices** or **low-memory environments**. ### **Ultra Low-Bit Quantization (IQ1_S IQ1_M IQ2_S IQ2_M IQ2_XS IQ2_XSS)** - *Ultra-low-bit quantization (1 2-bit) with **extreme memory efficiency**. - **Use case**: Best for cases were you have to fit the model into very constrained memory - **Trade-off**: Very Low Accuracy. May not function as expected. Please test fully before using. --- ### **Summary Table: Model Format Selection** | Model Format | Precision | Memory Usage | Device Requirements | Best Use Case | |--------------------------|------------------|------------------|----------------------------------|--------------------------------------------------------------| | **BF16** | Very High | High | BF16-supported GPU/CPU | High-speed inference with reduced memory | | **F16** | High | High | FP16-supported GPU/CPU | Inference when BF16 isnβt available | | **Q4_K** | Medium-Low | Low | CPU or Low-VRAM devices | Memory-constrained inference | | **Q6_K** | Medium | Moderate | CPU with more memory | Better accuracy with quantization | | **Q8_0** | High | Moderate | GPU/CPU with moderate VRAM | Highest accuracy among quantized models | | **IQ3_XS** | Low | Very Low | Ultra-low-memory devices | Max memory efficiency, low accuracy | | **IQ3_S** | Low | Very Low | Low-memory devices | Slightly more usable than IQ3_XS | | **IQ3_M** | Low-Medium | Low | Low-memory devices | Better accuracy than IQ3_S | | **Q4_0** | Low | Low | ARM-based/embedded devices | Llama.cpp automatically optimizes for ARM inference | | **Ultra Low-Bit (IQ1/2_*)** | Very Low | Extremely Low | Tiny edge/embedded devices | Fit models in extremely tight memory; low accuracy | | **Hybrid (e.g., `bf16_q8_0`)** | MediumβHigh | Medium | Mixed-precision capable hardware | Balanced performance and memory, near-FP accuracy in critical layers | --- # Seed-Coder-8B-Reasoning
## Introduction We are thrilled to introduce Seed-Coder, a powerful, transparent, and parameter-efficient family of open-source code models at the 8B scale, featuring base, instruct, and reasoning variants. Seed-Coder contributes to promote the evolution of open code models through the following highlights. - **Model-centric:** Seed-Coder predominantly leverages LLMs instead of hand-crafted rules for code data filtering, minimizing manual effort in pretraining data construction. - **Transparent:** We openly share detailed insights into our model-centric data pipeline, including methods for curating GitHub data, commits data, and code-related web data. - **Powerful:** Seed-Coder achieves state-of-the-art performance among open-source models of comparable size across a diverse range of coding tasks.