Buckets:
| # Optimum.Intel | |
| ## Docs | |
| - [Installation](https://huggingface.co/docs/optimum.intel/pr_1714/installation.md) | |
| - [๐ค Optimum Intel](https://huggingface.co/docs/optimum.intel/pr_1714/index.md) | |
| - [Supported models](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/models.md) | |
| - [Optimization](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/optimization.md) | |
| - [Models](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/reference.md) | |
| - [Export your model](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/export.md) | |
| - [Inference](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/inference.md) | |
| - [Generate images with Diffusion models](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/tutorials/diffusers.md) | |
| - [Notebooks](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/tutorials/notebooks.md) | |
| ### Installation | |
| https://huggingface.co/docs/optimum.intel/pr_1714/installation.md | |
| # Installation | |
| To install the latest release of ๐ค Optimum Intel with the corresponding required dependencies, you can do respectively: | |
| ```bash | |
| python -m pip install --upgrade-strategy eager "optimum-intel[openvino]" | |
| ``` | |
| The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version. | |
| We recommend creating a [virtual environment](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/#creating-a-virtual-environment) and upgrading pip with : | |
| ```bash | |
| python -m pip install --upgrade pip | |
| ``` | |
| Optimum Intel is a fast-moving project, and you may want to install from source with the following command: | |
| ```bash | |
| python -m pip install "optimum-intel[openvino]"@git+https://github.com/huggingface/optimum-intel.git | |
| ``` | |
| ### ๐ค Optimum Intel | |
| https://huggingface.co/docs/optimum.intel/pr_1714/index.md | |
| # ๐ค Optimum Intel | |
| ๐ค Optimum Intel is the interface between the ๐ค Transformers, Diffusers, Sentence Transformers and timm libraries and the different tools and libraries provided by [OpenVINO](https://docs.openvino.ai) to accelerate end-to-end pipelines on Intel architectures. | |
| [OpenVINO](https://docs.openvino.ai) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime. | |
| ### Supported models | |
| https://huggingface.co/docs/optimum.intel/pr_1714/openvino/models.md | |
| # Supported models | |
| ๐ค Optimum handles the export of models to OpenVINO in the `exporters.openvino` module. It provides classes, functions, and a command line interface to perform the export easily. | |
| Here is the list of the supported architectures : | |
| ## [Transformers](https://huggingface.co/docs/transformers/index) | |
| - AFMoE (aka Arcee Trinity) | |
| - ALBERT | |
| - Aquila | |
| - Aquila 2 | |
| - Arcee | |
| - Arctic | |
| - Audio Spectrogram Transformer | |
| - Baichuan 2 | |
| - BART | |
| - BEiT | |
| - BERT | |
| - BigBirdPegasus | |
| - BioGPT | |
| - BitNet | |
| - BlenderBot | |
| - Blenderbot Small | |
| - BLOOM | |
| - CLIP | |
| - CamemBERT | |
| - ChatGLM (ChatGLM2, ChatGLM3, GLM4) | |
| - CodeGen | |
| - CodeGen2 | |
| - Cohere | |
| - Cohere2 | |
| - ConvBERT | |
| - ConvNeXt | |
| - DBRX | |
| - Data2VecAudio | |
| - Data2VecText | |
| - Data2VecVision | |
| - DeBERTa | |
| - DeBERTa-v2 | |
| - DeciLM | |
| - DeiT | |
| - DeepSeek | |
| - DeepSeek-V2 | |
| - DeepSeek-V3 | |
| - DistilBERT | |
| - ERNIE 4.5 | |
| - ELECTRA | |
| - Encoder Decoder | |
| - ESM | |
| - EXAONE | |
| - EXAONE 4 | |
| - Falcon | |
| - Falcon-Mamba | |
| - FlauBERT | |
| - GLM-4 | |
| - GLM-Edge | |
| - GPT-2 | |
| - GPT-BigCode | |
| - GPT-J | |
| - GPT-Neo | |
| - GPT-NeoX | |
| - GPT-NeoX-Japanese | |
| - GPT-OSS | |
| - Gemma | |
| - Gemma 2 | |
| - Gemma 3 | |
| - Gemma 4 | |
| - GOT-OCR 2.0 | |
| - Granite | |
| - Granite 4.0 | |
| - GraniteMoE | |
| - HuBERT | |
| - HunYuan V1 Dense | |
| - I-BERT | |
| - Idefics3 | |
| - InternLM | |
| - InternLM2 | |
| - InternVL2 | |
| - Jais | |
| - LeViT | |
| - LFM2 | |
| - LFM2-MoE | |
| - LLaMA | |
| - LLaMA 4 | |
| - LLaVa | |
| - LLaVa-NeXT | |
| - LLaVa-NeXT-Video | |
| - LLaVa-Qwen2 (NanoLLaVa) | |
| - LongT5 | |
| - M2M-100 | |
| - MAIRA-2 | |
| - Mamba | |
| - mBART | |
| - MPNet | |
| - MPT | |
| - mT5 | |
| - MarianMT | |
| - MiniCPM | |
| - MiniCPM3 | |
| - MiniCPM-o | |
| - MiniCPM-V | |
| - Mistral | |
| - Mixtral | |
| - MobileBERT | |
| - MobileNet v1 | |
| - MobileNet v2 | |
| - MobileViT | |
| - Nystromformer | |
| - OLMo | |
| - OLMo 2 | |
| - OPT | |
| - Orion | |
| - Pegasus | |
| - Perceiver | |
| - Persimmon | |
| - Phi | |
| - Phi-3 | |
| - Phi-3.5-MoE | |
| - Phi-3 Vision | |
| - Phi-4 Multimodal | |
| - Pix2Struct | |
| - PoolFormer | |
| - Qwen | |
| - Qwen2 (Qwen1.5, Qwen2.5) | |
| - Qwen2MoE | |
| - Qwen2-VL | |
| - Qwen2.5-VL | |
| - Qwen3 | |
| - Qwen3MoE | |
| - Qwen3-VL | |
| - Qwen3.5 | |
| - Qwen3.5-MoE | |
| - Qwen3.6 | |
| - Qwen3-Next | |
| - RemBERT | |
| - ResNet | |
| - RoBERTa | |
| - RoFormer | |
| - SAM | |
| - SEW | |
| - SEW-D | |
| - SegFormer | |
| - SigLIP | |
| - SmolVLM (SmolVLM2) | |
| - SpeechT5 (text-to-speech) | |
| - SqueezeBERT | |
| - StableLM | |
| - StarCoder2 | |
| - Swin | |
| - T5 | |
| - TrOCR | |
| - UniSpeech | |
| - UniSpeech-SAT | |
| - Vision Encoder Decoder | |
| - ViT | |
| - Wav2Vec2 | |
| - Wav2Vec2-Conformer | |
| - WavLM | |
| - Whisper | |
| - XGLM | |
| - XLM | |
| - XLM-RoBERTa | |
| - XVERSE | |
| - Zamba2 | |
| ## [Diffusers](https://huggingface.co/docs/diffusers/index) | |
| - Stable Diffusion | |
| - Stable Diffusion XL | |
| - Latent Consistency | |
| - Stable Diffusion 3 | |
| - Flux | |
| - Sana | |
| - SanaSprint | |
| - LTX | |
| ## [Timm](https://huggingface.co/docs/timm/index) | |
| - PiT | |
| - ViT | |
| ## [Sentence Transformers](https://github.com/UKPLab/sentence-transformers) | |
| - All Transformer and CLIP-based models. | |
| ## [OpenCLIP](https://github.com/mlfoundations/open_clip) | |
| - All CLIP-based models | |
| ### Optimization | |
| https://huggingface.co/docs/optimum.intel/pr_1714/openvino/optimization.md | |
| # Optimization | |
| ๐ค Optimum Intel provides an `openvino` package that enables you to apply a variety of model quantization methods on many models hosted on the ๐ค hub using the [NNCF](https://docs.openvino.ai/2024/openvino-workflow/model-optimization.html) framework. | |
| Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit. | |
| ## Optimization Support Matrix | |
| Click on a โ to copy the command/code for the corresponding optimization case. | |
| Command copied to clipboard | |
| Task(OV Model Class) | |
| Weight-only Quantization | |
| Hybrid Quantization | |
| Full Quantization | |
| Mixed Quantization | |
| Data-free | |
| Data-aware | |
| CLI | |
| Python | |
| CLI | |
| Python | |
| CLI | |
| Python | |
| CLI | |
| Python | |
| CLI | |
| Python | |
| text-generation(OVModelForCausalLM) | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int8 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --dataset wikitext2 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| - | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant-mode int8 --dataset wikitext2 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant-mode cb4_f8e4m3 --dataset wikitext2 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'cb4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'wikitext2\'))).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| image-text-to-text(OVModelForVisualCausalLM) | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino --task image-text-to-text -m OpenGVLab/InternVL2-1B --trust-remote-code --weight-format int4 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForVisualCausalLM.from_pretrained(\'OpenGVLab/InternVL2-1B\', trust_remote_code=True, quantization_config=OVWeightQuantizationConfig(bits=4)).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino --task image-text-to-text -m OpenGVLab/InternVL2-1B --trust-remote-code --weight-format int4 --dataset contextual ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForVisualCausalLM.from_pretrained(\'OpenGVLab/InternVL2-1B\', trust_remote_code=True, quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'contextual\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino --task image-text-to-text -m OpenGVLab/InternVL2-1B --trust-remote-code --quant-mode int8 --dataset contextual ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForVisualCausalLM.from_pretrained(\'OpenGVLab/InternVL2-1B\', trust_remote_code=True, quantization_config=OVQuantizationConfig(bits=8, dataset=\'contextual\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| โ | |
| text-to-image, text-to-video(OVDiffusionPipeline) | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m dreamlike-art/dreamlike-anime-1.0 --weight-format int8 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVDiffusionPipeline.from_pretrained(\'dreamlike-art/dreamlike-anime-1.0\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m dreamlike-art/dreamlike-anime-1.0 --weight-format int8 --dataset conceptual_captions ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVDiffusionPipeline.from_pretrained(\'dreamlike-art/dreamlike-anime-1.0\', quantization_config=OVWeightQuantizationConfig(bits=8, quant_method=\'hybrid\', dataset=\'conceptual_captions\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m dreamlike-art/dreamlike-anime-1.0 --quant-mode int8 --dataset conceptual_captions ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVDiffusionPipeline.from_pretrained(\'dreamlike-art/dreamlike-anime-1.0\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'conceptual_captions\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| โ | |
| automatic-speech-recognition(OVModelForSpeechSeq2Seq) | |
| โ | |
| โ | |
| โ | |
| โ | |
| โ | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m openai/whisper-large-v3-turbo --quant-mode int8 --dataset librispeech --num-samples 10 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForSpeechSeq2Seq.from_pretrained(\'openai/whisper-large-v3-turbo\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'librispeech\', num_samples=10)).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| โ | |
| feature-extraction(OVModelForFeatureExtraction) | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m microsoft/codebert-base --weight-format int8 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForFeatureExtraction.from_pretrained(\'microsoft/codebert-base\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m microsoft/codebert-base --weight-format int4 --dataset wikitext2 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForFeatureExtraction.from_pretrained(\'microsoft/codebert-base\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| - | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m microsoft/codebert-base --quant-mode int8 --dataset wikitext2 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForFeatureExtraction.from_pretrained(\'microsoft/codebert-base\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m microsoft/codebert-base --quant-mode cb4_f8e4m3 --dataset wikitext2 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForFeatureExtraction.from_pretrained(\'microsoft/codebert-base\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'cb4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'wikitext2\'))).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| feature-extraction(OVSentenceTransformer) | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino --library sentence_transformers -m sentence-transformers/all-mpnet-base-v2 --weight-format int8 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVSentenceTransformer.from_pretrained(\'sentence-transformers/all-mpnet-base-v2\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino --library sentence_transformers -m sentence-transformers/all-mpnet-base-v2 --weight-format int4 --dataset wikitext2 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVSentenceTransformer.from_pretrained(\'sentence-transformers/all-mpnet-base-v2\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| - | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino --library sentence_transformers -m sentence-transformers/all-mpnet-base-v2 --quant-mode int8 --dataset wikitext2 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVSentenceTransformer.from_pretrained(\'sentence-transformers/all-mpnet-base-v2\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino --library sentence_transformers -m sentence-transformers/all-mpnet-base-v2 --quant-mode cb4_f8e4m3 --dataset wikitext2 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVSentenceTransformer.from_pretrained(\'sentence-transformers/all-mpnet-base-v2\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'cb4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'wikitext2\'))).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| fill-mask(OVModelForMaskedLM) | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m FacebookAI/roberta-base --weight-format int8 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForMaskedLM.from_pretrained(\'FacebookAI/roberta-base\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m FacebookAI/roberta-base --weight-format int4 --dataset wikitext2 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForMaskedLM.from_pretrained(\'FacebookAI/roberta-base\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| - | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m FacebookAI/roberta-base --quant-mode int8 --dataset wikitext2 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForMaskedLM.from_pretrained(\'FacebookAI/roberta-base\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m FacebookAI/roberta-base --quant-mode cb4_f8e4m3 --dataset wikitext2 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForMaskedLM.from_pretrained(\'FacebookAI/roberta-base\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'cb4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'wikitext2\'))).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| text2text-generation(OVModelForSeq2SeqLM) | |
| <button | |
| onclick="navigator.clipboard.writeText('optimum-cli export openvino -m google-t5/t5-small --weight-format int8 ./save_dir')"> | |
| โ | |
| <button | |
| onclick="navigator.clipboard.writeText('OVModelForSeq2SeqLM.from_pretrained(\'google-t5/t5-small\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')')"> | |
| โ | |
| <button | |
| onclick="navigator.clipboard.writeText('optimum-cli export openvino -m google-t5/t5-small --weight-format int4 --dataset wikitext2 ./save_dir')"> | |
| โ | |
| <button | |
| onclick="navigator.clipboard.writeText('OVModelForSeq2SeqLM.from_pretrained(\'google-t5/t5-small\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')')"> | |
| โ | |
| โ | |
| - | |
| <button | |
| onclick="navigator.clipboard.writeText('optimum-cli export openvino -m google-t5/t5-small --quant-mode int8 --dataset wikitext2 --smooth-quant-alpha -1 ./save_dir')"> | |
| โ | |
| <button | |
| onclick="navigator.clipboard.writeText('OVModelForSeq2SeqLM.from_pretrained(\'google-t5/t5-small\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'wikitext2\', smooth_quant_alpha=-1)).save_pretrained(\'save_dir\')')"> | |
| โ | |
| <button | |
| onclick="navigator.clipboard.writeText('optimum-cli export openvino -m google-t5/t5-small --quant-mode cb4_f8e4m3 --dataset wikitext2 --smooth-quant-alpha -1 ./save_dir')"> | |
| โ | |
| <button | |
| onclick="navigator.clipboard.writeText('OVModelForSeq2SeqLM.from_pretrained(\'google-t5/t5-small\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'cb4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'wikitext2\', smooth_quant_alpha=-1))).save_pretrained(\'save_dir\')')"> | |
| โ | |
| zero-shot-image-classification(OVModelForZeroShotImageClassification) | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m openai/clip-vit-base-patch16 --weight-format int8 ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForZeroShotImageClassification.from_pretrained(\'openai/clip-vit-base-patch16\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m openai/clip-vit-base-patch16 --weight-format int4 --dataset conceptual_captions ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForZeroShotImageClassification.from_pretrained(\'openai/clip-vit-base-patch16\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'conceptual_captions\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| - | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m openai/clip-vit-base-patch16 --quant-mode int8 --dataset conceptual_captions ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForZeroShotImageClassification.from_pretrained(\'openai/clip-vit-base-patch16\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'conceptual_captions\')).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m openai/clip-vit-base-patch16 --quant-mode cb4_f8e4m3 --dataset conceptual_captions ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForZeroShotImageClassification.from_pretrained(\'openai/clip-vit-base-patch16\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'cb4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'conceptual_captions\'))).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| feature-extraction(OVSamModel) | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVSamModel.from_pretrained(\'facebook/sam-vit-base\', quantization_config=OVPipelineQuantizationConfig(quantization_configs=dict(vision_encoder=OVWeightQuantizationConfig(bits=8)))).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVSamModel.from_pretrained(\'facebook/sam-vit-base\', quantization_config=OVPipelineQuantizationConfig(quantization_configs=dict(vision_encoder=OVWeightQuantizationConfig(bits=4, dataset=\'coco\')))).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| - | |
| <button onclick=" | |
| navigator.clipboard.writeText('optimum-cli export openvino -m facebook/sam-vit-base --quant-mode int8 --dataset coco ./save_dir'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVSamModel.from_pretrained(\'facebook/sam-vit-base\', quantization_config=OVPipelineQuantizationConfig(quantization_configs=dict(vision_encoder=OVQuantizationConfig(bits=8, dataset=\'coco\')))).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| โ | |
| text-to-audio(OVModelForTextToSpeechSeq2Seq) | |
| โ | |
| <button onclick=" | |
| navigator.clipboard.writeText('OVModelForTextToSpeechSeq2Seq.from_pretrained(\'microsoft/speecht5_tts\', vocoder=\'microsoft/speecht5_hifigan\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')'); | |
| let m=document.getElementById('copyMsg'); | |
| m.style.display='block'; | |
| clearTimeout(window._copyTimeout); | |
| window._copyTimeout=setTimeout(()=>m.style.display='none', 2000); | |
| "> | |
| โ | |
| โ | |
| โ | |
| โ | |
| โ | |
| โ | |
| โ | |
| โ | |
| โ | |
| ## Weight-only Quantization | |
| Quantization can be applied on the model's Linear, Convolutional and Embedding layers, enabling the loading of large models on memory-limited devices. For example, when applying 8-bit quantization, the resulting model will be x4 smaller than its fp32 counterpart. For 4-bit quantization, the reduction in memory could theoretically reach x8, but is closer to x6 in practice. | |
| ### 8-bit | |
| For the 8-bit weight quantization you can provide `quantization_config` equal to `OVWeightQuantizationConfig(bits=8)` to load your model's weights in 8-bit: | |
| ```python | |
| from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig | |
| model_id = "helenai/gpt2-ov" | |
| quantization_config = OVWeightQuantizationConfig(bits=8) | |
| model = OVModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config) | |
| # Saves the int8 model that will be x4 smaller than its fp32 counterpart | |
| model.save_pretrained(saving_directory) | |
| ``` | |
| Weights of language models inside vision-language pipelines can be quantized in a similar way: | |
| ```python | |
| model = OVModelForVisualCausalLM.from_pretrained( | |
| "llava-hf/llava-v1.6-mistral-7b-hf", | |
| quantization_config=quantization_config | |
| ) | |
| ``` | |
| If quantization_config is not provided, model will be exported in 8 bits by default when it has more than 1 billion parameters. You can disable it with `load_in_8bit=False`. | |
| ### 4-bit | |
| 4-bit weight quantization can be achieved in a similar way: | |
| ```python | |
| from optimum.intel import OVModelForCausalLM | |
| model = OVModelForCausalLM.from_pretrained(model_id, quantization_config={"bits": 4}) | |
| ``` | |
| For some models, we provide preconfigured 4-bit weight-only quantization [configurations](https://github.com/huggingface/optimum-intel/blob/main/optimum/intel/openvino/configuration.py) that offer a good trade-off between quality and speed. This default 4-bit configuration is applied automatically when you specify `quantization_config={"bits": 4}`. | |
| Or for vision-language pipelines: | |
| ```python | |
| model = OVModelForVisualCausalLM.from_pretrained( | |
| "llava-hf/llava-v1.6-mistral-7b-hf", | |
| quantization_config={"bits": 4} | |
| ) | |
| ``` | |
| You can tune quantization parameters to achieve a better performance accuracy trade-off as follows: | |
| ```python | |
| from optimum.intel import OVWeightQuantizationConfig | |
| quantization_config = OVWeightQuantizationConfig( | |
| bits=4, | |
| sym=False, | |
| ratio=0.8, | |
| quant_method="awq", | |
| dataset="wikitext2" | |
| ) | |
| ``` | |
| Note: `OVWeightQuantizationConfig` also accepts keyword arguments that are not listed in its constructor. In this case such arguments will be passed directly to `nncf.compress_weights()` call. This is useful for passing additional parameters to the quantization algorithm. | |
| By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) you can add `sym=True`. | |
| For 4-bit quantization you can also specify the following arguments in the quantization configuration : | |
| * The `group_size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization. | |
| * The `ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`. | |
| Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency. | |
| Quality of 4-bit weight compressed model can further be improved by employing one of the following data-dependent methods: | |
| * **AWQ** which stands for Activation Aware Quantization is an algorithm that tunes model weights for more accurate 4-bit compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time and memory for tuning weights on a calibration dataset. Please note that it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped. There is also a data-free version of AWQ available that relies on per-column magnitudes of weights instead of activations. | |
| * **Scale Estimation** is a method that tunes quantization scales to minimize the `L2` error between the original and compressed layers. Providing a dataset is required to run scale estimation. Using this method also incurs additional time and memory overhead. | |
| * **GPTQ** optimizes compressed weights in a layer-wise fashion to minimize the difference between activations of a compressed and original layer. | |
| * **LoRA Correction** mitigates quantization noise introduced during weight compression by leveraging low-rank adaptation. | |
| Data-aware algorithms can be applied together or separately. For that, provide corresponding arguments to the 4-bit `OVWeightQuantizationConfig` together with a dataset. For example: | |
| ```python | |
| quantization_config = OVWeightQuantizationConfig( | |
| bits=4, | |
| sym=False, | |
| ratio=0.8, | |
| quant_method="awq", | |
| scale_estimation=True, | |
| gptq=True, | |
| dataset="wikitext2" | |
| ) | |
| ``` | |
| Note: GPTQ and LoRA Correction algorithms can't be applied simultaneously. | |
| ## Full quantization | |
| When applying post-training full quantization, both the weights and the activations are quantized. | |
| To apply quantization on the activations, an additional calibration step is needed which consists in feeding a `calibration_dataset` to the network in order to estimate the quantization activations parameters. | |
| Here is how to apply full quantization on a fine-tuned DistilBERT given your own `calibration_dataset`: | |
| ```python | |
| from transformers import AutoTokenizer | |
| from optimum.intel import OVQuantizer, OVModelForSequenceClassification, OVConfig, OVQuantizationConfig | |
| model_id = "distilbert-base-uncased-finetuned-sst-2-english" | |
| model = OVModelForSequenceClassification.from_pretrained(model_id, export=True) | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| # The directory where the quantized model will be saved | |
| save_dir = "ptq_model" | |
| quantizer = OVQuantizer.from_pretrained(model) | |
| # Apply full quantization and export the resulting quantized model to OpenVINO IR format | |
| ov_config = OVConfig(quantization_config=OVQuantizationConfig()) | |
| quantizer.quantize(ov_config=ov_config, calibration_dataset=calibration_dataset, save_directory=save_dir) | |
| # Save the tokenizer | |
| tokenizer.save_pretrained(save_dir) | |
| ``` | |
| The calibration dataset can also be created easily using your `OVQuantizer`: | |
| ```python | |
| from functools import partial | |
| def preprocess_function(examples, tokenizer): | |
| return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True) | |
| # Create the calibration dataset used to perform full quantization | |
| calibration_dataset = quantizer.get_calibration_dataset( | |
| "glue", | |
| dataset_config_name="sst2", | |
| preprocess_function=partial(preprocess_function, tokenizer=tokenizer), | |
| num_samples=300, | |
| dataset_split="train", | |
| ) | |
| ``` | |
| The `quantize()` method applies post-training quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device. | |
| ### Speech-to-text Models Quantization | |
| The speech-to-text Whisper model can be quantized without the need for preparing a custom calibration dataset. Please see example below. | |
| ```python | |
| model_id = "openai/whisper-tiny" | |
| ov_model = OVModelForSpeechSeq2Seq.from_pretrained( | |
| model_id, | |
| quantization_config=OVQuantizationConfig( | |
| num_samples=10, | |
| dataset="librispeech", | |
| processor=model_id, | |
| smooth_quant_alpha=0.95, | |
| ) | |
| ) | |
| ``` | |
| With this, encoder and decoder models of the Whisper pipeline will be fully quantized, including activations. | |
| ## Hybrid quantization | |
| Traditional optimization methods like post-training 8-bit quantization do not work well for Stable Diffusion (SD) models and can lead to poor generation results. On the other hand, weight compression does not improve performance significantly when applied to Stable Diffusion models, as the size of activations is comparable to weights. | |
| The U-Net component takes up most of the overall execution time of the pipeline. Thus, optimizing just this one component can bring substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial accuracy degradation. | |
| Therefore, the proposal is to apply quantization in *hybrid mode* for the U-Net model and weight-only quantization for the rest of the pipeline components : | |
| * U-Net : quantization applied on both the weights and activations | |
| * The text encoder, VAE encoder / decoder : quantization applied on the weights | |
| The hybrid mode involves the quantization of weights in MatMul and Embedding layers, and activations of other layers, facilitating accuracy preservation post-optimization while reducing the model size. | |
| The `quantization_config` is utilized to define optimization parameters for optimizing the SD pipeline. To enable hybrid quantization, specify the quantization dataset in the `quantization_config`. If the dataset is not defined, weight-only quantization will be applied on all components. | |
| ```python | |
| from optimum.intel import OVStableDiffusionPipeline, OVWeightQuantizationConfig | |
| model = OVStableDiffusionPipeline.from_pretrained( | |
| model_id, | |
| export=True, | |
| quantization_config=OVWeightQuantizationConfig(bits=8, dataset="conceptual_captions"), | |
| ) | |
| ``` | |
| For more details, please refer to the corresponding NNCF [documentation](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/post_training_compression/weights_compression/Usage.md). | |
| ## Mixed Quantization | |
| Mixed quantization is a technique that combines weight-only quantization with full quantization. During mixed quantization we separately quantize: | |
| 1. weights of weighted layers to one precision, and | |
| 2. activations (and possibly, weights, if some were skipped at the first step) of other supported layers to another precision. | |
| By default, weights of all weighted layers are quantized in the first step. In the second step activations of weighted and non-weighted layers are quantized. If some layers are instructed to be ignored in the first step with `weight_quantization_config.ignored_scope` parameter, both weights and activations of these layers are quantized to the precision given in the `full_quantization_config`. | |
| When running this kind of optimization through Python API, `OVMixedQuantizationConfig` should be used. In such case the precision for the first step should be provided with `weight_quantization_config` argument and the precision for the second step with `full_quantization_config` argument. For example: | |
| ```python | |
| model = OVModelForCausalLM.from_pretrained( | |
| 'TinyLlama/TinyLlama-1.1B-Chat-v1.0', | |
| quantization_config=OVMixedQuantizationConfig( | |
| weight_quantization_config=OVWeightQuantizationConfig(bits=4, dtype='cb4'), | |
| full_quantization_config=OVQuantizationConfig(dtype='f8e4m3', dataset='wikitext2') | |
| ) | |
| ) | |
| ``` | |
| To apply mixed quantization through CLI, the `--quant-mode` argument should be used. For example: | |
| ```bash | |
| optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant-mode cb4_f8e4m3 --dataset wikitext2 ./save_dir | |
| ``` | |
| Don't forget to provide a dataset since it is required for the calibration procedure during full quantization. | |
| ## Pipeline Quantization | |
| There are multimodal pipelines that consist of multiple components, such as Stable Diffusion or Visual Language models. In these cases, there may be a need to apply different quantization methods to different components of the pipeline. For example, you may want to apply int4 data-aware weight-only quantization to a language model in visual-language pipeline, while applying int8 weight-only quantization to other components. In this case you can use the `OVPipelineQuantizationConfig` class to specify the quantization configuration for each component of the pipeline. | |
| For example, the code below quantizes weights and activations of a language model inside InternVL2-1B, compresses weights of a text embedding model and skips any quantization for vision embedding model. | |
| ```python | |
| from optimum.intel import OVModelForVisualCausalLM | |
| from optimum.intel import OVPipelineQuantizationConfig, OVQuantizationConfig, OVWeightQuantizationConfig | |
| model_id = "OpenGVLab/InternVL2-1B" | |
| model = OVModelForVisualCausalLM.from_pretrained( | |
| model_id, | |
| export=True, | |
| trust_remote_code=True, | |
| quantization_config=OVPipelineQuantizationConfig( | |
| quantization_configs={ | |
| "lm_model": OVQuantizationConfig(bits=8), | |
| "text_embeddings_model": OVWeightQuantizationConfig(bits=8), | |
| }, | |
| dataset="contextual", | |
| ) | |
| ) | |
| ``` | |
| ### Models | |
| https://huggingface.co/docs/optimum.intel/pr_1714/openvino/reference.md | |
| # Models | |
| ## Generic model classes[[optimum.intel.openvino.modeling_base.OVBaseModel]] | |
| #### optimum.intel.openvino.modeling_base.OVBaseModel[[optimum.intel.openvino.modeling_base.OVBaseModel]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L189) | |
| Base OVModel class. | |
| from_pretrainedoptimum.intel.openvino.modeling_base.OVBaseModel.from_pretrainedhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L560[{"name": "model_id", "val": ": typing.Union[str, pathlib.Path]"}, {"name": "export", "val": ": bool = False"}, {"name": "force_download", "val": ": bool = False"}, {"name": "token", "val": ": typing.Union[bool, str, NoneType] = None"}, {"name": "cache_dir", "val": ": str = '/home/runner/.cache/huggingface/hub'"}, {"name": "subfolder", "val": ": str = ''"}, {"name": "config", "val": ": typing.Optional[transformers.configuration_utils.PreTrainedConfig] = None"}, {"name": "local_files_only", "val": ": bool = False"}, {"name": "trust_remote_code", "val": ": bool = False"}, {"name": "revision", "val": ": typing.Optional[str] = None"}, {"name": "**kwargs", "val": ""}]- **model_id** (`Union[str, Path]`) -- | |
| Can be either: | |
| - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co. | |
| Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a | |
| user or organization name, like `dbmdz/bert-base-german-cased`. | |
| - A path to a *directory* containing a model saved using `~OptimizedModel.save_pretrained`, | |
| e.g., `./my_model_directory/`. | |
| - **export** (`bool`, defaults to `False`) -- | |
| Defines whether the provided `model_id` needs to be exported to the targeted format. | |
| - **force_download** (`bool`, defaults to `True`) -- | |
| Whether or not to force the (re-)download of the model weights and configuration files, overriding the | |
| cached versions if they exist. | |
| - **token** (`Optional[Union[bool,str]]`, defaults to `None`) -- | |
| The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated | |
| when running `huggingface-cli login` (stored in `huggingface_hub.constants.HF_TOKEN_PATH`). | |
| - **cache_dir** (`Optional[str]`, defaults to `None`) -- | |
| Path to a directory in which a downloaded pretrained model configuration should be cached if the | |
| standard cache should not be used. | |
| - **subfolder** (`str`, defaults to `""`) -- | |
| In case the relevant files are located inside a subfolder of the model repo either locally or on huggingface.co, you can | |
| specify the folder name here. | |
| - **config** (`Optional[transformers.PretrainedConfig]`, defaults to `None`) -- | |
| The model configuration. | |
| - **local_files_only** (`Optional[bool]`, defaults to `False`) -- | |
| Whether or not to only look at local files (i.e., do not try to download the model). | |
| - **trust_remote_code** (`bool`, defaults to `False`) -- | |
| Whether or not to allow for custom code defined on the Hub in their own modeling. This option should only be set | |
| to `True` for repositories you trust and in which you have read the code, as it will execute code present on | |
| the Hub on your local machine. | |
| - **revision** (`Optional[str]`, defaults to `None`) -- | |
| The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a | |
| git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any | |
| identifier allowed by git.0 | |
| Instantiate a pretrained model from a pre-trained model configuration. | |
| **Parameters:** | |
| model_id (`Union[str, Path]`) : Can be either: - A string, the *model id* of a pretrained model hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`. - A path to a *directory* containing a model saved using `~OptimizedModel.save_pretrained`, e.g., `./my_model_directory/`. | |
| export (`bool`, defaults to `False`) : Defines whether the provided `model_id` needs to be exported to the targeted format. | |
| force_download (`bool`, defaults to `True`) : Whether or not to force the (re-)download of the model weights and configuration files, overriding the cached versions if they exist. | |
| token (`Optional[Union[bool,str]]`, defaults to `None`) : The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated when running `huggingface-cli login` (stored in `huggingface_hub.constants.HF_TOKEN_PATH`). | |
| cache_dir (`Optional[str]`, defaults to `None`) : Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used. | |
| subfolder (`str`, defaults to `""`) : In case the relevant files are located inside a subfolder of the model repo either locally or on huggingface.co, you can specify the folder name here. | |
| config (`Optional[transformers.PretrainedConfig]`, defaults to `None`) : The model configuration. | |
| local_files_only (`Optional[bool]`, defaults to `False`) : Whether or not to only look at local files (i.e., do not try to download the model). | |
| trust_remote_code (`bool`, defaults to `False`) : Whether or not to allow for custom code defined on the Hub in their own modeling. This option should only be set to `True` for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine. | |
| revision (`Optional[str]`, defaults to `None`) : The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any identifier allowed by git. | |
| #### reshape[[optimum.intel.openvino.modeling_base.OVBaseModel.reshape]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L936) | |
| Propagates the given input shapes on the model's layers, fixing the inputs shapes of the model. | |
| **Parameters:** | |
| batch_size (`int`) : The batch size. | |
| sequence_length (`int`) : The sequence length or number of channels. | |
| height (`int`, *optional*) : The image height. | |
| width (`int`, *optional*) : The image width. | |
| ## Natural Language Processing | |
| The following classes are available for the following natural language processing tasks. | |
| ### OVModelForCausalLM[[optimum.intel.OVModelForCausalLM]] | |
| #### optimum.intel.OVModelForCausalLM[[optimum.intel.OVModelForCausalLM]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_decoder.py#L464) | |
| OpenVINO Model with a causal language modeling head on top (linear layer with weights tied to the input | |
| embeddings). | |
| This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving) | |
| forwardoptimum.intel.OVModelForCausalLM.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_decoder.py#L577[{"name": "input_ids", "val": ": LongTensor"}, {"name": "attention_mask", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "past_key_values", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None"}, {"name": "position_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "token_type_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "**kwargs", "val": ""}] | |
| **Parameters:** | |
| model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights. | |
| device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device. | |
| dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default. | |
| ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation. | |
| compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled. | |
| #### generate[[optimum.intel.OVModelForCausalLM.generate]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_decoder.py#L757) | |
| ### OVModelForMaskedLM[[optimum.intel.OVModelForMaskedLM]] | |
| #### optimum.intel.OVModelForMaskedLM[[optimum.intel.OVModelForMaskedLM]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L448) | |
| OpenVINO Model with a MaskedLMOutput for masked language modeling tasks. | |
| This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving) | |
| forwardoptimum.intel.OVModelForMaskedLM.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L455[{"name": "input_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "token_type_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "**kwargs", "val": ""}]- **input_ids** (`torch.Tensor`) -- | |
| Indices of input sequence tokens in the vocabulary. | |
| Indices can be obtained using [`AutoTokenizer`](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer). | |
| [What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids) | |
| - **attention_mask** (`torch.Tensor`), *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask) | |
| - **token_type_ids** (`torch.Tensor`, *optional*) -- | |
| Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`: | |
| - 1 for tokens that are **sentence A**, | |
| - 0 for tokens that are **sentence B**. | |
| [What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)0 | |
| The [OVModelForMaskedLM](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForMaskedLM) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of masked language modeling using `transformers.pipelines`: | |
| ```python | |
| >>> from transformers import AutoTokenizer, pipeline | |
| >>> from optimum.intel import OVModelForMaskedLM | |
| >>> tokenizer = AutoTokenizer.from_pretrained("roberta-base") | |
| >>> model = OVModelForMaskedLM.from_pretrained("roberta-base", export=True) | |
| >>> mask_token = tokenizer.mask_token | |
| >>> pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer) | |
| >>> outputs = pipe("The goal of life is" + mask_token) | |
| ``` | |
| **Parameters:** | |
| model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights. | |
| device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device. | |
| dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default. | |
| ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation. | |
| compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled. | |
| ### OVModelForSeq2SeqLM[[optimum.intel.OVModelForSeq2SeqLM]] | |
| #### optimum.intel.OVModelForSeq2SeqLM[[optimum.intel.OVModelForSeq2SeqLM]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L329) | |
| Sequence-to-sequence model with a language modeling head for OpenVINO inference. | |
| forwardoptimum.intel.OVModelForSeq2SeqLM.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L644[{"name": "input_ids", "val": ": LongTensor = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "decoder_input_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_attention_mask", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "encoder_outputs", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "past_key_values", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "cache_position", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "labels", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "**kwargs", "val": ""}]- **input_ids** (`torch.LongTensor`) -- | |
| Indices of input sequence tokens in the vocabulary of shape `(batch_size, encoder_sequence_length)`. | |
| - **attention_mask** (`torch.LongTensor`) -- | |
| Mask to avoid performing attention on padding token indices, of shape | |
| `(batch_size, encoder_sequence_length)`. Mask values selected in `[0, 1]`. | |
| - **decoder_input_ids** (`torch.LongTensor`) -- | |
| Indices of decoder input sequence tokens in the vocabulary of shape `(batch_size, decoder_sequence_length)`. | |
| - **encoder_outputs** (`torch.FloatTensor`) -- | |
| The encoder `last_hidden_state` of shape `(batch_size, encoder_sequence_length, hidden_size)`. | |
| - **past_key_values** (`tuple(tuple(torch.FloatTensor), *optional*)` -- | |
| Contains the precomputed key and value hidden states of the attention blocks used to speed up decoding. | |
| The tuple is of length `config.n_layers` with each tuple having 2 tensors of shape | |
| `(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)` and 2 additional tensors of shape | |
| `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.0 | |
| The [OVModelForSeq2SeqLM](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForSeq2SeqLM) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of text generation: | |
| ```python | |
| >>> from transformers import AutoTokenizer | |
| >>> from optimum.intel import OVModelForSeq2SeqLM | |
| >>> tokenizer = AutoTokenizer.from_pretrained("echarlaix/t5-small-openvino") | |
| >>> model = OVModelForSeq2SeqLM.from_pretrained("echarlaix/t5-small-openvino") | |
| >>> text = "He never went out without a book under his arm, and he often came back with two." | |
| >>> inputs = tokenizer(text, return_tensors="pt") | |
| >>> gen_tokens = model.generate(**inputs) | |
| >>> outputs = tokenizer.batch_decode(gen_tokens) | |
| ``` | |
| Example using `transformers.pipeline`: | |
| ```python | |
| >>> from transformers import AutoTokenizer, pipeline | |
| >>> from optimum.intel import OVModelForSeq2SeqLM | |
| >>> tokenizer = AutoTokenizer.from_pretrained("echarlaix/t5-small-openvino") | |
| >>> model = OVModelForSeq2SeqLM.from_pretrained("echarlaix/t5-small-openvino") | |
| >>> pipe = pipeline("translation_en_to_fr", model=model, tokenizer=tokenizer) | |
| >>> text = "He never went out without a book under his arm, and he often came back with two." | |
| >>> outputs = pipe(text) | |
| ``` | |
| **Parameters:** | |
| encoder (`openvino.Model`) : The OpenVINO Runtime model associated to the encoder. | |
| decoder (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder. | |
| decoder_with_past (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder with past key values. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is an instance of the configuration associated to the model. Initializing with a config file does not load the weights associated with the model, only the configuration. | |
| ### OVModelForQuestionAnswering[[optimum.intel.OVModelForQuestionAnswering]] | |
| #### optimum.intel.OVModelForQuestionAnswering[[optimum.intel.OVModelForQuestionAnswering]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L219) | |
| OpenVINO Model with a QuestionAnsweringModelOutput for extractive question-answering tasks. | |
| This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving) | |
| forwardoptimum.intel.OVModelForQuestionAnswering.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L226[{"name": "input_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "token_type_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "**kwargs", "val": ""}]- **input_ids** (`torch.Tensor`) -- | |
| Indices of input sequence tokens in the vocabulary. | |
| Indices can be obtained using [`AutoTokenizer`](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer). | |
| [What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids) | |
| - **attention_mask** (`torch.Tensor`), *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask) | |
| - **token_type_ids** (`torch.Tensor`, *optional*) -- | |
| Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`: | |
| - 1 for tokens that are **sentence A**, | |
| - 0 for tokens that are **sentence B**. | |
| [What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)0 | |
| The [OVModelForQuestionAnswering](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForQuestionAnswering) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of question answering using `transformers.pipeline`: | |
| ```python | |
| >>> from transformers import AutoTokenizer, pipeline | |
| >>> from optimum.intel import OVModelForQuestionAnswering | |
| >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad") | |
| >>> model = OVModelForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad", export=True) | |
| >>> pipe = pipeline("question-answering", model=model, tokenizer=tokenizer) | |
| >>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet" | |
| >>> outputs = pipe(question, text) | |
| ``` | |
| **Parameters:** | |
| model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights. | |
| device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device. | |
| dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default. | |
| ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation. | |
| compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled. | |
| ### OVModelForSequenceClassification[[optimum.intel.OVModelForSequenceClassification]] | |
| #### optimum.intel.OVModelForSequenceClassification[[optimum.intel.OVModelForSequenceClassification]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L154) | |
| OpenVINO Model with a SequenceClassifierOutput for sequence classification tasks. | |
| This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving) | |
| forwardoptimum.intel.OVModelForSequenceClassification.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L161[{"name": "input_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "token_type_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "**kwargs", "val": ""}]- **input_ids** (`torch.Tensor`) -- | |
| Indices of input sequence tokens in the vocabulary. | |
| Indices can be obtained using [`AutoTokenizer`](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer). | |
| [What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids) | |
| - **attention_mask** (`torch.Tensor`), *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask) | |
| - **token_type_ids** (`torch.Tensor`, *optional*) -- | |
| Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`: | |
| - 1 for tokens that are **sentence A**, | |
| - 0 for tokens that are **sentence B**. | |
| [What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)0 | |
| The [OVModelForSequenceClassification](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForSequenceClassification) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of sequence classification using `transformers.pipeline`: | |
| ```python | |
| >>> from transformers import AutoTokenizer, pipeline | |
| >>> from optimum.intel import OVModelForSequenceClassification | |
| >>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english") | |
| >>> model = OVModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", export=True) | |
| >>> pipe = pipeline("text-classification", model=model, tokenizer=tokenizer) | |
| >>> outputs = pipe("Hello, my dog is cute") | |
| ``` | |
| **Parameters:** | |
| model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights. | |
| device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device. | |
| dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default. | |
| ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation. | |
| compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled. | |
| ### OVModelForTokenClassification[[optimum.intel.OVModelForTokenClassification]] | |
| #### optimum.intel.OVModelForTokenClassification[[optimum.intel.OVModelForTokenClassification]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L288) | |
| OpenVINO Model with a TokenClassifierOutput for token classification tasks. | |
| This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving) | |
| forwardoptimum.intel.OVModelForTokenClassification.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L295[{"name": "input_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "token_type_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "**kwargs", "val": ""}]- **input_ids** (`torch.Tensor`) -- | |
| Indices of input sequence tokens in the vocabulary. | |
| Indices can be obtained using [`AutoTokenizer`](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer). | |
| [What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids) | |
| - **attention_mask** (`torch.Tensor`), *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask) | |
| - **token_type_ids** (`torch.Tensor`, *optional*) -- | |
| Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`: | |
| - 1 for tokens that are **sentence A**, | |
| - 0 for tokens that are **sentence B**. | |
| [What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)0 | |
| The [OVModelForTokenClassification](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForTokenClassification) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of token classification using `transformers.pipelines`: | |
| ```python | |
| >>> from transformers import AutoTokenizer, pipeline | |
| >>> from optimum.intel import OVModelForTokenClassification | |
| >>> tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER") | |
| >>> model = OVModelForTokenClassification.from_pretrained("dslim/bert-base-NER", export=True) | |
| >>> pipe = pipeline("token-classification", model=model, tokenizer=tokenizer) | |
| >>> outputs = pipe("My Name is Peter and I live in New York.") | |
| ``` | |
| **Parameters:** | |
| model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights. | |
| device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device. | |
| dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default. | |
| ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation. | |
| compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled. | |
| ## Audio | |
| The following classes are available for the following audio tasks. | |
| ### OVModelForAudioClassification[[optimum.intel.OVModelForAudioClassification]] | |
| #### optimum.intel.OVModelForAudioClassification[[optimum.intel.OVModelForAudioClassification]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L651) | |
| OpenVINO Model with a SequenceClassifierOutput for audio classification tasks. | |
| This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving) | |
| forwardoptimum.intel.OVModelForAudioClassification.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L658[{"name": "input_values", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "**kwargs", "val": ""}]- **input_ids** (`torch.Tensor`) -- | |
| Indices of input sequence tokens in the vocabulary. | |
| Indices can be obtained using [`AutoTokenizer`](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer). | |
| [What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids) | |
| - **attention_mask** (`torch.Tensor`), *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask) | |
| - **token_type_ids** (`torch.Tensor`, *optional*) -- | |
| Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`: | |
| - 1 for tokens that are **sentence A**, | |
| - 0 for tokens that are **sentence B**. | |
| [What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)0 | |
| The [OVModelForAudioClassification](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForAudioClassification) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of audio classification using `transformers.pipelines`: | |
| ```python | |
| >>> from datasets import load_dataset | |
| >>> from transformers import AutoFeatureExtractor, pipeline | |
| >>> from optimum.intel import OVModelForAudioClassification | |
| >>> preprocessor = AutoFeatureExtractor.from_pretrained("superb/hubert-base-superb-er") | |
| >>> model = OVModelForAudioClassification.from_pretrained("superb/hubert-base-superb-er", export=True) | |
| >>> pipe = pipeline("audio-classification", model=model, feature_extractor=preprocessor) | |
| >>> dataset = load_dataset("superb", "ks", split="test") | |
| >>> audio_file = dataset[3]["audio"]["array"] | |
| >>> outputs = pipe(audio_file) | |
| ``` | |
| **Parameters:** | |
| model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights. | |
| device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device. | |
| dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default. | |
| ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation. | |
| compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled. | |
| ### OVModelForAudioFrameClassification[[optimum.intel.OVModelForAudioFrameClassification]] | |
| #### optimum.intel.OVModelForAudioFrameClassification[[optimum.intel.OVModelForAudioFrameClassification]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L879) | |
| OpenVINO Model for with a frame classification head on top for tasks like Speaker Diarization. | |
| This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving) | |
| Audio Frame Classification model for OpenVINO. | |
| forwardoptimum.intel.OVModelForAudioFrameClassification.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L887[{"name": "input_values", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "**kwargs", "val": ""}]- **input_values** (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- | |
| Float values of input raw speech waveform.. | |
| Input values can be obtained from audio file loaded into an array using [`AutoFeatureExtractor`](https://huggingface.co/docs/transformers/autoclass_tutorial#autofeatureextractor).0 | |
| The [OVModelForAudioFrameClassification](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForAudioFrameClassification) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of audio frame classification: | |
| ```python | |
| >>> from transformers import AutoFeatureExtractor | |
| >>> from optimum.intel import OVModelForAudioFrameClassification | |
| >>> from datasets import load_dataset | |
| >>> import torch | |
| >>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") | |
| >>> dataset = dataset.sort("id") | |
| >>> sampling_rate = dataset.features["audio"].sampling_rate | |
| >>> feature_extractor = AutoFeatureExtractor.from_pretrained("anton-l/wav2vec2-base-superb-sd") | |
| >>> model = OVModelForAudioFrameClassification.from_pretrained("anton-l/wav2vec2-base-superb-sd", export=True) | |
| >>> inputs = feature_extractor(dataset[0]["audio"]["array"], return_tensors="pt", sampling_rate=sampling_rate) | |
| >>> logits = model(**inputs).logits | |
| >>> probabilities = torch.sigmoid(torch.as_tensor(logits)[0]) | |
| >>> labels = (probabilities > 0.5).long() | |
| >>> labels[0].tolist() | |
| ``` | |
| **Parameters:** | |
| model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights. | |
| device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device. | |
| dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default. | |
| ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation. | |
| compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled. | |
| ### OVModelForCTC[[optimum.intel.OVModelForCTC]] | |
| #### optimum.intel.OVModelForCTC[[optimum.intel.OVModelForCTC]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L723) | |
| Onnx Model with a language modeling head on top for Connectionist Temporal Classification (CTC). | |
| This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving) | |
| CTC model for OpenVINO. | |
| forwardoptimum.intel.OVModelForCTC.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L731[{"name": "input_values", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "**kwargs", "val": ""}]- **input_values** (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- | |
| Float values of input raw speech waveform.. | |
| Input values can be obtained from audio file loaded into an array using [`AutoFeatureExtractor`](https://huggingface.co/docs/transformers/autoclass_tutorial#autofeatureextractor).0 | |
| The [OVModelForCTC](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForCTC) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of CTC: | |
| ```python | |
| >>> from transformers import AutoFeatureExtractor | |
| >>> from optimum.intel import OVModelForCTC | |
| >>> from datasets import load_dataset | |
| >>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") | |
| >>> dataset = dataset.sort("id") | |
| >>> sampling_rate = dataset.features["audio"].sampling_rate | |
| >>> processor = AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft") | |
| >>> model = OVModelForCTC.from_pretrained("facebook/hubert-large-ls960-ft", export=True) | |
| >>> # audio file is decoded on the fly | |
| >>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="np") | |
| >>> logits = model(**inputs).logits | |
| >>> predicted_ids = np.argmax(logits, axis=-1) | |
| >>> transcription = processor.batch_decode(predicted_ids) | |
| ``` | |
| **Parameters:** | |
| model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights. | |
| device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device. | |
| dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default. | |
| ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation. | |
| compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled. | |
| ### OVModelForAudioXVector[[optimum.intel.OVModelForAudioXVector]] | |
| #### optimum.intel.OVModelForAudioXVector[[optimum.intel.OVModelForAudioXVector]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L803) | |
| Onnx Model with an XVector feature extraction head on top for tasks like Speaker Verification. | |
| This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving) | |
| Audio XVector model for OpenVINO. | |
| forwardoptimum.intel.OVModelForAudioXVector.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L811[{"name": "input_values", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "**kwargs", "val": ""}]- **input_values** (`torch.Tensor` of shape `(batch_size, sequence_length)`) -- | |
| Float values of input raw speech waveform.. | |
| Input values can be obtained from audio file loaded into an array using [`AutoFeatureExtractor`](https://huggingface.co/docs/transformers/autoclass_tutorial#autofeatureextractor).0 | |
| The [OVModelForAudioXVector](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForAudioXVector) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of Audio XVector: | |
| ```python | |
| >>> from transformers import AutoFeatureExtractor | |
| >>> from optimum.intel import OVModelForAudioXVector | |
| >>> from datasets import load_dataset | |
| >>> import torch | |
| >>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation") | |
| >>> dataset = dataset.sort("id") | |
| >>> sampling_rate = dataset.features["audio"].sampling_rate | |
| >>> feature_extractor = AutoFeatureExtractor.from_pretrained("anton-l/wav2vec2-base-superb-sv") | |
| >>> model = OVModelForAudioXVector.from_pretrained("anton-l/wav2vec2-base-superb-sv", export=True) | |
| >>> # audio file is decoded on the fly | |
| >>> inputs = feature_extractor( | |
| ... [d["array"] for d in dataset[:2]["audio"]], sampling_rate=sampling_rate, return_tensors="pt", padding=True | |
| ... ) | |
| >>> embeddings = model(**inputs).embeddings | |
| >>> embeddings = torch.nn.functional.normalize(embeddings, dim=-1).cpu() | |
| >>> cosine_sim = torch.nn.CosineSimilarity(dim=-1) | |
| >>> similarity = cosine_sim(embeddings[0], embeddings[1]) | |
| >>> threshold = 0.7 | |
| >>> if similarity >> round(similarity.item(), 2) | |
| ``` | |
| **Parameters:** | |
| model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights. | |
| device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device. | |
| dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default. | |
| ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation. | |
| compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled. | |
| ### OVModelForSpeechSeq2Seq[[optimum.intel.OVModelForSpeechSeq2Seq]] | |
| #### optimum.intel.OVModelForSpeechSeq2Seq[[optimum.intel.OVModelForSpeechSeq2Seq]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L1246) | |
| Speech Sequence-to-sequence model with a language modeling head for OpenVINO inference. This class officially supports whisper, speech_to_text. | |
| forwardoptimum.intel.OVModelForSpeechSeq2Seq.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L1282[{"name": "input_features", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_input_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_attention_mask", "val": ": typing.Optional[torch.BoolTensor] = None"}, {"name": "encoder_outputs", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "past_key_values", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "cache_position", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "**kwargs", "val": ""}]- **input_features** (`torch.FloatTensor`) -- | |
| Mel features extracted from the raw speech waveform. | |
| `(batch_size, feature_size, encoder_sequence_length)`. | |
| - **decoder_input_ids** (`torch.LongTensor`) -- | |
| Indices of decoder input sequence tokens in the vocabulary of shape `(batch_size, decoder_sequence_length)`. | |
| - **encoder_outputs** (`torch.FloatTensor`) -- | |
| The encoder `last_hidden_state` of shape `(batch_size, encoder_sequence_length, hidden_size)`. | |
| - **past_key_values** (`tuple(tuple(torch.FloatTensor), *optional*, defaults to `None`)` -- | |
| Contains the precomputed key and value hidden states of the attention blocks used to speed up decoding. | |
| The tuple is of length `config.n_layers` with each tuple having 2 tensors of shape | |
| `(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)` and 2 additional tensors of shape | |
| `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.0 | |
| The [OVModelForSpeechSeq2Seq](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForSpeechSeq2Seq) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of text generation: | |
| ```python | |
| >>> from transformers import AutoProcessor | |
| >>> from optimum.intel import OVModelForSpeechSeq2Seq | |
| >>> from datasets import load_dataset | |
| >>> processor = AutoProcessor.from_pretrained("openai/whisper-tiny") | |
| >>> model = OVModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny") | |
| >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") | |
| >>> inputs = processor.feature_extractor(ds[0]["audio"]["array"], return_tensors="pt") | |
| >>> gen_tokens = model.generate(inputs=inputs.input_features) | |
| >>> outputs = processor.tokenizer.batch_decode(gen_tokens) | |
| ``` | |
| Example using `transformers.pipeline`: | |
| ```python | |
| >>> from transformers import AutoProcessor, pipeline | |
| >>> from optimum.intel import OVModelForSpeechSeq2Seq | |
| >>> from datasets import load_dataset | |
| >>> processor = AutoProcessor.from_pretrained("openai/whisper-tiny") | |
| >>> model = OVModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny") | |
| >>> speech_recognition = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor) | |
| >>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") | |
| >>> pred = speech_recognition(ds[0]["audio"]["array"]) | |
| ``` | |
| **Parameters:** | |
| encoder (`openvino.Model`) : The OpenVINO Runtime model associated to the encoder. | |
| decoder (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder. | |
| decoder_with_past (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder with past key values. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is an instance of the configuration associated to the model. Initializing with a config file does not load the weights associated with the model, only the configuration. | |
| ## Computer Vision | |
| The following classes are available for the following computer vision tasks. | |
| ### OVModelForImageClassification[[optimum.intel.OVModelForImageClassification]] | |
| #### optimum.intel.OVModelForImageClassification[[optimum.intel.OVModelForImageClassification]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L538) | |
| OpenVINO Model with a ImageClassifierOutput for image classification tasks. | |
| This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving) | |
| forwardoptimum.intel.OVModelForImageClassification.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L600[{"name": "pixel_values", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "**kwargs", "val": ""}]- **pixel_values** (`torch.Tensor`) -- | |
| Pixel values corresponding to the images in the current batch. | |
| Pixel values can be obtained from encoded images using [`AutoFeatureExtractor`](https://huggingface.co/docs/transformers/autoclass_tutorial#autofeatureextractor).0 | |
| The [OVModelForImageClassification](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForImageClassification) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of image classification using `transformers.pipelines`: | |
| ```python | |
| >>> from transformers import AutoFeatureExtractor, pipeline | |
| >>> from optimum.intel import OVModelForImageClassification | |
| >>> preprocessor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224") | |
| >>> model = OVModelForImageClassification.from_pretrained("google/vit-base-patch16-224", export=True) | |
| >>> model.reshape(batch_size=1, sequence_length=3, height=224, width=224) | |
| >>> pipe = pipeline("image-classification", model=model, feature_extractor=preprocessor) | |
| >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" | |
| >>> outputs = pipe(url) | |
| ``` | |
| This class can also be used with [timm](https://github.com/huggingface/pytorch-image-models) | |
| models hosted on [HuggingFaceHub](https://huggingface.co/timm). Example: | |
| ```python | |
| >>> from transformers import pipeline | |
| >>> from optimum.intel.openvino.modeling_timm import TimmImageProcessor | |
| >>> from optimum.intel import OVModelForImageClassification | |
| >>> model_id = "timm/vit_tiny_patch16_224.augreg_in21k" | |
| >>> preprocessor = TimmImageProcessor.from_pretrained(model_id) | |
| >>> model = OVModelForImageClassification.from_pretrained(model_id, export=True) | |
| >>> pipe = pipeline("image-classification", model=model, feature_extractor=preprocessor) | |
| >>> url = "http://images.cocodataset.org/val2017/000000039769.jpg" | |
| >>> outputs = pipe(url) | |
| ``` | |
| **Parameters:** | |
| model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights. | |
| device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device. | |
| dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default. | |
| ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation. | |
| compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled. | |
| ## Multimodal | |
| The following classes are available for the following multimodal tasks. | |
| ### OVModelForVision2Seq[[optimum.intel.OVModelForVision2Seq]] | |
| #### optimum.intel.OVModelForVision2Seq[[optimum.intel.OVModelForVision2Seq]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L1051) | |
| VisionEncoderDecoder Sequence-to-sequence model with a language modeling head for OpenVINO inference. | |
| forwardoptimum.intel.OVModelForVision2Seq.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L1105[{"name": "pixel_values", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_input_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_attention_mask", "val": ": typing.Optional[torch.BoolTensor] = None"}, {"name": "encoder_outputs", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "past_key_values", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "**kwargs", "val": ""}]- **pixel_values** (`torch.FloatTensor`) -- | |
| Features extracted from an Image. This tensor should be of shape | |
| `(batch_size, num_channels, height, width)`. | |
| - **decoder_input_ids** (`torch.LongTensor`) -- | |
| Indices of decoder input sequence tokens in the vocabulary of shape `(batch_size, decoder_sequence_length)`. | |
| - **encoder_outputs** (`torch.FloatTensor`) -- | |
| The encoder `last_hidden_state` of shape `(batch_size, encoder_sequence_length, hidden_size)`. | |
| - **past_key_values** (`tuple(tuple(torch.FloatTensor), *optional*, defaults to `None`)` -- | |
| Contains the precomputed key and value hidden states of the attention blocks used to speed up decoding. | |
| The tuple is of length `config.n_layers` with each tuple having 2 tensors of shape | |
| `(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)` and 2 additional tensors of shape | |
| `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.0 | |
| The [OVModelForVision2Seq](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForVision2Seq) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of text generation: | |
| ```python | |
| >>> from transformers import AutoProcessor, AutoTokenizer | |
| >>> from optimum.intel import OVModelForVision2Seq | |
| >>> from PIL import Image | |
| >>> import requests | |
| >>> processor = AutoProcessor.from_pretrained("microsoft/trocr-small-handwritten") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-handwritten") | |
| >>> model = OVModelForVision2Seq.from_pretrained("microsoft/trocr-small-handwritten", export=True) | |
| >>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg" | |
| >>> image = Image.open(requests.get(url, stream=True).raw) | |
| >>> inputs = processor(image, return_tensors="pt") | |
| >>> gen_tokens = model.generate(**inputs) | |
| >>> outputs = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True) | |
| ``` | |
| Example using `transformers.pipeline`: | |
| ```python | |
| >>> from transformers import AutoProcessor, AutoTokenizer, pipeline | |
| >>> from optimum.intel import OVModelForVision2Seq | |
| >>> from PIL import Image | |
| >>> import requests | |
| >>> processor = AutoProcessor.from_pretrained("microsoft/trocr-small-handwritten") | |
| >>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-handwritten") | |
| >>> model = OVModelForVision2Seq.from_pretrained("microsoft/trocr-small-handwritten", export=True) | |
| >>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg" | |
| >>> image = Image.open(requests.get(url, stream=True).raw) | |
| >>> image_to_text = pipeline("image-to-text", model=model, tokenizer=tokenizer, feature_extractor=processor, image_processor=processor) | |
| >>> pred = image_to_text(image) | |
| ``` | |
| **Parameters:** | |
| encoder (`openvino.Model`) : The OpenVINO Runtime model associated to the encoder. | |
| decoder (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder. | |
| decoder_with_past (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder with past key values. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is an instance of the configuration associated to the model. Initializing with a config file does not load the weights associated with the model, only the configuration. | |
| ### OVModelForPix2Struct[[optimum.intel.OVModelForPix2Struct]] | |
| #### optimum.intel.OVModelForPix2Struct[[optimum.intel.OVModelForPix2Struct]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L1156) | |
| Pix2Struct model with a language modeling head for OpenVINO inference. | |
| forwardoptimum.intel.OVModelForPix2Struct.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L1196[{"name": "flattened_patches", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_input_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_attention_mask", "val": ": typing.Optional[torch.BoolTensor] = None"}, {"name": "encoder_outputs", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "past_key_values", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "**kwargs", "val": ""}]- **flattened_patches** (`torch.FloatTensor` of shape `(batch_size, seq_length, hidden_size)`) -- | |
| Flattened pixel patches. the `hidden_size` is obtained by the following formula: `hidden_size` = | |
| `num_channels` * `patch_size` * `patch_size` | |
| The process of flattening the pixel patches is done by `Pix2StructProcessor`. | |
| - **attention_mask** (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, *optional*) -- | |
| Mask to avoid performing attention on padding token indices. | |
| - **decoder_input_ids** (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, *optional*) -- | |
| Indices of decoder input sequence tokens in the vocabulary. | |
| Pix2StructText uses the `pad_token_id` as the starting token for `decoder_input_ids` generation. If | |
| `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see | |
| `past_key_values`). | |
| - **decoder_attention_mask** (`torch.BoolTensor` of shape `(batch_size, target_sequence_length)`, *optional*) -- | |
| Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also | |
| be used by default. | |
| - **encoder_outputs** (`tuple(tuple(torch.FloatTensor)`, *optional*) -- | |
| Tuple consists of (`last_hidden_state`, `optional`: *hidden_states*, `optional`: *attentions*) | |
| `last_hidden_state` of shape `(batch_size, sequence_length, hidden_size)` is a sequence of hidden states at | |
| the output of the last layer of the encoder. Used in the cross-attention of the decoder. | |
| - **past_key_values** (`tuple(tuple(torch.FloatTensor), *optional*, defaults to `None`)` -- | |
| Contains the precomputed key and value hidden states of the attention blocks used to speed up decoding. | |
| The tuple is of length `config.n_layers` with each tuple having 2 tensors of shape | |
| `(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)` and 2 additional tensors of shape | |
| `(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.0 | |
| The [OVModelForPix2Struct](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForPix2Struct) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of pix2struct: | |
| ```python | |
| >>> from transformers import AutoProcessor | |
| >>> from optimum.intel import OVModelForPix2Struct | |
| >>> from PIL import Image | |
| >>> import requests | |
| >>> processor = AutoProcessor.from_pretrained("google/pix2struct-ai2d-base") | |
| >>> model = OVModelForPix2Struct.from_pretrained("google/pix2struct-ai2d-base", export=True) | |
| >>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg" | |
| >>> image = Image.open(requests.get(url, stream=True).raw) | |
| >>> question = "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud" | |
| >>> inputs = processor(images=image, text=question, return_tensors="pt") | |
| >>> gen_tokens = model.generate(**inputs) | |
| >>> outputs = processor.batch_decode(gen_tokens, skip_special_tokens=True) | |
| ``` | |
| **Parameters:** | |
| encoder (`openvino.Model`) : The OpenVINO Runtime model associated to the encoder. | |
| decoder (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder. | |
| decoder_with_past (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder with past key values. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is an instance of the configuration associated to the model. Initializing with a config file does not load the weights associated with the model, only the configuration. | |
| ## Custom Tasks | |
| ### OVModelForCustomTasks[[optimum.intel.OVModelForCustomTasks]] | |
| #### optimum.intel.OVModelForCustomTasks[[optimum.intel.OVModelForCustomTasks]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L945) | |
| OpenVINO Model for custom tasks. It can be used to leverage the inference acceleration for any single-file OpenVINO model, that may use custom inputs and outputs. | |
| This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving) | |
| forwardoptimum.intel.OVModelForCustomTasks.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L946[{"name": "**kwargs", "val": ""}] | |
| The [OVModelForCustomTasks](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForCustomTasks) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of custom tasks (e.g. a sentence transformers with a pooler head): | |
| ```python | |
| >>> from transformers import AutoTokenizer | |
| >>> from optimum.intel import OVModelForCustomTasks | |
| >>> tokenizer = AutoTokenizer.from_pretrained("IlyasMoutawwakil/sbert-all-MiniLM-L6-v2-with-pooler") | |
| >>> model = OVModelForCustomTasks.from_pretrained("IlyasMoutawwakil/sbert-all-MiniLM-L6-v2-with-pooler") | |
| >>> inputs = tokenizer("I love burritos!", return_tensors="np") | |
| >>> outputs = model(**inputs) | |
| >>> last_hidden_state = outputs.last_hidden_state | |
| >>> pooler_output = outputs.pooler_output | |
| ``` | |
| **Parameters:** | |
| model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights. | |
| device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device. | |
| dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default. | |
| ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation. | |
| compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled. | |
| ### OVModelForFeatureExtraction[[optimum.intel.OVModelForFeatureExtraction]] | |
| #### optimum.intel.OVModelForFeatureExtraction[[optimum.intel.OVModelForFeatureExtraction]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L352) | |
| OpenVINO Model with a BaseModelOutput for feature extraction tasks. | |
| This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the | |
| library implements for all its model (such as downloading or saving) | |
| forwardoptimum.intel.OVModelForFeatureExtraction.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L366[{"name": "input_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "token_type_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "**kwargs", "val": ""}]- **input_ids** (`torch.Tensor`) -- | |
| Indices of input sequence tokens in the vocabulary. | |
| Indices can be obtained using [`AutoTokenizer`](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer). | |
| [What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids) | |
| - **attention_mask** (`torch.Tensor`), *optional*) -- | |
| Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`: | |
| - 1 for tokens that are **not masked**, | |
| - 0 for tokens that are **masked**. | |
| [What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask) | |
| - **token_type_ids** (`torch.Tensor`, *optional*) -- | |
| Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`: | |
| - 1 for tokens that are **sentence A**, | |
| - 0 for tokens that are **sentence B**. | |
| [What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)0 | |
| The [OVModelForFeatureExtraction](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForFeatureExtraction) forward method, overrides the `__call__` special method. | |
| Although the recipe for forward pass needs to be defined within this function, one should call the `Module` | |
| instance afterwards instead of this since the former takes care of running the pre and post processing steps while | |
| the latter silently ignores them. | |
| Example of feature extraction using `transformers.pipelines`: | |
| ```python | |
| >>> from transformers import AutoTokenizer, pipeline | |
| >>> from optimum.intel import OVModelForFeatureExtraction | |
| >>> tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2") | |
| >>> model = OVModelForFeatureExtraction.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", export=True) | |
| >>> pipe = pipeline("feature-extraction", model=model, tokenizer=tokenizer) | |
| >>> outputs = pipe("My Name is Peter and I live in New York.") | |
| ``` | |
| **Parameters:** | |
| model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference. | |
| config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights. | |
| device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device. | |
| dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default. | |
| ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation. | |
| compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled. | |
| ## Text-to-image | |
| ### OVStableDiffusionPipeline[[optimum.intel.OVStableDiffusionPipeline]] | |
| #### optimum.intel.OVStableDiffusionPipeline[[optimum.intel.OVStableDiffusionPipeline]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_diffusion.py#L1443) | |
| OpenVINO-powered stable diffusion pipeline corresponding to [diffusers.StableDiffusionPipeline](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion#diffusers.StableDiffusionPipeline). | |
| forwardoptimum.intel.OVStableDiffusionPipeline.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L976[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}] | |
| ### OVStableDiffusionXLPipeline[[optimum.intel.OVStableDiffusionXLPipeline]] | |
| #### optimum.intel.OVStableDiffusionXLPipeline[[optimum.intel.OVStableDiffusionXLPipeline]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_diffusion.py#L1477) | |
| OpenVINO-powered stable diffusion pipeline corresponding to [diffusers.StableDiffusionXLPipeline](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline). | |
| forwardoptimum.intel.OVStableDiffusionXLPipeline.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L976[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}] | |
| ### OVLatentConsistencyModelPipeline[[optimum.intel.OVLatentConsistencyModelPipeline]] | |
| #### optimum.intel.OVLatentConsistencyModelPipeline[[optimum.intel.OVLatentConsistencyModelPipeline]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_diffusion.py#L1578) | |
| OpenVINO-powered stable diffusion pipeline corresponding to [diffusers.LatentConsistencyModelPipeline](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/latent_consistency#diffusers.LatentConsistencyModelPipeline). | |
| forwardoptimum.intel.OVLatentConsistencyModelPipeline.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L976[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}] | |
| ## Image-to-image | |
| ### OVStableDiffusionImg2ImgPipeline[[optimum.intel.OVStableDiffusionImg2ImgPipeline]] | |
| #### optimum.intel.OVStableDiffusionImg2ImgPipeline[[optimum.intel.OVStableDiffusionImg2ImgPipeline]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_diffusion.py#L1453) | |
| OpenVINO-powered stable diffusion pipeline corresponding to [diffusers.StableDiffusionImg2ImgPipeline](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_img2img#diffusers.StableDiffusionImg2ImgPipeline). | |
| forwardoptimum.intel.OVStableDiffusionImg2ImgPipeline.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L976[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}] | |
| ### OVStableDiffusionXLImg2ImgPipeline[[optimum.intel.OVStableDiffusionXLImg2ImgPipeline]] | |
| #### optimum.intel.OVStableDiffusionXLImg2ImgPipeline[[optimum.intel.OVStableDiffusionXLImg2ImgPipeline]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_diffusion.py#L1500) | |
| OpenVINO-powered stable diffusion pipeline corresponding to [diffusers.StableDiffusionXLImg2ImgPipeline](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline). | |
| forwardoptimum.intel.OVStableDiffusionXLImg2ImgPipeline.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L976[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}] | |
| ## Inpainting | |
| ### OVStableDiffusionInpaintPipeline[[optimum.intel.OVStableDiffusionInpaintPipeline]] | |
| #### optimum.intel.OVStableDiffusionInpaintPipeline[[optimum.intel.OVStableDiffusionInpaintPipeline]] | |
| [Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_diffusion.py#L1465) | |
| OpenVINO-powered stable diffusion pipeline corresponding to [diffusers.StableDiffusionInpaintPipeline](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_inpaint#diffusers.StableDiffusionInpaintPipeline). | |
| forwardoptimum.intel.OVStableDiffusionInpaintPipeline.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L976[{"name": "*args", "val": ""}, {"name": "**kwargs", "val": ""}] | |
| ### Export your model | |
| https://huggingface.co/docs/optimum.intel/pr_1714/openvino/export.md | |
| # Export your model | |
| To export a [model](https://huggingface.co/docs/optimum/main/en/intel/openvino/models) hosted on the [Hub](https://huggingface.co/models) you can use our [space](https://huggingface.co/spaces/echarlaix/openvino-export). After conversion, a repository will be pushed under your namespace, this repository can be either public or private. | |
| ## Using the CLI | |
| To export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI : | |
| ```bash | |
| optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B ov_model/ | |
| ``` | |
| To export a private model or a model that requires access, you can either run `huggingface-cli login` to log in permanently, or set the environment variable `HF_TOKEN` to a [token](https://huggingface.co/settings/tokens) with access to the model. See the [authentication documentation](https://huggingface.co/docs/huggingface_hub/quick-start#authentication) for more information. | |
| The model argument can either be the model ID of a model hosted on the [Hub](https://huggingface.co/models) or a path to a model hosted locally. For local models, you need to specify the task for which the model should be loaded before export, among the list of the [supported tasks](https://huggingface.co/docs/optimum/main/en/exporters/task_manager). | |
| ```bash | |
| optimum-cli export openvino --model local_llama --task text-generation-with-past ov_model/ | |
| ``` | |
| Check out the help for more options: | |
| ```text | |
| usage: optimum-cli export openvino [-h] -m MODEL [--task TASK] [--framework {pt}] [--trust-remote-code] | |
| [--weight-format {fp32,fp16,int8,int4,mxfp4,nf4,cb4}] | |
| [--quant-mode {int8,f8e4m3,f8e5m2,cb4_f8e4m3,int4_f8e4m3,int4_f8e5m2}] | |
| [--library {transformers,diffusers,timm,sentence_transformers,open_clip}] | |
| [--cache_dir CACHE_DIR] [--pad-token-id PAD_TOKEN_ID] [--ratio RATIO] [--sym] | |
| [--group-size GROUP_SIZE] [--backup-precision {none,int8_sym,int8_asym}] | |
| [--dataset DATASET] [--all-layers] [--awq] [--scale-estimation] [--gptq] | |
| [--lora-correction] [--sensitivity-metric SENSITIVITY_METRIC] | |
| [--quantization-statistics-path QUANTIZATION_STATISTICS_PATH] | |
| [--num-samples NUM_SAMPLES] [--disable-stateful] [--disable-convert-tokenizer] | |
| [--smooth-quant-alpha SMOOTH_QUANT_ALPHA] | |
| output | |
| optional arguments: | |
| -h, --help show this help message and exit | |
| Required arguments: | |
| -m MODEL, --model MODEL | |
| Model ID on huggingface.co or path on disk to load model from. | |
| output Path indicating the directory where to store the generated OV model. | |
| Optional arguments: | |
| --task TASK The task to export the model for. If not specified, the task will be auto-inferred based on | |
| the model. Available tasks depend on the model, but are among: ['image-to-image', | |
| 'image-segmentation', 'inpainting', 'sentence-similarity', 'text-to-audio', 'image-to-text', | |
| 'automatic-speech-recognition', 'token-classification', 'text-to-image', 'audio-classification', | |
| 'feature-extraction', 'semantic-segmentation', 'masked-im', 'audio-xvector', | |
| 'audio-frame-classification', 'text2text-generation', 'multiple-choice', 'depth-estimation', | |
| 'image-classification', 'fill-mask', 'zero-shot-object-detection', 'object-detection', | |
| 'question-answering', 'zero-shot-image-classification', 'mask-generation', 'text-generation', | |
| 'text-classification']. For decoder models, use 'xxx-with-past' to export the model using past | |
| key values in the decoder. | |
| --framework {pt} The framework to use for the export. Defaults to 'pt' for PyTorch. | |
| --trust-remote-code Allows to use custom code for the modeling hosted in the model repository. This option should | |
| only be set for repositories you trust and in which you have read the code, as it will execute | |
| on your local machine arbitrary code present in the model repository. | |
| --weight-format {fp32,fp16,int8,int4,mxfp4,nf4,cb4} | |
| The weight format of the exported model. Option 'cb4' represents a codebook with 16 | |
| fixed fp8 values in E4M3 format. | |
| --quant-mode {int8,f8e4m3,f8e5m2,cb4_f8e4m3,int4_f8e4m3,int4_f8e5m2} | |
| Quantization precision mode. This is used for applying full model quantization including | |
| activations. | |
| --library {transformers,diffusers,timm,sentence_transformers,open_clip} | |
| The library used to load the model before export. If not provided, will attempt to infer the | |
| local checkpoint's library | |
| --cache_dir CACHE_DIR | |
| The path to a directory in which the downloaded model should be cached if the standard cache | |
| should not be used. | |
| --pad-token-id PAD_TOKEN_ID | |
| This is needed by some models, for some tasks. If not provided, will attempt to use the | |
| tokenizer to guess it. | |
| --ratio RATIO A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit | |
| quantization. If set to 0.8, 80% of the layers will be quantized to int4 while 20% will be | |
| quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size | |
| and inference latency. Default value is 1.0. Note: If dataset is provided, and the ratio is | |
| less than 1.0, then data-aware mixed precision assignment will be applied. | |
| --sym Whether to apply symmetric quantization. This argument is related to integer-typed | |
| --weight-format and --quant-mode options. In case of full or mixed quantization (--quant-mode) | |
| symmetric quantization will be applied to weights in any case, so only activation quantization | |
| will be affected by --sym argument. For weight-only quantization (--weight-format) --sym | |
| argument does not affect backup precision. Examples: (1) --weight-format int8 --sym => int8 | |
| symmetric quantization of weights; (2) --weight-format int4 => int4 asymmetric quantization of | |
| weights; (3) --weight-format int4 --sym --backup-precision int8_asym => int4 symmetric | |
| quantization of weights with int8 asymmetric backup precision; (4) --quant-mode int8 --sym => | |
| weights and activations are quantized to int8 symmetric data type; (5) --quant-mode int8 => | |
| activations are quantized to int8 asymmetric data type, weights -- to int8 symmetric data type; | |
| (6) --quant-mode int4_f8e5m2 --sym => activations are quantized to f8e5m2 data type, weights -- | |
| to int4 symmetric data type. | |
| --group-size GROUP_SIZE | |
| The group size to use for quantization. Recommended value is 128 and -1 uses per-column | |
| quantization. | |
| --group-size-fallback {error,ignore,adjust} | |
| Specifies how to handle operations that do not support the given group size. Possible values are: | |
| `error`: raise an error if the given group size is not supported by a node, this is the default | |
| behavior; | |
| `ignore`: skip nodes that cannot be compressed with the given group size; | |
| `adjust`: adjust the group size to the maximum supported value for each problematic node, if | |
| there is no valid value greater than or equal to 32, then the node is quantized to the backup | |
| precision which is int8_asym by default. | |
| --backup-precision {none,int8_sym,int8_asym} | |
| Defines a backup precision for mixed-precision weight compression. Only valid for 4-bit weight | |
| formats. If not provided, backup precision is int8_asym. 'none' stands for original floating- | |
| point precision of the model weights, in this case weights are retained in their original | |
| precision without any quantization. 'int8_sym' stands for 8-bit integer symmetric quantization | |
| without zero point. 'int8_asym' stands for 8-bit integer asymmetric quantization with zero | |
| points per each quantization group. | |
| --dataset DATASET The dataset used for data-aware compression or quantization with NNCF. Can be a dataset name | |
| (e.g., 'wikitext2') or a string with options (e.g., 'wikitext2:seq_len=128'). The only currently | |
| supported option is `seq_len` which represents a length of an input sample sequence (sentence). | |
| For language models you can use the one from the list | |
| ['auto','wikitext2','c4','c4-new','gsm8k']. With 'auto' the dataset will be collected from model's | |
| generations. For diffusion models it should be on of ['conceptual_captions', | |
| 'laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit']. For visual language models | |
| the dataset must be set to 'contextual'. Note: if none of the data-aware compression algorithms | |
| are selected and ratio parameter is omitted or equals 1.0, the dataset argument will not have an | |
| effect on the resulting model. Note: for text generation task, datasets with English texts such | |
| as 'wikitext2','gsm8k','c4' or 'c4-new' usually work fine even for non-English models. | |
| --all-layers Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an | |
| weight compression is applied, they are compressed to INT8. | |
| --awq Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs. If | |
| dataset is provided, a data-aware activation-based version of the algorithm will be executed, | |
| which requires additional time. Otherwise, data-free AWQ will be applied which relies on | |
| per-column magnitudes of weights instead of activations. Note: it is possible that there will | |
| be no matching patterns in the model to apply AWQ, in such case it will be skipped. | |
| --scale-estimation Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between | |
| the original and compressed layers. Providing a dataset is required to run scale estimation. | |
| Please note, that applying scale estimation takes additional memory and time. | |
| --gptq Indicates whether to apply GPTQ algorithm that optimizes compressed weights in a layer-wise | |
| fashion to minimize the difference between activations of a compressed and original layer. | |
| Please note, that applying GPTQ takes additional memory and time. | |
| --lora-correction Indicates whether to apply LoRA Correction algorithm. When enabled, this algorithm introduces | |
| low-rank adaptation layers in the model that can recover accuracy after weight compression at | |
| some cost of inference latency. Please note, that applying LoRA Correction algorithm takes | |
| additional memory and time. | |
| --sensitivity-metric SENSITIVITY_METRIC | |
| The sensitivity metric for assigning quantization precision to layers. It can be one of the | |
| following: ['weight_quantization_error', 'hessian_input_activation', | |
| 'mean_activation_variance', 'max_activation_variance', 'mean_activation_magnitude']. | |
| --quantization-statistics-path QUANTIZATION_STATISTICS_PATH | |
| Directory path to dump/load data-aware weight-only quantization statistics. This is useful when | |
| running data-aware quantization multiple times on the same model and dataset to avoid | |
| recomputing statistics. This option is applicable exclusively for weight-only quantization. | |
| Please note that the statistics depend on the dataset, so if you change the dataset, you should | |
| also change the statistics path to avoid confusion. | |
| --num-samples NUM_SAMPLES | |
| The maximum number of samples to take from the dataset for quantization. | |
| --disable-stateful Disable stateful converted models, stateless models will be generated instead. Stateful models | |
| are produced by default when this key is not used. In stateful models all kv-cache inputs and | |
| outputs are hidden in the model and are not exposed as model inputs and outputs. If --disable- | |
| stateful option is used, it may result in sub-optimal inference performance. Use it when you | |
| intentionally want to use a stateless model, for example, to be compatible with existing | |
| OpenVINO native inference code that expects KV-cache inputs and outputs in the model. | |
| --disable-convert-tokenizer | |
| Do not add converted tokenizer and detokenizer OpenVINO models. | |
| --smooth-quant-alpha SMOOTH_QUANT_ALPHA | |
| SmoothQuant alpha parameter that improves the distribution of activations before MatMul layers | |
| and reduces quantization error. Valid only when activations quantization is enabled. | |
| ``` | |
| You can also apply fp16, 8-bit or 4-bit weight-only quantization on the Linear, Convolutional and Embedding layers when exporting your model by setting `--weight-format` to respectively `fp16`, `int8` or `int4`. | |
| Export with INT8 weights compression: | |
| ```bash | |
| optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int8 ov_model/ | |
| ``` | |
| Export with INT4 weights compression: | |
| ```bash | |
| optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int4 ov_model/ | |
| ``` | |
| Export with INT4 weights compression and data-free AWQ algorithm: | |
| ```bash | |
| optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int4 --awq ov_model/ | |
| ``` | |
| Export with INT4 weights compression and data-aware AWQ and Scale Estimation algorithms: | |
| ```bash | |
| optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B \ | |
| --weight-format int4 --awq --scale-estimation --dataset wikitext2 ov_model/ | |
| ``` | |
| For more information on the quantization parameters checkout the [documentation](inference#weight-only-quantization) | |
| Models larger than 1 billion parameters are exported to the OpenVINO format with 8-bit weights by default. You can disable it with `--weight-format fp32`. | |
| Besides weight-only quantization, you can also apply full model quantization including activations by setting `--quant-mode` to preferred precision. This will quantize both weights and activations of Linear, Convolutional and some other layers to selected mode. Please see example below. | |
| ```bash | |
| optimum-cli export openvino -m openai/whisper-large-v3-turbo --quant-mode int8 --dataset librispeech --num-samples 32 --smooth-quant-alpha 0.9 ./whisper-large-v3-turbo | |
| ``` | |
| #### Default quantization configs | |
| For some models we maintain a set of default quantization configs ([link](https://github.com/huggingface/optimum-intel/blob/main/optimum/intel/openvino/configuration.py)). To apply a default 4-bit weight-only quantization one should provide `--weight-format int4` without any additional arguments. For int8 weight & activation quantization it should be `--quant-mode int8`. For example: | |
| ```bash | |
| optimum-cli export openvino -m microsoft/Phi-4-mini-instruct --weight-format int4 ./Phi-4-mini-instruct | |
| ``` | |
| or | |
| ```bash | |
| optimum-cli export openvino -m openai/clip-vit-base-patch16 --quant-mode int8 ./clip-vit-base-patch16 | |
| ``` | |
| ### Decoder models | |
| For models with a decoder, we enable the re-use of past keys and values by default. This allows to avoid recomputing the same intermediate activations at each generation step. To export the model without, you will need to remove the `-with-past` suffix when specifying the task. | |
| | With K-V cache | Without K-V cache | | |
| |------------------------------------------|--------------------------------------| | |
| | `text-generation-with-past` | `text-generation` | | |
| | `text2text-generation-with-past` | `text2text-generation` | | |
| | `automatic-speech-recognition-with-past` | `automatic-speech-recognition` | | |
| ### Diffusion models | |
| When Stable Diffusion models are exported to the OpenVINO format, they are decomposed into different components that are later combined during inference: | |
| * Text encoder(s) | |
| * U-Net | |
| * VAE encoder | |
| * VAE decoder | |
| To export your Stable Diffusion XL model to the OpenVINO IR format with the CLI you can do as follows: | |
| ```bash | |
| optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 ov_sdxl/ | |
| ``` | |
| You can also apply hybrid quantization during model export. For example: | |
| ```bash | |
| optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 \ | |
| --weight-format int8 --dataset conceptual_captions ov_sdxl/ | |
| ``` | |
| For more information about hybrid quantization, take a look at this jupyter [notebook](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_hybrid_quantization.ipynb). | |
| ## When loading your model | |
| You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model. | |
| To easily save the resulting model, you can use the `save_pretrained()` method, which will save both the BIN and XML files describing the graph. It is useful to save the tokenizer to the same directory, to enable easy loading of the tokenizer for the model. | |
| ```diff | |
| - from transformers import AutoModelForCausalLM | |
| + from optimum.intel import OVModelForCausalLM | |
| from transformers import AutoTokenizer | |
| model_id = "meta-llama/Meta-Llama-3-8B" | |
| - model = AutoModelForCausalLM.from_pretrained(model_id) | |
| + model = OVModelForCausalLM.from_pretrained(model_id, export=True) | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| save_directory = "ov_model" | |
| model.save_pretrained(save_directory) | |
| tokenizer.save_pretrained(save_directory) | |
| ``` | |
| ## After loading your model | |
| ```python | |
| from transformers import AutoModelForCausalLM | |
| from optimum.exporters.openvino import export_from_model | |
| model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B") | |
| export_from_model(model, output="ov_model", task="text-generation-with-past") | |
| ``` | |
| Once the model is exported, you can now [load your OpenVINO model](inference) by replacing the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class. | |
| ## Troubleshooting | |
| Some models do not work with the latest transformers release. You may see an error message with a maximum supported version. To export these models, install a transformers version that supports the model, for example `pip install transformers==4.53.3`. | |
| The supported transformers versions compatible with each optimum-intel release are listed on the [Github releases page](https://github.com/huggingface/optimum-intel/releases/). | |
| ### Inference | |
| https://huggingface.co/docs/optimum.intel/pr_1714/openvino/inference.md | |
| # Inference | |
| Optimum Intel can be used to load optimized models from the [Hub](https://huggingface.co/models?library=openvino&sort=downloads) and create pipelines to run inference with OpenVINO Runtime on a variety of Intel processors ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices) | |
| ## Loading | |
| ### Transformers models | |
| Once [your model was exported](export), you can load it by replacing the `AutoModelForXxx` class with the corresponding `OVModelForXxx`. | |
| ```diff | |
| - from transformers import AutoModelForCausalLM | |
| + from optimum.intel import OVModelForCausalLM | |
| from transformers import AutoTokenizer, pipeline | |
| model_id = "helenai/gpt2-ov" | |
| - model = AutoModelForCausalLM.from_pretrained(model_id) | |
| # here the model was already exported so no need to set export=True | |
| + model = OVModelForCausalLM.from_pretrained(model_id) | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| pipe = pipeline("text-generation", model=model, tokenizer=tokenizer) | |
| results = pipe("He's a dreadful magician and") | |
| ``` | |
| As shown in the table below, each task is associated with a class enabling to automatically load your model. | |
| | Auto Class | Task | | |
| |--------------------------------------|--------------------------------------| | |
| | `OVModelForSequenceClassification` | `text-classification` | | |
| | `OVModelForTokenClassification` | `token-classification` | | |
| | `OVModelForQuestionAnswering` | `question-answering` | | |
| | `OVModelForAudioClassification` | `audio-classification` | | |
| | `OVModelForImageClassification` | `image-classification` | | |
| | `OVModelForFeatureExtraction` | `feature-extraction` | | |
| | `OVModelForMaskedLM` | `fill-mask` | | |
| | `OVModelForImageClassification` | `image-classification` | | |
| | `OVModelForAudioClassification` | `audio-classification` | | |
| | `OVModelForCausalLM` | `text-generation-with-past` | | |
| | `OVModelForSeq2SeqLM` | `text2text-generation-with-past` | | |
| | `OVModelForSpeechSeq2Seq` | `automatic-speech-recognition` | | |
| | `OVModelForVision2Seq` | `image-to-text` | | |
| | `OVModelForTextToSpeechSeq2Seq` | `text-to-audio` | | |
| ### Diffusers models | |
| Make sure you have ๐ค Diffusers installed. To install `diffusers`: | |
| ```bash | |
| pip install diffusers | |
| ``` | |
| ```diff | |
| - from diffusers import StableDiffusionPipeline | |
| + from optimum.intel import OVStableDiffusionPipeline | |
| model_id = "echarlaix/stable-diffusion-v1-5-openvino" | |
| - pipeline = StableDiffusionPipeline.from_pretrained(model_id) | |
| + pipeline = OVStableDiffusionPipeline.from_pretrained(model_id) | |
| prompt = "sailing ship in storm by Rembrandt" | |
| images = pipeline(prompt).images | |
| ``` | |
| As shown in the table below, each task is associated with a class enabling to automatically load your model. | |
| | Auto Class | Task | | |
| |--------------------------------------|--------------------------------------| | |
| | `OVStableDiffusionPipeline` | `text-to-image` | | |
| | `OVStableDiffusionImg2ImgPipeline` | `image-to-image` | | |
| | `OVStableDiffusionInpaintPipeline` | `inpaint` | | |
| | `OVStableDiffusionXLPipeline` | `text-to-image` | | |
| | `OVStableDiffusionXLImg2ImgPipeline` | `image-to-image` | | |
| | `OVLatentConsistencyModelPipeline` | `text-to-image` | | |
| | `OVLTXPipeline` | `text-to-video` | | |
| | `OVPipelineForText2Video` | `text-to-video` | | |
| See the [reference documentation](reference) for more information about parameters, and examples for different tasks. | |
| ## Compilation | |
| By default the model will be compiled when instantiating an `OVModel`. In the case where the model is reshaped or placed to another device, the model will need to be recompiled again, which will happen by default before the first inference (thus inflating the latency of the first inference). To avoid an unnecessary compilation, you can disable the first compilation by setting `compile=False`. | |
| ```python | |
| from optimum.intel import OVModelForQuestionAnswering | |
| model_id = "distilbert/distilbert-base-cased-distilled-squad" | |
| # Load the model and disable the model compilation | |
| model = OVModelForQuestionAnswering.from_pretrained(model_id, compile=False) | |
| ``` | |
| To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html) about installing drivers for GPU inference). | |
| ```python | |
| model.to("gpu") | |
| ``` | |
| The model can be compiled: | |
| ```python | |
| model.compile() | |
| ``` | |
| ## Static shape | |
| By default, dynamic shapes are supported, enabling inference for inputs of every shape. To speed up inference, static shapes can be enabled by giving the desired input shapes with [.reshape()](reference#optimum.intel.OVBaseModel.reshape). | |
| ```python | |
| # Fix the batch size to 1 and the sequence length to 40 | |
| batch_size, seq_len = 1, 40 | |
| model.reshape(batch_size, seq_len) | |
| ``` | |
| When fixing the shapes with the `reshape()` method, inference cannot be performed with an input of a different shape. | |
| ```python | |
| from transformers import AutoTokenizer | |
| from optimum.intel import OVModelForQuestionAnswering | |
| model_id = "distilbert/distilbert-base-cased-distilled-squad" | |
| model = OVModelForQuestionAnswering.from_pretrained(model_id, compile=False) | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| batch_size, seq_len = 1, 40 | |
| model.reshape(batch_size, seq_len) | |
| # Compile the model before the first inference | |
| model.compile() | |
| question = "Which name is also used to describe the Amazon rainforest ?" | |
| context = "The Amazon rainforest, also known as Amazonia or the Amazon Jungle" | |
| tokens = tokenizer(question, context, max_length=seq_len, padding="max_length", return_tensors="np") | |
| outputs = model(**tokens) | |
| ``` | |
| For models that handle images, you can also specify the `height` and `width` when reshaping your model: | |
| ```python | |
| batch_size, num_images, height, width = 1, 1, 512, 512 | |
| pipeline.reshape(batch_size=batch_size, height=height, width=width, num_images_per_prompt=num_images) | |
| images = pipeline(prompt, height=height, width=width, num_images_per_prompt=num_images).images | |
| ``` | |
| ## Configuration | |
| The `ov_config` parameter allow to provide custom OpenVINO configuration values. This can be used for example to enable full precision inference on devices where FP16 or BF16 inference precision is used by default. | |
| ```python | |
| ov_config = {"INFERENCE_PRECISION_HINT": "f32"} | |
| model = OVModelForSequenceClassification.from_pretrained(model_id, ov_config=ov_config) | |
| ``` | |
| Optimum Intel leverages OpenVINO's model caching to speed up model compiling on GPU. By default a `model_cache` directory is created in the model's directory in the [Hugging Face Hub cache](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage-cache). To override this, use the ov_config parameter and set `CACHE_DIR` to a different value. To disable model caching on GPU, set `CACHE_DIR` to an empty string. | |
| ```python | |
| ov_config = {"CACHE_DIR": ""} | |
| model = OVModelForSequenceClassification.from_pretrained(model_id, device="gpu", ov_config=ov_config) | |
| ``` | |
| ## Weight quantization | |
| You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when loading your model to reduce the memory footprint and inference latency. | |
| For more information on the quantization parameters checkout the [documentation](optimization#weight-only-quantization). | |
| If not specified, `load_in_8bit` will be set to `True` by default when models larger than 1 billion parameters are exported to the OpenVINO format (with `export=True`). You can disable it with `load_in_8bit=False`. | |
| It's also possible to apply quantization on both weights and activations using the [`OVQuantizer`](optimization#static-quantization). | |
| ### Generate images with Diffusion models | |
| https://huggingface.co/docs/optimum.intel/pr_1714/openvino/tutorials/diffusers.md | |
| # Generate images with Diffusion models | |
| ## Stable Diffusion | |
| Stable Diffusion models can also be used when running inference with OpenVINO. When Stable Diffusion models | |
| are exported to the OpenVINO format, they are decomposed into different components that are later combined during inference: | |
| - The text encoder | |
| - The U-NET | |
| - The VAE encoder | |
| - The VAE decoder | |
| | Task | Auto Class | | |
| |--------------------------------------|--------------------------------------| | |
| | `text-to-image` | `OVStableDiffusionPipeline` | | |
| | `image-to-image` | `OVStableDiffusionImg2ImgPipeline` | | |
| | `inpaint` | `OVStableDiffusionInpaintPipeline` | | |
| ### Text-to-Image | |
| Here is an example of how you can load an OpenVINO Stable Diffusion model and run inference using OpenVINO Runtime: | |
| ```python | |
| from optimum.intel import OVStableDiffusionPipeline | |
| model_id = "echarlaix/stable-diffusion-v1-5-openvino" | |
| pipeline = OVStableDiffusionPipeline.from_pretrained(model_id) | |
| prompt = "sailing ship in storm by Rembrandt" | |
| images = pipeline(prompt).images | |
| ``` | |
| To load your PyTorch model and convert it to OpenVINO on the fly, you can set `export=True`. | |
| ```python | |
| model_id = "runwayml/stable-diffusion-v1-5" | |
| pipeline = OVStableDiffusionPipeline.from_pretrained(model_id, export=True) | |
| # Don't forget to save the exported model | |
| pipeline.save_pretrained("openvino-sd-v1-5") | |
| ``` | |
| To further speed up inference, the model can be statically reshaped : | |
| ```python | |
| # Define the shapes related to the inputs and desired outputs | |
| batch_size, num_images, height, width = 1, 1, 512, 512 | |
| # Statically reshape the model | |
| pipeline.reshape(batch_size=batch_size, height=height, width=width, num_images_per_prompt=num_images) | |
| # Compile the model before the first inference | |
| pipeline.compile() | |
| # Run inference | |
| images = pipeline(prompt, height=height, width=width, num_images_per_prompt=num_images).images | |
| ``` | |
| In case you want to change any parameters such as the outputs height or width, you'll need to statically reshape your model once again. | |
| ### Text-to-Image with Textual Inversion | |
| Here is an example of how you can load an OpenVINO Stable Diffusion model with pre-trained textual inversion embeddings and run inference using OpenVINO Runtime: | |
| First, you can run original pipeline without textual inversion | |
| ```python | |
| from optimum.intel import OVStableDiffusionPipeline | |
| import numpy as np | |
| model_id = "echarlaix/stable-diffusion-v1-5-openvino" | |
| prompt = "A <cat-toy> back-pack" | |
| # Set a random seed for better comparison | |
| np.random.seed(42) | |
| pipeline = OVStableDiffusionPipeline.from_pretrained(model_id, export=False, compile=False) | |
| pipeline.compile() | |
| image1 = pipeline(prompt, num_inference_steps=50).images[0] | |
| image1.save("stable_diffusion_v1_5_without_textual_inversion.png") | |
| ``` | |
| Then, you can load [sd-concepts-library/cat-toy](https://huggingface.co/sd-concepts-library/cat-toy) textual inversion embedding and run pipeline with same prompt again | |
| ```python | |
| # Reset stable diffusion pipeline | |
| pipeline.clear_requests() | |
| # Load textual inversion into stable diffusion pipeline | |
| pipeline.load_textual_inversion("sd-concepts-library/cat-toy", "<cat-toy>") | |
| # Compile the model before the first inference | |
| pipeline.compile() | |
| image2 = pipeline(prompt, num_inference_steps=50).images[0] | |
| image2.save("stable_diffusion_v1_5_with_textual_inversion.png") | |
| ``` | |
| The left image shows the generation result of original stable diffusion v1.5, the right image shows the generation result of stable diffusion v1.5 with textual inversion. | |
| | | | | |
| |---|---| | |
| |  |  | | |
| ### Image-to-Image | |
| ```python | |
| import requests | |
| import torch | |
| from PIL import Image | |
| from io import BytesIO | |
| from optimum.intel import OVStableDiffusionImg2ImgPipeline | |
| model_id = "runwayml/stable-diffusion-v1-5" | |
| pipeline = OVStableDiffusionImg2ImgPipeline.from_pretrained(model_id, export=True) | |
| url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" | |
| response = requests.get(url) | |
| init_image = Image.open(BytesIO(response.content)).convert("RGB") | |
| init_image = init_image.resize((768, 512)) | |
| prompt = "A fantasy landscape, trending on artstation" | |
| image = pipeline(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images[0] | |
| image.save("fantasy_landscape.png") | |
| ``` | |
| ## Stable Diffusion XL | |
| | Task | Auto Class | | |
| |--------------------------------------|--------------------------------------| | |
| | `text-to-image` | `OVStableDiffusionXLPipeline` | | |
| | `image-to-image` | `OVStableDiffusionXLImg2ImgPipeline` | | |
| ### Text-to-Image | |
| Here is an example of how you can load a SDXL OpenVINO model from [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and run inference using OpenVINO Runtime: | |
| ```python | |
| from optimum.intel import OVStableDiffusionXLPipeline | |
| model_id = "stabilityai/stable-diffusion-xl-base-1.0" | |
| base = OVStableDiffusionXLPipeline.from_pretrained(model_id) | |
| prompt = "train station by Caspar David Friedrich" | |
| image = base(prompt).images[0] | |
| image.save("train_station.png") | |
| ``` | |
| | | | | |
| |---|---| | |
| |  |  | | |
| ### Text-to-Image with Textual Inversion | |
| Here is an example of how you can load an SDXL OpenVINO model from [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) with pre-trained textual inversion embeddings and run inference using OpenVINO Runtime: | |
| First, you can run original pipeline without textual inversion | |
| ```python | |
| from optimum.intel import OVStableDiffusionXLPipeline | |
| import numpy as np | |
| model_id = "stabilityai/stable-diffusion-xl-base-1.0" | |
| prompt = "charturnerv2, multiple views of the same character in the same outfit, a character turnaround wearing a red jacket and black shirt, best quality, intricate details." | |
| # Set a random seed for better comparison | |
| np.random.seed(112) | |
| base = OVStableDiffusionXLPipeline.from_pretrained(model_id, export=False, compile=False) | |
| base.compile() | |
| image1 = base(prompt, num_inference_steps=50).images[0] | |
| image1.save("sdxl_without_textual_inversion.png") | |
| ``` | |
| Then, you can load [charturnerv2](https://civitai.com/models/3036/charturner-character-turnaround-helper-for-15-and-21) textual inversion embedding and run pipeline with same prompt again | |
| ```python | |
| # Reset stable diffusion pipeline | |
| base.clear_requests() | |
| # Load textual inversion into stable diffusion pipeline | |
| base.load_textual_inversion("./charturnerv2.pt", "charturnerv2") | |
| # Compile the model before the first inference | |
| base.compile() | |
| image2 = base(prompt, num_inference_steps=50).images[0] | |
| image2.save("sdxl_with_textual_inversion.png") | |
| ``` | |
| ### Image-to-Image | |
| Here is an example of how you can load a PyTorch SDXL model, convert it to OpenVINO on-the-fly and run inference using OpenVINO Runtime for *image-to-image*: | |
| ```python | |
| from optimum.intel import OVStableDiffusionXLImg2ImgPipeline | |
| from diffusers.utils import load_image | |
| model_id = "stabilityai/stable-diffusion-xl-refiner-1.0" | |
| pipeline = OVStableDiffusionXLImg2ImgPipeline.from_pretrained(model_id, export=True) | |
| url = "https://huggingface.co/datasets/optimum/documentation-images/resolve/main/intel/openvino/sd_xl/castle_friedrich.png" | |
| image = load_image(url).convert("RGB") | |
| prompt = "medieval castle by Caspar David Friedrich" | |
| image = pipeline(prompt, image=image).images[0] | |
| # Don't forget to save your OpenVINO model so that you can load it without exporting it with `export=True` | |
| pipeline.save_pretrained("openvino-sd-xl-refiner-1.0") | |
| ``` | |
| ### Refining the image output | |
| The image can be refined by making use of a model like [stabilityai/stable-diffusion-xl-refiner-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0). In this case, you only have to output the latents from the base model. | |
| ```python | |
| from optimum.intel import OVStableDiffusionXLImg2ImgPipeline | |
| model_id = "stabilityai/stable-diffusion-xl-refiner-1.0" | |
| refiner = OVStableDiffusionXLImg2ImgPipeline.from_pretrained(model_id, export=True) | |
| image = base(prompt=prompt, output_type="latent").images[0] | |
| image = refiner(prompt=prompt, image=image[None, :]).images[0] | |
| ``` | |
| ## Latent Consistency Models | |
| | Task | Auto Class | | |
| |--------------------------------------|--------------------------------------| | |
| | `text-to-image` | `OVLatentConsistencyModelPipeline` | | |
| ### Text-to-Image | |
| Here is an example of how you can load a Latent Consistency Model (LCM) from [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) and run inference using OpenVINO : | |
| ```python | |
| from optimum.intel import OVLatentConsistencyModelPipeline | |
| model_id = "SimianLuo/LCM_Dreamshaper_v7" | |
| pipeline = OVLatentConsistencyModelPipeline.from_pretrained(model_id, export=True) | |
| prompt = "sailing ship in storm by Leonardo da Vinci" | |
| images = pipeline(prompt, num_inference_steps=4, guidance_scale=8.0).images | |
| ``` | |
| ### Notebooks | |
| https://huggingface.co/docs/optimum.intel/pr_1714/openvino/tutorials/notebooks.md | |
| # Notebooks | |
| ## Inference | |
| | Notebook | Description | | | | |
| |:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|------:| | |
| | [How to run inference with the OpenVINO](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/optimum_openvino_inference.ipynb) | Explains how to export your model to OpenVINO and to run inference with OpenVINO Runtime on various tasks | [](https://colab.research.google.com/github/huggingface/optimum-intel/blob/main/notebooks/openvino/optimum_openvino_inference.ipynb) | [](https://studiolab.sagemaker.aws/import/github/huggingface/optimum-intel/blob/main/notebooks/openvino/optimum_openvino_inference.ipynb) | | |
| ## Quantization | |
| | Notebook | Description | | | | |
| |:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:--------------------------------------------------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:|------:| | |
| | [How to quantize a question answering model with OpenVINO NNCF](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/question_answering_quantization.ipynb) | Show how to apply post-training quantization on a question answering model using [NNCF](https://github.com/openvinotoolkit/nncf) | [](https://colab.research.google.com/github/huggingface/optimum-intel/blob/main/notebooks/openvino/question_answering_quantization.ipynb) | [](https://studiolab.sagemaker.aws/import/github/huggingface/optimum-intel/blob/main/notebooks/openvino/question_answering_quantization.ipynb) | | |
| | [How to quantize Stable Diffusion model with OpenVINO NNCF](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_hybrid_quantization.ipynb) | Show how to apply post-training hybrid quantization on a Stable Diffusion model using [NNCF](https://github.com/openvinotoolkit/nncf) | [](https://colab.research.google.com/github/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_hybrid_quantization.ipynb)| [](https://studiolab.sagemaker.aws/import/github/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_hybrid_quantization.ipynb)| | [](https://colab.research.google.com/github/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_optimization.ipynb) | [](https://studiolab.sagemaker.aws/import/github/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_optimization.ipynb) | | |
Xet Storage Details
- Size:
- 179 kB
- Xet hash:
- 21e6330bf22193ea935c33d479a06113bfe28f0a659b683690ef0d2831d7dfab
ยท
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.