Buckets:

hf-doc-build
/

doc-dev

Files

xet

hf-doc-build/doc-dev / optimum-intel /pr_1714 /en /llms-full.txt

HuggingFaceDocBuilder

20 days ago

download

raw

179 kB

	# Optimum.Intel

	## Docs

	- [Installation](https://huggingface.co/docs/optimum.intel/pr_1714/installation.md)
	- [🤗 Optimum Intel](https://huggingface.co/docs/optimum.intel/pr_1714/index.md)
	- [Supported models](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/models.md)
	- [Optimization](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/optimization.md)
	- [Models](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/reference.md)
	- [Export your model](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/export.md)
	- [Inference](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/inference.md)
	- [Generate images with Diffusion models](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/tutorials/diffusers.md)
	- [Notebooks](https://huggingface.co/docs/optimum.intel/pr_1714/openvino/tutorials/notebooks.md)

	### Installation
	https://huggingface.co/docs/optimum.intel/pr_1714/installation.md

	# Installation

	To install the latest release of 🤗 Optimum Intel with the corresponding required dependencies, you can do respectively:

	```bash
	python -m pip install --upgrade-strategy eager "optimum-intel[openvino]"
	```

	The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.

	We recommend creating a [virtual environment](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/#creating-a-virtual-environment) and upgrading pip with :
	```bash
	python -m pip install --upgrade pip
	```

	Optimum Intel is a fast-moving project, and you may want to install from source with the following command:

	```bash
	python -m pip install "optimum-intel[openvino]"@git+https://github.com/huggingface/optimum-intel.git
	```

	### 🤗 Optimum Intel
	https://huggingface.co/docs/optimum.intel/pr_1714/index.md

	# 🤗 Optimum Intel

	🤗 Optimum Intel is the interface between the 🤗 Transformers, Diffusers, Sentence Transformers and timm libraries and the different tools and libraries provided by [OpenVINO](https://docs.openvino.ai) to accelerate end-to-end pipelines on Intel architectures.

	[OpenVINO](https://docs.openvino.ai) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.

	### Supported models
	https://huggingface.co/docs/optimum.intel/pr_1714/openvino/models.md

	# Supported models

	🤗 Optimum handles the export of models to OpenVINO in the `exporters.openvino` module. It provides classes, functions, and a command line interface to perform the export easily.
	Here is the list of the supported architectures :

	## [Transformers](https://huggingface.co/docs/transformers/index)

	- AFMoE (aka Arcee Trinity)
	- ALBERT
	- Aquila
	- Aquila 2
	- Arcee
	- Arctic
	- Audio Spectrogram Transformer
	- Baichuan 2
	- BART
	- BEiT
	- BERT
	- BigBirdPegasus
	- BioGPT
	- BitNet
	- BlenderBot
	- Blenderbot Small
	- BLOOM
	- CLIP
	- CamemBERT
	- ChatGLM (ChatGLM2, ChatGLM3, GLM4)
	- CodeGen
	- CodeGen2
	- Cohere
	- Cohere2
	- ConvBERT
	- ConvNeXt
	- DBRX
	- Data2VecAudio
	- Data2VecText
	- Data2VecVision
	- DeBERTa
	- DeBERTa-v2
	- DeciLM
	- DeiT
	- DeepSeek
	- DeepSeek-V2
	- DeepSeek-V3
	- DistilBERT
	- ERNIE 4.5
	- ELECTRA
	- Encoder Decoder
	- ESM
	- EXAONE
	- EXAONE 4
	- Falcon
	- Falcon-Mamba
	- FlauBERT
	- GLM-4
	- GLM-Edge
	- GPT-2
	- GPT-BigCode
	- GPT-J
	- GPT-Neo
	- GPT-NeoX
	- GPT-NeoX-Japanese
	- GPT-OSS
	- Gemma
	- Gemma 2
	- Gemma 3
	- Gemma 4
	- GOT-OCR 2.0
	- Granite
	- Granite 4.0
	- GraniteMoE
	- HuBERT
	- HunYuan V1 Dense
	- I-BERT
	- Idefics3
	- InternLM
	- InternLM2
	- InternVL2
	- Jais
	- LeViT
	- LFM2
	- LFM2-MoE
	- LLaMA
	- LLaMA 4
	- LLaVa
	- LLaVa-NeXT
	- LLaVa-NeXT-Video
	- LLaVa-Qwen2 (NanoLLaVa)
	- LongT5
	- M2M-100
	- MAIRA-2
	- Mamba
	- mBART
	- MPNet
	- MPT
	- mT5
	- MarianMT
	- MiniCPM
	- MiniCPM3
	- MiniCPM-o
	- MiniCPM-V
	- Mistral
	- Mixtral
	- MobileBERT
	- MobileNet v1
	- MobileNet v2
	- MobileViT
	- Nystromformer
	- OLMo
	- OLMo 2
	- OPT
	- Orion
	- Pegasus
	- Perceiver
	- Persimmon
	- Phi
	- Phi-3
	- Phi-3.5-MoE
	- Phi-3 Vision
	- Phi-4 Multimodal
	- Pix2Struct
	- PoolFormer
	- Qwen
	- Qwen2 (Qwen1.5, Qwen2.5)
	- Qwen2MoE
	- Qwen2-VL
	- Qwen2.5-VL
	- Qwen3
	- Qwen3MoE
	- Qwen3-VL
	- Qwen3.5
	- Qwen3.5-MoE
	- Qwen3.6
	- Qwen3-Next
	- RemBERT
	- ResNet
	- RoBERTa
	- RoFormer
	- SAM
	- SEW
	- SEW-D
	- SegFormer
	- SigLIP
	- SmolVLM (SmolVLM2)
	- SpeechT5 (text-to-speech)
	- SqueezeBERT
	- StableLM
	- StarCoder2
	- Swin
	- T5
	- TrOCR
	- UniSpeech
	- UniSpeech-SAT
	- Vision Encoder Decoder
	- ViT
	- Wav2Vec2
	- Wav2Vec2-Conformer
	- WavLM
	- Whisper
	- XGLM
	- XLM
	- XLM-RoBERTa
	- XVERSE
	- Zamba2

	## [Diffusers](https://huggingface.co/docs/diffusers/index)
	- Stable Diffusion
	- Stable Diffusion XL
	- Latent Consistency
	- Stable Diffusion 3
	- Flux
	- Sana
	- SanaSprint
	- LTX

	## [Timm](https://huggingface.co/docs/timm/index)
	- PiT
	- ViT

	## [Sentence Transformers](https://github.com/UKPLab/sentence-transformers)
	- All Transformer and CLIP-based models.

	## [OpenCLIP](https://github.com/mlfoundations/open_clip)
	- All CLIP-based models

	### Optimization
	https://huggingface.co/docs/optimum.intel/pr_1714/openvino/optimization.md

	# Optimization

	🤗 Optimum Intel provides an `openvino` package that enables you to apply a variety of model quantization methods on many models hosted on the 🤗 hub using the [NNCF](https://docs.openvino.ai/2024/openvino-workflow/model-optimization.html) framework.

	Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.

	## Optimization Support Matrix

	Click on a ✅ to copy the command/code for the corresponding optimization case.

	Command copied to clipboard



	Task(OV Model Class)
	Weight-only Quantization
	Hybrid Quantization
	Full Quantization
	Mixed Quantization


	Data-free
	Data-aware


	CLI
	Python
	CLI
	Python
	CLI
	Python
	CLI
	Python
	CLI
	Python




	text-generation(OVModelForCausalLM)

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int8 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --dataset wikitext2 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–
	-

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant-mode int8 --dataset wikitext2 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant-mode cb4_f8e4m3 --dataset wikitext2 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForCausalLM.from_pretrained(\'TinyLlama/TinyLlama-1.1B-Chat-v1.0\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'cb4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'wikitext2\'))).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅




	image-text-to-text(OVModelForVisualCausalLM)

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino --task image-text-to-text -m OpenGVLab/InternVL2-1B --trust-remote-code --weight-format int4 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForVisualCausalLM.from_pretrained(\'OpenGVLab/InternVL2-1B\', trust_remote_code=True, quantization_config=OVWeightQuantizationConfig(bits=4)).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino --task image-text-to-text -m OpenGVLab/InternVL2-1B --trust-remote-code --weight-format int4 --dataset contextual ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForVisualCausalLM.from_pretrained(\'OpenGVLab/InternVL2-1B\', trust_remote_code=True, quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'contextual\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–
	–

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino --task image-text-to-text -m OpenGVLab/InternVL2-1B --trust-remote-code --quant-mode int8 --dataset contextual ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForVisualCausalLM.from_pretrained(\'OpenGVLab/InternVL2-1B\', trust_remote_code=True, quantization_config=OVQuantizationConfig(bits=8, dataset=\'contextual\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–
	–


	text-to-image, text-to-video(OVDiffusionPipeline)

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m dreamlike-art/dreamlike-anime-1.0 --weight-format int8 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVDiffusionPipeline.from_pretrained(\'dreamlike-art/dreamlike-anime-1.0\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–
	–

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m dreamlike-art/dreamlike-anime-1.0 --weight-format int8 --dataset conceptual_captions ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVDiffusionPipeline.from_pretrained(\'dreamlike-art/dreamlike-anime-1.0\', quantization_config=OVWeightQuantizationConfig(bits=8, quant_method=\'hybrid\', dataset=\'conceptual_captions\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m dreamlike-art/dreamlike-anime-1.0 --quant-mode int8 --dataset conceptual_captions ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVDiffusionPipeline.from_pretrained(\'dreamlike-art/dreamlike-anime-1.0\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'conceptual_captions\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–
	–


	automatic-speech-recognition(OVModelForSpeechSeq2Seq)
	–
	–
	–
	–
	–
	–

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m openai/whisper-large-v3-turbo --quant-mode int8 --dataset librispeech --num-samples 10 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForSpeechSeq2Seq.from_pretrained(\'openai/whisper-large-v3-turbo\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'librispeech\', num_samples=10)).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–
	–


	feature-extraction(OVModelForFeatureExtraction)

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m microsoft/codebert-base --weight-format int8 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForFeatureExtraction.from_pretrained(\'microsoft/codebert-base\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m microsoft/codebert-base --weight-format int4 --dataset wikitext2 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForFeatureExtraction.from_pretrained(\'microsoft/codebert-base\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–
	-

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m microsoft/codebert-base --quant-mode int8 --dataset wikitext2 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForFeatureExtraction.from_pretrained(\'microsoft/codebert-base\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m microsoft/codebert-base --quant-mode cb4_f8e4m3 --dataset wikitext2 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForFeatureExtraction.from_pretrained(\'microsoft/codebert-base\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'cb4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'wikitext2\'))).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅




	feature-extraction(OVSentenceTransformer)

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino --library sentence_transformers -m sentence-transformers/all-mpnet-base-v2 --weight-format int8 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVSentenceTransformer.from_pretrained(\'sentence-transformers/all-mpnet-base-v2\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino --library sentence_transformers -m sentence-transformers/all-mpnet-base-v2 --weight-format int4 --dataset wikitext2 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVSentenceTransformer.from_pretrained(\'sentence-transformers/all-mpnet-base-v2\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–
	-

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino --library sentence_transformers -m sentence-transformers/all-mpnet-base-v2 --quant-mode int8 --dataset wikitext2 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVSentenceTransformer.from_pretrained(\'sentence-transformers/all-mpnet-base-v2\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino --library sentence_transformers -m sentence-transformers/all-mpnet-base-v2 --quant-mode cb4_f8e4m3 --dataset wikitext2 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVSentenceTransformer.from_pretrained(\'sentence-transformers/all-mpnet-base-v2\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'cb4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'wikitext2\'))).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅




	fill-mask(OVModelForMaskedLM)

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m FacebookAI/roberta-base --weight-format int8 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForMaskedLM.from_pretrained(\'FacebookAI/roberta-base\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m FacebookAI/roberta-base --weight-format int4 --dataset wikitext2 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForMaskedLM.from_pretrained(\'FacebookAI/roberta-base\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–
	-

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m FacebookAI/roberta-base --quant-mode int8 --dataset wikitext2 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForMaskedLM.from_pretrained(\'FacebookAI/roberta-base\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m FacebookAI/roberta-base --quant-mode cb4_f8e4m3 --dataset wikitext2 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForMaskedLM.from_pretrained(\'FacebookAI/roberta-base\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'cb4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'wikitext2\'))).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅




	text2text-generation(OVModelForSeq2SeqLM)

	<button
	onclick="navigator.clipboard.writeText('optimum-cli export openvino -m google-t5/t5-small --weight-format int8 ./save_dir')">
	✅



	<button
	onclick="navigator.clipboard.writeText('OVModelForSeq2SeqLM.from_pretrained(\'google-t5/t5-small\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')')">
	✅



	<button
	onclick="navigator.clipboard.writeText('optimum-cli export openvino -m google-t5/t5-small --weight-format int4 --dataset wikitext2 ./save_dir')">
	✅



	<button
	onclick="navigator.clipboard.writeText('OVModelForSeq2SeqLM.from_pretrained(\'google-t5/t5-small\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'wikitext2\')).save_pretrained(\'save_dir\')')">
	✅


	–
	-

	<button
	onclick="navigator.clipboard.writeText('optimum-cli export openvino -m google-t5/t5-small --quant-mode int8 --dataset wikitext2 --smooth-quant-alpha -1 ./save_dir')">
	✅



	<button
	onclick="navigator.clipboard.writeText('OVModelForSeq2SeqLM.from_pretrained(\'google-t5/t5-small\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'wikitext2\', smooth_quant_alpha=-1)).save_pretrained(\'save_dir\')')">
	✅



	<button
	onclick="navigator.clipboard.writeText('optimum-cli export openvino -m google-t5/t5-small --quant-mode cb4_f8e4m3 --dataset wikitext2 --smooth-quant-alpha -1 ./save_dir')">
	✅



	<button
	onclick="navigator.clipboard.writeText('OVModelForSeq2SeqLM.from_pretrained(\'google-t5/t5-small\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'cb4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'wikitext2\', smooth_quant_alpha=-1))).save_pretrained(\'save_dir\')')">
	✅




	zero-shot-image-classification(OVModelForZeroShotImageClassification)

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m openai/clip-vit-base-patch16 --weight-format int8 ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForZeroShotImageClassification.from_pretrained(\'openai/clip-vit-base-patch16\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m openai/clip-vit-base-patch16 --weight-format int4 --dataset conceptual_captions ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForZeroShotImageClassification.from_pretrained(\'openai/clip-vit-base-patch16\', quantization_config=OVWeightQuantizationConfig(bits=4, dataset=\'conceptual_captions\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–
	-

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m openai/clip-vit-base-patch16 --quant-mode int8 --dataset conceptual_captions ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForZeroShotImageClassification.from_pretrained(\'openai/clip-vit-base-patch16\', quantization_config=OVQuantizationConfig(bits=8, dataset=\'conceptual_captions\')).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m openai/clip-vit-base-patch16 --quant-mode cb4_f8e4m3 --dataset conceptual_captions ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVModelForZeroShotImageClassification.from_pretrained(\'openai/clip-vit-base-patch16\', quantization_config=OVMixedQuantizationConfig(OVWeightQuantizationConfig(bits=4, dtype=\'cb4\'), OVQuantizationConfig(dtype=\'f8e4m3\', dataset=\'conceptual_captions\'))).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅




	feature-extraction(OVSamModel)
	–

	<button onclick="
	navigator.clipboard.writeText('OVSamModel.from_pretrained(\'facebook/sam-vit-base\', quantization_config=OVPipelineQuantizationConfig(quantization_configs=dict(vision_encoder=OVWeightQuantizationConfig(bits=8)))).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–

	<button onclick="
	navigator.clipboard.writeText('OVSamModel.from_pretrained(\'facebook/sam-vit-base\', quantization_config=OVPipelineQuantizationConfig(quantization_configs=dict(vision_encoder=OVWeightQuantizationConfig(bits=4, dataset=\'coco\')))).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–
	-

	<button onclick="
	navigator.clipboard.writeText('optimum-cli export openvino -m facebook/sam-vit-base --quant-mode int8 --dataset coco ./save_dir');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅



	<button onclick="
	navigator.clipboard.writeText('OVSamModel.from_pretrained(\'facebook/sam-vit-base\', quantization_config=OVPipelineQuantizationConfig(quantization_configs=dict(vision_encoder=OVQuantizationConfig(bits=8, dataset=\'coco\')))).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–
	–


	text-to-audio(OVModelForTextToSpeechSeq2Seq)
	✅

	<button onclick="
	navigator.clipboard.writeText('OVModelForTextToSpeechSeq2Seq.from_pretrained(\'microsoft/speecht5_tts\', vocoder=\'microsoft/speecht5_hifigan\', quantization_config=OVWeightQuantizationConfig(bits=8)).save_pretrained(\'save_dir\')');
	let m=document.getElementById('copyMsg');
	m.style.display='block';
	clearTimeout(window._copyTimeout);
	window._copyTimeout=setTimeout(()=>m.style.display='none', 2000);
	">
	✅


	–
	–
	–
	–
	–
	–
	–
	–



	## Weight-only Quantization

	Quantization can be applied on the model's Linear, Convolutional and Embedding layers, enabling the loading of large models on memory-limited devices. For example, when applying 8-bit quantization, the resulting model will be x4 smaller than its fp32 counterpart. For 4-bit quantization, the reduction in memory could theoretically reach x8, but is closer to x6 in practice.

	### 8-bit

	For the 8-bit weight quantization you can provide `quantization_config` equal to `OVWeightQuantizationConfig(bits=8)` to load your model's weights in 8-bit:

	```python
	from optimum.intel import OVModelForCausalLM, OVWeightQuantizationConfig

	model_id = "helenai/gpt2-ov"
	quantization_config = OVWeightQuantizationConfig(bits=8)
	model = OVModelForCausalLM.from_pretrained(model_id, quantization_config=quantization_config)

	# Saves the int8 model that will be x4 smaller than its fp32 counterpart
	model.save_pretrained(saving_directory)
	```

	Weights of language models inside vision-language pipelines can be quantized in a similar way:
	```python
	model = OVModelForVisualCausalLM.from_pretrained(
	"llava-hf/llava-v1.6-mistral-7b-hf",
	quantization_config=quantization_config
	)
	```

	If quantization_config is not provided, model will be exported in 8 bits by default when it has more than 1 billion parameters. You can disable it with `load_in_8bit=False`.

	### 4-bit

	4-bit weight quantization can be achieved in a similar way:

	```python
	from optimum.intel import OVModelForCausalLM

	model = OVModelForCausalLM.from_pretrained(model_id, quantization_config={"bits": 4})
	```

	For some models, we provide preconfigured 4-bit weight-only quantization [configurations](https://github.com/huggingface/optimum-intel/blob/main/optimum/intel/openvino/configuration.py) that offer a good trade-off between quality and speed. This default 4-bit configuration is applied automatically when you specify `quantization_config={"bits": 4}`.

	Or for vision-language pipelines:
	```python
	model = OVModelForVisualCausalLM.from_pretrained(
	"llava-hf/llava-v1.6-mistral-7b-hf",
	quantization_config={"bits": 4}
	)
	```

	You can tune quantization parameters to achieve a better performance accuracy trade-off as follows:

	```python
	from optimum.intel import OVWeightQuantizationConfig

	quantization_config = OVWeightQuantizationConfig(
	bits=4,
	sym=False,
	ratio=0.8,
	quant_method="awq",
	dataset="wikitext2"
	)
	```

	Note: `OVWeightQuantizationConfig` also accepts keyword arguments that are not listed in its constructor. In this case such arguments will be passed directly to `nncf.compress_weights()` call. This is useful for passing additional parameters to the quantization algorithm.

	By default the quantization scheme will be [asymmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#asymmetric-quantization), to make it [symmetric](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/training_time_compression/other_algorithms/LegacyQuantization.md#symmetric-quantization) you can add `sym=True`.

	For 4-bit quantization you can also specify the following arguments in the quantization configuration :
	* The `group_size` parameter will define the group size to use for quantization, `-1` it will results in per-column quantization.
	* The `ratio` parameter controls the ratio between 4-bit and 8-bit quantization. If set to 0.9, it means that 90% of the layers will be quantized to `int4` while 10% will be quantized to `int8`.

	Smaller `group_size` and `ratio` values usually improve accuracy at the sacrifice of the model size and inference latency.

	Quality of 4-bit weight compressed model can further be improved by employing one of the following data-dependent methods:
	* AWQ which stands for Activation Aware Quantization is an algorithm that tunes model weights for more accurate 4-bit compression. It slightly improves generation quality of compressed LLMs, but requires significant additional time and memory for tuning weights on a calibration dataset. Please note that it is possible that there will be no matching patterns in the model to apply AWQ, in such case it will be skipped. There is also a data-free version of AWQ available that relies on per-column magnitudes of weights instead of activations.
	* Scale Estimation is a method that tunes quantization scales to minimize the `L2` error between the original and compressed layers. Providing a dataset is required to run scale estimation. Using this method also incurs additional time and memory overhead.
	* GPTQ optimizes compressed weights in a layer-wise fashion to minimize the difference between activations of a compressed and original layer.
	* LoRA Correction mitigates quantization noise introduced during weight compression by leveraging low-rank adaptation.

	Data-aware algorithms can be applied together or separately. For that, provide corresponding arguments to the 4-bit `OVWeightQuantizationConfig` together with a dataset. For example:
	```python
	quantization_config = OVWeightQuantizationConfig(
	bits=4,
	sym=False,
	ratio=0.8,
	quant_method="awq",
	scale_estimation=True,
	gptq=True,
	dataset="wikitext2"
	)
	```

	Note: GPTQ and LoRA Correction algorithms can't be applied simultaneously.

	## Full quantization

	When applying post-training full quantization, both the weights and the activations are quantized.
	To apply quantization on the activations, an additional calibration step is needed which consists in feeding a `calibration_dataset` to the network in order to estimate the quantization activations parameters.

	Here is how to apply full quantization on a fine-tuned DistilBERT given your own `calibration_dataset`:

	```python
	from transformers import AutoTokenizer
	from optimum.intel import OVQuantizer, OVModelForSequenceClassification, OVConfig, OVQuantizationConfig

	model_id = "distilbert-base-uncased-finetuned-sst-2-english"
	model = OVModelForSequenceClassification.from_pretrained(model_id, export=True)
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	# The directory where the quantized model will be saved
	save_dir = "ptq_model"

	quantizer = OVQuantizer.from_pretrained(model)

	# Apply full quantization and export the resulting quantized model to OpenVINO IR format
	ov_config = OVConfig(quantization_config=OVQuantizationConfig())
	quantizer.quantize(ov_config=ov_config, calibration_dataset=calibration_dataset, save_directory=save_dir)
	# Save the tokenizer
	tokenizer.save_pretrained(save_dir)
	```

	The calibration dataset can also be created easily using your `OVQuantizer`:

	```python
	from functools import partial

	def preprocess_function(examples, tokenizer):
	return tokenizer(examples["sentence"], padding="max_length", max_length=128, truncation=True)

	# Create the calibration dataset used to perform full quantization
	calibration_dataset = quantizer.get_calibration_dataset(
	"glue",
	dataset_config_name="sst2",
	preprocess_function=partial(preprocess_function, tokenizer=tokenizer),
	num_samples=300,
	dataset_split="train",
	)
	```

	The `quantize()` method applies post-training quantization and export the resulting quantized model to the OpenVINO Intermediate Representation (IR). The resulting graph is represented with two files: an XML file describing the network topology and a binary file describing the weights. The resulting model can be run on any target Intel device.

	### Speech-to-text Models Quantization

	The speech-to-text Whisper model can be quantized without the need for preparing a custom calibration dataset. Please see example below.

	```python
	model_id = "openai/whisper-tiny"
	ov_model = OVModelForSpeechSeq2Seq.from_pretrained(
	model_id,
	quantization_config=OVQuantizationConfig(
	num_samples=10,
	dataset="librispeech",
	processor=model_id,
	smooth_quant_alpha=0.95,
	)
	)
	```

	With this, encoder and decoder models of the Whisper pipeline will be fully quantized, including activations.

	## Hybrid quantization

	Traditional optimization methods like post-training 8-bit quantization do not work well for Stable Diffusion (SD) models and can lead to poor generation results. On the other hand, weight compression does not improve performance significantly when applied to Stable Diffusion models, as the size of activations is comparable to weights.
	The U-Net component takes up most of the overall execution time of the pipeline. Thus, optimizing just this one component can bring substantial benefits in terms of inference speed while keeping acceptable accuracy without fine-tuning. Quantizing the rest of the diffusion pipeline does not significantly improve inference performance but could potentially lead to substantial accuracy degradation.
	Therefore, the proposal is to apply quantization in hybrid mode for the U-Net model and weight-only quantization for the rest of the pipeline components :
	* U-Net : quantization applied on both the weights and activations
	* The text encoder, VAE encoder / decoder : quantization applied on the weights

	The hybrid mode involves the quantization of weights in MatMul and Embedding layers, and activations of other layers, facilitating accuracy preservation post-optimization while reducing the model size.

	The `quantization_config` is utilized to define optimization parameters for optimizing the SD pipeline. To enable hybrid quantization, specify the quantization dataset in the `quantization_config`. If the dataset is not defined, weight-only quantization will be applied on all components.

	```python
	from optimum.intel import OVStableDiffusionPipeline, OVWeightQuantizationConfig

	model = OVStableDiffusionPipeline.from_pretrained(
	model_id,
	export=True,
	quantization_config=OVWeightQuantizationConfig(bits=8, dataset="conceptual_captions"),
	)
	```

	For more details, please refer to the corresponding NNCF [documentation](https://github.com/openvinotoolkit/nncf/blob/develop/docs/usage/post_training_compression/weights_compression/Usage.md).

	## Mixed Quantization

	Mixed quantization is a technique that combines weight-only quantization with full quantization. During mixed quantization we separately quantize:
	1. weights of weighted layers to one precision, and
	2. activations (and possibly, weights, if some were skipped at the first step) of other supported layers to another precision.

	By default, weights of all weighted layers are quantized in the first step. In the second step activations of weighted and non-weighted layers are quantized. If some layers are instructed to be ignored in the first step with `weight_quantization_config.ignored_scope` parameter, both weights and activations of these layers are quantized to the precision given in the `full_quantization_config`.

	When running this kind of optimization through Python API, `OVMixedQuantizationConfig` should be used. In such case the precision for the first step should be provided with `weight_quantization_config` argument and the precision for the second step with `full_quantization_config` argument. For example:

	```python
	model = OVModelForCausalLM.from_pretrained(
	'TinyLlama/TinyLlama-1.1B-Chat-v1.0',
	quantization_config=OVMixedQuantizationConfig(
	weight_quantization_config=OVWeightQuantizationConfig(bits=4, dtype='cb4'),
	full_quantization_config=OVQuantizationConfig(dtype='f8e4m3', dataset='wikitext2')
	)
	)
	```

	To apply mixed quantization through CLI, the `--quant-mode` argument should be used. For example:

	```bash
	optimum-cli export openvino -m TinyLlama/TinyLlama-1.1B-Chat-v1.0 --quant-mode cb4_f8e4m3 --dataset wikitext2 ./save_dir
	```

	Don't forget to provide a dataset since it is required for the calibration procedure during full quantization.

	## Pipeline Quantization

	There are multimodal pipelines that consist of multiple components, such as Stable Diffusion or Visual Language models. In these cases, there may be a need to apply different quantization methods to different components of the pipeline. For example, you may want to apply int4 data-aware weight-only quantization to a language model in visual-language pipeline, while applying int8 weight-only quantization to other components. In this case you can use the `OVPipelineQuantizationConfig` class to specify the quantization configuration for each component of the pipeline.

	For example, the code below quantizes weights and activations of a language model inside InternVL2-1B, compresses weights of a text embedding model and skips any quantization for vision embedding model.

	```python
	from optimum.intel import OVModelForVisualCausalLM
	from optimum.intel import OVPipelineQuantizationConfig, OVQuantizationConfig, OVWeightQuantizationConfig

	model_id = "OpenGVLab/InternVL2-1B"
	model = OVModelForVisualCausalLM.from_pretrained(
	model_id,
	export=True,
	trust_remote_code=True,
	quantization_config=OVPipelineQuantizationConfig(
	quantization_configs={
	"lm_model": OVQuantizationConfig(bits=8),
	"text_embeddings_model": OVWeightQuantizationConfig(bits=8),
	},
	dataset="contextual",
	)
	)
	```

	### Models
	https://huggingface.co/docs/optimum.intel/pr_1714/openvino/reference.md

	# Models

	## Generic model classes[[optimum.intel.openvino.modeling_base.OVBaseModel]]

	#### optimum.intel.openvino.modeling_base.OVBaseModel[[optimum.intel.openvino.modeling_base.OVBaseModel]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L189)

	Base OVModel class.

	from_pretrainedoptimum.intel.openvino.modeling_base.OVBaseModel.from_pretrainedhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L560[{"name": "model_id", "val": ": typing.Union[str, pathlib.Path]"}, {"name": "export", "val": ": bool = False"}, {"name": "force_download", "val": ": bool = False"}, {"name": "token", "val": ": typing.Union[bool, str, NoneType] = None"}, {"name": "cache_dir", "val": ": str = '/home/runner/.cache/huggingface/hub'"}, {"name": "subfolder", "val": ": str = ''"}, {"name": "config", "val": ": typing.Optional[transformers.configuration_utils.PreTrainedConfig] = None"}, {"name": "local_files_only", "val": ": bool = False"}, {"name": "trust_remote_code", "val": ": bool = False"}, {"name": "revision", "val": ": typing.Optional[str] = None"}, {"name": "kwargs", "val": ""}]- model_id** (`Union[str, Path]`) --
	Can be either:
	- A string, the model id of a pretrained model hosted inside a model repo on huggingface.co.
	Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a
	user or organization name, like `dbmdz/bert-base-german-cased`.
	- A path to a directory containing a model saved using `~OptimizedModel.save_pretrained`,
	e.g., `./my_model_directory/`.
	- export (`bool`, defaults to `False`) --
	Defines whether the provided `model_id` needs to be exported to the targeted format.
	- force_download (`bool`, defaults to `True`) --
	Whether or not to force the (re-)download of the model weights and configuration files, overriding the
	cached versions if they exist.
	- token (`Optional[Union[bool,str]]`, defaults to `None`) --
	The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated
	when running `huggingface-cli login` (stored in `huggingface_hub.constants.HF_TOKEN_PATH`).
	- cache_dir (`Optional[str]`, defaults to `None`) --
	Path to a directory in which a downloaded pretrained model configuration should be cached if the
	standard cache should not be used.
	- subfolder (`str`, defaults to `""`) --
	In case the relevant files are located inside a subfolder of the model repo either locally or on huggingface.co, you can
	specify the folder name here.
	- config (`Optional[transformers.PretrainedConfig]`, defaults to `None`) --
	The model configuration.
	- local_files_only (`Optional[bool]`, defaults to `False`) --
	Whether or not to only look at local files (i.e., do not try to download the model).
	- trust_remote_code (`bool`, defaults to `False`) --
	Whether or not to allow for custom code defined on the Hub in their own modeling. This option should only be set
	to `True` for repositories you trust and in which you have read the code, as it will execute code present on
	the Hub on your local machine.
	- revision (`Optional[str]`, defaults to `None`) --
	The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a
	git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any
	identifier allowed by git.0

	Instantiate a pretrained model from a pre-trained model configuration.

	Parameters:

	model_id (`Union[str, Path]`) : Can be either: - A string, the model id of a pretrained model hosted inside a model repo on huggingface.co. Valid model ids can be located at the root-level, like `bert-base-uncased`, or namespaced under a user or organization name, like `dbmdz/bert-base-german-cased`. - A path to a directory containing a model saved using `~OptimizedModel.save_pretrained`, e.g., `./my_model_directory/`.

	export (`bool`, defaults to `False`) : Defines whether the provided `model_id` needs to be exported to the targeted format.

	force_download (`bool`, defaults to `True`) : Whether or not to force the (re-)download of the model weights and configuration files, overriding the cached versions if they exist.

	token (`Optional[Union[bool,str]]`, defaults to `None`) : The token to use as HTTP bearer authorization for remote files. If `True`, will use the token generated when running `huggingface-cli login` (stored in `huggingface_hub.constants.HF_TOKEN_PATH`).

	cache_dir (`Optional[str]`, defaults to `None`) : Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.

	subfolder (`str`, defaults to `""`) : In case the relevant files are located inside a subfolder of the model repo either locally or on huggingface.co, you can specify the folder name here.

	config (`Optional[transformers.PretrainedConfig]`, defaults to `None`) : The model configuration.

	local_files_only (`Optional[bool]`, defaults to `False`) : Whether or not to only look at local files (i.e., do not try to download the model).

	trust_remote_code (`bool`, defaults to `False`) : Whether or not to allow for custom code defined on the Hub in their own modeling. This option should only be set to `True` for repositories you trust and in which you have read the code, as it will execute code present on the Hub on your local machine.

	revision (`Optional[str]`, defaults to `None`) : The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so `revision` can be any identifier allowed by git.
	#### reshape[[optimum.intel.openvino.modeling_base.OVBaseModel.reshape]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L936)

	Propagates the given input shapes on the model's layers, fixing the inputs shapes of the model.

	Parameters:

	batch_size (`int`) : The batch size.

	sequence_length (`int`) : The sequence length or number of channels.

	height (`int`, optional) : The image height.

	width (`int`, optional) : The image width.

	## Natural Language Processing

	The following classes are available for the following natural language processing tasks.

	### OVModelForCausalLM[[optimum.intel.OVModelForCausalLM]]

	#### optimum.intel.OVModelForCausalLM[[optimum.intel.OVModelForCausalLM]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_decoder.py#L464)

	OpenVINO Model with a causal language modeling head on top (linear layer with weights tied to the input
	embeddings).

	This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the
	library implements for all its model (such as downloading or saving)

	forwardoptimum.intel.OVModelForCausalLM.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_decoder.py#L577[{"name": "input_ids", "val": ": LongTensor"}, {"name": "attention_mask", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "past_key_values", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.FloatTensor]]] = None"}, {"name": "position_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "token_type_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "**kwargs", "val": ""}]

	Parameters:

	model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights.

	device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device.

	dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default.

	ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation.

	compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled.
	#### generate[[optimum.intel.OVModelForCausalLM.generate]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_decoder.py#L757)

	### OVModelForMaskedLM[[optimum.intel.OVModelForMaskedLM]]

	#### optimum.intel.OVModelForMaskedLM[[optimum.intel.OVModelForMaskedLM]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L448)

	OpenVINO Model with a MaskedLMOutput for masked language modeling tasks.

	This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the
	library implements for all its model (such as downloading or saving)

	forwardoptimum.intel.OVModelForMaskedLM.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L455[{"name": "input_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "token_type_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "kwargs", "val": ""}]- input_ids** (`torch.Tensor`) --
	Indices of input sequence tokens in the vocabulary.
	Indices can be obtained using [`AutoTokenizer`](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer).
	[What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids)
	- attention_mask (`torch.Tensor`), optional) --
	Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
	- 1 for tokens that are not masked,
	- 0 for tokens that are masked.
	[What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask)
	- token_type_ids (`torch.Tensor`, optional) --
	Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
	- 1 for tokens that are sentence A,
	- 0 for tokens that are sentence B.
	[What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)0
	The [OVModelForMaskedLM](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForMaskedLM) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of masked language modeling using `transformers.pipelines`:
	```python
	>>> from transformers import AutoTokenizer, pipeline
	>>> from optimum.intel import OVModelForMaskedLM

	>>> tokenizer = AutoTokenizer.from_pretrained("roberta-base")
	>>> model = OVModelForMaskedLM.from_pretrained("roberta-base", export=True)
	>>> mask_token = tokenizer.mask_token
	>>> pipe = pipeline("fill-mask", model=model, tokenizer=tokenizer)
	>>> outputs = pipe("The goal of life is" + mask_token)
	```

	Parameters:

	model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights.

	device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device.

	dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default.

	ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation.

	compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled.

	### OVModelForSeq2SeqLM[[optimum.intel.OVModelForSeq2SeqLM]]

	#### optimum.intel.OVModelForSeq2SeqLM[[optimum.intel.OVModelForSeq2SeqLM]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L329)

	Sequence-to-sequence model with a language modeling head for OpenVINO inference.

	forwardoptimum.intel.OVModelForSeq2SeqLM.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L644[{"name": "input_ids", "val": ": LongTensor = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "decoder_input_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_attention_mask", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "encoder_outputs", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "past_key_values", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "cache_position", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "labels", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "kwargs", "val": ""}]- input_ids** (`torch.LongTensor`) --
	Indices of input sequence tokens in the vocabulary of shape `(batch_size, encoder_sequence_length)`.
	- attention_mask (`torch.LongTensor`) --
	Mask to avoid performing attention on padding token indices, of shape
	`(batch_size, encoder_sequence_length)`. Mask values selected in `[0, 1]`.
	- decoder_input_ids (`torch.LongTensor`) --
	Indices of decoder input sequence tokens in the vocabulary of shape `(batch_size, decoder_sequence_length)`.
	- encoder_outputs (`torch.FloatTensor`) --
	The encoder `last_hidden_state` of shape `(batch_size, encoder_sequence_length, hidden_size)`.
	- past_key_values (`tuple(tuple(torch.FloatTensor), optional)` --
	Contains the precomputed key and value hidden states of the attention blocks used to speed up decoding.
	The tuple is of length `config.n_layers` with each tuple having 2 tensors of shape
	`(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)` and 2 additional tensors of shape
	`(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.0
	The [OVModelForSeq2SeqLM](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForSeq2SeqLM) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of text generation:
	```python
	>>> from transformers import AutoTokenizer
	>>> from optimum.intel import OVModelForSeq2SeqLM

	>>> tokenizer = AutoTokenizer.from_pretrained("echarlaix/t5-small-openvino")
	>>> model = OVModelForSeq2SeqLM.from_pretrained("echarlaix/t5-small-openvino")
	>>> text = "He never went out without a book under his arm, and he often came back with two."
	>>> inputs = tokenizer(text, return_tensors="pt")
	>>> gen_tokens = model.generate(**inputs)
	>>> outputs = tokenizer.batch_decode(gen_tokens)
	```

	Example using `transformers.pipeline`:
	```python
	>>> from transformers import AutoTokenizer, pipeline
	>>> from optimum.intel import OVModelForSeq2SeqLM

	>>> tokenizer = AutoTokenizer.from_pretrained("echarlaix/t5-small-openvino")
	>>> model = OVModelForSeq2SeqLM.from_pretrained("echarlaix/t5-small-openvino")
	>>> pipe = pipeline("translation_en_to_fr", model=model, tokenizer=tokenizer)
	>>> text = "He never went out without a book under his arm, and he often came back with two."
	>>> outputs = pipe(text)
	```

	Parameters:

	encoder (`openvino.Model`) : The OpenVINO Runtime model associated to the encoder.

	decoder (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder.

	decoder_with_past (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder with past key values.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is an instance of the configuration associated to the model. Initializing with a config file does not load the weights associated with the model, only the configuration.

	### OVModelForQuestionAnswering[[optimum.intel.OVModelForQuestionAnswering]]

	#### optimum.intel.OVModelForQuestionAnswering[[optimum.intel.OVModelForQuestionAnswering]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L219)

	OpenVINO Model with a QuestionAnsweringModelOutput for extractive question-answering tasks.

	This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the
	library implements for all its model (such as downloading or saving)

	forwardoptimum.intel.OVModelForQuestionAnswering.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L226[{"name": "input_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "token_type_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "kwargs", "val": ""}]- input_ids** (`torch.Tensor`) --
	Indices of input sequence tokens in the vocabulary.
	Indices can be obtained using [`AutoTokenizer`](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer).
	[What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids)
	- attention_mask (`torch.Tensor`), optional) --
	Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
	- 1 for tokens that are not masked,
	- 0 for tokens that are masked.
	[What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask)
	- token_type_ids (`torch.Tensor`, optional) --
	Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
	- 1 for tokens that are sentence A,
	- 0 for tokens that are sentence B.
	[What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)0
	The [OVModelForQuestionAnswering](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForQuestionAnswering) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of question answering using `transformers.pipeline`:
	```python
	>>> from transformers import AutoTokenizer, pipeline
	>>> from optimum.intel import OVModelForQuestionAnswering

	>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad")
	>>> model = OVModelForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad", export=True)
	>>> pipe = pipeline("question-answering", model=model, tokenizer=tokenizer)
	>>> question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
	>>> outputs = pipe(question, text)
	```

	Parameters:

	model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights.

	device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device.

	dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default.

	ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation.

	compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled.

	### OVModelForSequenceClassification[[optimum.intel.OVModelForSequenceClassification]]

	#### optimum.intel.OVModelForSequenceClassification[[optimum.intel.OVModelForSequenceClassification]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L154)

	OpenVINO Model with a SequenceClassifierOutput for sequence classification tasks.

	This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the
	library implements for all its model (such as downloading or saving)

	forwardoptimum.intel.OVModelForSequenceClassification.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L161[{"name": "input_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "token_type_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "kwargs", "val": ""}]- input_ids** (`torch.Tensor`) --
	Indices of input sequence tokens in the vocabulary.
	Indices can be obtained using [`AutoTokenizer`](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer).
	[What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids)
	- attention_mask (`torch.Tensor`), optional) --
	Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
	- 1 for tokens that are not masked,
	- 0 for tokens that are masked.
	[What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask)
	- token_type_ids (`torch.Tensor`, optional) --
	Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
	- 1 for tokens that are sentence A,
	- 0 for tokens that are sentence B.
	[What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)0
	The [OVModelForSequenceClassification](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForSequenceClassification) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of sequence classification using `transformers.pipeline`:
	```python
	>>> from transformers import AutoTokenizer, pipeline
	>>> from optimum.intel import OVModelForSequenceClassification

	>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english")
	>>> model = OVModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", export=True)
	>>> pipe = pipeline("text-classification", model=model, tokenizer=tokenizer)
	>>> outputs = pipe("Hello, my dog is cute")
	```

	Parameters:

	model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights.

	device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device.

	dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default.

	ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation.

	compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled.

	### OVModelForTokenClassification[[optimum.intel.OVModelForTokenClassification]]

	#### optimum.intel.OVModelForTokenClassification[[optimum.intel.OVModelForTokenClassification]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L288)

	OpenVINO Model with a TokenClassifierOutput for token classification tasks.

	This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the
	library implements for all its model (such as downloading or saving)

	forwardoptimum.intel.OVModelForTokenClassification.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L295[{"name": "input_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "token_type_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "kwargs", "val": ""}]- input_ids** (`torch.Tensor`) --
	Indices of input sequence tokens in the vocabulary.
	Indices can be obtained using [`AutoTokenizer`](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer).
	[What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids)
	- attention_mask (`torch.Tensor`), optional) --
	Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
	- 1 for tokens that are not masked,
	- 0 for tokens that are masked.
	[What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask)
	- token_type_ids (`torch.Tensor`, optional) --
	Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
	- 1 for tokens that are sentence A,
	- 0 for tokens that are sentence B.
	[What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)0
	The [OVModelForTokenClassification](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForTokenClassification) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of token classification using `transformers.pipelines`:
	```python
	>>> from transformers import AutoTokenizer, pipeline
	>>> from optimum.intel import OVModelForTokenClassification

	>>> tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
	>>> model = OVModelForTokenClassification.from_pretrained("dslim/bert-base-NER", export=True)
	>>> pipe = pipeline("token-classification", model=model, tokenizer=tokenizer)
	>>> outputs = pipe("My Name is Peter and I live in New York.")
	```

	Parameters:

	model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights.

	device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device.

	dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default.

	ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation.

	compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled.

	## Audio

	The following classes are available for the following audio tasks.

	### OVModelForAudioClassification[[optimum.intel.OVModelForAudioClassification]]

	#### optimum.intel.OVModelForAudioClassification[[optimum.intel.OVModelForAudioClassification]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L651)

	OpenVINO Model with a SequenceClassifierOutput for audio classification tasks.

	This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the
	library implements for all its model (such as downloading or saving)

	forwardoptimum.intel.OVModelForAudioClassification.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L658[{"name": "input_values", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "kwargs", "val": ""}]- input_ids** (`torch.Tensor`) --
	Indices of input sequence tokens in the vocabulary.
	Indices can be obtained using [`AutoTokenizer`](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer).
	[What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids)
	- attention_mask (`torch.Tensor`), optional) --
	Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
	- 1 for tokens that are not masked,
	- 0 for tokens that are masked.
	[What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask)
	- token_type_ids (`torch.Tensor`, optional) --
	Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
	- 1 for tokens that are sentence A,
	- 0 for tokens that are sentence B.
	[What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)0
	The [OVModelForAudioClassification](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForAudioClassification) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of audio classification using `transformers.pipelines`:
	```python
	>>> from datasets import load_dataset
	>>> from transformers import AutoFeatureExtractor, pipeline
	>>> from optimum.intel import OVModelForAudioClassification

	>>> preprocessor = AutoFeatureExtractor.from_pretrained("superb/hubert-base-superb-er")
	>>> model = OVModelForAudioClassification.from_pretrained("superb/hubert-base-superb-er", export=True)
	>>> pipe = pipeline("audio-classification", model=model, feature_extractor=preprocessor)
	>>> dataset = load_dataset("superb", "ks", split="test")
	>>> audio_file = dataset[3]["audio"]["array"]
	>>> outputs = pipe(audio_file)
	```

	Parameters:

	model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights.

	device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device.

	dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default.

	ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation.

	compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled.

	### OVModelForAudioFrameClassification[[optimum.intel.OVModelForAudioFrameClassification]]

	#### optimum.intel.OVModelForAudioFrameClassification[[optimum.intel.OVModelForAudioFrameClassification]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L879)

	OpenVINO Model for with a frame classification head on top for tasks like Speaker Diarization.

	This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the
	library implements for all its model (such as downloading or saving)

	Audio Frame Classification model for OpenVINO.

	forwardoptimum.intel.OVModelForAudioFrameClassification.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L887[{"name": "input_values", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "kwargs", "val": ""}]- input_values** (`torch.Tensor` of shape `(batch_size, sequence_length)`) --
	Float values of input raw speech waveform..
	Input values can be obtained from audio file loaded into an array using [`AutoFeatureExtractor`](https://huggingface.co/docs/transformers/autoclass_tutorial#autofeatureextractor).0
	The [OVModelForAudioFrameClassification](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForAudioFrameClassification) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of audio frame classification:

	```python
	>>> from transformers import AutoFeatureExtractor
	>>> from optimum.intel import OVModelForAudioFrameClassification
	>>> from datasets import load_dataset
	>>> import torch

	>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
	>>> dataset = dataset.sort("id")
	>>> sampling_rate = dataset.features["audio"].sampling_rate

	>>> feature_extractor = AutoFeatureExtractor.from_pretrained("anton-l/wav2vec2-base-superb-sd")
	>>> model = OVModelForAudioFrameClassification.from_pretrained("anton-l/wav2vec2-base-superb-sd", export=True)

	>>> inputs = feature_extractor(dataset[0]["audio"]["array"], return_tensors="pt", sampling_rate=sampling_rate)
	>>> logits = model(**inputs).logits

	>>> probabilities = torch.sigmoid(torch.as_tensor(logits)[0])
	>>> labels = (probabilities > 0.5).long()
	>>> labels[0].tolist()
	```

	Parameters:

	model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights.

	device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device.

	dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default.

	ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation.

	compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled.

	### OVModelForCTC[[optimum.intel.OVModelForCTC]]

	#### optimum.intel.OVModelForCTC[[optimum.intel.OVModelForCTC]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L723)

	Onnx Model with a language modeling head on top for Connectionist Temporal Classification (CTC).

	This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the
	library implements for all its model (such as downloading or saving)

	CTC model for OpenVINO.

	forwardoptimum.intel.OVModelForCTC.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L731[{"name": "input_values", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "kwargs", "val": ""}]- input_values** (`torch.Tensor` of shape `(batch_size, sequence_length)`) --
	Float values of input raw speech waveform..
	Input values can be obtained from audio file loaded into an array using [`AutoFeatureExtractor`](https://huggingface.co/docs/transformers/autoclass_tutorial#autofeatureextractor).0
	The [OVModelForCTC](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForCTC) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of CTC:

	```python
	>>> from transformers import AutoFeatureExtractor
	>>> from optimum.intel import OVModelForCTC
	>>> from datasets import load_dataset

	>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
	>>> dataset = dataset.sort("id")
	>>> sampling_rate = dataset.features["audio"].sampling_rate

	>>> processor = AutoFeatureExtractor.from_pretrained("facebook/hubert-large-ls960-ft")
	>>> model = OVModelForCTC.from_pretrained("facebook/hubert-large-ls960-ft", export=True)

	>>> # audio file is decoded on the fly
	>>> inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="np")
	>>> logits = model(**inputs).logits
	>>> predicted_ids = np.argmax(logits, axis=-1)

	>>> transcription = processor.batch_decode(predicted_ids)
	```

	Parameters:

	model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights.

	device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device.

	dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default.

	ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation.

	compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled.

	### OVModelForAudioXVector[[optimum.intel.OVModelForAudioXVector]]

	#### optimum.intel.OVModelForAudioXVector[[optimum.intel.OVModelForAudioXVector]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L803)

	Onnx Model with an XVector feature extraction head on top for tasks like Speaker Verification.

	This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the
	library implements for all its model (such as downloading or saving)

	Audio XVector model for OpenVINO.

	forwardoptimum.intel.OVModelForAudioXVector.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L811[{"name": "input_values", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "kwargs", "val": ""}]- input_values** (`torch.Tensor` of shape `(batch_size, sequence_length)`) --
	Float values of input raw speech waveform..
	Input values can be obtained from audio file loaded into an array using [`AutoFeatureExtractor`](https://huggingface.co/docs/transformers/autoclass_tutorial#autofeatureextractor).0
	The [OVModelForAudioXVector](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForAudioXVector) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of Audio XVector:

	```python
	>>> from transformers import AutoFeatureExtractor
	>>> from optimum.intel import OVModelForAudioXVector
	>>> from datasets import load_dataset
	>>> import torch

	>>> dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
	>>> dataset = dataset.sort("id")
	>>> sampling_rate = dataset.features["audio"].sampling_rate

	>>> feature_extractor = AutoFeatureExtractor.from_pretrained("anton-l/wav2vec2-base-superb-sv")
	>>> model = OVModelForAudioXVector.from_pretrained("anton-l/wav2vec2-base-superb-sv", export=True)

	>>> # audio file is decoded on the fly
	>>> inputs = feature_extractor(
	... [d["array"] for d in dataset[:2]["audio"]], sampling_rate=sampling_rate, return_tensors="pt", padding=True
	... )
	>>> embeddings = model(**inputs).embeddings

	>>> embeddings = torch.nn.functional.normalize(embeddings, dim=-1).cpu()

	>>> cosine_sim = torch.nn.CosineSimilarity(dim=-1)
	>>> similarity = cosine_sim(embeddings[0], embeddings[1])
	>>> threshold = 0.7
	>>> if similarity >> round(similarity.item(), 2)
	```

	Parameters:

	model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights.

	device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device.

	dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default.

	ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation.

	compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled.

	### OVModelForSpeechSeq2Seq[[optimum.intel.OVModelForSpeechSeq2Seq]]

	#### optimum.intel.OVModelForSpeechSeq2Seq[[optimum.intel.OVModelForSpeechSeq2Seq]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L1246)

	Speech Sequence-to-sequence model with a language modeling head for OpenVINO inference. This class officially supports whisper, speech_to_text.

	forwardoptimum.intel.OVModelForSpeechSeq2Seq.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L1282[{"name": "input_features", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_input_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_attention_mask", "val": ": typing.Optional[torch.BoolTensor] = None"}, {"name": "encoder_outputs", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "past_key_values", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "cache_position", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "kwargs", "val": ""}]- input_features** (`torch.FloatTensor`) --
	Mel features extracted from the raw speech waveform.
	`(batch_size, feature_size, encoder_sequence_length)`.
	- decoder_input_ids (`torch.LongTensor`) --
	Indices of decoder input sequence tokens in the vocabulary of shape `(batch_size, decoder_sequence_length)`.
	- encoder_outputs (`torch.FloatTensor`) --
	The encoder `last_hidden_state` of shape `(batch_size, encoder_sequence_length, hidden_size)`.
	- past_key_values (`tuple(tuple(torch.FloatTensor), optional, defaults to `None`)` --
	Contains the precomputed key and value hidden states of the attention blocks used to speed up decoding.
	The tuple is of length `config.n_layers` with each tuple having 2 tensors of shape
	`(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)` and 2 additional tensors of shape
	`(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.0
	The [OVModelForSpeechSeq2Seq](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForSpeechSeq2Seq) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of text generation:

	```python
	>>> from transformers import AutoProcessor
	>>> from optimum.intel import OVModelForSpeechSeq2Seq
	>>> from datasets import load_dataset

	>>> processor = AutoProcessor.from_pretrained("openai/whisper-tiny")
	>>> model = OVModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny")

	>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
	>>> inputs = processor.feature_extractor(ds[0]["audio"]["array"], return_tensors="pt")

	>>> gen_tokens = model.generate(inputs=inputs.input_features)
	>>> outputs = processor.tokenizer.batch_decode(gen_tokens)
	```

	Example using `transformers.pipeline`:

	```python
	>>> from transformers import AutoProcessor, pipeline
	>>> from optimum.intel import OVModelForSpeechSeq2Seq
	>>> from datasets import load_dataset

	>>> processor = AutoProcessor.from_pretrained("openai/whisper-tiny")
	>>> model = OVModelForSpeechSeq2Seq.from_pretrained("openai/whisper-tiny")
	>>> speech_recognition = pipeline("automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor)

	>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
	>>> pred = speech_recognition(ds[0]["audio"]["array"])
	```

	Parameters:

	encoder (`openvino.Model`) : The OpenVINO Runtime model associated to the encoder.

	decoder (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder.

	decoder_with_past (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder with past key values.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is an instance of the configuration associated to the model. Initializing with a config file does not load the weights associated with the model, only the configuration.

	## Computer Vision

	The following classes are available for the following computer vision tasks.

	### OVModelForImageClassification[[optimum.intel.OVModelForImageClassification]]

	#### optimum.intel.OVModelForImageClassification[[optimum.intel.OVModelForImageClassification]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L538)

	OpenVINO Model with a ImageClassifierOutput for image classification tasks.

	This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the
	library implements for all its model (such as downloading or saving)

	forwardoptimum.intel.OVModelForImageClassification.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L600[{"name": "pixel_values", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "kwargs", "val": ""}]- pixel_values** (`torch.Tensor`) --
	Pixel values corresponding to the images in the current batch.
	Pixel values can be obtained from encoded images using [`AutoFeatureExtractor`](https://huggingface.co/docs/transformers/autoclass_tutorial#autofeatureextractor).0
	The [OVModelForImageClassification](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForImageClassification) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of image classification using `transformers.pipelines`:
	```python
	>>> from transformers import AutoFeatureExtractor, pipeline
	>>> from optimum.intel import OVModelForImageClassification

	>>> preprocessor = AutoFeatureExtractor.from_pretrained("google/vit-base-patch16-224")
	>>> model = OVModelForImageClassification.from_pretrained("google/vit-base-patch16-224", export=True)
	>>> model.reshape(batch_size=1, sequence_length=3, height=224, width=224)
	>>> pipe = pipeline("image-classification", model=model, feature_extractor=preprocessor)
	>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
	>>> outputs = pipe(url)
	```

	This class can also be used with [timm](https://github.com/huggingface/pytorch-image-models)

	models hosted on [HuggingFaceHub](https://huggingface.co/timm). Example:
	```python
	>>> from transformers import pipeline
	>>> from optimum.intel.openvino.modeling_timm import TimmImageProcessor
	>>> from optimum.intel import OVModelForImageClassification

	>>> model_id = "timm/vit_tiny_patch16_224.augreg_in21k"
	>>> preprocessor = TimmImageProcessor.from_pretrained(model_id)
	>>> model = OVModelForImageClassification.from_pretrained(model_id, export=True)
	>>> pipe = pipeline("image-classification", model=model, feature_extractor=preprocessor)
	>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
	>>> outputs = pipe(url)
	```

	Parameters:

	model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights.

	device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device.

	dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default.

	ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation.

	compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled.

	## Multimodal

	The following classes are available for the following multimodal tasks.

	### OVModelForVision2Seq[[optimum.intel.OVModelForVision2Seq]]

	#### optimum.intel.OVModelForVision2Seq[[optimum.intel.OVModelForVision2Seq]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L1051)

	VisionEncoderDecoder Sequence-to-sequence model with a language modeling head for OpenVINO inference.

	forwardoptimum.intel.OVModelForVision2Seq.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L1105[{"name": "pixel_values", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_input_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_attention_mask", "val": ": typing.Optional[torch.BoolTensor] = None"}, {"name": "encoder_outputs", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "past_key_values", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "kwargs", "val": ""}]- pixel_values** (`torch.FloatTensor`) --
	Features extracted from an Image. This tensor should be of shape
	`(batch_size, num_channels, height, width)`.
	- decoder_input_ids (`torch.LongTensor`) --
	Indices of decoder input sequence tokens in the vocabulary of shape `(batch_size, decoder_sequence_length)`.
	- encoder_outputs (`torch.FloatTensor`) --
	The encoder `last_hidden_state` of shape `(batch_size, encoder_sequence_length, hidden_size)`.
	- past_key_values (`tuple(tuple(torch.FloatTensor), optional, defaults to `None`)` --
	Contains the precomputed key and value hidden states of the attention blocks used to speed up decoding.
	The tuple is of length `config.n_layers` with each tuple having 2 tensors of shape
	`(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)` and 2 additional tensors of shape
	`(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.0
	The [OVModelForVision2Seq](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForVision2Seq) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of text generation:

	```python
	>>> from transformers import AutoProcessor, AutoTokenizer
	>>> from optimum.intel import OVModelForVision2Seq
	>>> from PIL import Image
	>>> import requests

	>>> processor = AutoProcessor.from_pretrained("microsoft/trocr-small-handwritten")
	>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-handwritten")
	>>> model = OVModelForVision2Seq.from_pretrained("microsoft/trocr-small-handwritten", export=True)

	>>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg"
	>>> image = Image.open(requests.get(url, stream=True).raw)
	>>> inputs = processor(image, return_tensors="pt")

	>>> gen_tokens = model.generate(**inputs)
	>>> outputs = tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)

	```

	Example using `transformers.pipeline`:

	```python
	>>> from transformers import AutoProcessor, AutoTokenizer, pipeline
	>>> from optimum.intel import OVModelForVision2Seq
	>>> from PIL import Image
	>>> import requests

	>>> processor = AutoProcessor.from_pretrained("microsoft/trocr-small-handwritten")
	>>> tokenizer = AutoTokenizer.from_pretrained("microsoft/trocr-small-handwritten")
	>>> model = OVModelForVision2Seq.from_pretrained("microsoft/trocr-small-handwritten", export=True)

	>>> url = "https://fki.tic.heia-fr.ch/static/img/a01-122-02-00.jpg"
	>>> image = Image.open(requests.get(url, stream=True).raw)

	>>> image_to_text = pipeline("image-to-text", model=model, tokenizer=tokenizer, feature_extractor=processor, image_processor=processor)
	>>> pred = image_to_text(image)
	```

	Parameters:

	encoder (`openvino.Model`) : The OpenVINO Runtime model associated to the encoder.

	decoder (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder.

	decoder_with_past (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder with past key values.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is an instance of the configuration associated to the model. Initializing with a config file does not load the weights associated with the model, only the configuration.

	### OVModelForPix2Struct[[optimum.intel.OVModelForPix2Struct]]

	#### optimum.intel.OVModelForPix2Struct[[optimum.intel.OVModelForPix2Struct]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L1156)

	Pix2Struct model with a language modeling head for OpenVINO inference.

	forwardoptimum.intel.OVModelForPix2Struct.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_seq2seq.py#L1196[{"name": "flattened_patches", "val": ": typing.Optional[torch.FloatTensor] = None"}, {"name": "attention_mask", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_input_ids", "val": ": typing.Optional[torch.LongTensor] = None"}, {"name": "decoder_attention_mask", "val": ": typing.Optional[torch.BoolTensor] = None"}, {"name": "encoder_outputs", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "past_key_values", "val": ": typing.Optional[typing.Tuple[typing.Tuple[torch.Tensor]]] = None"}, {"name": "kwargs", "val": ""}]- flattened_patches** (`torch.FloatTensor` of shape `(batch_size, seq_length, hidden_size)`) --
	Flattened pixel patches. the `hidden_size` is obtained by the following formula: `hidden_size` =
	`num_channels` * `patch_size` * `patch_size`
	The process of flattening the pixel patches is done by `Pix2StructProcessor`.
	- attention_mask (`torch.FloatTensor` of shape `(batch_size, sequence_length)`, optional) --
	Mask to avoid performing attention on padding token indices.
	- decoder_input_ids (`torch.LongTensor` of shape `(batch_size, target_sequence_length)`, optional) --
	Indices of decoder input sequence tokens in the vocabulary.
	Pix2StructText uses the `pad_token_id` as the starting token for `decoder_input_ids` generation. If
	`past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
	`past_key_values`).
	- decoder_attention_mask (`torch.BoolTensor` of shape `(batch_size, target_sequence_length)`, optional) --
	Default behavior: generate a tensor that ignores pad tokens in `decoder_input_ids`. Causal mask will also
	be used by default.
	- encoder_outputs (`tuple(tuple(torch.FloatTensor)`, optional) --
	Tuple consists of (`last_hidden_state`, `optional`: hidden_states, `optional`: attentions)
	`last_hidden_state` of shape `(batch_size, sequence_length, hidden_size)` is a sequence of hidden states at
	the output of the last layer of the encoder. Used in the cross-attention of the decoder.
	- past_key_values (`tuple(tuple(torch.FloatTensor), optional, defaults to `None`)` --
	Contains the precomputed key and value hidden states of the attention blocks used to speed up decoding.
	The tuple is of length `config.n_layers` with each tuple having 2 tensors of shape
	`(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)` and 2 additional tensors of shape
	`(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.0
	The [OVModelForPix2Struct](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForPix2Struct) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of pix2struct:

	```python
	>>> from transformers import AutoProcessor
	>>> from optimum.intel import OVModelForPix2Struct
	>>> from PIL import Image
	>>> import requests

	>>> processor = AutoProcessor.from_pretrained("google/pix2struct-ai2d-base")
	>>> model = OVModelForPix2Struct.from_pretrained("google/pix2struct-ai2d-base", export=True)

	>>> url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"
	>>> image = Image.open(requests.get(url, stream=True).raw)
	>>> question = "What does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud"
	>>> inputs = processor(images=image, text=question, return_tensors="pt")

	>>> gen_tokens = model.generate(**inputs)
	>>> outputs = processor.batch_decode(gen_tokens, skip_special_tokens=True)
	```

	Parameters:

	encoder (`openvino.Model`) : The OpenVINO Runtime model associated to the encoder.

	decoder (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder.

	decoder_with_past (`openvino.Model`) : The OpenVINO Runtime model associated to the decoder with past key values.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is an instance of the configuration associated to the model. Initializing with a config file does not load the weights associated with the model, only the configuration.

	## Custom Tasks

	### OVModelForCustomTasks[[optimum.intel.OVModelForCustomTasks]]

	#### optimum.intel.OVModelForCustomTasks[[optimum.intel.OVModelForCustomTasks]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L945)

	OpenVINO Model for custom tasks. It can be used to leverage the inference acceleration for any single-file OpenVINO model, that may use custom inputs and outputs.

	This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the
	library implements for all its model (such as downloading or saving)

	forwardoptimum.intel.OVModelForCustomTasks.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L946[{"name": "**kwargs", "val": ""}]
	The [OVModelForCustomTasks](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForCustomTasks) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of custom tasks (e.g. a sentence transformers with a pooler head):

	```python
	>>> from transformers import AutoTokenizer
	>>> from optimum.intel import OVModelForCustomTasks

	>>> tokenizer = AutoTokenizer.from_pretrained("IlyasMoutawwakil/sbert-all-MiniLM-L6-v2-with-pooler")
	>>> model = OVModelForCustomTasks.from_pretrained("IlyasMoutawwakil/sbert-all-MiniLM-L6-v2-with-pooler")

	>>> inputs = tokenizer("I love burritos!", return_tensors="np")

	>>> outputs = model(**inputs)
	>>> last_hidden_state = outputs.last_hidden_state
	>>> pooler_output = outputs.pooler_output
	```

	Parameters:

	model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights.

	device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device.

	dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default.

	ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation.

	compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled.

	### OVModelForFeatureExtraction[[optimum.intel.OVModelForFeatureExtraction]]

	#### optimum.intel.OVModelForFeatureExtraction[[optimum.intel.OVModelForFeatureExtraction]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L352)

	OpenVINO Model with a BaseModelOutput for feature extraction tasks.

	This model inherits from `optimum.intel.openvino.modeling.OVBaseModel`. Check the superclass documentation for the generic methods the
	library implements for all its model (such as downloading or saving)

	forwardoptimum.intel.OVModelForFeatureExtraction.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling.py#L366[{"name": "input_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "attention_mask", "val": ": typing.Union[torch.Tensor, numpy.ndarray]"}, {"name": "token_type_ids", "val": ": typing.Union[torch.Tensor, numpy.ndarray, NoneType] = None"}, {"name": "kwargs", "val": ""}]- input_ids** (`torch.Tensor`) --
	Indices of input sequence tokens in the vocabulary.
	Indices can be obtained using [`AutoTokenizer`](https://huggingface.co/docs/transformers/autoclass_tutorial#autotokenizer).
	[What are input IDs?](https://huggingface.co/docs/transformers/glossary#input-ids)
	- attention_mask (`torch.Tensor`), optional) --
	Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
	- 1 for tokens that are not masked,
	- 0 for tokens that are masked.
	[What are attention masks?](https://huggingface.co/docs/transformers/glossary#attention-mask)
	- token_type_ids (`torch.Tensor`, optional) --
	Segment token indices to indicate first and second portions of the inputs. Indices are selected in `[0, 1]`:
	- 1 for tokens that are sentence A,
	- 0 for tokens that are sentence B.
	[What are token type IDs?](https://huggingface.co/docs/transformers/glossary#token-type-ids)0
	The [OVModelForFeatureExtraction](/docs/optimum.intel/pr_1714/en/openvino/reference#optimum.intel.OVModelForFeatureExtraction) forward method, overrides the `__call__` special method.

	Although the recipe for forward pass needs to be defined within this function, one should call the `Module`
	instance afterwards instead of this since the former takes care of running the pre and post processing steps while
	the latter silently ignores them.

	Example of feature extraction using `transformers.pipelines`:
	```python
	>>> from transformers import AutoTokenizer, pipeline
	>>> from optimum.intel import OVModelForFeatureExtraction

	>>> tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
	>>> model = OVModelForFeatureExtraction.from_pretrained("sentence-transformers/all-MiniLM-L6-v2", export=True)
	>>> pipe = pipeline("feature-extraction", model=model, tokenizer=tokenizer)
	>>> outputs = pipe("My Name is Peter and I live in New York.")
	```

	Parameters:

	model (`openvino.Model`) : is the main class used to run OpenVINO Runtime inference.

	config (`transformers.PretrainedConfig`) : [PretrainedConfig](https://huggingface.co/docs/transformers/main_classes/configuration#transformers.PretrainedConfig) is the Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the `~intel.openvino.modeling.OVBaseModel.from_pretrained` method to load the model weights.

	device (`str`, defaults to `"CPU"`) : The device type for which the model will be optimized for. The resulting compiled model will contains nodes specific to this device.

	dynamic_shapes (`bool`, defaults to `True`) : All the model's dimension will be set to dynamic when set to `True`. Should be set to `False` for the model to not be dynamically reshaped by default.

	ov_config (`Optional[Dict]`, defaults to `None`) : The dictionary containing the information related to the model compilation.

	compile (`bool`, defaults to `True`) : Disable the model compilation during the loading step when set to `False`. Can be useful to avoid unnecessary compilation, in the case where the model needs to be statically reshaped, the device modified or if FP16 conversion is enabled.

	## Text-to-image

	### OVStableDiffusionPipeline[[optimum.intel.OVStableDiffusionPipeline]]

	#### optimum.intel.OVStableDiffusionPipeline[[optimum.intel.OVStableDiffusionPipeline]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_diffusion.py#L1443)

	OpenVINO-powered stable diffusion pipeline corresponding to [diffusers.StableDiffusionPipeline](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion#diffusers.StableDiffusionPipeline).

	forwardoptimum.intel.OVStableDiffusionPipeline.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L976[{"name": "args", "val": ""}, {"name": "*kwargs", "val": ""}]

	### OVStableDiffusionXLPipeline[[optimum.intel.OVStableDiffusionXLPipeline]]

	#### optimum.intel.OVStableDiffusionXLPipeline[[optimum.intel.OVStableDiffusionXLPipeline]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_diffusion.py#L1477)

	OpenVINO-powered stable diffusion pipeline corresponding to [diffusers.StableDiffusionXLPipeline](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLPipeline).

	forwardoptimum.intel.OVStableDiffusionXLPipeline.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L976[{"name": "args", "val": ""}, {"name": "*kwargs", "val": ""}]

	### OVLatentConsistencyModelPipeline[[optimum.intel.OVLatentConsistencyModelPipeline]]

	#### optimum.intel.OVLatentConsistencyModelPipeline[[optimum.intel.OVLatentConsistencyModelPipeline]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_diffusion.py#L1578)

	OpenVINO-powered stable diffusion pipeline corresponding to [diffusers.LatentConsistencyModelPipeline](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/latent_consistency#diffusers.LatentConsistencyModelPipeline).

	forwardoptimum.intel.OVLatentConsistencyModelPipeline.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L976[{"name": "args", "val": ""}, {"name": "*kwargs", "val": ""}]

	## Image-to-image

	### OVStableDiffusionImg2ImgPipeline[[optimum.intel.OVStableDiffusionImg2ImgPipeline]]

	#### optimum.intel.OVStableDiffusionImg2ImgPipeline[[optimum.intel.OVStableDiffusionImg2ImgPipeline]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_diffusion.py#L1453)

	OpenVINO-powered stable diffusion pipeline corresponding to [diffusers.StableDiffusionImg2ImgPipeline](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_img2img#diffusers.StableDiffusionImg2ImgPipeline).

	forwardoptimum.intel.OVStableDiffusionImg2ImgPipeline.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L976[{"name": "args", "val": ""}, {"name": "*kwargs", "val": ""}]

	### OVStableDiffusionXLImg2ImgPipeline[[optimum.intel.OVStableDiffusionXLImg2ImgPipeline]]

	#### optimum.intel.OVStableDiffusionXLImg2ImgPipeline[[optimum.intel.OVStableDiffusionXLImg2ImgPipeline]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_diffusion.py#L1500)

	OpenVINO-powered stable diffusion pipeline corresponding to [diffusers.StableDiffusionXLImg2ImgPipeline](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_xl#diffusers.StableDiffusionXLImg2ImgPipeline).

	forwardoptimum.intel.OVStableDiffusionXLImg2ImgPipeline.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L976[{"name": "args", "val": ""}, {"name": "*kwargs", "val": ""}]

	## Inpainting

	### OVStableDiffusionInpaintPipeline[[optimum.intel.OVStableDiffusionInpaintPipeline]]

	#### optimum.intel.OVStableDiffusionInpaintPipeline[[optimum.intel.OVStableDiffusionInpaintPipeline]]

	[Source](https://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_diffusion.py#L1465)

	OpenVINO-powered stable diffusion pipeline corresponding to [diffusers.StableDiffusionInpaintPipeline](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/stable_diffusion_inpaint#diffusers.StableDiffusionInpaintPipeline).

	forwardoptimum.intel.OVStableDiffusionInpaintPipeline.forwardhttps://github.com/huggingface/optimum-intel/blob/vr_1714/optimum/intel/openvino/modeling_base.py#L976[{"name": "args", "val": ""}, {"name": "*kwargs", "val": ""}]

	### Export your model
	https://huggingface.co/docs/optimum.intel/pr_1714/openvino/export.md

	# Export your model

	To export a [model](https://huggingface.co/docs/optimum/main/en/intel/openvino/models) hosted on the [Hub](https://huggingface.co/models) you can use our [space](https://huggingface.co/spaces/echarlaix/openvino-export). After conversion, a repository will be pushed under your namespace, this repository can be either public or private.

	## Using the CLI

	To export your model to the [OpenVINO IR](https://docs.openvino.ai/2024/documentation/openvino-ir-format.html) format with the CLI :

	```bash
	optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B ov_model/
	```

	To export a private model or a model that requires access, you can either run `huggingface-cli login` to log in permanently, or set the environment variable `HF_TOKEN` to a [token](https://huggingface.co/settings/tokens) with access to the model. See the [authentication documentation](https://huggingface.co/docs/huggingface_hub/quick-start#authentication) for more information.

	The model argument can either be the model ID of a model hosted on the [Hub](https://huggingface.co/models) or a path to a model hosted locally. For local models, you need to specify the task for which the model should be loaded before export, among the list of the [supported tasks](https://huggingface.co/docs/optimum/main/en/exporters/task_manager).

	```bash
	optimum-cli export openvino --model local_llama --task text-generation-with-past ov_model/
	```

	Check out the help for more options:

	```text
	usage: optimum-cli export openvino [-h] -m MODEL [--task TASK] [--framework {pt}] [--trust-remote-code]
	[--weight-format {fp32,fp16,int8,int4,mxfp4,nf4,cb4}]
	[--quant-mode {int8,f8e4m3,f8e5m2,cb4_f8e4m3,int4_f8e4m3,int4_f8e5m2}]
	[--library {transformers,diffusers,timm,sentence_transformers,open_clip}]
	[--cache_dir CACHE_DIR] [--pad-token-id PAD_TOKEN_ID] [--ratio RATIO] [--sym]
	[--group-size GROUP_SIZE] [--backup-precision {none,int8_sym,int8_asym}]
	[--dataset DATASET] [--all-layers] [--awq] [--scale-estimation] [--gptq]
	[--lora-correction] [--sensitivity-metric SENSITIVITY_METRIC]
	[--quantization-statistics-path QUANTIZATION_STATISTICS_PATH]
	[--num-samples NUM_SAMPLES] [--disable-stateful] [--disable-convert-tokenizer]
	[--smooth-quant-alpha SMOOTH_QUANT_ALPHA]
	output

	optional arguments:
	-h, --help show this help message and exit

	Required arguments:
	-m MODEL, --model MODEL
	Model ID on huggingface.co or path on disk to load model from.
	output Path indicating the directory where to store the generated OV model.

	Optional arguments:
	--task TASK The task to export the model for. If not specified, the task will be auto-inferred based on
	the model. Available tasks depend on the model, but are among: ['image-to-image',
	'image-segmentation', 'inpainting', 'sentence-similarity', 'text-to-audio', 'image-to-text',
	'automatic-speech-recognition', 'token-classification', 'text-to-image', 'audio-classification',
	'feature-extraction', 'semantic-segmentation', 'masked-im', 'audio-xvector',
	'audio-frame-classification', 'text2text-generation', 'multiple-choice', 'depth-estimation',
	'image-classification', 'fill-mask', 'zero-shot-object-detection', 'object-detection',
	'question-answering', 'zero-shot-image-classification', 'mask-generation', 'text-generation',
	'text-classification']. For decoder models, use 'xxx-with-past' to export the model using past
	key values in the decoder.
	--framework {pt} The framework to use for the export. Defaults to 'pt' for PyTorch.
	--trust-remote-code Allows to use custom code for the modeling hosted in the model repository. This option should
	only be set for repositories you trust and in which you have read the code, as it will execute
	on your local machine arbitrary code present in the model repository.
	--weight-format {fp32,fp16,int8,int4,mxfp4,nf4,cb4}
	The weight format of the exported model. Option 'cb4' represents a codebook with 16
	fixed fp8 values in E4M3 format.
	--quant-mode {int8,f8e4m3,f8e5m2,cb4_f8e4m3,int4_f8e4m3,int4_f8e5m2}
	Quantization precision mode. This is used for applying full model quantization including
	activations.
	--library {transformers,diffusers,timm,sentence_transformers,open_clip}
	The library used to load the model before export. If not provided, will attempt to infer the
	local checkpoint's library
	--cache_dir CACHE_DIR
	The path to a directory in which the downloaded model should be cached if the standard cache
	should not be used.
	--pad-token-id PAD_TOKEN_ID
	This is needed by some models, for some tasks. If not provided, will attempt to use the
	tokenizer to guess it.
	--ratio RATIO A parameter used when applying 4-bit quantization to control the ratio between 4-bit and 8-bit
	quantization. If set to 0.8, 80% of the layers will be quantized to int4 while 20% will be
	quantized to int8. This helps to achieve better accuracy at the sacrifice of the model size
	and inference latency. Default value is 1.0. Note: If dataset is provided, and the ratio is
	less than 1.0, then data-aware mixed precision assignment will be applied.
	--sym Whether to apply symmetric quantization. This argument is related to integer-typed
	--weight-format and --quant-mode options. In case of full or mixed quantization (--quant-mode)
	symmetric quantization will be applied to weights in any case, so only activation quantization
	will be affected by --sym argument. For weight-only quantization (--weight-format) --sym
	argument does not affect backup precision. Examples: (1) --weight-format int8 --sym => int8
	symmetric quantization of weights; (2) --weight-format int4 => int4 asymmetric quantization of
	weights; (3) --weight-format int4 --sym --backup-precision int8_asym => int4 symmetric
	quantization of weights with int8 asymmetric backup precision; (4) --quant-mode int8 --sym =>
	weights and activations are quantized to int8 symmetric data type; (5) --quant-mode int8 =>
	activations are quantized to int8 asymmetric data type, weights -- to int8 symmetric data type;
	(6) --quant-mode int4_f8e5m2 --sym => activations are quantized to f8e5m2 data type, weights --
	to int4 symmetric data type.
	--group-size GROUP_SIZE
	The group size to use for quantization. Recommended value is 128 and -1 uses per-column
	quantization.
	--group-size-fallback {error,ignore,adjust}
	Specifies how to handle operations that do not support the given group size. Possible values are:
	`error`: raise an error if the given group size is not supported by a node, this is the default
	behavior;
	`ignore`: skip nodes that cannot be compressed with the given group size;
	`adjust`: adjust the group size to the maximum supported value for each problematic node, if
	there is no valid value greater than or equal to 32, then the node is quantized to the backup
	precision which is int8_asym by default.
	--backup-precision {none,int8_sym,int8_asym}
	Defines a backup precision for mixed-precision weight compression. Only valid for 4-bit weight
	formats. If not provided, backup precision is int8_asym. 'none' stands for original floating-
	point precision of the model weights, in this case weights are retained in their original
	precision without any quantization. 'int8_sym' stands for 8-bit integer symmetric quantization
	without zero point. 'int8_asym' stands for 8-bit integer asymmetric quantization with zero
	points per each quantization group.
	--dataset DATASET The dataset used for data-aware compression or quantization with NNCF. Can be a dataset name
	(e.g., 'wikitext2') or a string with options (e.g., 'wikitext2:seq_len=128'). The only currently
	supported option is `seq_len` which represents a length of an input sample sequence (sentence).
	For language models you can use the one from the list
	['auto','wikitext2','c4','c4-new','gsm8k']. With 'auto' the dataset will be collected from model's
	generations. For diffusion models it should be on of ['conceptual_captions',
	'laion/220k-GPT4Vision-captions-from-LIVIS','laion/filtered-wit']. For visual language models
	the dataset must be set to 'contextual'. Note: if none of the data-aware compression algorithms
	are selected and ratio parameter is omitted or equals 1.0, the dataset argument will not have an
	effect on the resulting model. Note: for text generation task, datasets with English texts such
	as 'wikitext2','gsm8k','c4' or 'c4-new' usually work fine even for non-English models.
	--all-layers Whether embeddings and last MatMul layers should be compressed to INT4. If not provided an
	weight compression is applied, they are compressed to INT8.
	--awq Whether to apply AWQ algorithm. AWQ improves generation quality of INT4-compressed LLMs. If
	dataset is provided, a data-aware activation-based version of the algorithm will be executed,
	which requires additional time. Otherwise, data-free AWQ will be applied which relies on
	per-column magnitudes of weights instead of activations. Note: it is possible that there will
	be no matching patterns in the model to apply AWQ, in such case it will be skipped.
	--scale-estimation Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between
	the original and compressed layers. Providing a dataset is required to run scale estimation.
	Please note, that applying scale estimation takes additional memory and time.
	--gptq Indicates whether to apply GPTQ algorithm that optimizes compressed weights in a layer-wise
	fashion to minimize the difference between activations of a compressed and original layer.
	Please note, that applying GPTQ takes additional memory and time.
	--lora-correction Indicates whether to apply LoRA Correction algorithm. When enabled, this algorithm introduces
	low-rank adaptation layers in the model that can recover accuracy after weight compression at
	some cost of inference latency. Please note, that applying LoRA Correction algorithm takes
	additional memory and time.
	--sensitivity-metric SENSITIVITY_METRIC
	The sensitivity metric for assigning quantization precision to layers. It can be one of the
	following: ['weight_quantization_error', 'hessian_input_activation',
	'mean_activation_variance', 'max_activation_variance', 'mean_activation_magnitude'].
	--quantization-statistics-path QUANTIZATION_STATISTICS_PATH
	Directory path to dump/load data-aware weight-only quantization statistics. This is useful when
	running data-aware quantization multiple times on the same model and dataset to avoid
	recomputing statistics. This option is applicable exclusively for weight-only quantization.
	Please note that the statistics depend on the dataset, so if you change the dataset, you should
	also change the statistics path to avoid confusion.
	--num-samples NUM_SAMPLES
	The maximum number of samples to take from the dataset for quantization.
	--disable-stateful Disable stateful converted models, stateless models will be generated instead. Stateful models
	are produced by default when this key is not used. In stateful models all kv-cache inputs and
	outputs are hidden in the model and are not exposed as model inputs and outputs. If --disable-
	stateful option is used, it may result in sub-optimal inference performance. Use it when you
	intentionally want to use a stateless model, for example, to be compatible with existing
	OpenVINO native inference code that expects KV-cache inputs and outputs in the model.
	--disable-convert-tokenizer
	Do not add converted tokenizer and detokenizer OpenVINO models.
	--smooth-quant-alpha SMOOTH_QUANT_ALPHA
	SmoothQuant alpha parameter that improves the distribution of activations before MatMul layers
	and reduces quantization error. Valid only when activations quantization is enabled.
	```

	You can also apply fp16, 8-bit or 4-bit weight-only quantization on the Linear, Convolutional and Embedding layers when exporting your model by setting `--weight-format` to respectively `fp16`, `int8` or `int4`.

	Export with INT8 weights compression:
	```bash
	optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int8 ov_model/
	```

	Export with INT4 weights compression:
	```bash
	optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int4 ov_model/
	```

	Export with INT4 weights compression and data-free AWQ algorithm:
	```bash
	optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B --weight-format int4 --awq ov_model/
	```

	Export with INT4 weights compression and data-aware AWQ and Scale Estimation algorithms:
	```bash
	optimum-cli export openvino --model meta-llama/Meta-Llama-3-8B \
	--weight-format int4 --awq --scale-estimation --dataset wikitext2 ov_model/
	```

	For more information on the quantization parameters checkout the [documentation](inference#weight-only-quantization)

	Models larger than 1 billion parameters are exported to the OpenVINO format with 8-bit weights by default. You can disable it with `--weight-format fp32`.

	Besides weight-only quantization, you can also apply full model quantization including activations by setting `--quant-mode` to preferred precision. This will quantize both weights and activations of Linear, Convolutional and some other layers to selected mode. Please see example below.

	```bash
	optimum-cli export openvino -m openai/whisper-large-v3-turbo --quant-mode int8 --dataset librispeech --num-samples 32 --smooth-quant-alpha 0.9 ./whisper-large-v3-turbo
	```

	#### Default quantization configs

	For some models we maintain a set of default quantization configs ([link](https://github.com/huggingface/optimum-intel/blob/main/optimum/intel/openvino/configuration.py)). To apply a default 4-bit weight-only quantization one should provide `--weight-format int4` without any additional arguments. For int8 weight & activation quantization it should be `--quant-mode int8`. For example:

	```bash
	optimum-cli export openvino -m microsoft/Phi-4-mini-instruct --weight-format int4 ./Phi-4-mini-instruct
	```

	or

	```bash
	optimum-cli export openvino -m openai/clip-vit-base-patch16 --quant-mode int8 ./clip-vit-base-patch16
	```

	### Decoder models

	For models with a decoder, we enable the re-use of past keys and values by default. This allows to avoid recomputing the same intermediate activations at each generation step. To export the model without, you will need to remove the `-with-past` suffix when specifying the task.

	\| With K-V cache \| Without K-V cache \|
	\|------------------------------------------\|--------------------------------------\|
	\| `text-generation-with-past` \| `text-generation` \|
	\| `text2text-generation-with-past` \| `text2text-generation` \|
	\| `automatic-speech-recognition-with-past` \| `automatic-speech-recognition` \|

	### Diffusion models

	When Stable Diffusion models are exported to the OpenVINO format, they are decomposed into different components that are later combined during inference:

	* Text encoder(s)
	* U-Net
	* VAE encoder
	* VAE decoder

	To export your Stable Diffusion XL model to the OpenVINO IR format with the CLI you can do as follows:

	```bash
	optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 ov_sdxl/
	```

	You can also apply hybrid quantization during model export. For example:
	```bash
	optimum-cli export openvino --model stabilityai/stable-diffusion-xl-base-1.0 \
	--weight-format int8 --dataset conceptual_captions ov_sdxl/
	```

	For more information about hybrid quantization, take a look at this jupyter [notebook](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_hybrid_quantization.ipynb).

	## When loading your model

	You can also load your PyTorch checkpoint and convert it to the OpenVINO format on-the-fly, by setting `export=True` when loading your model.

	To easily save the resulting model, you can use the `save_pretrained()` method, which will save both the BIN and XML files describing the graph. It is useful to save the tokenizer to the same directory, to enable easy loading of the tokenizer for the model.

	```diff
	- from transformers import AutoModelForCausalLM
	+ from optimum.intel import OVModelForCausalLM
	from transformers import AutoTokenizer

	model_id = "meta-llama/Meta-Llama-3-8B"
	- model = AutoModelForCausalLM.from_pretrained(model_id)
	+ model = OVModelForCausalLM.from_pretrained(model_id, export=True)
	tokenizer = AutoTokenizer.from_pretrained(model_id)

	save_directory = "ov_model"
	model.save_pretrained(save_directory)
	tokenizer.save_pretrained(save_directory)
	```

	## After loading your model

	```python
	from transformers import AutoModelForCausalLM
	from optimum.exporters.openvino import export_from_model

	model = AutoModelForCausalLM.from_pretrained("meta-llama/Meta-Llama-3-8B")
	export_from_model(model, output="ov_model", task="text-generation-with-past")
	```

	Once the model is exported, you can now [load your OpenVINO model](inference) by replacing the `AutoModelForXxx` class with the corresponding `OVModelForXxx` class.

	## Troubleshooting

	Some models do not work with the latest transformers release. You may see an error message with a maximum supported version. To export these models, install a transformers version that supports the model, for example `pip install transformers==4.53.3`.
	The supported transformers versions compatible with each optimum-intel release are listed on the [Github releases page](https://github.com/huggingface/optimum-intel/releases/).

	### Inference
	https://huggingface.co/docs/optimum.intel/pr_1714/openvino/inference.md

	# Inference

	Optimum Intel can be used to load optimized models from the [Hub](https://huggingface.co/models?library=openvino&sort=downloads) and create pipelines to run inference with OpenVINO Runtime on a variety of Intel processors ([see](https://docs.openvino.ai/2024/about-openvino/compatibility-and-support/supported-devices.html) the full list of supported devices)

	## Loading

	### Transformers models

	Once [your model was exported](export), you can load it by replacing the `AutoModelForXxx` class with the corresponding `OVModelForXxx`.

	```diff
	- from transformers import AutoModelForCausalLM
	+ from optimum.intel import OVModelForCausalLM
	from transformers import AutoTokenizer, pipeline

	model_id = "helenai/gpt2-ov"
	- model = AutoModelForCausalLM.from_pretrained(model_id)
	# here the model was already exported so no need to set export=True
	+ model = OVModelForCausalLM.from_pretrained(model_id)
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
	results = pipe("He's a dreadful magician and")
	```

	As shown in the table below, each task is associated with a class enabling to automatically load your model.

	\| Auto Class \| Task \|
	\|--------------------------------------\|--------------------------------------\|
	\| `OVModelForSequenceClassification` \| `text-classification` \|
	\| `OVModelForTokenClassification` \| `token-classification` \|
	\| `OVModelForQuestionAnswering` \| `question-answering` \|
	\| `OVModelForAudioClassification` \| `audio-classification` \|
	\| `OVModelForImageClassification` \| `image-classification` \|
	\| `OVModelForFeatureExtraction` \| `feature-extraction` \|
	\| `OVModelForMaskedLM` \| `fill-mask` \|
	\| `OVModelForImageClassification` \| `image-classification` \|
	\| `OVModelForAudioClassification` \| `audio-classification` \|
	\| `OVModelForCausalLM` \| `text-generation-with-past` \|
	\| `OVModelForSeq2SeqLM` \| `text2text-generation-with-past` \|
	\| `OVModelForSpeechSeq2Seq` \| `automatic-speech-recognition` \|
	\| `OVModelForVision2Seq` \| `image-to-text` \|
	\| `OVModelForTextToSpeechSeq2Seq` \| `text-to-audio` \|

	### Diffusers models

	Make sure you have 🤗 Diffusers installed. To install `diffusers`:

	```bash
	pip install diffusers
	```

	```diff
	- from diffusers import StableDiffusionPipeline
	+ from optimum.intel import OVStableDiffusionPipeline

	model_id = "echarlaix/stable-diffusion-v1-5-openvino"
	- pipeline = StableDiffusionPipeline.from_pretrained(model_id)
	+ pipeline = OVStableDiffusionPipeline.from_pretrained(model_id)
	prompt = "sailing ship in storm by Rembrandt"
	images = pipeline(prompt).images
	```

	As shown in the table below, each task is associated with a class enabling to automatically load your model.

	\| Auto Class \| Task \|
	\|--------------------------------------\|--------------------------------------\|
	\| `OVStableDiffusionPipeline` \| `text-to-image` \|
	\| `OVStableDiffusionImg2ImgPipeline` \| `image-to-image` \|
	\| `OVStableDiffusionInpaintPipeline` \| `inpaint` \|
	\| `OVStableDiffusionXLPipeline` \| `text-to-image` \|
	\| `OVStableDiffusionXLImg2ImgPipeline` \| `image-to-image` \|
	\| `OVLatentConsistencyModelPipeline` \| `text-to-image` \|
	\| `OVLTXPipeline` \| `text-to-video` \|
	\| `OVPipelineForText2Video` \| `text-to-video` \|

	See the [reference documentation](reference) for more information about parameters, and examples for different tasks.

	## Compilation

	By default the model will be compiled when instantiating an `OVModel`. In the case where the model is reshaped or placed to another device, the model will need to be recompiled again, which will happen by default before the first inference (thus inflating the latency of the first inference). To avoid an unnecessary compilation, you can disable the first compilation by setting `compile=False`.

	```python
	from optimum.intel import OVModelForQuestionAnswering

	model_id = "distilbert/distilbert-base-cased-distilled-squad"
	# Load the model and disable the model compilation
	model = OVModelForQuestionAnswering.from_pretrained(model_id, compile=False)
	```

	To run inference on Intel integrated or discrete GPU, use `.to("gpu")`. On GPU, models run in FP16 precision by default. (See [OpenVINO documentation](https://docs.openvino.ai/2024/get-started/configurations/configurations-intel-gpu.html) about installing drivers for GPU inference).

	```python
	model.to("gpu")
	```

	The model can be compiled:

	```python
	model.compile()
	```

	## Static shape

	By default, dynamic shapes are supported, enabling inference for inputs of every shape. To speed up inference, static shapes can be enabled by giving the desired input shapes with [.reshape()](reference#optimum.intel.OVBaseModel.reshape).

	```python
	# Fix the batch size to 1 and the sequence length to 40
	batch_size, seq_len = 1, 40
	model.reshape(batch_size, seq_len)
	```

	When fixing the shapes with the `reshape()` method, inference cannot be performed with an input of a different shape.

	```python

	from transformers import AutoTokenizer
	from optimum.intel import OVModelForQuestionAnswering

	model_id = "distilbert/distilbert-base-cased-distilled-squad"
	model = OVModelForQuestionAnswering.from_pretrained(model_id, compile=False)
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	batch_size, seq_len = 1, 40
	model.reshape(batch_size, seq_len)
	# Compile the model before the first inference
	model.compile()

	question = "Which name is also used to describe the Amazon rainforest ?"
	context = "The Amazon rainforest, also known as Amazonia or the Amazon Jungle"
	tokens = tokenizer(question, context, max_length=seq_len, padding="max_length", return_tensors="np")

	outputs = model(**tokens)
	```

	For models that handle images, you can also specify the `height` and `width` when reshaping your model:

	```python
	batch_size, num_images, height, width = 1, 1, 512, 512
	pipeline.reshape(batch_size=batch_size, height=height, width=width, num_images_per_prompt=num_images)
	images = pipeline(prompt, height=height, width=width, num_images_per_prompt=num_images).images
	```

	## Configuration

	The `ov_config` parameter allow to provide custom OpenVINO configuration values. This can be used for example to enable full precision inference on devices where FP16 or BF16 inference precision is used by default.

	```python
	ov_config = {"INFERENCE_PRECISION_HINT": "f32"}
	model = OVModelForSequenceClassification.from_pretrained(model_id, ov_config=ov_config)
	```

	Optimum Intel leverages OpenVINO's model caching to speed up model compiling on GPU. By default a `model_cache` directory is created in the model's directory in the [Hugging Face Hub cache](https://huggingface.co/docs/huggingface_hub/main/en/guides/manage-cache). To override this, use the ov_config parameter and set `CACHE_DIR` to a different value. To disable model caching on GPU, set `CACHE_DIR` to an empty string.

	```python
	ov_config = {"CACHE_DIR": ""}
	model = OVModelForSequenceClassification.from_pretrained(model_id, device="gpu", ov_config=ov_config)
	```

	## Weight quantization

	You can also apply fp16, 8-bit or 4-bit weight compression on the Linear, Convolutional and Embedding layers when loading your model to reduce the memory footprint and inference latency.

	For more information on the quantization parameters checkout the [documentation](optimization#weight-only-quantization).

	If not specified, `load_in_8bit` will be set to `True` by default when models larger than 1 billion parameters are exported to the OpenVINO format (with `export=True`). You can disable it with `load_in_8bit=False`.

	It's also possible to apply quantization on both weights and activations using the [`OVQuantizer`](optimization#static-quantization).

	### Generate images with Diffusion models
	https://huggingface.co/docs/optimum.intel/pr_1714/openvino/tutorials/diffusers.md

	# Generate images with Diffusion models

	## Stable Diffusion

	Stable Diffusion models can also be used when running inference with OpenVINO. When Stable Diffusion models
	are exported to the OpenVINO format, they are decomposed into different components that are later combined during inference:
	- The text encoder
	- The U-NET
	- The VAE encoder
	- The VAE decoder

	\| Task \| Auto Class \|
	\|--------------------------------------\|--------------------------------------\|
	\| `text-to-image` \| `OVStableDiffusionPipeline` \|
	\| `image-to-image` \| `OVStableDiffusionImg2ImgPipeline` \|
	\| `inpaint` \| `OVStableDiffusionInpaintPipeline` \|

	### Text-to-Image
	Here is an example of how you can load an OpenVINO Stable Diffusion model and run inference using OpenVINO Runtime:

	```python
	from optimum.intel import OVStableDiffusionPipeline

	model_id = "echarlaix/stable-diffusion-v1-5-openvino"
	pipeline = OVStableDiffusionPipeline.from_pretrained(model_id)
	prompt = "sailing ship in storm by Rembrandt"
	images = pipeline(prompt).images
	```

	To load your PyTorch model and convert it to OpenVINO on the fly, you can set `export=True`.

	```python
	model_id = "runwayml/stable-diffusion-v1-5"
	pipeline = OVStableDiffusionPipeline.from_pretrained(model_id, export=True)
	# Don't forget to save the exported model
	pipeline.save_pretrained("openvino-sd-v1-5")
	```

	To further speed up inference, the model can be statically reshaped :

	```python
	# Define the shapes related to the inputs and desired outputs
	batch_size, num_images, height, width = 1, 1, 512, 512
	# Statically reshape the model
	pipeline.reshape(batch_size=batch_size, height=height, width=width, num_images_per_prompt=num_images)
	# Compile the model before the first inference
	pipeline.compile()

	# Run inference
	images = pipeline(prompt, height=height, width=width, num_images_per_prompt=num_images).images
	```

	In case you want to change any parameters such as the outputs height or width, you'll need to statically reshape your model once again.



	### Text-to-Image with Textual Inversion
	Here is an example of how you can load an OpenVINO Stable Diffusion model with pre-trained textual inversion embeddings and run inference using OpenVINO Runtime:

	First, you can run original pipeline without textual inversion
	```python
	from optimum.intel import OVStableDiffusionPipeline
	import numpy as np

	model_id = "echarlaix/stable-diffusion-v1-5-openvino"
	prompt = "A <cat-toy> back-pack"
	# Set a random seed for better comparison
	np.random.seed(42)

	pipeline = OVStableDiffusionPipeline.from_pretrained(model_id, export=False, compile=False)
	pipeline.compile()
	image1 = pipeline(prompt, num_inference_steps=50).images[0]
	image1.save("stable_diffusion_v1_5_without_textual_inversion.png")
	```

	Then, you can load [sd-concepts-library/cat-toy](https://huggingface.co/sd-concepts-library/cat-toy) textual inversion embedding and run pipeline with same prompt again
	```python
	# Reset stable diffusion pipeline
	pipeline.clear_requests()

	# Load textual inversion into stable diffusion pipeline
	pipeline.load_textual_inversion("sd-concepts-library/cat-toy", "<cat-toy>")

	# Compile the model before the first inference
	pipeline.compile()
	image2 = pipeline(prompt, num_inference_steps=50).images[0]
	image2.save("stable_diffusion_v1_5_with_textual_inversion.png")
	```
	The left image shows the generation result of original stable diffusion v1.5, the right image shows the generation result of stable diffusion v1.5 with textual inversion.

	\| \| \|
	\|---\|---\|
	\| ![](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/intel/openvino/textual_inversion/stable_diffusion_v1_5_without_textual_inversion.png) \| ![](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/intel/openvino/textual_inversion/stable_diffusion_v1_5_with_textual_inversion.png) \|

	### Image-to-Image

	```python
	import requests
	import torch
	from PIL import Image
	from io import BytesIO
	from optimum.intel import OVStableDiffusionImg2ImgPipeline

	model_id = "runwayml/stable-diffusion-v1-5"
	pipeline = OVStableDiffusionImg2ImgPipeline.from_pretrained(model_id, export=True)

	url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
	response = requests.get(url)
	init_image = Image.open(BytesIO(response.content)).convert("RGB")
	init_image = init_image.resize((768, 512))
	prompt = "A fantasy landscape, trending on artstation"
	image = pipeline(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images[0]
	image.save("fantasy_landscape.png")
	```

	## Stable Diffusion XL

	\| Task \| Auto Class \|
	\|--------------------------------------\|--------------------------------------\|
	\| `text-to-image` \| `OVStableDiffusionXLPipeline` \|
	\| `image-to-image` \| `OVStableDiffusionXLImg2ImgPipeline` \|

	### Text-to-Image

	Here is an example of how you can load a SDXL OpenVINO model from [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) and run inference using OpenVINO Runtime:

	```python
	from optimum.intel import OVStableDiffusionXLPipeline

	model_id = "stabilityai/stable-diffusion-xl-base-1.0"
	base = OVStableDiffusionXLPipeline.from_pretrained(model_id)
	prompt = "train station by Caspar David Friedrich"
	image = base(prompt).images[0]
	image.save("train_station.png")
	```

	\| \| \|
	\|---\|---\|
	\| ![](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/intel/openvino/sd_xl/train_station_friedrich.png) \| ![](https://huggingface.co/datasets/optimum/documentation-images/resolve/main/intel/openvino/sd_xl/train_station_friedrich_2.png) \|

	### Text-to-Image with Textual Inversion

	Here is an example of how you can load an SDXL OpenVINO model from [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0) with pre-trained textual inversion embeddings and run inference using OpenVINO Runtime:

	First, you can run original pipeline without textual inversion
	```python
	from optimum.intel import OVStableDiffusionXLPipeline
	import numpy as np

	model_id = "stabilityai/stable-diffusion-xl-base-1.0"
	prompt = "charturnerv2, multiple views of the same character in the same outfit, a character turnaround wearing a red jacket and black shirt, best quality, intricate details."
	# Set a random seed for better comparison
	np.random.seed(112)

	base = OVStableDiffusionXLPipeline.from_pretrained(model_id, export=False, compile=False)
	base.compile()
	image1 = base(prompt, num_inference_steps=50).images[0]
	image1.save("sdxl_without_textual_inversion.png")
	```

	Then, you can load [charturnerv2](https://civitai.com/models/3036/charturner-character-turnaround-helper-for-15-and-21) textual inversion embedding and run pipeline with same prompt again
	```python
	# Reset stable diffusion pipeline
	base.clear_requests()

	# Load textual inversion into stable diffusion pipeline
	base.load_textual_inversion("./charturnerv2.pt", "charturnerv2")

	# Compile the model before the first inference
	base.compile()
	image2 = base(prompt, num_inference_steps=50).images[0]
	image2.save("sdxl_with_textual_inversion.png")
	```

	### Image-to-Image

	Here is an example of how you can load a PyTorch SDXL model, convert it to OpenVINO on-the-fly and run inference using OpenVINO Runtime for image-to-image:

	```python
	from optimum.intel import OVStableDiffusionXLImg2ImgPipeline
	from diffusers.utils import load_image

	model_id = "stabilityai/stable-diffusion-xl-refiner-1.0"
	pipeline = OVStableDiffusionXLImg2ImgPipeline.from_pretrained(model_id, export=True)

	url = "https://huggingface.co/datasets/optimum/documentation-images/resolve/main/intel/openvino/sd_xl/castle_friedrich.png"
	image = load_image(url).convert("RGB")
	prompt = "medieval castle by Caspar David Friedrich"
	image = pipeline(prompt, image=image).images[0]
	# Don't forget to save your OpenVINO model so that you can load it without exporting it with `export=True`
	pipeline.save_pretrained("openvino-sd-xl-refiner-1.0")
	```

	### Refining the image output

	The image can be refined by making use of a model like [stabilityai/stable-diffusion-xl-refiner-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0). In this case, you only have to output the latents from the base model.

	```python
	from optimum.intel import OVStableDiffusionXLImg2ImgPipeline

	model_id = "stabilityai/stable-diffusion-xl-refiner-1.0"
	refiner = OVStableDiffusionXLImg2ImgPipeline.from_pretrained(model_id, export=True)

	image = base(prompt=prompt, output_type="latent").images[0]
	image = refiner(prompt=prompt, image=image[None, :]).images[0]
	```

	## Latent Consistency Models

	\| Task \| Auto Class \|
	\|--------------------------------------\|--------------------------------------\|
	\| `text-to-image` \| `OVLatentConsistencyModelPipeline` \|

	### Text-to-Image

	Here is an example of how you can load a Latent Consistency Model (LCM) from [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) and run inference using OpenVINO :

	```python
	from optimum.intel import OVLatentConsistencyModelPipeline

	model_id = "SimianLuo/LCM_Dreamshaper_v7"
	pipeline = OVLatentConsistencyModelPipeline.from_pretrained(model_id, export=True)
	prompt = "sailing ship in storm by Leonardo da Vinci"
	images = pipeline(prompt, num_inference_steps=4, guidance_scale=8.0).images
	```

	### Notebooks
	https://huggingface.co/docs/optimum.intel/pr_1714/openvino/tutorials/notebooks.md

	# Notebooks

	## Inference

	\| Notebook \| Description \| \| \|
	\|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|:--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:\|------:\|
	\| [How to run inference with the OpenVINO](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/optimum_openvino_inference.ipynb) \| Explains how to export your model to OpenVINO and to run inference with OpenVINO Runtime on various tasks \| [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/optimum-intel/blob/main/notebooks/openvino/optimum_openvino_inference.ipynb) \| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/optimum-intel/blob/main/notebooks/openvino/optimum_openvino_inference.ipynb) \|

	## Quantization

	\| Notebook \| Description \| \| \|
	\|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\|:--------------------------------------------------------------------------------------------------------------------------------------\|:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:\|------:\|
	\| [How to quantize a question answering model with OpenVINO NNCF](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/question_answering_quantization.ipynb) \| Show how to apply post-training quantization on a question answering model using [NNCF](https://github.com/openvinotoolkit/nncf) \| [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/optimum-intel/blob/main/notebooks/openvino/question_answering_quantization.ipynb) \| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/optimum-intel/blob/main/notebooks/openvino/question_answering_quantization.ipynb) \|
	\| [How to quantize Stable Diffusion model with OpenVINO NNCF](https://github.com/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_hybrid_quantization.ipynb) \| Show how to apply post-training hybrid quantization on a Stable Diffusion model using [NNCF](https://github.com/openvinotoolkit/nncf) \| [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_hybrid_quantization.ipynb)\| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_hybrid_quantization.ipynb)\| \| [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_optimization.ipynb) \| [![Open in AWS Studio](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/huggingface/optimum-intel/blob/main/notebooks/openvino/stable_diffusion_optimization.ipynb) \|

Xet Storage Details

Size:: 179 kB
Xet hash:: 21e6330bf22193ea935c33d479a06113bfe28f0a659b683690ef0d2831d7dfab

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.