Transformers documentation
LASR
LASR
Overview
TODO
Usage
Basic usage
from transformers import pipeline
pipe = pipeline("automatic-speech-recognition", model="path/to/lasr-model")
out = pipe("path/to/audio.mp3")
print(out)Making The Model Go Brrr
TODO
Training
TODO
LasrTokenizer
class transformers.LasrTokenizer
< source >( eos_token = '</s>' unk_token = '<unk>' pad_token = '<pad>' extra_ids = 100 additional_special_tokens = None vocab = None vocab_file = None **kwargs )
Parameters
- vocab_file (
str, optional) — SentencePiece file (generally has a .spm extension) that contains the vocabulary necessary to instantiate a tokenizer. - eos_token (
str, optional, defaults to"</s>") — The end of sequence token.When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the
sep_token. - unk_token (
str, optional, defaults to"<unk>") — The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. - pad_token (
str, optional, defaults to"<pad>") — The token used for padding, for example when batching sequences of different lengths. - extra_ids (
int, optional, defaults to 100) — Add a number of extra ids added to the vocabulary for use as sentinels. These tokens are accessible as “id{%d}>” where ”{%d}” is a number between 0 and extra_ids-1. These tokens can be retrieved by calling get_sentinel_tokens method and token ids can be by calling get_sentinel_token_ids method - additional_special_tokens (
list[str], optional) — Additional special tokens used by the tokenizer. - vocab (
dict, optional) — Custom vocabulary dict. If not provided, a minimal vocabulary is created using the special tokens.
Construct a LASR tokenizer (backed by HuggingFace’s tokenizers library). Based on Unigram.
This tokenizer inherits from TokenizersBackend which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
Get the token IDs for sentinel tokens.
Get the list of sentinel tokens (extra_id tokens) from additional_special_tokens.
LasrFeatureExtractor
class transformers.LasrFeatureExtractor
< source >( feature_size = 128 sampling_rate = 16000 hop_length = 160 n_fft = 512 win_length = 400 padding_value = 0.0 **kwargs )
Parameters
- feature_size (
int, optional, defaults to 128) — The feature dimension of the extracted features. - sampling_rate (
int, optional, defaults to 16000) — The sampling rate at which the audio files should be digitalized expressed in hertz (Hz). - hop_length (
int, optional, defaults to 160) — Length of the overlapping windows for the STFT used to obtain the Mel Frequency coefficients. - n_fft (
int, optional, defaults to 512) — Size of the Fourier transform. - win_length (
int, optional, defaults to 400) — The window length for the STFT computation. - padding_value (
float, optional, defaults to 0.0) — Padding value used to pad the audio. Should correspond to silences.
Constructs a LASR feature extractor.
This feature extractor inherits from SequenceFeatureExtractor which contains most of the main methods. Users should refer to this superclass for more information regarding those methods.
This class extracts mel-filter bank features from raw speech using a custom numpy implementation of the Short Time Fourier Transform which should match pytorch’s torch.stft equivalent.
__call__
< source >( raw_speech: typing.Union[numpy.ndarray, list[float], list[numpy.ndarray], list[list[float]]] truncation: bool = False pad_to_multiple_of: typing.Optional[int] = None return_tensors: typing.Union[str, transformers.utils.generic.TensorType, NoneType] = None return_attention_mask: typing.Optional[bool] = None padding: typing.Optional[str] = 'longest' max_length: typing.Optional[int] = None sampling_rate: typing.Optional[int] = None do_normalize: typing.Optional[bool] = None device: typing.Optional[str] = 'cpu' return_token_timestamps: typing.Optional[bool] = None **kwargs )
Parameters
- raw_speech (
np.ndarray,list[float],list[np.ndarray],list[list[float]]) — The sequence or batch of sequences to be padded. Each sequence can be a numpy array, a list of float values, a list of numpy arrays or a list of list of float values. Must be mono channel audio, not stereo, i.e. single float per timestep. - truncation (
bool, optional, default toTrue) — Activates truncation to cut input sequences longer than max_length to max_length. - pad_to_multiple_of (
int, optional, defaults to None) — If set will pad the sequence to a multiple of the provided value.This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
>= 7.5(Volta), or on TPUs which benefit from having sequence lengths be a multiple of 128. - return_attention_mask (
bool, optional) — Whether to return the attention mask. If left to the default, will return the attention mask according to the specific feature_extractor’s default.For Parakeet models,
attention_maskshould always be passed for batched inference, to avoid subtle bugs. - return_tensors (
stror TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:'tf': Return TensorFlowtf.constantobjects.'pt': Return PyTorchtorch.Tensorobjects.'np': Return Numpynp.ndarrayobjects.
- sampling_rate (
int, optional) — The sampling rate at which theraw_speechinput was sampled. It is strongly recommended to passsampling_rateat the forward call to prevent silent errors and allow automatic speech recognition pipeline. - padding_value (
float, optional, defaults to 0.0) — The value that is used to fill the padding values / vectors. - do_normalize (
bool, optional, defaults toFalse) — Whether or not to zero-mean unit-variance normalize the input. Normalizing can help to significantly improve the performance of the model. - device (
str, optional, defaults to'cpu') — Specifies the device for computation of the log-mel spectrogram of audio signals in the_torch_extract_fbank_featuresmethod. (e.g., “cpu”, “cuda”) - return_token_timestamps (
bool, optional, defaults toNone) — Deprecated. Usereturn_attention_maskinstead from which the number of frames can be inferred.Whether or not to return the number of frames of the input raw_speech. These num_frames can be used by the model to compute word level timestamps.
Main method to featurize and prepare for the model one or several sequence(s). Implementation uses PyTorch for the STFT computation if available, otherwise a slower NumPy based one.
LasrProcessor
__call__
< source >( audio: typing.Union[numpy.ndarray, ForwardRef('torch.Tensor'), collections.abc.Sequence[numpy.ndarray], collections.abc.Sequence['torch.Tensor']] text: typing.Union[str, list[str], list[list[str]], NoneType] = None sampling_rate: typing.Optional[int] = None **kwargs: typing_extensions.Unpack[transformers.models.lasr.processing_lasr.LasrProcessorKwargs] )
This method forwards all its arguments to PreTrainedTokenizer’s batch_decode(). Please refer to the docstring of this method for more information.
This method forwards all its arguments to PreTrainedTokenizer’s decode(). Please refer to the docstring of this method for more information.
LasrEncoderConfig
class transformers.LasrEncoderConfig
< source >( hidden_size = 512 num_hidden_layers = 17 num_attention_heads = 8 intermediate_size = 2048 hidden_act = 'silu' attention_bias = False convolution_bias = False conv_kernel_size = 32 subsampling_conv_channels = 256 subsampling_conv_kernel_size = 5 subsampling_conv_stride = 2 num_mel_bins = 128 dropout = 0.1 dropout_positions = 0.0 layerdrop = 0.1 activation_dropout = 0.1 attention_dropout = 0.1 max_position_embeddings = 10000 initializer_range = 0.02 layer_norm_eps = 1e-06 feed_forward_residual_weights = [1.5, 0.5] conv_residual_weights = [2.0, 1.0] batch_norm_momentum = 0.01 rope_parameters = None **kwargs )
Parameters
- hidden_size (
int, optional, defaults to 512) — Dimension of the layers and the hidden states. - num_hidden_layers (
int, optional, defaults to 17) — Number of hidden layers in the Transformer encoder. - num_attention_heads (
int, optional, defaults to 8) — Number of attention heads for each attention layer in the Transformer encoder. - intermediate_size (
int, optional, defaults to 2048) — Dimension of the “intermediate” (often named feed-forward) layer in the Transformer encoder. - hidden_act (
strorfunction, optional, defaults to"silu") — The non-linear activation function (function or string) in the encoder and pooler. - attention_bias (
bool, optional, defaults toFalse) — Whether to use bias in the attention layers. - convolution_bias (
bool, optional, defaults toFalse) — Whether to use bias in convolutions of the conformer’s convolution module. - conv_kernel_size (
int, optional, defaults to 32) — The kernel size of the convolution layers in the Conformer block. - subsampling_conv_channels (
int, optional, defaults to 256) — The number of channels in the subsampling convolution layers. - subsampling_conv_kernel_size (
int, optional, defaults to 5) — The kernel size of the subsampling convolution layers. - subsampling_conv_stride (
int, optional, defaults to 2) — The stride of the subsampling convolution layers. - num_mel_bins (
int, optional, defaults to 128) — Number of mel features. - dropout (
float, optional, defaults to 0.1) — The dropout ratio for all fully connected layers in the embeddings, encoder, and pooler. - dropout_positions (
float, optional, defaults to 0.0) — The dropout ratio for the positions in the input sequence. - layerdrop (
float, optional, defaults to 0.1) — The dropout ratio for the layers in the encoder. - activation_dropout (
float, optional, defaults to 0.1) — The dropout ratio for activations inside the fully connected layer. - attention_dropout (
float, optional, defaults to 0.1) — The dropout ratio for the attention layers. - max_position_embeddings (
int, optional, defaults to 10000) — The maximum sequence length that this model might ever be used with. - initializer_range (
float, optional, defaults to 0.02) — The standard deviation of the truncated_normal_initializer for initializing all weight matrices. - layer_norm_eps (
float, optional, defaults to 1e-06) — The epsilon used by the layer normalization layers. - feed_forward_residual_weights (
tuple[float, float], optional, defaults to[1.5, 0.5]) — The residual weights for the feed forward layers. - conv_residual_weights (
tuple[float, float], optional, defaults to[2.0, 1.0]) — The residual weights for the convolution layers. - batch_norm_momentum (
float, optional, defaults to 0.01) — The momentum for the batch normalization layers. - rope_parameters (
RopeParameters, optional) — Dictionary containing the configuration parameters for the RoPE embeddings. The dictionary should contain a value forrope_thetaand optionally parameters used for scaling in case you want to use RoPE with longermax_position_embeddings.
This is the configuration class to store the configuration of a LasrEncoder. It is used to instantiate a
LasrEncoder model according to the specified arguments, defining the model architecture.
Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
Example:
>>> from transformers import LasrEncoderModel, LasrEncoderConfig
>>> # Initializing a `LasrEncoder` configuration
>>> configuration = LasrEncoderConfig()
>>> # Initializing a model from the configuration
>>> model = LasrEncoderModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configThis configuration class is based on the LasrEncoder architecture from Google Health AI. You can find more details and pre-trained models at TODO/TODO.
LasrCTCConfig
class transformers.LasrCTCConfig
< source >( vocab_size = 512 ctc_loss_reduction = 'mean' ctc_zero_infinity = True encoder_config: typing.Union[dict, transformers.models.lasr.configuration_lasr.LasrEncoderConfig] = None pad_token_id = 0 **kwargs )
Parameters
- vocab_size (
int, optional, defaults to 512) — Vocabulary size of the model. - ctc_loss_reduction (
str, optional, defaults to"mean") — Specifies the reduction to apply to the output oftorch.nn.CTCLoss. Only relevant when training an instance of LasrForCTC. - ctc_zero_infinity (
bool, optional, defaults toTrue) — Whether to zero infinite losses and the associated gradients oftorch.nn.CTCLoss. Infinite losses mainly occur when the inputs are too short to be aligned to the targets. Only relevant when training an instance of LasrForCTC. - encoder_config (
Union[dict, LasrEncoderConfig], optional) — The config object or dictionary of the encoder. - pad_token_id (
int, optional, defaults to 0) — Padding token id. Also used as blank token id.
This is the configuration class to store the configuration of a LasrForCTC. It is used to instantiate a Lasr CTC model according to the specified arguments, defining the model architecture. Configuration objects inherit from PreTrainedConfig and can be used to control the model outputs. Read the documentation from PreTrainedConfig for more information.
Example:
>>> from transformers import LasrForCTC, LasrCTCConfig
>>> # Initializing a Lasr configuration
>>> configuration = LasrCTCConfig()
>>> # Initializing a model from the configuration
>>> model = LasrForCTC(configuration)
>>> # Accessing the model configuration
>>> configuration = model.configfrom_encoder_config
< source >( encoder_config: LasrEncoderConfig **kwargs ) → LasrCTCConfig
Instantiate a LasrCTCConfig (or a derived class) from lasr encoder model configuration.
LasrEncoder
class transformers.LasrEncoder
< source >( config: LasrEncoderConfig )
Parameters
- config (LasrEncoderConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
The LasrEncoder model, based on the Conformer architecture](https://arxiv.org/abs/2005.08100).
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_features: Tensor attention_mask: typing.Optional[torch.Tensor] = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
Parameters
- input_features (
torch.Tensorof shape(batch_size, sequence_length, feature_dim)) — The tensors corresponding to the input audio features. Audio features can be obtained usingfeature_extractor_class. Seefeature_extractor_class.__call__for details (processor_classusesfeature_extractor_classfor processing audios). - attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
Returns
transformers.modeling_outputs.BaseModelOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.BaseModelOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (None) and inputs.
-
last_hidden_state (
torch.FloatTensorof shape(batch_size, sequence_length, hidden_size)) — Sequence of hidden-states at the output of the last layer of the model. -
hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LasrEncoder forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, LasrEncoder
>>> from datasets import load_dataset, Audio
>>> model_id = TODO
>>> processor = AutoProcessor.from_pretrained(model_id)
>>> encoder = ParakeetEncoder.from_pretrained(model_id)
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
>>> inputs = processor(ds[0]["audio"]["array"])
>>> encoder_outputs = encoder(**inputs)
>>> print(encoder_outputs.last_hidden_state.shape)LasrForCTC
class transformers.LasrForCTC
< source >( config: LasrCTCConfig )
Parameters
- config (LasrCTCConfig) — Model configuration class with all the parameters of the model. Initializing with a config file does not load the weights associated with the model, only the configuration. Check out the from_pretrained() method to load the model weights.
Lasr Encoder with a Connectionist Temporal Classification (CTC) head.
This model inherits from PreTrainedModel. Check the superclass documentation for the generic methods the library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads etc.)
This model is also a PyTorch torch.nn.Module subclass. Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and behavior.
forward
< source >( input_features: Tensor attention_mask: typing.Optional[torch.Tensor] = None labels: typing.Optional[torch.Tensor] = None **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] ) → transformers.modeling_outputs.CausalLMOutput or tuple(torch.FloatTensor)
Parameters
- input_features (
torch.Tensorof shape(batch_size, sequence_length, feature_dim)) — The tensors corresponding to the input audio features. Audio features can be obtained usingfeature_extractor_class. Seefeature_extractor_class.__call__for details (processor_classusesfeature_extractor_classfor processing audios). - attention_mask (
torch.Tensorof shape(batch_size, sequence_length), optional) — Mask to avoid performing attention on padding token indices. Mask values selected in[0, 1]:- 1 for tokens that are not masked,
- 0 for tokens that are masked.
- labels (
torch.Tensorof shape(batch_size, sequence_length), optional) — Labels for computing the masked language modeling loss. Indices should either be in[0, ..., config.vocab_size]or -100 (seeinput_idsdocstring). Tokens with indices set to-100are ignored (masked), the loss is only computed for the tokens with labels in[0, ..., config.vocab_size].
Returns
transformers.modeling_outputs.CausalLMOutput or tuple(torch.FloatTensor)
A transformers.modeling_outputs.CausalLMOutput or a tuple of
torch.FloatTensor (if return_dict=False is passed or when config.return_dict=False) comprising various
elements depending on the configuration (None) and inputs.
-
loss (
torch.FloatTensorof shape(1,), optional, returned whenlabelsis provided) — Language modeling loss (for next-token prediction). -
logits (
torch.FloatTensorof shape(batch_size, sequence_length, config.vocab_size)) — Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax). -
hidden_states (
tuple(torch.FloatTensor), optional, returned whenoutput_hidden_states=Trueis passed or whenconfig.output_hidden_states=True) — Tuple oftorch.FloatTensor(one for the output of the embeddings, if the model has an embedding layer, + one for the output of each layer) of shape(batch_size, sequence_length, hidden_size).Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
-
attentions (
tuple(torch.FloatTensor), optional, returned whenoutput_attentions=Trueis passed or whenconfig.output_attentions=True) — Tuple oftorch.FloatTensor(one for each layer) of shape(batch_size, num_heads, sequence_length, sequence_length).Attentions weights after the attention softmax, used to compute the weighted average in the self-attention heads.
The LasrForCTC forward method, overrides the __call__ special method.
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the pre and post processing steps while the latter silently ignores them.
Example:
>>> from transformers import AutoProcessor, LasrForCTC
>>> from datasets import load_dataset, Audio
>>> model_id = "nvidia/lasr-ctc-1.1b"
>>> processor = AutoProcessor.from_pretrained(model_id)
>>> model = LasrForCTC.from_pretrained(model_id)
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
>>> inputs = processor(ds[0]["audio"]["array"], text=ds[0]["text"])
>>> outputs = model(**inputs)
>>> print(outputs.loss)generate
< source >( input_features: Tensor attention_mask: typing.Optional[torch.Tensor] = None return_dict_in_generate: bool = False **kwargs: typing_extensions.Unpack[transformers.utils.generic.TransformersKwargs] )
Example:
>>> from transformers import AutoProcessor, LasrForCTC
>>> from datasets import load_dataset, Audio
>>> model_id = TODO
>>> processor = AutoProcessor.from_pretrained(model_id)
>>> model = LasrForCTC.from_pretrained(model_id)
>>> ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
>>> ds = ds.cast_column("audio", Audio(sampling_rate=processor.feature_extractor.sampling_rate))
>>> inputs = processor(ds[0]["audio"]["array"], text=ds[0]["text"])
>>> predicted_ids = model.generate(**inputs)
>>> transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
>>> print(transcription)