|
|
慷慨的柿子 · video全屏操作栏自定义样式&&js ...· 9 月前 · |
|
|
温暖的长颈鹿 · python sqlalchemy ...· 2 年前 · |
|
|
胡子拉碴的椰子 · python - ImportError: ...· 2 年前 · |
|
|
精明的开心果 · Centos使用chrony做时间同步 - ...· 2 年前 · |
These are the data type references in the
cudnn_adv
library.
These are the pointers to the opaque struct types in the
cudnn_adv
library.
This enumerated type is deprecated and is currently only used by deprecated APIs. Consider using replacements for the deprecated APIs that use this enumerated type.
cudnnAttnDescriptor_t
is a pointer to an opaque structure holding parameters of the multihead attention layer, such as:
weight and bias tensor shapes (vector lengths before and after linear projections)
parameters that can be set in advance and do not change when invoking functions to evaluate forward responses and gradients (number of attention heads, softmax smoothing and sharpening coefficient)
other settings that are necessary to compute temporary buffer sizes.
Use the cudnnCreateAttnDescriptor() function to create an instance of the attention descriptor object and cudnnDestroyAttnDescriptor() to delete the previously created descriptor. Use the cudnnSetAttnDescriptor() function to configure the descriptor.
cudnnRNNDataDescriptor_t
is a pointer to an opaque structure holding the description of an RNN data set. The function
cudnnCreateRNNDataDescriptor()
is used to create one instance, and
cudnnSetRNNDataDescriptor()
must be used to initialize this instance.
cudnnRNNDescriptor_t
is a pointer to an opaque structure holding the description of an RNN operation.
cudnnCreateRNNDescriptor()
is used to create one instance.
This enumerated type is deprecated and is currently only used by deprecated APIs. Consider using replacements for the deprecated APIs that use this enumerated type.
cudnnSeqDataDescriptor_t
is a pointer to an opaque structure holding parameters of the sequence data container or buffer. The sequence data container is used to store fixed size vectors defined by the
VECT
dimension. Vectors are arranged in additional three dimensions:
TIME
,
BATCH
, and
BEAM
.
The
TIME
dimension is used to bundle vectors into sequences of vectors. The actual sequences can be shorter than the
TIME
dimension, therefore, additional information is needed about each sequence length and how unused (padding) vectors should be saved.
It is assumed that the sequence data container is fully packed. The
TIME
,
BATCH
, and
BEAM
dimensions can be in any order when vectors are traversed in the ascending order of addresses. Six data layouts (permutation of
TIME
,
BATCH
, and
BEAM
) are possible.
The
cudnnSeqDataDescriptor_t
object holds the following parameters:
data type used by vectors
TIME
,
BATCH
,
BEAM
, and
VECT
dimensions
data layout
the length of each sequence along the
TIME
dimension
an optional value to be copied to output padding vectors
Use the cudnnCreateSeqDataDescriptor() function to create one instance of the sequence data descriptor object and cudnnDestroySeqDataDescriptor() to delete a previously created descriptor. Use the cudnnSetSeqDataDescriptor() function to configure the descriptor.
This descriptor is used by multihead attention API functions.
These are the enumeration types in the
cudnn_adv
library.
cudnnDirectionMode_t
is an enumerated type used to specify the recurrence pattern.
Values
CUDNN_UNIDIRECTIONAL
The network iterates recurrently from the first input to the last.
CUDNN_BIDIRECTIONAL
Each layer of the network iterates recurrently from the first input to the last and separately from the last input to the first. The outputs of the two are concatenated at each iteration giving the output of the layer.
cudnnForwardMode_t
is an enumerated type to specify inference or training mode in RNN API. This parameter allows the cuDNN library to tune more precisely the size of the workspace buffer that could be different in inference and training regimens.
Values
CUDNN_FWD_MODE_INFERENCE
Selects the inference mode.
CUDNN_FWD_MODE_TRAINING
Selects the training mode.
cudnnLossNormalizationMode_t
is an enumerated type that controls the input normalization mode for a loss function. This type can be used with
cudnnSetCTCLossDescriptorEx()
.
Values
CUDNN_LOSS_NORMALIZATION_NONE
The input probs of the
cudnnCTCLoss()
function is expected to be the normalized probability, and the output
gradients
is the gradient of loss with respect to the unnormalized probability.
CUDNN_LOSS_NORMALIZATION_SOFTMAX
The input probs of the
cudnnCTCLoss()
function is expected to be the unnormalized activation from the previous layer, and the output
gradients
is the gradient with respect to the activation. Internally the probability is computed by softmax normalization.
cudnnMultiHeadAttnWeightKind_t
is an enumerated type that specifies a group of weights or biases in the
cudnnGetMultiHeadAttnWeights()
function.
Values
CUDNN_MH_ATTN_Q_WEIGHTS
Selects the input projection weights for
queries
.
CUDNN_MH_ATTN_K_WEIGHTS
Selects the input projection weights for
keys
.
CUDNN_MH_ATTN_V_WEIGHTS
Selects the input projection weights for
values
.
CUDNN_MH_ATTN_O_WEIGHTS
Selects the output projection weights.
CUDNN_MH_ATTN_Q_BIASES
Selects the input projection biases for
queries
.
CUDNN_MH_ATTN_K_BIASES
Selects the input projection biases for
keys
.
CUDNN_MH_ATTN_V_BIASES
Selects the input projection biases for
values
.
CUDNN_MH_ATTN_O_BIASES
Selects the output projection biases.
cudnnRNNAlgo_t
is an enumerated type used to specify the algorithm.
Values
CUDNN_RNN_ALGO_STANDARD
This algorithm uses cuBLASLt to perform all matrix multiplications and dedicated kernels for cell-specific operations such as applying nonlinearities or adding biases. This is the most versatile RNN algorithm. It supports pseudo-random dropout masks between RNN layers, variable length sequences in unpacked data layouts, recurrent projection in LSTM models, and multiple choices for RNN biases: no bias, one bias, or two biases. The algorithm traverses RNN cells layer-by-layer or in a diagonal pattern through multiple layers with a certain number of time steps grouped into one “comptational chunk”. Whenever possible GEMMs are executed in parallel CUDA streams. This algorithm is expected to deliver robust performance across a wide range of RNN configurations. It is also supported on a broad range of architectures, including the oldest GPUs.
CUDNN_RNN_ALGO_PERSIST_STATIC
Input GEMMs in this algorithm are performed by cuBLASLt. Recurrent GEMMs, typically with fused element-wise cell operations, are handled by persistent kernels that require all thread blocks of a grid to run concurrently on GPU and communicate. All recurrent weights are stored collaboratively in stream multi-processor (SM) registers and optionally in shared memory. RNN cells are traversed layer-by-layer. GPUs with a larger number of SMs can handle longer hidden state vectors using this algorithm. This method is expected to be fast when the first dimension of the input tensor is small (meaning, a small minibatch).
CUDNN_RNN_ALGO_PERSIST_STATIC
is supported on devices with compute capability >= 6.0.
CUDNN_RNN_ALGO_PERSIST_DYNAMIC
The recurrent parts of the network are executed using a
persistent kernel
approach. This method is expected to perform reasonably well for small RNN models.
CUDNN_RNN_ALGO_PERSIST_DYNAMIC
kernels are compiled at runtime and are optimized for specific parameters of the RNN model and active GPU. The limits on the maximum size of a hidden vector when using
CUDNN_RNN_ALGO_PERSIST_DYNAMIC
may be higher than the corresponding limits of
CUDNN_RNN_ALGO_PERSIST_STATIC
. This algorithm does not utilize NVIDIA Tensor Cores.
CUDNN_RNN_ALGO_PERSIST_DYNAMIC
is supported on devices with compute capability >= 6.0.
CUDNN_RNN_ALGO_PERSIST_STATIC_SMALL_H
Despite its name, this algorithm does not rely on persistent GPU kernels (all thread blocks being active at the same time) but in other aspects it operates similarly to
CUDNN_RNN_ALGO_PERSIST_STATIC
. Input GEMMs for all time-steps are performed by cuBLASLt and recurrent GEMMs with fused element-wise operations are handled by “regular” CUDA thread blocks. One thread block collaboratively loads all recurrent weights of one layer (square matrix) and a small number of input data vectors to compute the same number of output elements without any synchronization with other thread blocks. The algorithm is limited by available register resources so the hidden vector size cannot be very large, for example, up to 192 elements for LSTM/GRU cells and up to 384 elements for RELU/TANH cells in the forward pass. This algorithm could be surprisingly fast and it scales well with the number of available SMs for large batch sizes.
cudnnRNNBiasMode_t
is an enumerated type used to specify the number of bias vectors for RNN functions. Refer to the description of the
cudnnRNNMode_t
enumerated type for the equations for each cell type based on the bias mode.
Values
CUDNN_RNN_NO_BIAS
Applies RNN cell formulas that do not use biases.
CUDNN_RNN_SINGLE_INP_BIAS
Applies RNN cell formulas that use one input bias vector in the input GEMM.
CUDNN_RNN_DOUBLE_BIAS
Applies RNN cell formulas that use two bias vectors.
CUDNN_RNN_SINGLE_REC_BIAS
Applies RNN cell formulas that use one recurrent bias vector in the recurrent GEMM.
cudnnRNNClipMode_t
is an enumerated type used to select the LSTM cell clipping mode.
Values
CUDNN_RNN_CLIP_NONE
Disables LSTM cell clipping.
CUDNN_RNN_CLIP_MINMAX
Enables LSTM cell clipping.
cudnnRNNDataLayout_t
is an enumerated type used to select the RNN data layout. It is used in the API calls
cudnnGetRNNDataDescriptor()
and
cudnnSetRNNDataDescriptor()
.
Values
CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_UNPACKED
Data layout is padded, with outer stride from one time-step to the next.
CUDNN_RNN_DATA_LAYOUT_SEQ_MAJOR_PACKED
The sequence length is sorted and packed as in the basic RNN API.
CUDNN_RNN_DATA_LAYOUT_BATCH_MAJOR_UNPACKED
Data layout is padded, with outer stride from one batch to the next.
cudnnRNNInputMode_t
is an enumerated type used to specify the behavior of the first layer.
Values
CUDNN_LINEAR_INPUT
A biased matrix multiplication is performed at the input of the first recurrent layer.
CUDNN_SKIP_INPUT
No operation is performed at the input of the first recurrent layer. If
CUDNN_SKIP_INPUT
is used the leading dimension of the input tensor must be equal to the hidden state size of the network.
cudnnRNNMode_t
is an enumerated type used to specify the type of network.
Values
CUDNN_RNN_RELU
A single-gate recurrent neural network with a ReLU activation function.
In the forward pass, the output h
t
for a given iteration can be computed from the recurrent input h
t-1
and the previous layer input x
t
, given the matrices
W
,
R
, the bias vectors, and where
ReLU(x)
=
max(x,
0)
.
If
cudnnRNNBiasMode_t
biasMode
in
rnnDesc
is
CUDNN_RNN_DOUBLE_BIAS
(default mode), then the following equation with biases b
W
and b
R
applies:
h t = ReLU(W i x t + R i h t-1 + b Wi + b Ri )
If
cudnnRNNBiasMode_t
biasMode
in
rnnDesc
is
CUDNN_RNN_SINGLE_INP_BIAS
or
CUDNN_RNN_SINGLE_REC_BIAS
, then the following equation with bias
b
applies:
h t = ReLU(W i x t + R i h t-1 + b i )
If
cudnnRNNBiasMode_t
biasMode
in
rnnDesc
is
CUDNN_RNN_NO_BIAS
, then the following equation applies:
h t = ReLU(W i x t + R i h t-1 )
CUDNN_RNN_TANH
A single-gate recurrent neural network with a
tanh
activation function.
In the forward pass, the output h
t
for a given iteration can be computed from the recurrent input h
t-1
and the previous layer input x
t
, given the matrices
W
,
R
the bias vectors, and where
tanh
is the hyperbolic tangent function.
If
cudnnRNNBiasMode_t
biasMode
in
rnnDesc
is
CUDNN_RNN_DOUBLE_BIAS
(default mode), then the following equation with biases b
W
and b
R
applies:
h t = tanh(W i x t + R i h t-1 + b Wi + b Ri )
If
cudnnRNNBiasMode_t
biasMode
in
rnnDesc
is
CUDNN_RNN_SINGLE_INP_BIAS
or
CUDNN_RNN_SINGLE_REC_BIAS
, then the following equation with bias
b
applies:
h t = tanh(W i x t + R i h t-1 + b i )
If
cudnnRNNBiasMode_t
biasMode
in
rnnDesc
is
CUDNN_RNN_NO_BIAS
, then the following equation applies:
h t = tanh(W i x t + R i h t-1 )
CUDNN_LSTM
A four-gate LSTM (Long Short-Term Memory) network with no peephole connections.
In the forward pass, the output h
t
and cell output c
t
for a given iteration can be computed from the recurrent input h
t-1
, the cell input c
t-1
and the previous layer input x
t
, given the matrices
W
,
R
, and the bias vectors. In addition, the following applies:
σ is the sigmoid operator such that: σ(x) = 1 / (1 + e -x ),
◦ represents a point-wise multiplication,
tanh
is the hyperbolic tangent function, and
i t , f t , o t , c’ t represent the input, forget, output and new gates respectively.
If
cudnnRNNBiasMode_t
biasMode
in
rnnDesc
is
CUDNN_RNN_DOUBLE_BIAS
(default mode), then the following equations with biases b
W
and b
R
apply:
i t = σ(W i x t + R i h t-1 + b Wi + b Ri )
f t = σ(W f x t + R f h t-1 + b Wf + b Rf )
o t = σ(W o x t + R o h t-1 + b Wo + b Ro )
c’ t = tanh(W c x t + R c h t-1 + b Wc + b Rc )
c t = f t ◦ c t-1 + i t ◦ c’ t
h t = o t ◦ tanh(c t )
If
cudnnRNNBiasMode_t
biasMode
in
rnnDesc
is
CUDNN_RNN_SINGLE_INP_BIAS
or
CUDNN_RNN_SINGLE_REC_BIAS
, then the following equations with bias
b
apply:
i t = σ(W i x t + R i h t-1 + b i )
f t = σ(W f x t + R f h t-1 + b f )
o t = σ(W o x t + R o h t-1 + b o )
c’ t = tanh(W c x t + R c h t-1 + b c )
c t = f t ◦ c t-1 + i t ◦ c’ t
h t = o t ◦ tanh(c t )
If
cudnnRNNBiasMode_t
biasMode
in
rnnDesc
is
CUDNN_RNN_NO_BIAS
, then the following equations apply:
i t = σ(W i x t + R i h t-1 )
f t = σ(W f x t + R f h t-1 )
o t = σ(W o x t + R o h t-1 )
c’ t = tanh(W c x t + R c h t-1 )
c t = f t ◦ c t-1 + i t ◦ c’ t
h t = o t ◦ tanh(c t )
CUDNN_GRU
A three-gate network consisting of Gated Recurrent Units (GRU).
In the forward pass, the output h
t
for a given iteration can be computed from the recurrent input h
t-1
and the previous layer input x
t
given matrices
W
,
R
, and the bias vectors. In addition, the following applies:
σ is the sigmoid operator such that: σ(x) = 1 / (1 + e -x ),
◦ represents a point-wise multiplication,
tanh
is the hyperbolic tangent function, and
i t , r t , h’ t represent the input, reset, and new gates respectively.
If
cudnnRNNBiasMode_t
biasMode
in
rnnDesc
is
CUDNN_RNN_DOUBLE_BIAS
(default mode), then the following equations with biases b
W
and b
R
apply:
i t = σ(W i x t + R i h t-1 + b Wi + b Ru )
r t = σ(W r x t + R r h t-1 + b Wr + b Rr )
h’ t = tanh(W h x t + r t ◦ (R h h t-1 + b Rh ) + b Wh )
h t = (1 - i t ) ◦ h’ t + i t ◦ h t-1
If
cudnnRNNBiasMode_t
biasMode
in
rnnDesc
is
CUDNN_RNN_SINGLE_INP_BIAS
, then the following equations with bias
b
apply:
i t = σ(W i x t + R i h t-1 + b i )
r t = σ(W r x t + R r h t-1 + b r )
h’ t = tanh(W h x t + r t ◦ (R h h t-1 ) + b Wh )
h t = (1 - i t ) ◦ h’ t + i t ◦ h t-1
If
cudnnRNNBiasMode_t
biasMode
in
rnnDesc
is
CUDNN_RNN_SINGLE_REC_BIAS
, then the following equations with bias
b
apply:
i t = σ(W i x t + R i h t-1 + b i )
r t = σ(W r x t + R r h t-1 + b r )
h’ t = tanh(W h x t + r t ◦ (R h h t-1 + b Rh ))
h t = (1 - i t ) ◦ h’ t + i t ◦ h t-1
If
cudnnRNNBiasMode_t
biasMode
in
rnnDesc
is
CUDNN_RNN_NO_BIAS
, then the following equations apply:
i t = σ(W i x t + R i h t-1 )
r t = σ(W r x t + R r h t-1 )
h’ t = tanh(W h x t + rt ◦ (R h h t-1 ))
h t = (1 - i t ) ◦ h’ t + i t ◦ h t-1
cudnnSeqDataAxis_t
is an enumerated type that indexes active dimensions in the
dimA[]
argument that is passed to the
cudnnSetSeqDataDescriptor()
function to configure the sequence data descriptor of type
cudnnSeqDataDescriptor_t
.
cudnnSeqDataAxis_t
constants are also used in the
axis[]
argument of the
cudnnSetSeqDataDescriptor()
call to define the layout of the sequence data buffer in memory. Refer to
cudnnSetSeqDataDescriptor()
for a detailed description on how to use the
cudnnSeqDataAxis_t
enumerated type.
The
CUDNN_SEQDATA_DIM_COUNT
macro defines the number of constants in the
cudnnSeqDataAxis_t
enumerated type. This value is currently set to
4
.
Values
CUDNN_SEQDATA_TIME_DIM
Identifies the
TIME
(sequence length) dimension or specifies the
TIME
in the data layout.
CUDNN_SEQDATA_BATCH_DIM
Identifies the
BATCH
dimension or specifies the
BATCH
in the data layout.
CUDNN_SEQDATA_BEAM_DIM
Identifies the
BEAM
dimension or specifies the
BEAM
in the data layout.
CUDNN_SEQDATA_VECT_DIM
Identifies the
VECT
(vector) dimension or specifies the
VECT
in the data layout.
cudnnWgradMode_t
is an enumerated type that selects how buffers holding gradients of the loss function, computed with respect to trainable parameters, are updated. Currently, this type is used by the
cudnnMultiHeadAttnBackwardWeights()
and
cudnnRNNBackwardWeights_v8()
functions only.
Values
CUDNN_WGRAD_MODE_ADD
A weight gradient component corresponding to a new batch of inputs is added to previously evaluated weight gradients. Before using this mode, the buffer holding weight gradients should be initialized to zero. Alternatively, the first API call outputting to an uninitialized buffer should use the
CUDNN_WGRAD_MODE_SET
option.
CUDNN_WGRAD_MODE_SET
A weight gradient component, corresponding to a new batch of inputs, overwrites previously stored weight gradients in the output buffer.
Cross-library version checker. Each sublibrary has a version checker that checks whether its own version matches that of its dependencies.
Returns
CUDNN_STATUS_SUCCESS
The version check passed.
CUDNN_STATUS_SUBLIBRARY_VERSION_MISMATCH
The versions are inconsistent.
This function compiles the RNN persistent code using CUDA runtime compilation library (NVRTC) when the
CUDNN_RNN_ALGO_PERSIST_DYNAMIC
algo is selected. The code is tailored to the current GPU and specific hyperparameters (
miniBatch
). This call is expected to be expensive in terms of runtime and should be invoked infrequently. Note that the
CUDNN_RNN_ALGO_PERSIST_DYNAMIC
algo does not support variable length sequences within the batch.
cudnnStatus_t cudnnBuildRNNDynamic(
cudnnHandle_t handle,
cudnnRNNDescriptor_t rnnDesc,
int32_t miniBatch);
Parameters
handleInput. Handle to a previously created cuDNN context.
rnnDescInput. A previously initialized RNN descriptor.
miniBatchInput. The exact number of sequences in a batch.
Returns
CUDNN_STATUS_SUCCESSThe code was built and linked successfully.
CUDNN_STATUS_MAPPING_ERRORA GPU/CUDA resource, such as a texture object, shared memory, or zero-copy memory is not available in the required size or there is a mismatch between the user resource and cuDNN internal resources. A resource mismatch may occur, for example, when calling cudnnSetStream(). There could be a mismatch between the user provided CUDA stream and the internal CUDA events instantiated in the cuDNN handle when cudnnCreate() was invoked.
This error status may not be correctable when it is related to texture dimensions, shared memory size, or zero-copy memory availability. If CUDNN_STATUS_MAPPING_ERROR is returned by cudnnSetStream(), then it is typically correctable, however, it means that the cuDNN handle was created on one GPU and the user stream passed to this function is associated with another GPU.
CUDNN_STATUS_ALLOC_FAILEDThe resources could not be allocated.
CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSINGThe prerequisite runtime library could not be found.
CUDNN_STATUS_NOT_SUPPORTEDThe current hyper-parameters are invalid.
cudnnCreateAttnDescriptor()#
This function has been deprecated in cuDNN 9.0.
This function creates one instance of an opaque attention descriptor object by allocating the host memory for it and initializing all descriptor fields. The function writes NULL to attnDesc when the attention descriptor object cannot be allocated.
cudnnStatus_t cudnnCreateAttnDescriptor(cudnnAttnDescriptor_t *attnDesc);
Use the cudnnSetAttnDescriptor() function to configure the attention descriptor and cudnnDestroyAttnDescriptor() to destroy it and release the allocated memory.
Parameters
attnDescOutput. Pointer where the address to the newly created attention descriptor should be written.
Returns
CUDNN_STATUS_SUCCESSThe descriptor object was created successfully.
CUDNN_STATUS_BAD_PARAMAn invalid input argument was encountered (attnDesc=NULL).
CUDNN_STATUS_ALLOC_FAILEDThe memory allocation failed.
This function creates a CTC loss function descriptor.
cudnnStatus_t cudnnCreateCTCLossDescriptor(
cudnnCTCLossDescriptor_t* ctcLossDesc)
Parameters
ctcLossDescOutput. CTC loss descriptor to be set. For more information, refer to cudnnCTCLossDescriptor_t.
Returns
CUDNN_STATUS_SUCCESSThe function returned successfully.
CUDNN_STATUS_BAD_PARAMThe CTC loss descriptor passed to the function is invalid.
CUDNN_STATUS_ALLOC_FAILEDMemory allocation for this CTC loss descriptor failed.
cudnnCreateRNNDataDescriptor()#
This function creates a RNN data descriptor object by allocating the memory needed to hold its opaque structure.
cudnnStatus_t cudnnCreateRNNDataDescriptor(
cudnnRNNDataDescriptor_t *RNNDataDesc)
Parameters
RNNDataDescOutput. Pointer to where the address to the newly created RNN data descriptor should be written.
Returns
CUDNN_STATUS_SUCCESSThe RNN data descriptor object was created successfully.
CUDNN_STATUS_BAD_PARAMThe RNNDataDesc argument is NULL.
CUDNN_STATUS_ALLOC_FAILEDThe resources could not be allocated.
cudnnCreateRNNDescriptor()#
This function creates a generic RNN descriptor object by allocating the memory needed to hold its opaque structure.
cudnnStatus_t cudnnCreateRNNDescriptor(
cudnnRNNDescriptor_t *rnnDesc)
Parameters
rnnDescOutput. Pointer to where the address to the newly created RNN descriptor should be written.
Returns
CUDNN_STATUS_SUCCESSThe object was created successfully.
CUDNN_STATUS_BAD_PARAMThe rnnDesc argument is NULL.
CUDNN_STATUS_ALLOC_FAILEDThe resources could not be allocated.
cudnnCreateSeqDataDescriptor()#
This function has been deprecated in cuDNN 9.0.
This function creates one instance of an opaque sequence data descriptor object by allocating the host memory for it and initializing all descriptor fields. The function writes NULL to seqDataDesc when the sequence data descriptor object cannot be allocated.
cudnnStatus_t cudnnCreateSeqDataDescriptor(cudnnSeqDataDescriptor_t *seqDataDesc)
Use the cudnnSetSeqDataDescriptor() function to configure the sequence data descriptor and cudnnDestroySeqDataDescriptor() to destroy it and release the allocated memory.
Parameters
seqDataDescOutput. Pointer where the address to the newly created sequence data descriptor should be written.
Returns
CUDNN_STATUS_SUCCESSThe descriptor object was created successfully.
CUDNN_STATUS_BAD_PARAMAn invalid input argument was encountered (seqDataDesc=NULL).
CUDNN_STATUS_ALLOC_FAILEDThe memory allocation failed.
cudnnCTCLoss()#
This function returns the CTC costs and gradients, given the probabilities and labels.
cudnnStatus_t cudnnCTCLoss(
cudnnHandle_t handle,
const cudnnTensorDescriptor_t probsDesc,
const void *probs,
const int hostLabels[],
const int hostLabelLengths[],
const int hostInputLengths[],
void *costs,
const cudnnTensorDescriptor_t gradientsDesc,
const void *gradients,
cudnnCTCLossAlgo_t algo,
const cudnnCTCLossDescriptor_t ctcLossDesc,
void *workspace,
size_t *workSpaceSizeInBytes)
This function can have an inconsistent interface depending on the cudnnLossNormalizationMode_t chosen (bound to the cudnnCTCLossDescriptor_t with cudnnSetCTCLossDescriptorEx()). For the CUDNN_LOSS_NORMALIZATION_NONE, this function has an inconsistent interface, for example, the probs input is probability normalized by softmax, but the gradients output is with respect to the unnormalized activation. However, for CUDNN_LOSS_NORMALIZATION_SOFTMAX, the function has a consistent interface; all values are normalized by softmax.
Parameters
handleInput. Handle to a previously created cuDNN context. For more information, refer to cudnnHandle_t.
probsDescInput. Handle to the previously initialized probabilities tensor descriptor. For more information, refer to cudnnTensorDescriptor_t.
probsInput. Pointer to a previously initialized probabilities tensor. These input probabilities are normalized by softmax.
hostLabelsInput. Pointer to a previously initialized labels list, in CPU memory.
hostLabelLengthsInput. Pointer to a previously initialized lengths list in CPU memory, to walk the above labels list.
hostInputLengthsInput. Pointer to a previously initialized list of the lengths of the timing steps in each batch, in CPU memory.
costsOutput. Pointer to the computed costs of CTC.
gradientsDescInput. Handle to a previously initialized gradient tensor descriptor.
gradientsOutput. Pointer to the computed gradients of CTC. These computed gradient outputs are with respect to the unnormalized activation.
algoInput. Enumerant that specifies the chosen CTC loss algorithm. For more information, refer to cudnnCTCLossAlgo_t.
ctcLossDescInput. Handle to the previously initialized CTC loss descriptor. For more information, refer to cudnnCTCLossDescriptor_t.
workspaceInput. Pointer to GPU memory of a workspace needed to be able to execute the specified algorithm.
sizeInBytesInput. Amount of GPU memory needed as workspace to be able to execute the CTC loss computation with the specified algo.
Returns
CUDNN_STATUS_SUCCESSThe query was successful.
CUDNN_STATUS_BAD_PARAMAt least one of the following conditions are met:
The dimensions of probsDesc do not match the dimensions of gradientsDesc.
The inputLengths do not agree with the first dimension of probsDesc.
The workSpaceSizeInBytes is not sufficient.
The labelLengths is greater than 255.
CUDNN_STATUS_NOT_SUPPORTEDA compute or data type other than FLOAT was chosen, or an unknown algorithm type was chosen.
CUDNN_STATUS_EXECUTION_FAILEDThe function failed to launch on the GPU.
cudnnCTCLoss_v8()#
This function returns the CTC costs and gradients, given the probabilities and labels. Many CTC API functions were updated in version 8 with the _v8 suffix to support CUDA graphs. Label and input data is now passed in GPU memory.
cudnnStatus_t cudnnCTCLoss_v8(
cudnnHandle_t handle,
cudnnCTCLossAlgo_t algo,
const cudnnCTCLossDescriptor_t ctcLossDesc,
const cudnnTensorDescriptor_t probsDesc,
const void *probs,
const int labels[],
const int labelLengths[],
const int inputLengths[],
void *costs,
const cudnnTensorDescriptor_t gradientsDesc,
const void *gradients,
size_t *workSpaceSizeInBytes,
void *workspace)
This function can have an inconsistent interface depending on the cudnnLossNormalizationMode_t chosen (bound to the cudnnCTCLossDescriptor_t with cudnnSetCTCLossDescriptorEx()). For the CUDNN_LOSS_NORMALIZATION_NONE, this function has an inconsistent interface, for example, the probs input is probability normalized by softmax, but the gradients output is with respect to the unnormalized activation. However, for CUDNN_LOSS_NORMALIZATION_SOFTMAX, the function has a consistent interface; all values are normalized by softmax.
Parameters
handleInput. Handle to a previously created cuDNN context. For more information, refer to cudnnHandle_t
.
algoInput. Enumerant that specifies the chosen CTC loss algorithm. For more information, refer to cudnnCTCLossAlgo_t.
ctcLossDescInput. Handle to the previously initialized CTC loss descriptor. For more information, refer to cudnnCTCLossDescriptor_t.
probsDescInput. Handle to the previously initialized probabilities tensor descriptor. For more information, refer to cudnnTensorDescriptor_t.
probsInput. Pointer to a previously initialized probabilities tensor. These input probabilities are normalized by softmax.
labelsInput. Pointer to a previously initialized labels list, in GPU memory.
labelLengthsInput. Pointer to a previously initialized lengths list in GPU memory, to walk the above labels list.
inputLengthsInput. Pointer to a previously initialized list of the lengths of the timing steps in each batch, in GPU memory.
costsOutput. Pointer to the computed costs of CTC.
gradientsDescInput. Handle to a previously initialized gradient tensor descriptor.
gradientsOutput. Pointer to the computed gradients of CTC. These computed gradient outputs are with respect to the unnormalized activation.
workspaceInput. Pointer to GPU memory of a workspace needed to be able to execute the specified algorithm.
sizeInBytesInput. Amount of GPU memory needed as a workspace to be able to execute the CTC loss computation with the specified algo.
Returns
CUDNN_STATUS_SUCCESSThe query was successful.
CUDNN_STATUS_BAD_PARAMAt least one of the following conditions are met:
The dimensions of probsDesc do not match the dimensions of gradientsDesc.
The workSpaceSizeInBytes is not sufficient.
CUDNN_STATUS_NOT_SUPPORTEDA compute or data type other than FLOAT was chosen, or an unknown algorithm type was chosen.
CUDNN_STATUS_EXECUTION_FAILEDThe function failed to launch on the GPU.
cudnnDestroyAttnDescriptor()#
This function has been deprecated in cuDNN 9.0.
This function destroys the attention descriptor object and releases its memory. The attnDesc argument can be NULL. Invoking cudnnDestroyAttnDescriptor() with a NULL argument is a no operation (NOP).
cudnnStatus_t cudnnDestroyAttnDescriptor(cudnnAttnDescriptor_t attnDesc);
The cudnnDestroyAttnDescriptor() function is not able to detect if the attnDesc argument holds a valid address. Undefined behavior will occur in case of passing an invalid pointer, not returned by the cudnnCreateAttnDescriptor() function, or in the double deletion scenario of a valid address.
Parameters
attnDescInput. Pointer to the attention descriptor object to be destroyed.
Returns
CUDNN_STATUS_SUCCESSThe descriptor was destroyed successfully.
cudnnDestroyCTCLossDescriptor()#
This function destroys a CTC loss function descriptor object.
cudnnStatus_t cudnnDestroyCTCLossDescriptor(
cudnnCTCLossDescriptor_t ctcLossDesc)
Parameters
ctcLossDescInput. CTC loss function descriptor to be destroyed.
Returns
CUDNN_STATUS_SUCCESSThe function returned successfully.
cudnnDestroyRNNDataDescriptor()#
This function destroys a previously created RNN data descriptor object. Invoking cudnnDestroyRNNDataDescriptor() with the NULL argument is a no operation (NOP).
cudnnStatus_t cudnnDestroyRNNDataDescriptor(
cudnnRNNDataDescriptor_t RNNDataDesc)
The cudnnDestroyRNNDataDescriptor() function is not able to detect if the RNNDataDesc argument holds a valid address. Undefined behavior will occur in cases of passing an invalid pointer, not returned by the cudnnCreateRNNDataDescriptor() function, or in the double deletion scenario of a valid address.
Parameters
RNNDataDescInput. Pointer to the RNN data descriptor object to be destroyed.
Returns
CUDNN_STATUS_SUCCESSThe RNN data descriptor object was destroyed successfully.
cudnnDestroyRNNDescriptor()#
This function destroys a previously created RNN descriptor object. Invoking cudnnDestroyRNNDescriptor() with the NULL argument is a no operation (NOP).
cudnnStatus_t cudnnDestroyRNNDescriptor(
cudnnRNNDescriptor_t rnnDesc)
The cudnnDestroyRNNDescriptor() function is not able to detect if the rnnDesc argument holds a valid address. Undefined behavior will occur in cases of passing an invalid pointer, not returned by the cudnnCreateRNNDescriptor() function, or in the double deletion scenario of a valid address.
Parameters
rnnDescInput. Pointer to the RNN descriptor object to be destroyed.
Returns
CUDNN_STATUS_SUCCESSThe object was destroyed successfully.
cudnnDestroySeqDataDescriptor()#
This function has been deprecated in cuDNN 9.0.
This function destroys the sequence data descriptor object and releases its memory. The seqDataDesc argument can be NULL. Invoking cudnnDestroySeqDataDescriptor() with a NULL argument is a no operation (NOP).
cudnnStatus_t cudnnDestroySeqDataDescriptor(cudnnSeqDataDescriptor_t seqDataDesc);
The cudnnDestroySeqDataDescriptor() function is not able to detect if the seqDataDesc argument holds a valid address. Undefined behavior will occur in case of passing an invalid pointer, not returned by the cudnnCreateSeqDataDescriptor() function, or in the double deletion scenario of a valid address.
Parameters
seqDataDescInput. Pointer to the sequence data descriptor object to be destroyed.
Returns
CUDNN_STATUS_SUCCESSThe descriptor was destroyed successfully.
cudnnGetAttnDescriptor()#
This function has been deprecated in cuDNN 9.0.
This function retrieves settings from the previously created attention descriptor. The user can assign NULL to any pointer except attnDesc when the retrieved value is not needed.
cudnnStatus_t cudnnGetAttnDescriptor(
cudnnAttnDescriptor_t attnDesc,
unsigned *attnMode,
int *nHeads,
double *smScaler,
cudnnDataType_t *dataType,
cudnnDataType_t *computePrec,
cudnnMathType_t *mathType,
cudnnDropoutDescriptor_t *attnDropoutDesc,
cudnnDropoutDescriptor_t *postDropoutDesc,
int *qSize,
int *kSize,
int *vSize,
int *qProjSize,
int *kProjSize,
int *vProjSize,
int *oProjSize,
int *qoMaxSeqLength,
int *kvMaxSeqLength,
int *maxBatchSize,
int *maxBeamSize);
Parameters
attnDescInput. Attention descriptor.
attnModeOutput. Pointer to the storage for binary attention flags.
nHeadsOutput. Pointer to the storage for the number of attention heads.
smScalerOutput. Pointer to the storage for the softmax smoothing/sharpening coefficient.
dataTypeOutput. Data type for attention weights, sequence data inputs, and outputs.
computePrec
Output. Pointer to the storage for the compute precision.
mathTypeOutput. NVIDIA Tensor Core settings.
attnDropoutDescOutput. Descriptor of the dropout operation applied to the softmax output.
postDropoutDescOutput. Descriptor of the dropout operation applied to the multihead attention output.
qSize, kSize, vSizeOutput. Q, K, and V embedding vector lengths.
qProjSize, kProjSize, vProjSizeOutput. Q, K, and V embedding vector lengths after input projections.
oProjSizeOutput. Pointer to store the output vector length after projection.
qoMaxSeqLengthOutput. Largest sequence length expected in sequence data descriptors related to Q, O, dQ, dO inputs and outputs.
kvMaxSeqLengthOutput. Largest sequence length expected in sequence data descriptors related to K, V, dK, dV inputs and outputs.
maxBatchSizeOutput. Largest batch size expected in the cudnnSeqDataDescriptor_t container.
maxBeamSizeOutput. Largest beam size expected in the cudnnSeqDataDescriptor_t container.
Returns
CUDNN_STATUS_SUCCESSRequested attention descriptor fields were retrieved successfully.
CUDNN_STATUS_BAD_PARAMAn invalid input argument was found.
cudnnGetCTCLossDescriptor()#
This function has been deprecated in cuDNN 9.0; use cudnnGetCTCLossDescriptor_v9() instead.
This function returns the configuration of the passed CTC loss function descriptor.
cudnnStatus_t cudnnGetCTCLossDescriptor(
cudnnCTCLossDescriptor_t ctcLossDesc,
cudnnDataType_t* compType)
Parameters
ctcLossDescInput. CTC loss function descriptor passed, from which to retrieve the configuration.
compTypeOutput. Compute type associated with this CTC loss function descriptor.
Returns
CUDNN_STATUS_SUCCESSThe function returned successfully.
CUDNN_STATUS_BAD_PARAMInput ctcLossDesc descriptor passed is invalid.
cudnnGetCTCLossDescriptor_v8()#
This function has been deprecated in cuDNN 9.0; use cudnnGetCTCLossDescriptor_v9() instead.
This function returns the configuration of the passed CTC loss function descriptor.
cudnnStatus_t cudnnGetCTCLossDescriptor_v8(
cudnnCTCLossDescriptor_t ctcLossDesc,
cudnnDataType_t *compType,
cudnnLossNormalizationMode_t *normMode,
cudnnNanPropagation_t *gradMode,
int *maxLabelLength)
Parameters
ctcLossDescInput. CTC loss function descriptor passed, from which to retrieve the configuration.
compTypeOutput. Compute type associated with this CTC loss function descriptor.
normModeOutput. Input normalization type for this CTC loss function descriptor. For more information, refer to cudnnLossNormalizationMode_t.
gradModeOutput. NaN propagation type for this CTC loss function descriptor.
maxLabelLengthOutput. The max label length for this CTC loss function descriptor.
Returns
CUDNN_STATUS_SUCCESSThe function returned successfully.
CUDNN_STATUS_BAD_PARAMInput ctcLossDesc descriptor passed is invalid.
cudnnGetCTCLossDescriptor_v9()#
This function returns the configuration of the passed CTC loss function descriptor.
cudnnStatus_t cudnnGetCTCLossDescriptor_v8(
cudnnCTCLossDescriptor_t ctcLossDesc,
cudnnDataType_t *compType,
cudnnLossNormalizationMode_t *normMode,
cudnnCTCGradMode_t *ctcGradMode,
int *maxLabelLength)
Parameters
ctcLossDescInput. CTC loss function descriptor passed, from which to retrieve the configuration.
compTypeOutput. Compute type associated with this CTC loss function descriptor.
normModeOutput. Input normalization type for this CTC loss function descriptor. For more information, refer to cudnnLossNormalizationMode_t.
ctcGradModeOutput. The gradient mode for handling OOB samples for this CTC loss function descriptor. Refer to cudnnSetCTCLossDescriptor_v9() for more information.
maxLabelLengthOutput. The max label length for this CTC loss function descriptor.
Returns
CUDNN_STATUS_SUCCESSThe function returned successfully.
CUDNN_STATUS_BAD_PARAMInput ctcLossDesc descriptor passed is invalid.
cudnnGetCTCLossDescriptorEx()#
This function has been deprecated in cuDNN 9.0; use cudnnGetCTCLossDescriptor_v9() instead.
This function returns the configuration of the passed CTC loss function descriptor.
cudnnStatus_t cudnnGetCTCLossDescriptorEx(
cudnnCTCLossDescriptor_t ctcLossDesc,
cudnnDataType_t *compType,
cudnnLossNormalizationMode_t *normMode,
cudnnNanPropagation_t *gradMode)
Parameters
ctcLossDescInput. CTC loss function descriptor passed, from which to retrieve the configuration.
compTypeOutput. Compute type associated with this CTC loss function descriptor.
normModeOutput. Input normalization type for this CTC loss function descriptor. For more information, refer to cudnnLossNormalizationMode_t.
gradModeOutput. NaN propagation type for this CTC loss function descriptor.
Returns
CUDNN_STATUS_SUCCESSThe function returned successfully.
CUDNN_STATUS_BAD_PARAMInput ctcLossDesc descriptor passed is invalid.
cudnnGetCTCLossWorkspaceSize()#
This function returns the amount of GPU memory workspace the user needs to allocate to be able to call cudnnCTCLoss() with the specified algorithm. The workspace allocated will then be passed to the routine cudnnCTCLoss().
cudnnStatus_t cudnnGetCTCLossWorkspaceSize(
cudnnHandle_t handle,
const cudnnTensorDescriptor_t probsDesc,
const cudnnTensorDescriptor_t gradientsDesc,
const int *labels,
const int *labelLengths,
const int *inputLengths,
cudnnCTCLossAlgo_t algo,
const cudnnCTCLossDescriptor_t ctcLossDesc,
size_t *sizeInBytes)
Parameters
handleInput. Handle to a previously created cuDNN context.
probsDescInput. Handle to the previously initialized probabilities tensor descriptor.
gradientsDescInput. Handle to a previously initialized gradient tensor descriptor.
labelsInput. Pointer to a previously initialized labels list.
labelLengthsInput. Pointer to a previously initialized lengths list, to walk the above labels list.
inputLengthsInput. Pointer to a previously initialized list of the lengths of the timing steps in each batch.
algoInput. Enumerant that specifies the chosen CTC loss algorithm.
ctcLossDescInput. Handle to the previously initialized CTC loss descriptor.
sizeInBytesOutput. Amount of GPU memory needed as workspace to be able to execute the CTC loss computation with the specified algo.
Returns
CUDNN_STATUS_SUCCESSThe query was successful.
CUDNN_STATUS_BAD_PARAMAt least one of the following conditions are met:
The dimensions of probsDesc do not match the dimensions of gradientsDesc
The inputLengths do not agree with the first dimension of probsDesc
The workSpaceSizeInBytes is not sufficient
The labelLengths is greater than 256
CUDNN_STATUS_NOT_SUPPORTEDA compute or data type other than FLOAT was chosen, or an unknown algorithm type was chosen.
cudnnGetCTCLossWorkspaceSize_v8()#
This function returns the amount of GPU memory workspace the user needs to allocate to be able to call cudnnCTCLoss_v8 with the specified algorithm. The workspace allocated will then be passed to the routine cudnnCTCLoss_v8().
cudnnStatus_t cudnnGetCTCLossWorkspaceSize_v8(
cudnnHandle_t handle,
cudnnCTCLossAlgo_t algo,
const cudnnCTCLossDescriptor_t ctcLossDesc,
const cudnnTensorDescriptor_t probsDesc,
const cudnnTensorDescriptor_t gradientsDesc,
size_t *sizeInBytes)
Parameters
handleInput. Handle to a previously created cuDNN context.
algoInput. Enumerant that specifies the chosen CTC loss algorithm.
ctcLossDescInput. Handle to the previously initialized CTC loss descriptor.
probsDescInput. Handle to the previously initialized probabilities tensor descriptor.
gradientsDescInput. Handle to a previously initialized gradient tensor descriptor.
sizeInBytesOutput. Amount of GPU memory needed as workspace to be able to execute the CTC loss computation with the specified algo.
Returns
CUDNN_STATUS_SUCCESSThe query was successful.
CUDNN_STATUS_BAD_PARAMAt least one of the following conditions are met:
The dimensions of probsDesc do not match the dimensions of gradientsDesc
CUDNN_STATUS_NOT_SUPPORTED
- A compute or data type other than FLOAT was chosen, or an unknown algorithm type was chosen.
- For the deterministic CTC loss algorithm, the maxLabelLength in ctcLossDesc is greater than or equal to 256.
- For the nondeterministic CTC loss algorithm, the maxLabelLength in ctcLossDesc is greater than or equal to 2048.
cudnnGetMultiHeadAttnBuffers()#
This function has been deprecated in cuDNN 9.0.
This function computes weight, work, and reserve space buffer sizes used by the following functions:
cudnnStatus_t cudnnGetMultiHeadAttnBuffers(
cudnnHandle_t handle,
const cudnnAttnDescriptor_t attnDesc,
size_t *weightSizeInBytes,
size_t *workSpaceSizeInBytes,
size_t *reserveSpaceSizeInBytes);
Assigning NULL to the reserveSpaceSizeInBytes argument indicates that the user does not plan to invoke multihead attention gradient functions: cudnnMultiHeadAttnBackwardData() and cudnnMultiHeadAttnBackwardWeights(). This situation occurs in the inference mode.
NULL cannot be assigned to weightSizeInBytes and workSpaceSizeInBytes pointers.
The user must allocate weight, work, and reserve space buffer sizes in the GPU memory using cudaMalloc() with the reported buffer sizes. The buffers can be also carved out from a larger chunk of allocated memory but the buffer addresses must be at least 16B aligned.
The workspace buffer is used for temporary storage. Its content can be discarded or modified after all GPU kernels launched by the corresponding API complete. The reserve-space buffer is used to transfer intermediate results from cudnnMultiHeadAttnForward() to cudnnMultiHeadAttnBackwardData(), and from cudnnMultiHeadAttnBackwardData() to cudnnMultiHeadAttnBackwardWeights(). The content of the reserve-space buffer cannot be modified until all GPU kernels launched by the above three multihead attention API functions finish.
All multihead attention weight and bias tensors are stored in a single weight buffer. For speed optimizations, the cuDNN API may change tensor layouts and their relative locations in the weight buffer based on the provided attention parameters. Use the cudnnGetMultiHeadAttnWeights() function to obtain the start address and the shape of each weight or bias tensor.
Parameters
handleInput. The current cuDNN context handle.
attnDescInput. Pointer to a previously initialized attention descriptor.
weightSizeInBytesOutput. Minimum buffer size required to store all multihead attention trainable parameters.
workSpaceSizeInBytesOutput. Minimum buffer size required to hold all temporary surfaces used by the forward and gradient multihead attention API calls.
reserveSpaceSizeInBytesOutput. Minimum buffer size required to store all intermediate data exchanged between forward and backward (gradient) multihead attention functions. Set this parameter to NULL in the inference mode indicating that gradient API calls will not be invoked.
Returns
CUDNN_STATUS_SUCCESSThe requested buffer sizes were computed successfully.
CUDNN_STATUS_BAD_PARAMAn invalid input argument was found.
cudnnGetMultiHeadAttnWeights()#
This function has been deprecated in cuDNN 9.0.
This function obtains the shape of the weight or bias tensor. It also retrieves the start address of tensor data located in the weight buffer. Use the wKind argument to select a particular tensor. For more information, refer to cudnnMultiHeadAttnWeightKind_t for the description of the enumerant type.
cudnnStatus_t cudnnGetMultiHeadAttnWeights(
cudnnHandle_t handle,
const cudnnAttnDescriptor_t attnDesc,
cudnnMultiHeadAttnWeightKind_t wKind,
size_t weightSizeInBytes,
const void *weights,
cudnnTensorDescriptor_t wDesc,
void **wAddr);
Biases are used in the input and output projections when the CUDNN_ATTN_ENABLE_PROJ_BIASES flag is set in the attention descriptor. Refer to cudnnSetAttnDescriptor() for the description of flags to control projection biases.
When the corresponding weight or bias tensor does not exist, the function writes NULL to the storage location pointed by wAddr and returns zeros in the wDesc tensor descriptor. The return status of the cudnnGetMultiHeadAttnWeights() function is CUDNN_STATUS_SUCCESS in this case.
The cuDNN multiHeadAttention sample code demonstrates how to access multihead attention weights. Although the buffer with weights and biases should be allocated in the GPU memory, the user can copy it to the host memory and invoke the cudnnGetMultiHeadAttnWeights() function with the host weights address to obtain tensor pointers in the host memory. This scheme allows the user to inspect trainable parameters directly in the CPU memory.
Parameters
handleInput. The current cuDNN context handle.
attnDescInput. A previously configured attention descriptor.
wKindInput. Enumerant type to specify which weight or bias tensor should be retrieved.
weightSizeInBytesInput. Buffer size that stores all multihead attention weights and biases.
weightsInput. Pointer to the weight buffer in the host or device memory.
wDescOutput. The descriptor specifying weight or bias tensor shape. For weights, the wDesc.dimA[] array has three elements: [nHeads, projected size, original size]. For biases, the wDesc.dimA[]
array also has three elements: [nHeads, projected size, 1]. The wDesc.strideA[] array describes how tensor elements are arranged in memory.
wAddrOutput. Pointer to a location where the start address of the requested tensor should be written. When the corresponding projection is disabled, the address written to wAddr is NULL.
Returns
CUDNN_STATUS_SUCCESSThe weight tensor descriptor and the address of data in the device memory were successfully retrieved.
CUDNN_STATUS_BAD_PARAMAn invalid or incompatible input argument was encountered. For example, wKind did not have a valid value or weightSizeInBytes was too small.
cudnnGetRNNDataDescriptor()#
This function retrieves a previously created RNN data descriptor object.
cudnnStatus_t cudnnGetRNNDataDescriptor(
cudnnRNNDataDescriptor_t RNNDataDesc,
cudnnDataType_t *dataType,
cudnnRNNDataLayout_t *layout,
int *maxSeqLength,
int *batchSize,
int *vectorSize,
int arrayLengthRequested,
int seqLengthArray[],
void *paddingFill);
Parameters
RNNDataDescInput. A previously created and initialized RNN descriptor.
dataTypeOutput. Pointer to the host memory location to store the datatype of the RNN data tensor.
layoutOutput. Pointer to the host memory location to store the memory layout of the RNN data tensor.
maxSeqLengthOutput. The maximum sequence length within this RNN data tensor, including the padding vectors.
batchSizeOutput. The number of sequences within the mini-batch.
vectorSizeOutput. The vector length (meaning, embedding size) of the input or output tensor at each time-step.
arrayLengthRequestedInput. The number of elements that the user requested for seqLengthArray.
seqLengthArrayOutput. Pointer to the host memory location to store the integer array describing the length (meaning, number of timesteps) of each sequence. This is allowed to be a NULL pointer if arrayLengthRequested is 0.
paddingFillOutput. Pointer to the host memory location to store the user defined symbol. The symbol should be interpreted as the same data type as the RNN data tensor.
Returns
CUDNN_STATUS_SUCCESSThe parameters are fetched successfully.
CUDNN_STATUS_BAD_PARAMAny one of these have occurred:
Any of RNNDataDesc, dataType, layout, maxSeqLength, batchSize, vectorSize, or paddingFill is NULL.
seqLengthArray is NULL while arrayLengthRequested is greater than zero.
arrayLengthRequested is less than zero.
cudnnGetRNNDescriptor_v8()#
This function retrieves RNN network parameters that were configured by cudnnSetRNNDescriptor_v8(). The user can assign NULL to any pointer except rnnDesc when the retrieved value is not needed. The function does not check the validity of retrieved parameters.
cudnnStatus_t cudnnGetRNNDescriptor_v8(
cudnnRNNDescriptor_t rnnDesc,
cudnnRNNAlgo_t *algo,
cudnnRNNMode_t *cellMode,
cudnnRNNBiasMode_t *biasMode,
cudnnDirectionMode_t *dirMode,
cudnnRNNInputMode_t *inputMode,
cudnnDataType_t *dataType,
cudnnDataType_t *mathPrec,
cudnnMathType_t *mathType,
int32_t *inputSize,
int32_t *hiddenSize,
int32_t *projSize,
int32_t *numLayers,
cudnnDropoutDescriptor_t *dropoutDesc,
uint32_t *auxFlags);
Parameters
rnnDescInput. A previously created and initialized RNN descriptor.
algoOutput. Pointer to where RNN algorithm type should be stored.
cellModeOutput. Pointer to where RNN cell type should be saved.
biasModeOutput. Pointer to where RNN bias mode cudnnRNNBiasMode_t should be saved.
dirModeOutput. Pointer to where RNN unidirectional/bidirectional mode should be saved.
inputModeOutput. Pointer to where the mode of the first RNN layer should be saved.
dataTypeOutput. Pointer to where the data type of RNN weights/biases should be stored.
mathPrecOutput. Pointer to where the math precision type should be stored.
mathTypeOutput. Pointer to where the preferred option for Tensor Cores are saved.
inputSizeOutput. Pointer to where the RNN input vector size is stored.
hiddenSizeOutput. Pointer to where the size of the hidden state should be stored (the same value is used in every RNN layer).
projSizeOutput. Pointer to where the LSTM cell output size after the recurrent projection is stored.
numLayersOutput. Pointer to where the number of RNN layers should be stored.
dropoutDescOutput. Pointer to where the handle to a previously configured dropout descriptor should be stored.
auxFlagsOutput. Pointer to miscellaneous RNN options (flags) that do not require passing additional numerical values to configure.
Returns
CUDNN_STATUS_SUCCESSRNN parameters were successfully retrieved from the RNN descriptor.
CUDNN_STATUS_BAD_PARAMAn invalid input argument was found (rnnDesc was NULL).
CUDNN_STATUS_NOT_INITIALIZEDThe cuDNN library was not initialized properly.
cudnnGetRNNTempSpaceSizes()#
This function computes the work and reserve space buffer sizes based on the RNN network geometry stored in rnnDesc, designated usage (inference or training) defined by the fMode argument, and the current RNN data dimensions (maxSeqLength, batchSize) retrieved from xDesc. When RNN data dimensions change, the cudnnGetRNNTempSpaceSizes() must be called again because RNN temporary buffer sizes are not monotonic.
cudnnStatus_t cudnnGetRNNTempSpaceSizes(
cudnnHandle_t handle,
cudnnRNNDescriptor_t rnnDesc,
cudnnForwardMode_t fMode,
cudnnRNNDataDescriptor_t xDesc,
size_t *workSpaceSize,
size_t *reserveSpaceSize);
The user can assign NULL to workSpaceSize or reserveSpaceSize pointers when the corresponding value is not needed.
Parameters
handleInput. The current cuDNN context handle.
rnnDescInput. A previously initialized RNN descriptor.
fModeInput. Specifies whether temporary buffers are used in inference or training modes. The reserve-space buffer is not used during inference. Therefore, the returned size of the reserve space buffer will be zero when the fMode argument is CUDNN_FWD_MODE_INFERENCE.
xDescInput. A single RNN data descriptor that specifies current RNN data dimensions: maxSeqLength and batchSize
.
workSpaceSizeOutput. Minimum amount of GPU memory in bytes needed as a workspace buffer. The workspace buffer is not used to pass intermediate results between APIs but as a temporary read/write buffer.
reserveSpaceSizeOutput. Minimum amount of GPU memory in bytes needed as the reserve-space buffer. The reserve space buffer is used to pass intermediate results from cudnnRNNForward() to RNN BackwardData and BackwardWeights routines that compute first order derivatives with respect to RNN inputs or trainable weight and biases.
Returns
CUDNN_STATUS_SUCCESSRNN temporary buffer sizes were computed successfully.
CUDNN_STATUS_BAD_PARAMAn invalid input argument was detected.
CUDNN_STATUS_NOT_SUPPORTEDAn incompatible or unsupported combination of input arguments was detected.
cudnnGetRNNWeightParams()#
This function is used to obtain the start address and shape of every RNN weight matrix and bias vector in each pseudo-layer within the recurrent neural network model.
cudnnStatus_t cudnnGetRNNWeightParams(
cudnnHandle_t handle,
cudnnRNNDescriptor_t rnnDesc,
int32_t pseudoLayer,
size_t weightSpaceSize,
const void *weightSpace,
int32_t linLayerID,
cudnnTensorDescriptor_t mDesc,
void **mAddr,
cudnnTensorDescriptor_t bDesc,
void **bAddr);
Parameters
handleInput. Handle to a previously created cuDNN library descriptor.
rnnDescInput. A previously initialized RNN descriptor.
pseudoLayerInput. The pseudo-layer to query. In unidirectional RNNs, a pseudo-layer is the same as a physical layer (pseudoLayer=0 is the RNN input layer, pseudoLayer=1 is the first hidden layer). In bidirectional RNNs, there are twice as many pseudo-layers in comparison to physical layers:
pseudoLayer=0 refers to the forward direction sub-layer of the physical input layer
pseudoLayer=1 refers to the backward direction sub-layer of the physical input layer
pseudoLayer=2 is the forward direction sub-layer of the first hidden layer, and so on
weightSpaceSizeInput. Address of the weight space buffer. Starting from cuDNN version 9.1, this parameter can be NULL. This allows you to retrieve weight/bias offsets instead of the actual pointers within the buffer. For best performance, the recommended alignment of the weight space buffer should be 256 B or the same as returned by cudaMalloc().
weightSpaceInput. Pointer to the weight space buffer.
linLayerIDInput. Weight matrix or bias vector linear ID index.
If cellMode in rnnDesc was set to CUDNN_RNN_RELU or CUDNN_RNN_TANH:
Value 0 references the weight matrix or bias vector used in conjunction with the input from the previous layer or input to the RNN model.
Value 1 references the weight matrix or bias vector used in conjunction with the hidden state from the previous time step or the initial hidden state.
If cellMode in rnnDesc was set to CUDNN_LSTM:
Values 0, 1, 2, and 3 reference weight matrices or bias vectors used in conjunction with the input from the previous layer or input to the RNN model.
Values 4, 5, 6, and 7 reference weight matrices or bias vectors used in conjunction with the hidden state from the previous time step or the initial hidden state.
Value 8 corresponds to the projection matrix, if enabled (there is no bias in this operation).
Values and their LSTM gates:
linLayerID 0 and 4 correspond to the input gate.
linLayerID 1 and 5 correspond to the forget gate.
linLayerID 2 and 6 correspond to the new cell state calculations with hyperbolic tangent.
linLayerID 3 and 7 correspond to the output gate.
If cellMode in rnnDesc was set to CUDNN_GRU:
Values 0, 1, and 2 reference weight matrices or bias vectors used in conjunction with the input from the previous layer or input to the RNN model.
Values 3, 4, and 5 reference weight matrices or bias vectors used in conjunction with the hidden state from the previous time step or the initial hidden state.
Values and their GRU gates:
linLayerID 0 and 3 correspond to the reset gate.
linLayerID 1 and 4 reference to the update gate.
linLayerID 2 and 5 correspond to the new hidden state calculations with hyperbolic tangent.
For more information on modes and bias modes, refer to cudnnRNNMode_t.
mDescOutput. Handle to a previously created tensor descriptor. The shape of the corresponding weight matrix is returned in this descriptor in the following format: dimA[3] = {1, rows, cols}. The reported number of tensor dimensions is zero when the weight matrix does not exist. This situation occurs for input GEMM matrices of the first layer when CUDNN_SKIP_INPUT is selected or for the LSTM projection matrix when the feature is disabled.
mAddrOutput. Pointer to the beginning of the weight matrix within the weight space buffer. When the weight matrix does not exist, the returned address written to mAddr is NULL. Starting from cuDNN version 9.1, the mDesc and mAddr arguments can be both NULL. In this case, the shape of the weight matrix and its address will not be reported. By assigning mDesc=NULL and mAddr=NULL, you can retrieve information about bias vectors only.
bDescOutput. Handle to a previously created tensor descriptor. The shape of the corresponding bias vector is returned in this descriptor in the following format: dimA[3] = {1, rows, 1}. The reported number of tensor dimensions is zero when the bias vector does not exist.
bAddrOutput. Pointer to the beginning of the bias vector within the weight space buffer. When the bias vector does not exist, the returned address is NULL. Starting from cuDNN version 9.1, the bDesc and bAddr arguments can be both NULL. In this case, the shape of the bias vector and its address will not be reported. By assigning bDesc=NULL and bAddr=NULL, you can retrieve information about weight matrices only.
Returns
CUDNN_STATUS_SUCCESSThe query was completed successfully.
CUDNN_STATUS_BAD_PARAMAn invalid input argument was encountered. For example, the value of pseudoLayer is out of range or linLayerID is negative or larger than 8.
CUDNN_STATUS_INVALID_VALUESome weight/bias elements are outside the weight space buffer boundaries.
CUDNN_STATUS_NOT_INITIALIZEDThe cuDNN library was not initialized properly.
cudnnGetRNNWeightSpaceSize()#
This function reports the required size of the weight space buffer in bytes. The weight space buffer holds all RNN weight matrices and bias vectors.
cudnnStatus_t cudnnGetRNNWeightSpaceSize(
cudnnHandle_t handle,
cudnnRNNDescriptor_t rnnDesc,
size_t *weightSpaceSize);
Parameters
handleInput. The current cuDNN context handle.
rnnDescInput. A previously initialized RNN descriptor.
weightSpaceSizeOutput. Minimum size in bytes of GPU memory needed for all RNN trainable parameters.
Returns
CUDNN_STATUS_SUCCESSThe query was successful.
CUDNN_STATUS_BAD_PARAMAn invalid input argument was encountered. For example, any input argument was NULL.
CUDNN_STATUS_NOT_INITIALIZEDThe cuDNN library was not initialized properly.
cudnnGetSeqDataDescriptor()#
This function has been deprecated in cuDNN 9.0.
This function retrieves settings from a previously created sequence data descriptor. The user can assign NULL
to any pointer except seqDataDesc when the retrieved value is not needed. The nbDimsRequested argument applies to both dimA[] and axes[] arrays. A positive value of nbDimsRequested or seqLengthSizeRequested is ignored when the corresponding array, dimA[], axes[], or seqLengthArray[] is NULL.
cudnnStatus_t cudnnGetSeqDataDescriptor(
const cudnnSeqDataDescriptor_t seqDataDesc,
cudnnDataType_t *dataType,
int *nbDims,
int nbDimsRequested,
int dimA[],
cudnnSeqDataAxis_t axes[],
size_t *seqLengthArraySize,
size_t seqLengthSizeRequested,
int seqLengthArray[],
void *paddingFill);
The cudnnGetSeqDataDescriptor() function does not report the actual strides in the sequence data buffer. Those strides can be handy in computing the offset to any sequence data element. The user must precompute strides based on the axes[] and dimA[] arrays reported by the cudnnGetSeqDataDescriptor() function. Below is sample code that performs this task:
// Array holding sequence data strides.
size_t strA[CUDNN_SEQDATA_DIM_COUNT] = {0};
// Compute strides from dimension and order arrays.
size_t stride = 1;
for (int i = nbDims - 1; i >= 0; i--) {
int j = int(axes[i]);
if (unsigned(j) < CUDNN_SEQDATA_DIM_COUNT-1 && strA[j] == 0) {
strA[j] = stride;
stride *= dimA[j];
} else {
fprintf(stderr, "ERROR: invalid axes[%d]=%d\n\n", i, j);
abort();
Now, the strA[] array can be used to compute the index to any sequence data element, for example:
// Using four indices (batch, beam, time, vect) with ranges already checked.
size_t base = strA[CUDNN_SEQDATA_BATCH_DIM] * batch
+ strA[CUDNN_SEQDATA_BEAM_DIM] * beam
+ strA[CUDNN_SEQDATA_TIME_DIM] * time;
val = seqDataPtr[base + vect];
The above code assumes that all four indices (batch, beam, time, vect) are less than the corresponding value in the dimA[] array. The sample code also omits the strA[CUDNN_SEQDATA_VECT_DIM] stride because its value is always 1, meaning, elements of one vector occupy a contiguous block of memory.
Parameters
seqDataDescInput. Sequence data descriptor.
dataTypeOutput. Data type used in the sequence data buffer.
nbDimsOutput. The number of active dimensions in the dimA[] and axes[] arrays.
nbDimsRequestedInput. The maximum number of consecutive elements that can be written to dimA[] and axes[] arrays starting from index zero. The recommended value for this argument is CUDNN_SEQDATA_DIM_COUNT.
dimA[]Output. Integer array holding sequence data dimensions.
axes[]Output. Array of cudnnSeqDataAxis_t that defines the layout of sequence data in memory.
seqLengthArraySizeOutput. The number of required elements in seqLengthArray[] to save all sequence lengths.
seqLengthSizeRequestedInput. The maximum number of consecutive elements that can be written to the seqLengthArray[] array starting from index zero.
seqLengthArray[]Output. Integer array holding sequence lengths.
paddingFillOutput. Pointer to a storage location of dataType with the fill value that should be written to all padding vectors. Use NULL when an explicit initialization of output padding vectors was not requested.
Returns
CUDNN_STATUS_SUCCESSRequested sequence data descriptor fields were retrieved successfully.
CUDNN_STATUS_BAD_PARAMAn invalid input argument was found.
CUDNN_STATUS_INTERNAL_ERRORAn inconsistent internal state was encountered.
cudnnMultiHeadAttnBackwardData()#
This function has been deprecated in cuDNN 9.0.
This function computes exact, first-order derivatives of the multihead attention block with respect to its inputs: Q, K, V. If y=F(w) is a vector-valued function that represents the multihead attention layer and it takes some vector \(\chi\epsilon\mathbb{R}^{n}\) as an input (with all other parameters and inputs constant), and outputs vector \(\chi\epsilon\mathbb{R}^{m}\), then cudnnMultiHeadAttnBackwardData() computes the result of \(\left(\partial y_{i}/\partial x_{j}\right)^{T} \delta_{out}\) where \(\delta_{out}\) is the mx1 gradient of the loss function with respect to multihead attention outputs. The \(\delta_{out}\) gradient is back propagated through prior layers of the deep learning model. \(\partial y_{i}/\partial x_{j}\) is the mxn Jacobian matrix of F(x). The input is supplied via the dout argument and gradient results for Q, K, V are written to the dqueries, dkeys, and dvalues buffers.
The cudnnMultiHeadAttnBackwardData() function does not output partial derivatives for residual connections because this result is equal to \(\delta_{out}\). If the multihead attention model enables residual connections sourced directly from Q, then the dout tensor needs to be added to dqueries to obtain the correct result of the latter. This operation is demonstrated in the cuDNN multiHeadAttention sample code.
cudnnStatus_t cudnnMultiHeadAttnBackwardData(
cudnnHandle_t handle,
const cudnnAttnDescriptor_t attnDesc,
const int loWinIdx[],
const int hiWinIdx[],
const int devSeqLengthsDQDO[],
const int devSeqLengthsDKDV[],
const cudnnSeqDataDescriptor_t doDesc,
const void *dout,
const cudnnSeqDataDescriptor_t dqDesc,
void *dqueries,
const void *queries
,
const cudnnSeqDataDescriptor_t dkDesc,
void *dkeys,
const void *keys,
const cudnnSeqDataDescriptor_t dvDesc,
void *dvalues,
const void *values,
size_t weightSizeInBytes,
const void *weights,
size_t workSpaceSizeInBytes,
void *workSpace,
size_t reserveSpaceSizeInBytes,
void *reserveSpace);
The cudnnMultiHeadAttnBackwardData() function must be invoked after cudnnMultiHeadAttnForward(). The loWinIdx[], hiWinIdx[], queries, keys, values, weights, and reserveSpace arguments should be the same as in the cudnnMultiHeadAttnForward() call. devSeqLengthsDQDO[] and devSeqLengthsDKDV[] device arrays should contain the same start and end attention window indices as devSeqLengthsQO[] and devSeqLengthsKV[] arrays in the forward function invocation.
cudnnMultiHeadAttnBackwardData() does not verify that sequence lengths stored in devSeqLengthsDQDO[] and devSeqLengthsDKDV[] contain the same settings as seqLengthArray[] in the corresponding sequence data descriptor.
Parameters
handleInput. The current cuDNN context handle.
attnDescInput. A previously initialized attention descriptor.
loWinIdx[], hiWinIdx[]Input. Two host integer arrays specifying the start and end indices of the attention window for each Q time-step. The start index in K, V sets is inclusive, and the end index is exclusive.
devSeqLengthsDQDO[]Input. Device array containing a copy of the sequence length array from the dqDesc or doDesc sequence data descriptor.
devSeqLengthsDKDV[]Input. Device array containing a copy of the sequence length array from the dkDesc or dvDesc sequence data descriptor.
doDescInput. Descriptor for the \(\delta_{out}\) gradients (vectors of partial derivatives of the loss function with respect to the multihead attention outputs).
doutInput. Pointer to the \(\delta_{out}\) gradient data in the device memory.
dqDescInput. Descriptor for queries and dqueries sequence data.
dqueriesOutput. Device pointer to gradients of the loss function computed with respect to queries vectors.
queriesInput. Pointer to queries data in the device memory. This is the same input as in cudnnMultiHeadAttnForward().
dkDescInput. Descriptor for keys and dkeys sequence data.
dkeysOutput. Device pointer to gradients of the loss function computed with respect to keys vectors.
keysInput. Pointer to keys data in the device memory. This is the same input as in cudnnMultiHeadAttnForward().
dvDescInput. Descriptor for values and dvalues sequence data.
dvaluesOutput. Device pointer to gradients of the loss function computed with respect to values vectors.
valuesInput. Pointer to values data in the device memory. This is the same input as in cudnnMultiHeadAttnForward().
weightSizeInBytesInput. Size of the weight buffer in bytes where all multihead attention trainable parameters are stored.
weightsInput. Address of the weight buffer in the device memory.
workSpaceSizeInBytesInput. Size of the workspace buffer in bytes used for temporary API storage.
workSpaceInput/Output. Address of the workspace buffer in the device memory.
reserveSpaceSizeInBytesInput. Size of the reserve-space buffer in bytes used for data exchange between forward and backward (gradient) API calls.
reserveSpaceInput/Output. Address to the reserve-space buffer in the device memory.
Returns
CUDNN_STATUS_SUCCESSNo errors were detected while processing API input arguments and launching GPU kernels.
CUDNN_STATUS_BAD_PARAMAn invalid or incompatible input argument was encountered.
CUDNN_STATUS_EXECUTION_FAILEDThe process of launching a GPU kernel returned an error, or an earlier kernel did not complete successfully.
CUDNN_STATUS_INTERNAL_ERRORAn inconsistent internal state was encountered.
CUDNN_STATUS_NOT_SUPPORTEDA requested option or a combination of input arguments is not supported.
CUDNN_STATUS_ALLOC_FAILEDInsufficient amount of shared memory to launch a GPU kernel.
cudnnMultiHeadAttnBackwardWeights()#
This function has been deprecated in cuDNN 9.0.
This function computes exact, first-order derivatives of the multihead attention block with respect to its trainable parameters: projection weights and projection biases. If y=F(w) is a vector-valued function that represents the multihead attention layer and it takes some vector \(\chi\epsilon\mathbb{R}^{n}\) of “flatten” weights or biases as an input (with all other parameters and inputs fixed), and outputs vector \(\chi\epsilon\mathbb{R}^{m}\), then cudnnMultiHeadAttnBackwardWeights() computes the result of \(\left(\partial y_{i}/\partial w_{j}\right)^{T} \delta_{out}\) where \(\delta_{out}\) is the mx1 gradient of the loss function with respect to multihead attention outputs. The \(\delta_{out}\) gradient is back propagated through prior layers of the deep learning model. \(\partial y_{i}/\partial w_{j}\) is the mxn Jacobian matrix of F(w). The \(\delta_{out}\) input is supplied via the dout argument.
cudnnStatus_t cudnnMultiHeadAttnBackwardWeights(
cudnnHandle_t handle,
const cudnnAttnDescriptor_t attnDesc,
cudnnWgradMode_t addGrad,
const cudnnSeqDataDescriptor_t qDesc,
const void *queries,
const cudnnSeqDataDescriptor_t kDesc,
const void *keys,
const cudnnSeqDataDescriptor_t vDesc,
const void *values,
const cudnnSeqDataDescriptor_t doDesc,
const void *dout,
size_t weightSizeInBytes,
const void *weights,
void *dweights,
size_t workSpaceSizeInBytes,
void *workSpace,
size_t reserveSpaceSizeInBytes,
void *reserveSpace);
All gradient results with respect to weights and biases are written to the dweights buffer. The size and the organization of the dweights buffer is the same as the weights buffer that holds multihead attention weights and biases. The cuDNN multiHeadAttention sample code demonstrates how to access those weights.
Gradient of the loss function with respect to weights or biases is typically computed over multiple batches. In such a case, partial results computed for each batch should be summed together. The addGrad argument specifies if the gradients from the current batch should be added to previously computed results or the dweights
buffer should be overwritten with the new results.
The cudnnMultiHeadAttnBackwardWeights() function should be invoked after cudnnMultiHeadAttnBackwardData(). The queries, keys, values, weights, and reserveSpace arguments should be the same as in cudnnMultiHeadAttnForward() and cudnnMultiHeadAttnBackwardData() calls. The dout argument should be the same as in cudnnMultiHeadAttnBackwardData().
Parameters
handleInput. The current cuDNN context handle.
attnDescInput. A previously initialized attention descriptor.
addGradInput. Weight gradient output mode.
qDescInput. Descriptor for the query sequence data.
queriesInput. Pointer to queries sequence data in the device memory.
kDescInput. Descriptor for the keys sequence data.
keysInput. Pointer to keys sequence data in the device memory.
vDescInput. Descriptor for the values sequence data.
valuesInput. Pointer to values sequence data in the device memory.
doDescInput. Descriptor for the \(\delta_{out}\) gradients (vectors of partial derivatives of the loss function with respect to the multihead attention outputs).
doutInput. Pointer to the \(\delta_{out}\) gradient vectors in the device memory.
weightSizeInBytesInput. Size of the weights and dweights buffers in bytes.
weightsInput. Address of the weight buffer in the device memory.
dweightsOutput. Address of the weight gradient buffer in the device memory.
workSpaceSizeInBytesInput. Size of the workspace buffer in bytes used for temporary API storage.
workSpaceInput/Output. Address of the workspace buffer in the device memory.
reserveSpaceSizeInBytesInput. Size of the reserve-space buffer in bytes used for data exchange between forward and backward (gradient) API calls.
reserveSpaceInput/Output. Address to the reserve-space buffer in the device memory.
Returns
CUDNN_STATUS_SUCCESSNo errors were detected while processing API input arguments and launching GPU kernels.
CUDNN_STATUS_BAD_PARAMAn invalid or incompatible input argument was encountered.
CUDNN_STATUS_EXECUTION_FAILEDThe process of launching a GPU kernel returned an error, or an earlier kernel did not complete successfully.
CUDNN_STATUS_INTERNAL_ERRORAn inconsistent internal state was encountered.
CUDNN_STATUS_NOT_SUPPORTEDA requested option or a combination of input arguments is not supported.
cudnnMultiHeadAttnForward()#
This function has been deprecated in cuDNN 9.0.
The cudnnMultiHeadAttnForward() function computes the forward responses of the multihead attention layer. When reserveSpaceSizeInBytes=0 and reserveSpace=NULL, the function operates in the inference mode in which backward (gradient) functions are not invoked, otherwise, the training mode is assumed. In the training mode, the reserve space is used to pass intermediate results from cudnnMultiHeadAttnForward() to cudnnMultiHeadAttnBackwardData() and from cudnnMultiHeadAttnBackwardData() to cudnnMultiHeadAttnBackwardWeights().
cudnnStatus_t cudnnMultiHeadAttnForward(
cudnnHandle_t handle,
const cudnnAttnDescriptor_t attnDesc,
int currIdx,
const int loWinIdx[],
const int hiWinIdx[],
const int devSeqLengthsQO[],
const int devSeqLengthsKV[],
const cudnnSeqDataDescriptor_t qDesc,
const void *queries,
const void *residuals,
const cudnnSeqDataDescriptor_t kDesc,
const void *keys,
const cudnnSeqDataDescriptor_t vDesc,
const void *values,
const cudnnSeqDataDescriptor_t oDesc,
void *out,
size_t weightSizeInBytes,
const void *weights,
size_t workSpaceSizeInBytes,
void *workSpace,
size_t reserveSpaceSizeInBytes,
void *reserveSpace);
In the inference mode, the currIdx specifies the time-step or sequence index of the embedding vectors to be processed. In this mode, the user can perform one iteration for time-step zero (currIdx=0), then update Q, K, V vectors and the attention window, and execute the next step (currIdx=1). The iterative process can be repeated for all time-steps.
When all Q time-steps are available (for example, in the training mode or in the inference mode on the encoder side in self-attention), the user can assign a negative value to currIdx and the cudnnMultiHeadAttnForward() API will automatically sweep through all Q time-steps.
The loWinIdx[] and hiWinIdx[] host arrays specify the attention window size for each Q time-step. In a typical self-attention case, the user must include all previously visited embedding vectors but not the current or future vectors. In this situation, the user should set:
currIdx=0: loWinIdx[0]=0; hiWinIdx[0]=0; // initial time-step, no attention window
currIdx=1: loWinIdx[1]=0; hiWinIdx[1]=1; // attention window spans one vector
currIdx=2: loWinIdx[2]=0; hiWinIdx[2]=2; // attention window spans two vectors
(...)
When currIdx is negative in cudnnMultiHeadAttnForward(), the loWinIdx[] and hiWinIdx[] arrays must be fully initialized for all time-steps. When cudnnMultiHeadAttnForward() is invoked with currIdx=0, currIdx=1, currIdx=2, and so on, then the user can update loWinIdx[currIdx] and hiWinIdx[currIdx] elements only before invoking the forward response function. All other elements in the loWinIdx[] and hiWinIdx[] arrays will not be accessed. Any adaptive attention window scheme can be implemented that way.
Use the following settings when the attention window should be the maximum size, for example, in cross-attention:
currIdx=0: loWinIdx[0]=0; hiWinIdx[0]=maxSeqLenK;
currIdx=1: loWinIdx[1]=0; hiWinIdx[1]=maxSeqLenK;
currIdx=2: loWinIdx[2]=0; hiWinIdx
[2]=maxSeqLenK;
(...)
The maxSeqLenK value above should be equal to or larger than dimA[CUDNN_SEQDATA_TIME_DIM] in the kDesc descriptor. A good choice is to use maxSeqLenK=INT_MAX from limits.h.
The actual length of any K sequence defined in seqLengthArray[] in cudnnSetSeqDataDescriptor() can be shorter than maxSeqLenK. The effective attention window span is computed based on seqLengthArray[] stored in the K sequence descriptor and indices held in loWinIdx[] and hiWinIdx[] arrays.
devSeqLengthsQO[] and devSeqLengthsKV[] are pointers to device (not host) arrays with Q, O, and K, V sequence lengths. Note that the same information is also passed in the corresponding descriptors of type cudnnSeqDataDescriptor_t on the host side. The need for extra device arrays comes from the asynchronous nature of cuDNN calls and limited size of the constant memory dedicated to GPU kernel arguments. When the cudnnMultiHeadAttnForward() API returns, the sequence length arrays stored in the descriptors can be immediately modified for the next iteration. However, the GPU kernels launched by the forward call may not have started at this point. For this reason, copies of sequence arrays are needed on the device side to be accessed directly by GPU kernels. Those copies cannot be created inside the cudnnMultiHeadAttnForward() function for very large K, V inputs without the device memory allocation and CUDA stream synchronization.
To reduce the cudnnMultiHeadAttnForward() API overhead, devSeqLengthsQO[] and devSeqLengthsKV[] device arrays are not validated to contain the same settings as seqLengthArray[] in the sequence data descriptors.
Sequence lengths in the kDesc and vDesc descriptors should be the same. Similarly, sequence lengths in the qDesc and oDesc descriptors should match. The user can define six different data layouts in the qDesc, kDesc, vDesc, and oDesc descriptors. Refer to the cudnnSetSeqDataDescriptor() function for the discussion of those layouts. All multihead attention API calls require that the same layout is used in all sequence data descriptors.
In the transformer model, the multihead attention block is tightly coupled with the layer normalization and residual connections. cudnnMultiHeadAttnForward() does not encompass the layer normalization but it can be used to handle residual connections as depicted in the following figure.
Queries and residuals share the same qDesc descriptor in cudnnMultiHeadAttnForward(). When residual connections are disabled, the residuals pointer should be NULL. When residual connections are enabled, the vector length in qDesc should match the vector length specified in the oDesc descriptor, so that a vector addition is feasible.
The queries, keys, and values pointers are not allowed to be NULL, even when K and V are the same inputs or Q, K, V are the same inputs.
Parameters
handleInput. The current cuDNN context handle.
attnDescInput. A previously initialized attention descriptor.
currIdxInput. Time-step in queries to process. When the currIdx argument is negative, all Q time-steps are processed. When currIdx is zero or positive, the forward response is computed for the selected time-step only. The latter input can be used in inference mode only, to process one time-step while updating the next attention window and Q, R, K, V inputs in-between calls.
loWinIdx[], hiWinIdx[]Input. Two host integer arrays specifying the start and end indices of the attention window for each Q time-step. The start index in K, V sets is inclusive, and the end index is exclusive.
devSeqLengthsQO[]Input. Device array specifying sequence lengths of query, residual, and output sequence data.
devSeqLengthsKV[]Input. Device array specifying sequence lengths of key and value input data.
qDescInput. Descriptor for the query and residual sequence data.
queriesInput. Pointer to queries data in the device memory.
residualsInput. Pointer to residual data in device memory. Set this argument to NULL if no residual connections are required.
kDescInput. Descriptor for the keys sequence data.
keysInput. Pointer to keys data in the device memory.
vDescInput. Descriptor for the values sequence data.
valuesInput. Pointer to values data in the device memory.
oDescInput. Descriptor for the multihead attention output sequence data.
outOutput. Pointer to device memory where the output response should be written.
weightSizeInBytesInput. Size of the weight buffer in bytes where all multihead attention trainable parameters are stored.
weightsInput. Pointer to the weight buffer in device memory.
workSpaceSizeInBytesInput. Size of the workspace buffer in bytes used for temporary API storage.
workSpaceInput/Output. Pointer to the workspace buffer in device memory.
reserveSpaceSizeInBytesInput. Size of the reserve-space buffer in bytes used for data exchange between forward and backward (gradient) API calls. This parameter should be zero in the inference mode and non-zero in the training mode.
reserveSpaceInput/Output. Pointer to the reserve-space buffer in device memory. This argument should be NULL in inference mode and non-NULL in the training mode.
Returns
CUDNN_STATUS_SUCCESSNo errors were detected while processing API input arguments and launching GPU kernels.
CUDNN_STATUS_BAD_PARAMAn invalid or incompatible input argument was encountered. Some examples include:
a required input pointer was NULL
currIdx was out of bound
the descriptor value for attention, query, key, value, and output were incompatible with one another
CUDNN_STATUS_EXECUTION_FAILEDThe process of launching a GPU kernel returned an error, or an earlier kernel did not complete successfully.
CUDNN_STATUS_INTERNAL_ERRORAn inconsistent internal state was encountered.
CUDNN_STATUS_NOT_SUPPORTEDA requested option or a combination of input arguments is not supported.
CUDNN_STATUS_ALLOC_FAILEDInsufficient amount of shared memory to launch a GPU kernel.
cudnnRNNBackwardData_v8()#
This function computes exact, first-order derivatives of the RNN model with respect to its inputs: x, hx and for the LSTM cell type also cx. If o = [y, hy, cy] = F(x, hx, cx) = F(z) is a vector-valued function that represents the entire RNN model and it takes vectors x (for all time-steps) and vectors hx, cx (for all layers) as inputs, concatenated into \(\textbf{z}\epsilon\mathbb{R}^{n}\) (network weights and biases are assumed constant), and outputs vectors y, hy, cy concatenated into a vector \(\textbf{o}\epsilon\mathbb{R}^{m}\), then cudnnRNNBackwardData_v8() computes the result of \(\left(\partial o_{i}/\partial z_{j}\right)^{T} \delta_{out}\) where \(\delta_{out}\) is the mx1 gradient of the loss function with respect to all RNN outputs. The \(\delta_{out}\) gradient is back propagated through prior layers of the deep learning model, starting from the model output. \(\partial o_{i}/\partial z_{j}\) is the mxn Jacobian matrix of F(z). The \(\delta_{out}\) input is supplied via the dy, dhy, and dcy arguments and gradient results \(\left(\partial o_{i}/\partial z_{j}\right)^{T} \delta_{out}\) are written to the dx, dhx, and dcx buffers.