Release Notes
Updates in CUDA 12.4
New Features
Added tracing support for applications using Green contexts. Added two new fields
isGreenContext
andparentContextId
in the context activity record. The activity recordCUpti_ActivityContext
is deprecated and it is replaced by a new activity recordCUpti_ActivityContext2
.CUDA API calls are completed asynchronously from the perspective of the host CPU. This is accomplished by queuing the work slated for the GPU into a structure known as a command buffer. If there is insufficient space available in the command buffer when attempting to call a CUDA API, the host call will block until space becomes available. The user should be able to identify when this situation occurs. This is indicated using the new attribute
CUPTI_ACTIVITY_OVERHEAD_COMMAND_BUFFER_FULL
added in the activity overhead enumCUpti_ActivityOverheadKind
. To provide additional details about the overhead, a new fieldoverheadData
is added in the overhead activity record. Activity recordCUpti_ActivityOverhead2
is deprecated and it is replaced by the new activity recordCUpti_ActivityOverhead3
.Added process ID and thread ID in the JIT activity record. To accomodate this change, activity record
CUpti_ActivityJit
is deprecated and it is replaced by a new activity recordCUpti_ActivityJit2
.To correlate the sampling data for a kernel with the launch API in the serial mode of the PC Sampling APIs, a new field
correlationId
is added in the structCUpti_PCSamplingPCData
.For PC Sampling APIs, total (smsp__pcsamp_sample_count) and dropped (smsp__pcsamp_samples_data_dropped) sample counts are collected by default.
Resolved Issues
Fixed the issue for overhead records showing the default thread ID than the one requested using the API
cuptiSetThreadIdType()
.Fixed instruction level SASS metrics profiling for CUDA Graph applications.
When a device graph is first launched from the device and it is not launched from the host earlier, end timestamp could be 0 for graph-level tracing on Ampere and later GPU architectures. This issue is fixed.
Updates in CUDA 12.3 Update 1
Resolved Issues
To provide normalized timestamps for all activities, CUPTI uses linear interpolation for conversion from GPU timestamps to CPU timestamps. This was broken with CUDA 12.3 causing spurious gaps or overlap on Tegra platforms. Fixed the issue.
Updates in CUDA 12.3
New Features
New attributes
CUPTI_ACTIVITY_OVERHEAD_RUNTIME_TRIGGERED_MODULE_LOADING
andCUPTI_ACTIVITY_OVERHEAD_LAZY_FUNCTION_LOADING
are added in the activity overhead enumCUpti_ActivityOverheadKind
to provide the overhead information for CUDA runtime triggered module loading and lazy function loading respectively.New API
cuptiGetGraphExecId
provides the unique ID of the executable graph.Added support for collecting graph level trace for device launched graphs. A new API
cuptiActivityEnableDeviceGraph
is added to enable the collection of records for device launched graphs.CUDA Graphs can be executed on multiple devices i.e. the root node could be launched on one device and the leaf node could be launched on the another device. New fields
endDeviceId
andendContextId
are added to identify the ids of device and context respectively which are used to execute the last node of the graph. To accomodate this change, activity recordCUpti_ActivityGraphTrace
is deprecated and it is replaced by a new activity recordCUpti_ActivityGraphTrace2
.Added WSL profiling support on Windows 10 WSL with OS build version 19044 and greater. WSL profiling is not supported on Windows 10 WSL for systems that exceed 1 TB of system memory.
Several performance improvements are done in the tracing path. One of the key improvements is to allow clients to request CUPTI to maintain the activity buffers at the thread level instead of global buffers. This can be achieved by setting the option
CUPTI_ACTIVITY_ATTR_PER_THREAD_ACTIVITY_BUFFER
of the enumCUpti_ActivityAttribute
. This can help in reducing the collection overhead for applications which launch CUDA activities from multiple host threads.Frame pointers are enabled for Linux x86_64 libraries.
The deprecated Activity APIs and structures have been moved to a new header cupti_activity_deprecated.h, which is included in the header cupti_activity.h. Header cupti_activity.h contains only the latest version of APIs and structures.
CUPTI no longer uses profiling semaphore pool to store the profiling data. Coresponding attributes
CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_SIZE
,CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_LIMIT
andCUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_PRE_ALLOCATE_VALUE
have been deprecated.
Resolved Issues
Fixed SASS metric profiling for cuda graph.
Fixed race condition in the API
cuptiSetThreadIdType
for late subscription.
Updates in CUDA 12.2 Update 2
New Features
SASS Metric APIs introduced in the CUDA 12.2 GA release are transitioning from the beta to the production release.
Added support for collecting SASS metrics for CUDA Graphs which are launched from host.
Added a new field
numOfDroppedRecords
in the structCUpti_SassMetricsDisable_Params
to indicate the number of dropped records when SASS data is flushed prior to calling the disable API.
Added a new field
api
in the structCUpti_Profiler_DeviceSupported_Params
which can be used to check the configuration support level for profiler APIs like Profiling, PC Sampling and SASS Metric APIs.
Resolved Issues
Fixed the tracing and profiling support for the GA103 GPU.
Fixed a hang which can occur when activity buffer gets full while collecting the sampling data using the PC Sampling Activity API.
Fixed the issue of incorrect timestamps for graph level trace when a graph node is disabled using the APIs
cuGraphNodeSetEnabled
orcudaGraphNodeSetEnabled
.
Updates in CUDA 12.2 Update 1
Support for Confidential Computing
CUPTI supports some APIs while running in CC-devtools mode:
Callback API
Activity API
The profiling APIs are not supported in CC-devtools mode with this release. Using these APIs should return an error indicating the configuration is not supported:
Profiling API
PC Sampling API
Checkpoint API
SASS Metrics API
Additionally, CUPTI is not supported at all in full CC mode. CC-devtools mode must be used for tools support. Some CUDA APIs are not supported or behave differently when running in CC or CC-devtools mode; notably, host pinned memory requests will be traced as managed memory requests, and any CUDA memcopies on these converted pointers are traced as Device to Device copies irrespective of the locality of the source or destination pointers. For details on how to configure CC or CC-devtools mode, system and software requirements, as well as documentation on CUDA API changes, please see the confidential compute release documentation at https://docs.nvidia.com/confidential-computing/.
Resolved Issues
Fixed timestamps for graph-level tracing for CUDA graphs running across multiple GPUs.
Fixed a potential hang when CUPTI is unable to fetch attributes for an activity.
Updates in CUDA 12.2
New Features
A new set of CUPTI APIs for collection of SASS metric data at the source level are provided in the header file
cupti_sass_metrics.h
. These support a larger set of metrics compared to the CUPTI Activity APIs for source-level analysis. SASS to source correlation can be done in the offline mode, similar to the PC sampling APIs. Hence the runtime overhead during data collection is lower. Refer to the section CUPTI SASS Metrics API for more details. Please note that this is a Beta feature, interface and functionality are subject to change in a future release.CUPTI now reports fatal errors, non-fatal errors and warnings instantaneously through callbacks. A new callback domain
CUPTI_CB_DOMAIN_STATE
is added for subscribing to the instantaneous error reporting. Corresponding callback ids are provided in the structCUpti_CallbackIdState
.Added support for profiling of device graphs and host graphs that launch device graphs. There are some known limitations, please refer to the Known Issues section for details.
Change in the stream attribute value is communicated by issuing the resource callback. Refer to the struct
CUpti_StreamAttrData
and callback idCUPTI_CBID_RESOURCE_STREAM_ATTRIBUTE_CHANGED
added in the enumCUpti_CallbackIdResource
.New API
cuptiGetErrorMessage
provides descriptive message for CUPTI error codes.Removed the deprecated API
cuptiDeviceGetTimestamp
from the headercupti_events.h
.Added metrics for Tensor core operations to count different types of tensor instructions. These metrics are named as
sm[sp]__ops_path_tensor_src_{src}[_dst_{dst}[_sparsity_{on,off}]].
These are available for devices with compute capability 7.0 and higher, except for Turing TU11x GPUs.
Resolved Issues
Fixed crash for the graph-level trace for device graphs which are launched from the host.
Updates in CUDA 12.1 Update 1
Resolved Issues
Fixed CUPTI tracing failure when just-in-time compilation of embedded PTX code is disabled using the environment variable
CUDA_DISABLE_PTX_JIT
.Fixed a crash in the API
cuptiFinalize
.
Updates in CUDA 12.1
New Features
Field
wsl
is added in the structCUpti_Profiler_DeviceSupported_Params
to indicate whether Profiling API is supported on Windows Subsystem for Linux (WSL) system or not.
Updates in CUDA 12.0 Update 1
Resolved Issues
Reduced the host memory overhead by avoiding caching copies of cubin images at the time of loading CUDA modules. Copies of cubin images are now created only when profiling features that need it are enabled.
By default CUPTI switches back to the device memory, instead of the pinned host memory, for allocation of the profiling buffer for concurrent kernel tracing. This might help in improving the performance of the tracing run. Memory location can be controlled using the attribute
CUPTI_ACTIVITY_ATTR_MEM_ALLOCATION_TYPE_HOST_PINNED
of the activity attribute enumCUpti_ActivityAttribute
.CUPTI now captures the
cudaGraphLaunch
API and its kernels when CUPTI is attached after the graph is instantiated using the APIcudaGraphInstantiate
but it is attached before the graph is launched using the APIcudaGraphLaunch
. Some data in the kernel record would be missing i.e.cacheConfig
,sharedMemoryExecuted
,partitionedGlobalCacheRequested
,partitionedGlobalCacheExecuted
,sharedMemoryCarveoutRequested
etc. This fix requires the matching CUDA driver which ships with the CUDA 12.0 Update 1 release.
Updates in CUDA 12.0
New Features
Added new fields
maxPotentialClusterSize
andmaxActiveClusters
to help in calculating the cluster occupancy correctly. These fields are valid for devices with compute capability 9.0 and higher. To accomodate this change, activity recordCUpti_ActivityKernel8
is deprecated and replaced by a new activity recordCUpti_ActivityKernel9
.Enhancements for PC Sampling APIs:
CUPTI creates few worker threads to offload certain operations like decoding of the hardware data to the CUPTI PC sampling data and correlation of the PC data to the SASS instructions. CUPTI wakes up these threads periodically. To control the sleep time of the worker threads, a new attribute
CUPTI_PC_SAMPLING_CONFIGURATION_ATTR_TYPE_WORKER_THREAD_PERIODIC_SLEEP_SPAN
is added in the enumCUpti_PCSamplingConfigurationAttributeType
.Improved error reporting for hardware buffer overflow. When hardware buffer overflows, CUPTI returns the out of memory error code. And a new field
hardwareBufferFull
added in the structCUpti_PCSamplingData
is set to differentiate it from other out of memory cases. User can either increase the hardware buffer size or flush the hardware buffer at a higher frequency to avoid overflow.
Profiling APIs are supported on Windows Subsystem for Linux (WSL) with WSL version 2, NVIDIA display driver version 525 or higher and Windows 11.
CUPTI support for Kepler GPUs is dropped in CUDA Toolkit 12.0.
Resolved Issues
Removed minor CUDA version from the SONAME of the CUPTI shared library for compatibility reasons. For example, SONAME of CUPTI library is libcupti.so.12 instead of libcupti.so.12.0 in CUDA 12.0 release.
Activity kinds
CUPTI_ACTIVITY_KIND_MARKER
andCUPTI_ACTIVITY_KIND_MARKER_DATA
can be enabled together.
Updates in CUDA 11.8
New Features
CUPTI adds tracing and profiling support for Hopper and Ada Lovelace GPU families.
Added new fields
clusterX
,clusterY
,clusterZ
andclusterSchedulingPolicy
to output the Thread Block Cluster dimensions and scheduling policy. These fields are valid for devices with compute capability 9.0 and higher. To accomodate this change, activity recordCUpti_ActivityKernel7
is deprecated and replaced by a new activity recordCUpti_ActivityKernel8
.A new activity kind
CUPTI_ACTIVITY_KIND_JIT
and corresponding activity recordCUpti_ActivityJit
are introduced to capture the overhead involved in the JIT (just-in-time) compilation and caching of the PTX or NVVM IR code to the binary code. New record also provides the information about the size and path of the compute cache where the binary code is stored.PC Sampling API is supported on Tegra platforms - QNX, Linux (aarch64) and Linux (x86_64) (Drive SDK).
Resolved Issues
Resolved an issue that might cause crash when the size of the device buffer is changed, using the attribute
CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_SIZE
, after creation of the CUDA context.
Updates in CUDA 11.7 Update 1
Resolved Issues
Resolved an issue for PC Sampling API
cuptiPCSamplingGetData
which might not always return all the samples when called after the PC sampling range defined by using the APIscuptiPCSamplingStart
andcuptiPCSamplingStop
. Remaining samples were delivered in the successive call of the APIcuptiPCSamplingGetData
after the next range.Disabled tracing of nodes in the CUDA Graph when user enables tracing at the Graph level using the activity kind
CUPTI_ACTIVITY_KIND_GRAPH_TRACE
.Fixed missing
channelID
andchannelType
information for kernel records. Earlier these fields were populated for CUDA Graph launches only.
Updates in CUDA 11.7
New Features
A new activity kind
CUPTI_ACTIVITY_KIND_GRAPH_TRACE
and activity recordCUpti_ActivityGraphTrace
are introduced to represent the execution for a graph without giving visibility about the execution of its nodes. This is intended to reduce overheads involved in tracing each node separately. This activity can only be enabled for drivers of version 515 and above.A new API
cuptiActivityEnableAndDump
is added to provide snapshot of certain activities like device, context, stream, NVLink and PCIe at any point during the profiling session.Added sample cupti_correlation to show correlation between CUDA APIs and corresponding GPU activities.
Added sample cupti_trace_injection to show how to build an injection library using the activity and callback APIs which can be used to trace any CUDA application.
Resolved Issues
Fixed corruption in the function name for PC Sampling API records.
Fixed incorrect timestamps for GPU activities when user calls the API
cuptiActivityRegisterTimestampCallback
in the late CUPTI attach scenario.Fixed incomplete records for device to device memcopies in the late CUPTI attach scenario. This issue manifests mainly when application has a mix of CUDA graph and normal kernel launches.
Updates in CUDA 11.6 Update 1
Resolved Issues
Fixed hang for the PC Sampling API
cuptiPCSamplingStop
. This issue is seen for the PC sampling start and stop resulting in generation of large number of sampling records.Fixed timing issue for specific device to device memcpy operations.
Updates in CUDA 11.6
New Features
Two new fields
channelID
andchannelType
are added in the activity records for kernel, memcpy, peer-to-peer memcpy and memset to output the ID and type of the hardware channel on which these activities happen. Activity recordsCUpti_ActivityKernel6
,CUpti_ActivityMemcpy4
,CUpti_ActivityMemcpyPtoP3
andCUpti_ActivityMemset3
are deprecated and replaced by new activity recordsCUpti_ActivityKernel7
,CUpti_ActivityMemcpy5
,CUpti_ActivityMemcpyPtoP4
andCUpti_ActivityMemset4
.New fields
isMigEnabled
,gpuInstanceId
,computeInstanceId
andmigUuid
are added in the device activity record to provide MIG information for the MIG enabled GPU. Activity recordCUpti_ActivityDevice3
is deprecated and replaced by a new activity recordCUpti_ActivityDevice4
.A new field
utilizedSize
is added in the memory pool and memory activity record to provide the utilized size of the memory pool. Activity recordCUpti_ActivityMemoryPool
andCUpti_ActivityMemory2
are deprecated and replaced by a new activity recordCUpti_ActivityMemoryPool2
andCUpti_ActivityMemory3
respectively.API
cuptiActivityRegisterTimestampCallback
and callback functionCUpti_TimestampCallbackFunc
are added to register a callback function to obtain timestamp of user’s choice instead of using CUPTI provided timestamp in activity records.Profiling API supports profiling OptiX application.
Resolved Issues
Fixed multi-pass metric collection using the Profiling API in the auto range and kernel replay mode for Cuda Graph.
Fixed the performance issue for the PC sampling API
cuptiPCSamplingStop
.Fixed corruption in variable names for OpenACC activity records.
Fixed corruption in the fields of the struct
memoryPoolConfig
in the activity recordCUpti_ActivityMemory3
.Filled the fields of the struct
memoryPoolConfig
in the activity recordCUpti_ActivityMemory3
when a memory pointer allocated via memory pool is released usingcudaFree
CUDA API.
Updates in CUDA 11.5 Update 1
Resolved Issues
Resolved an issue that causes incorrect range name for NVTX event attributes. The issue was introduced in CUDA 11.4.
Made NVTX initialization APIs
InitializeInjectionNvtx
andInitializeInjectionNvtx2
thread-safe.
Updates in CUDA 11.5
New Features
A new API
cuptiProfilerDeviceSupported
is introduced to expose overall Profiling API support and specific requirements for a given device. Profiling API must be initialized by callingcuptiProfilerInitialize
before testing device support.PC Sampling struct
CUpti_PCSamplingData
introduces a new fieldnonUsrKernelsTotalSamples
to provide information about the number of samples collected for all non-user kernels.Activity record
CUpti_ActivityDevice2
for device information has been deprecated and replaced by a new activity recordCUpti_ActivityDevice3
. New record adds a flagisCudaVisible
to indicate whether device is visible to CUDA.Activity record
CUpti_ActivityNvLink3
for NVLink information has been deprecated and replaced by a new activity recordCUpti_ActivityNvLink4
. New record can accommodate NVLink port information upto a maximum of 32 ports.A new CUPTI Checkpoint API is introduced, enabling automatic saving and restoring of device state, and facilitating development of kernel replay tools. This is helpful for User Replay mode of the CUPTI Profiling API, but is not limited to use with CUPTI.
Tracing is supported on the Windows Subsystem for Linux version 2 (WSL 2).
CUPTI is not supported on NVIDIA Crypto Mining Processors (CMP). A new error code
CUPTI_ERROR_CMP_DEVICE_NOT_SUPPORTED
is introduced to indicate it.
Resolved Issues
Resolved an issue that causes crash for tracing of device to device memcopy operations.
Resolved an issue that causes crash for OpenACC activity when it is enabled before other activities.
Updates in CUDA 11.4 Update 1
Resolved Issues
Resolved serialization of CUDA Graph launches for applications which use multiple threads to launch work.
Previously, for applications that use CUDA Dynamic Parallelism (CDP), CUPTI detects the presence of the CDP kernels in the CUDA module. Even if CDP kernels are not called, it fails to trace the application. There is a change in the behavior, CUPTI now traces all the host launched kernels until it encounters a host launched kernel which launches child kernels. Subsequent kernels are not traced.
Updates in CUDA 11.4
New Features
Profiling APIs support profiling of the CUDA kernel nodes launched by a CUDA Graph. Auto range profiling with kernel replay mode and user range profiling with user replay and application replay modes are supported. Other combinations of range profiling and replay modes are not supported.
Added support for tracing and profiling on NVIDIA virtual GPUs (vGPUs) on an upcoming GRID/vGPU release.
Added sample profiling_injection to show how to build injection library using the Profiling API.
Added sample concurrent_profiling to show how to retain the kernel concurrency across streams and devices using the Profiling API.
Resolved Issues
Resolved the issue of not tracing the device to device memcopy nodes in a CUDA Graph.
Fixed the issue of reporting zero size for local memory pool for mempool creation record.
Resolved the issue of non-collection of samples for the default CUDA context for PC Sampling API.
Enabled tracking of all domains and registered strings in NVTX irrespective of whether the NVTX activity kind or callbacks are enabled. This state tracking is needed for proper working of the tool which creates these NVTX objects before enabling the NVTX activity kind or callback.
Updates in CUDA 11.3
New Features
A new set of CUPTI APIs for PC sampling data collection are provided in the header file
cupti_pcsampling.h
which support continuous mode data collection without serializing kernel execution and have a lower runtime overhead. Along with these a utility library is provided in the header filecupti_pcsampling_util.h
which has APIs for GPU assembly to CUDA-C source correlation and for reading and writing the PC sampling data from/to files. Refer to the section CUPTI PC Sampling API for more details.Enum
CUpti_PcieGen
is extended to include PCIe Gen 5.The following functions are deprecated and will be removed in a future release:
Struct
NVPA_MetricsContext
and related APIsNVPW_MetricsContext_*
from the headernvperf_host.h
. It is recommended to use the structNVPW_MetricsEvaluator
and related APIsNVPW_MetricsEvaluator_*
instead. Profiling API samples have been updated to show how to use these APIs.cuptiDeviceGetTimestamp
from the headercupti_events.h
.
Resolved Issues
Overhead reduction for tracing of CUDA memcopies.
To provide normalized timestamps for all activities, CUPTI uses linear interpolation for conversion from GPU timestamps to CPU timestamps. This method can cause spurious gaps or overlap on the timeline. CUPTI improves the conversion function to provide more precise timestamps.
Generate overhead activity record for semaphore pool allocation.
Updates in CUDA 11.2
New Features
A new activity kind
CUPTI_ACTIVITY_KIND_MEMORY_POOL
and activity recordCUpti_ActivityMemoryPool
are introduced to represent the creation, destruction and trimming of a memory pool. EnumCUpti_ActivityMemoryPoolType
lists types of memory pool.A new activity kind
CUPTI_ACTIVITY_KIND_MEMORY2
and activity recordCUpti_ActivityMemory2
are introduced to provide separate records for memory allocation and release operations. This helps in correlation of records of these operations to the corresponding CUDA APIs, which otherwise is not possible using the existing activity recordCUpti_ActivityMemory
which provides a single record for both the memory operations.Added a new pointer field of type
CUaccessPolicyWindow
in the kernel activity record to provide the access policy window which specifies a contiguous region of global memory and a persistence property in the L2 cache for accesses within that region. To accomodate this change, activity recordCUpti_ActivityKernel5
is deprecated and replaced by a new activity recordCUpti_ActivityKernel6
. This attribute is not collected by default. To control the collection of launch attributes, a new APIcuptiActivityEnableLaunchAttributes
is introdcued.New attributes
CUPTI_ACTIVITY_ATTR_DEVICE_BUFFER_PRE_ALLOCATE_VALUE
andCUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_PRE_ALLOCATE_VALUE
are added in the activity attribute enumCUpti_ActivityAttribute
to set and get the number of device buffers and profiling semaphore pools which are preallocated for the context.CUPTI now allocates profiling buffer for concurrent kernel tracing in the pinned host memory in place of device memory. This might help in improving the performance of the tracing run. Memory location can be controlled using the attribute
CUPTI_ACTIVITY_ATTR_MEM_ALLOCATION_TYPE_HOST_PINNED
of the activity attribute enumCUpti_ActivityAttribute
.The compiler generated line information for inlined functions is improved due to which CUPTI can associate inlined functions with the line information of the function call site that has been inlined.
Removed support for NVLink performance metrics (
nvlrx__*
andnvltx__*
) from the Profiling API due to a potential application hang during data collection. The metrics will be added back in a future CUDA release.
Resolved Issues
Execution overheads introduced by CUPTI in the tracing path is reduced.
For the concurrent kernel activity kind
CUPTI_ACTIVITY_KIND_CONCURRENT_KERNEL
, CUPTI instruments the kernel code to collect the timing information. Previously, every kernel in the CUDA module was instrumented, thus the overhead is proportional to the number of different kernels in the module. This is a static overhead which happens at the time of loading the CUDA module. To reduce this overhead, kernels are not instrumented at the module load time, instead a single instrumentation code is generated at the time of loading the CUDA module and it is applied to each kernel during the kernel execution, thus avoiding most of the static overhead at the CUDA module load time.
Updates in CUDA 11.1
New Features
CUPTI adds tracing and profiling support for the NVIDIA Ampere GPUs with compute capability 8.6.
Added a new field
graphId
in the activity records for kernel, memcpy, peer-to-peer memcpy and memset to output the unique ID of the CUDA graph that launches the activity through CUDA graph APIs. To accomodate this change, activity recordsCUpti_ActivityMemcpy3
,CUpti_ActivityMemcpyPtoP2
andCUpti_ActivityMemset2
are deprecated and replaced by new activity recordsCUpti_ActivityMemcpy4
,CUpti_ActivityMemcpyPtoP3
andCUpti_ActivityMemset3
. And kernel activity recordCUpti_ActivityKernel5
replaces the padding field withgraphId
. Added a new APIcuptiGetGraphId
to query the unique ID of the CUDA graph.Added a new API
cuptiActivityFlushPeriod
to set the flush period for the worker thread.Added support for profiling cooperative kernels using Profiling APIs.
Added NVLink performance metrics (nvlrx__* and nvltx__*) using the Profiling APIs. These metrics are available on devices with compute capability 7.0, 7.5 and 8.0, and these can be collected at the context level. Refer to the table Metrics Mapping Table for mapping between earlier CUPTI metrics and the Perfworks NVLink metrics for devices with compute capability 7.0.
Resolved Issues
Resolved an issue that causes CUPTI to not return full and completed activity buffers for a long time, CUPTI now attempts to return buffers early.
To reduce the runtime overhead, CUPTI wakes up the worker thread based on certain heuristics instead of waking it up at a regular interval. New API
cuptiActivityFlushPeriod
can be used to control the flush period of the worker thread. This setting overrides the CUPTI heurtistics.
Updates in CUDA 11.0
New Features
CUPTI adds tracing and profiling support for devices with compute capability 8.0 i.e. NVIDIA A100 GPUs and systems that are based on A100.
Enhancements for CUDA Graph:
Support to correlate the CUDA Graph node with the GPU activities: kernel, memcpy, memset.
Added a new field
graphNodeId
for Node Id in the activity records for kernel, memcpy, memset and P2P transfers. Activity recordsCUpti_ActivityKernel4
,CUpti_ActivityMemcpy2
,CUpti_ActivityMemset
andCUpti_ActivityMemcpyPtoP
are deprecated and replaced by new activity recordsCUpti_ActivityKernel5
,CUpti_ActivityMemcpy3
,CUpti_ActivityMemset2
andCUpti_ActivityMemcpyPtoP2
.graphNodeId
is the unique ID for the graph node.graphNodeId
can be queried using the new CUPTI APIcuptiGetGraphNodeId()
.Callback
CUPTI_CBID_RESOURCE_GRAPHNODE_CREATED
is issued between a pair of the API enter and exit callbacks.
Introduced new callback
CUPTI_CBID_RESOURCE_GRAPHNODE_CLONED
to indicate the cloning of the CUDA Graph node.Retain CUDA driver performance optimization in case memset node is sandwiched between kernel nodes. CUPTI no longer disables the conversion of memset nodes into kernel nodes for CUDA graphs.
Added support for cooperative kernels in CUDA graphs.
Added support to trace Optix applications. Refer the Optix Profiling section.
CUPTI overhead is associated with the thread rather than process. Object kind of the overhead record
CUpti_ActivityOverhead
is switched toCUPTI_ACTIVITY_OBJECT_THREAD
.Added error code
CUPTI_ERROR_MULTIPLE_SUBSCRIBERS_NOT_SUPPORTED
to indicate the presense of another CUPTI subscriber. APIcuptiSubscribe()
returns the new error code thanCUPTI_ERROR_MAX_LIMIT_REACHED
.Added a new enum
CUpti_FuncShmemLimitConfig
to indicate whether user has opted in for maximun dynamic shared memory size on devices with compute capability 7.x by using function attributesCU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES
orcudaFuncAttributeMaxDynamicSharedMemorySize
with CUDA driver and runtime respectively. FieldshmemLimitConfig
in the kernel activity recordCUpti_ActivityKernel5
shows the user choice. This helps in correct occupancy calulation. ValueFUNC_SHMEM_LIMIT_OPTIN
in the enumcudaOccFuncShmemConfig
is the corresponding option in the CUDA occupancy calculator.
Resolved Issues
Resolved an issue that causes incorrect or stale timing for memcopy and serial kernel activities.
Overhead for PC Sampling Activity APIs is reduced by avoiding the reconfiguration of the GPU when PC sampling period doesn’t change between successive kernels. This is applicable for devices with compute capability 7.0 and higher.
Fixed issues in the API
cuptiFinalize()
including the issue which may cause the application to crash. This API provides ability for safe and full detach of CUPTI during the execution of the application. More details in the section Dynamic Detach.
Updates in CUDA 10.2
New Features
CUPTI allows tracing features for non-root and non-admin users on desktop platforms. Note that events and metrics profiling is still restricted for non-root and non-admin users. More details about the issue and the solutions can be found on this web page.
CUPTI no longer turns off the performance characteristics of CUDA Graph when tracing the application.
CUPTI now shows memset nodes in the CUDA graph.
Fixed the incorrect timing issue for the asynchronous cuMemset/cudaMemset activity.
Several performance improvements are done in the tracing path.
Updates in CUDA 10.1 Update 2
New Features
This release is focused on bug fixes and stability of the CUPTI.
A security vulnerability issue required profiling tools to disable all the features for non-root or non-admin users. As a result, CUPTI cannot profile the application when using a Windows 419.17 or Linux 418.43 or later driver. More details about the issue and the solutions can be found on this web page.
Updates in CUDA 10.1 Update 1
New Features
Support for the IBM POWER platform is added for the
Profiling APIs in the header
cupti_profiler_target.h
Perfworks metric APIs in the headers
nvperf_host.h
andnvperf_target.h
Updates in CUDA 10.1
New Features
This release is focused on bug fixes and performance improvements.
The new set of profiling APIs and Perfworks metric APIs which were introduced in the CUDA Toolkit 10.0 are now integrated into the CUPTI library distributed in the CUDA Toolkit. Refer to the sections CUPTI Profiling API and Perfworks Metric APIs for documentation of the new APIs.
Event collection mode
CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS
is now supported on all device classes including Geforce and Quadro.Support for the NVTX string registration API
nvtxDomainRegisterStringA().
Added enum
CUpti_PcieGen
to list PCIe generations.
Updates in CUDA 10.0
New Features
Added tracing support for devices with compute capability 7.5.
A new set of metric APIs are added for devices with compute capability 7.0 and higher. These provide low and deterministic profiling overhead on the target system. These APIs are currently supported only on Linux x86 64-bit and Windows 64-bit platforms. Refer to the CUPTI web page for documentation and details to download the package with support for these new APIs. Note that both the old and new metric APIs are supported for compute capability 7.0. This is to enable transition of code to the new metric APIs. But one cannot mix the usage of the old and new metric APIs.
CUPTI supports profiling of OpenMP applications. OpenMP profiling information is provided in the form of new activity records
CUpti_ActivityOpenMp
. New APIcuptiOpenMpInitialize
is used to initialize profiling for supported OpenMP runtimes.Activity record for kernel
CUpti_ActivityKernel4
provides shared memory size set by the CUDA driver.Tracing support for CUDA kernels, memcpy and memset nodes launched by a CUDA Graph.
Added support for resource callbacks for resources associated with the CUDA Graph. Refer enum
CUpti_CallbackIdResource
for new callback IDs.
Updates in CUDA 9.2
New Features
Added support to query PCI devices information which can be used to construct the PCIe topology. See activity kind
CUPTI_ACTIVITY_KIND_PCIE
and related activity recordCUpti_ActivityPcie
.To view and analyze bandwidth of memory transfers over PCIe topologies, new set of metrics to collect total data bytes transmitted and recieved through PCIe are added. Those give accumulated count for all devices in the system. These metrics are collected at the device level for the entire application. And those are made available for devices with compute capability 5.2 and higher.
CUPTI added support for new metrics:
Instruction executed for different types of load and store
Total number of cached global/local load requests from SM to texture cache
Global atomic/non-atomic/reduction bytes written to L2 cache from texture cache
Surface atomic/non-atomic/reduction bytes written to L2 cache from texture cache
Hit rate at L2 cache for all requests from texture cache
Device memory (DRAM) read and write bytes
The utilization level of the multiprocessor function units that execute tensor core instructions for devices with compute capability 7.0
A new attribute
CUPTI_EVENT_ATTR_PROFILING_SCOPE
is added under enumCUpti_EventAttribute
to query the profiling scope of a event. Profiling scope indicates if the event can be collected at the context level or device level or both. See EnumCUpti_EventProfilingScope
for avaiable profiling scopes.A new error code
CUPTI_ERROR_VIRTUALIZED_DEVICE_NOT_SUPPORTED
is added to indicate that tracing and profiling on virtualized GPU is not supported.
Updates in CUDA 9.1
New Features
Added a field for correlation ID in the activity record
CUpti_ActivityStream
.
Updates in CUDA 9.0
New Features
CUPTI extends tracing and profiling support for devices with compute capability 7.0.
Usage of compute device memory can be tracked through CUPTI. A new activity record
CUpti_ActivityMemory
and activity kindCUPTI_ACTIVITY_KIND_MEMORY
are added to track the allocation and freeing of memory. This activity record includes fields like virtual base address, size, PC (program counter), timestamps for memory allocation and free calls.Unified memory profiling adds new events for thrashing, throttling, remote map and device-to-device migration on 64 bit Linux platforms. New events are added under enum
CUpti_ActivityUnifiedMemoryCounterKind
. EnumCUpti_ActivityUnifiedMemoryRemoteMapCause
lists possible causes for remote map events.PC sampling supports wide range of sampling periods ranging from 2^5 cycles to 2^31 cycles per sample. This can be controlled through new field
samplingPeriod2
in the PC sampling configuration structCUpti_ActivityPCSamplingConfig
.Added API
cuptiDeviceSupported()
to check support for a compute device.Activity record
CUpti_ActivityKernel3
for kernel execution has been deprecated and replaced by new activity recordCUpti_ActivityKernel4
. New record gives information about queued and submit timestamps which can help to determine software and hardware latencies associated with the kernel launch. These timestamps are not collected by default. Use APIcuptiActivityEnableLatencyTimestamps()
to enable collection. New fieldlaunchType
of typeCUpti_ActivityLaunchType
can be used to determine if it is a cooperative CUDA kernel launch.Activity record
CUpti_ActivityPCSampling2
for PC sampling has been deprecated and replaced by new activity recordCUpti_ActivityPCSampling3
. New record accomodates 64-bit PC Offset supported on devices of compute capability 7.0 and higher.Activity record
CUpti_ActivityNvLink
for NVLink attributes has been deprecated and replaced by new activity recordCUpti_ActivityNvLink2
. New record accomodates increased port numbers between two compute devices.Activity record
CUpti_ActivityGlobalAccess2
for source level global accesses has been deprecated and replaced by new activity recordCUpti_ActivityGlobalAccess3
. New record accomodates 64-bit PC Offset supported on devices of compute capability 7.0 and higher.New attributes
CUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_SIZE
andCUPTI_ACTIVITY_ATTR_PROFILING_SEMAPHORE_POOL_LIMIT
are added in the activity attribute enumCUpti_ActivityAttribute
to set and get the profiling semaphore pool size and the pool limit.
Updates in CUDA 8.0
New Features
Sampling of the program counter (PC) is enhanced to point out the true latency issues, it indicates if the stall reasons for warps are actually causing stalls in the issue pipeline. Field
latencySamples
of new activity recordCUpti_ActivityPCSampling2
provides true latency samples. This field is valid for devices with compute capability 6.0 and higher. See section PC Sampling for more details.Support for NVLink topology information such as the pair of devices connected via NVLink, peak bandwidth, memory access permissions etc is provided through new activity record
CUpti_ActivityNvLink
. NVLink performance metrics for data transmitted/received, transmit/receive throughput and respective header overhead for each physical link. See section NVLink for more details.CUPTI supports profiling of OpenACC applications. OpenACC profiling information is provided in the form of new activity records
CUpti_ActivityOpenAccData
,CUpti_ActivityOpenAccLaunch
andCUpti_ActivityOpenAccOther
. This aids in correlating OpenACC constructs on the CPU with the corresponding activity taking place on the GPU, and mapping it back to the source code. New APIcuptiOpenACCInitialize
is used to initialize profiling for supported OpenACC runtimes. See section OpenACC for more details.Unified memory profiling provides GPU page fault events on devices with compute capability 6.0 and 64 bit Linux platforms. Enum
CUpti_ActivityUnifiedMemoryAccessType
lists memory access types for GPU page fault events and enumCUpti_ActivityUnifiedMemoryMigrationCause
lists migration causes for data transfer events.Unified Memory profiling support is extended to Mac platform.
Support for 16-bit floating point (FP16) data format profiling. New metrics inst_fp_16, flop_count_hp_add, flop_count_hp_mul, flop_count_hp_fma, flop_count_hp, flop_hp_efficiency, half_precision_fu_utilization are supported. Peak FP16 flops per cycle for device can be queried using the enum
CUPTI_DEVICE_ATTR_FLOP_HP_PER_CYCLE
added toCUpti_DeviceAttribute
.Added new activity kinds
CUPTI_ACTIVITY_KIND_SYNCHRONIZATION
,CUPTI_ACTIVITY_KIND_STREAM
andCUPTI_ACTIVITY_KIND_CUDA_EVENT
, to support the tracing of CUDA synchronization constructs such as context, stream and CUDA event synchronization. Synchronization details are provided in the form of new activity recordCUpti_ActivitySynchronization
. EnumCUpti_ActivitySynchronizationType
lists different types of CUDA synchronization constructs.APIs
cuptiSetThreadIdType()
/cuptiGetThreadIdType()
to set/get the mechanism used to fetch the thread-id used in CUPTI records. EnumCUpti_ActivityThreadIdType
lists all supported mechanisms.Added API
cuptiComputeCapabilitySupported()
to check the support for a specific compute capability by the CUPTI.Added support to establish correlation between an external API (such as OpenACC, OpenMP) and CUPTI API activity records. APIs
cuptiActivityPushExternalCorrelationId()
andcuptiActivityPopExternalCorrelationId()
should be used to push and pop external correlation ids for the calling thread. Generated records of typeCUpti_ActivityExternalCorrelation
contain both external and CUPTI assigned correlation ids.Added containers to store the information of events and metrics in the form of activity records
CUpti_ActivityInstantaneousEvent
,CUpti_ActivityInstantaneousEventInstance
,CUpti_ActivityInstantaneousMetric
andCUpti_ActivityInstantaneousMetricInstance
. These activity records are not produced by the CUPTI, these are included for completeness and ease-of-use. Profilers built on top of CUPTI that sample events may choose to use these records to store the collected event data.Support for domains and annotation of synchronization objects added in NVTX v2. New activity record
CUpti_ActivityMarker2
and enums to indicate various stages of synchronization object i.e.CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE
,CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE_SUCCESS
,CUPTI_ACTIVITY_FLAG_MARKER_SYNC_ACQUIRE_FAILED
andCUPTI_ACTIVITY_FLAG_MARKER_SYNC_RELEASE
are added.Unused field
runtimeCorrelationId
of the activity recordCUpti_ActivityMemset
is broken into two fieldsflags
andmemoryKind
to indicate the asynchronous behaviour and the kind of the memory used for the memset operation. It is supported by the new flagCUPTI_ACTIVITY_FLAG_MEMSET_ASYNC
added in the enumCUpti_ActivityFlag
.Added flag
CUPTI_ACTIVITY_MEMORY_KIND_MANAGED
in the enumCUpti_ActivityMemoryKind
to indicate managed memory.API
cuptiGetStreamId
has been deprecated. A new APIcuptiGetStreamIdEx
is introduced to provide the stream id based on the legacy or per-thread default stream flag.
Updates in CUDA 7.5
New Features
Device-wide sampling of the program counter (PC) is enabled by default. This was a preview feature in the CUDA Toolkit 7.0 release and it was not enabled by default.
Ability to collect all events and metrics accurately in presence of multiple contexts on the GPU is extended for devices with compute capability 5.x.
API
cuptiGetLastError
is introduced to return the last error that has been produced by any of the CUPTI API calls or the callbacks in the same host thread.Unified memory profiling is supported with MPS (Multi-Process Service)
Callback is provided to collect replay information after every kernel run during kernel replay. See API
cuptiKernelReplaySubscribeUpdate
and callback typeCUpti_KernelReplayUpdateFunc
.Added new attributes in enum
CUpti_DeviceAttribute
to query maximum shared memory size for different cache preferences for a device function.
Updates in CUDA 7.0
New Features
CUPTI supports device-wide sampling of the program counter (PC). Program counters along with the stall reasons from all active warps are sampled at a fixed frequency in the round robin order. Activity record
CUpti_ActivityPCSampling
enabled using activity kindCUPTI_ACTIVITY_KIND_PC_SAMPLING
outputs stall reason along with PC and other related information. EnumCUpti_ActivityPCSamplingStallReason
lists all the stall reasons. Sampling period is configurable and can be tuned using APIcuptiActivityConfigurePCSampling
. This feature is available on devices with compute capability 5.2.Added new activity record
CUpti_ActivityInstructionCorrelation
which can be used to dump source locator records for all the PCs of the function.All events and metrics for devices with compute capability 3.x and 5.0 can be collected accurately in presence of multiple contexts on the GPU. In previous releases only some events and metrics could be collected accurately when multiple contexts were executing on the GPU.
Unified memory profiling is enhanced by providing fine grain data transfers to and from the GPU, coupled with more accurate timestamps with each transfer. This information is provided through new activity record
CUpti_ActivityUnifiedMemoryCounter2
, deprecating old recordCUpti_ActivityUnifiedMemoryCounter
.MPS tracing and profiling support is extended on multi-gpu setups.
Activity record
CUpti_ActivityDevice
for device information has been deprecated and replaced by new activity recordCUpti_ActivityDevice2
. New record adds device UUID which can be used to uniquely identify the device across profiler runs.Activity record
CUpti_ActivityKernel2
for kernel execution has been deprecated and replaced by new activity recordCUpti_ActivityKernel3
. New record gives information about Global Partitioned Cache Configuration requested and executed. Partitioned global caching has an impact on occupancy calculation. If it is ON, then a CTA can only use a half SM, and thus a half of the registers available per SM. The new fields apply for devices with compute capability 5.2 and higher. Note that this change was done in CUDA 6.5 release with support for compute capabilty 5.2.
Updates in CUDA 6.5
New Features
Instruction classification is done for source-correlated Instruction Execution activity
CUpti_ActivityInstructionExecution
. SeeCUpti_ActivityInstructionClass
for instruction classes.Two new device attributes are added to the activity
CUpti_DeviceAttribute
:CUPTI_DEVICE_ATTR_FLOP_SP_PER_CYCLE
gives peak single precision flop per cycle for the GPU.CUPTI_DEVICE_ATTR_FLOP_DP_PER_CYCLE
gives peak double precision flop per cycle for the GPU.
Two new metric properties are added:
CUPTI_METRIC_PROPERTY_FLOP_SP_PER_CYCLE
gives peak single precision flop per cycle for the GPU.CUPTI_METRIC_PROPERTY_FLOP_DP_PER_CYCLE
gives peak double precision flop per cycle for the GPU.
Activity record
CUpti_ActivityGlobalAccess
for source level global access information has been deprecated and replaced by new activity recordCUpti_ActivityGlobalAccess2
. New record additionally gives information needed to map SASS assembly instructions to CUDA C source code. And it also provides ideal L2 transactions count based on the access pattern.Activity record
CUpti_ActivityBranch
for source level branch information has been deprecated and replaced by new activity recordCUpti_ActivityBranch2
. New record additionally gives information needed to map SASS assembly instructions to CUDA C source code.Sample
sass_source_map
is added to demonstrate the mapping of SASS assembly instructions to CUDA C source code.Default event collection mode is changed to Kernel (
CUPTI_EVENT_COLLECTION_MODE_KERNEL
) from Continuous (CUPTI_EVENT_COLLECTION_MODE_CONTINUOUS
). Also Continuous mode is supported only on Tesla devices.Profiling results might be inconsistent when auto boost is enabled. Profiler tries to disable auto boost by default, it might fail to do so in some conditions, but profiling will continue. A new API
cuptiGetAutoBoostState
is added to query the auto boost state of the device. This API returns errorCUPTI_ERROR_NOT_SUPPORTED
on devices that don’t support auto boost. Note that auto boost is supported only on certain Tesla devices from the Kepler+ family.Activity record
CUpti_ActivityKernel2
for kernel execution has been deprecated and replaced by new activity recordCUpti_ActivityKernel3
. New record additionally gives information about Global Partitioned Cache Configuration requested and executed. The new fields apply for devices with 5.2 Compute Capability.
Updates in CUDA 6.0
New Features
Two new CUPTI activity kinds have been introduced to enable two new types of source-correlated data collection. The
Instruction Execution
kind collects SASS-level instruction execution counts, divergence data, and predication data. TheShared Access
kind collects source correlated data indication inefficient shared memory accesses.CUPTI provides support for CUDA applications using Unified Memory. A new activity record reports Unified Memory activity such as transfers to and from a GPU and the number of Unified Memory related page faults.
CUPTI recognized and reports the special MPS context that is used by CUDA applications running on a system with MPS enabled.
The CUpti_ActivityContext activity record
CUpti_ActivityContext
has been updated to introduce a new field into the structure in a backwards compatible manner. The 32-bitcomputeApiKind
field was replaced with two 16 bit fields,computeApiKind
anddefaultStreamId
. Because all validcomputeApiKind
values fit within 16 bits, and because all supported CUDA platforms are little-endian, persisted context record data read with the new structure will have the correct value forcomputeApiKind
and have a value of zero fordefaultStreamId
. The CUPTI client is responsible for versioning the persisted context data to recognize when thedefaultStreamId
field is valid.To ensure that metric values are calculated as accurately as possible, a new metric API is introduced. Function
cuptiMetricGetRequiredEventGroupSets
can be used to get the groups of events that should be collected at the same time.Execution overheads introduced by CUPTI have been dramatically decreased.
The new activity buffer API introduced in CUDA Toolkit 5.5 is required. The legacy
cuptiActivityEnqueueBuffer
andcuptiActivityDequeueBuffer
functions have been removed.
Updates in CUDA 5.5
New Features
Applications that use CUDA Dynamic Parallelism can be profiled using CUPTI. Device-side kernel launches are reported using a new activity kind.
Device attributes such as power usage, clocks, thermals, etc. are reported via a new activity kind.
A new activity buffer API uses callbacks to request and return buffers of activity records. The existing
cuptiActivityEnqueueBuffer
andcuptiActivityDequeueBuffer
functions are still supported but are deprecated and will be removed in a future release.The Event API supports kernel replay so that any number of events can be collected during a single run of the application.
A new metric API
cuptiMetricGetValue2
allows metric values to be calculated for any device, even if that device is not available on the system.CUDA peer-to-peer memory copies are reported explicitly via the activity API. In previous releases these memory copies were only partially reported.