### Description
- Implement Objective-C binding for `ORTTrainingSession`
- Add `ORTUtils` utility class to handle conversion between C++ and
Objective-C types
- Add test case for saving checkpoint
- Add unit test cases for `ORTTrainingSession`
### Motivation and Context
This PR is part of implementing Objective-C bindings for training API.
It implements objective-c binding for training session. The objective-C
API closely resembles the C++ API.
---------
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
### Description
This PR implements a backward-compatible way to define custom operators
with fallible compute functions. The C++ API templated gained an
optional `Fallible` argument. Closes#14287
### Motivation and Context
#14287 contains more context. The gist is that the current C-API defines
compute operations of custom operators as functions returning `void`
rather than an `OrtStatusPtr`. Currently, errors are often propagated
across the C-ABI using C++ exceptions. That is very unsafe and undefined
behavior. Moreover, it is difficult for languages other than C++ to use
this approach even if they wanted to. A C-compliant sound and safe way
to propagate errors allows for non-C++ fallible custom operators.
### An example in action
https://github.com/cbourjau/ort-custom-op/pull/6/files is a
demonstration of how this PR can be used to write safe and fallible
custom operators in Rust.
### Description
<!-- Describe your changes. -->
New logic to share allocators among module, optimizer and eval sessions
for Training scenario
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previously on device training using shared allocator by sharing EP, now
with new mechanism to share allocator, we need to explicitly register
allocator in the environment.
---------
Co-authored-by: Lei Cao <leca@microsoft.com>
### Description
Make BeamScorer run on the GPU vs the CPU.
Brief overview:
Adds a CUDA 'CudaBeamSearchScorer' implementation of IBeamScorer
Instead of a 'done' flag per beam, there is one single 'not done'
variable that is copied to the CPU every iteration
Removes some of the extra CPU side buffers and parameters that are no
longer needed
Remaining future optimizations:
CPU copied beam indices is still used in the non
DecoderMaskedSelfAttention case. An extra kernel can be written to avoid
PickGptPasteState needing CPU copied beam indices (called from
UpdateGptFeeds).
### Motivation and Context
It's faster to keep the work on the GPU to avoid GPU->CPU->GPU copies of
data.
### Description
Fixes a typo that prevents skipping a test that targets the QNN HTP
backend on Windows x64.
### Motivation and Context
- Windows x64 machines cannot load/run the QNN HTP backend. Therefore,
we need to skip such tests on Windows x64.
- Fixes the QNN_Nuget_Windows pipeline.
### Description
<!-- Describe your changes. -->
As title.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Works with local onnxruntime-c pod in js/rn/e2e test.
A couple of places in onnxruntime used `float_t` data type alias as an
alternative to `float`. However, this is not entirely correct, since
`float_t` is an implementation-defined type alias, which may be `float`,
`double`, `long double` or some other implementation-defined data type,
depending on the state of the internal `FLT_EVAL_METHOD` macro:
https://en.cppreference.com/w/c/numeric/math/float_t
On most major platforms and compilers (clang, GCC, MSVC) this is only a
cosmetic change and will not lead to any changes. However, icpx compiler
(and legacy icc) tends to substitute `float_t` with `long double`,
resulting in a linker error (unresolved reference) to the base onnx
library, that only contains the `ParseData` function for `float` and
`double` as in
[here](9264e09367/onnx/defs/tensor_proto_util.cc (L133-L134)).
Overall, this PR cleans up the implementation-defined behaviour and
enables building onnxruntime with icpx.
The image for the onnxruntime-CI-nightly-ort-pipeline is too old.
The ort package in the image is older than latest test code in nightly
ci. This causes the nightly ci failed.
### Manage ORTModule options
Move all env vars that used for feature ON/OFF into runtime options for
consistent managements.
Be noted: the features' switch are assigned in 2 phases: default values,
overwritten by env vars (if specified by users). So env vars take the
highest priority when all 2 phases both given value explicitly for one
feature.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Use PadAndUnflatten to replace GatherGrad for restore
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
Fix the nullptr check so that it would check the actual existence of
engine/context
(Currently, it checks the address of unique_ptr, which is always not
null. Thx @jslhcl for pointing that out)
> A quick recall of struct
[trt_state](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/tensorrt/tensorrt_execution_provider.h#L104):
> ```
> std::unique_ptr<nvinfer1::ICudaEngine>* engine = nullptr;
>std::unique_ptr<nvinfer1::IExecutionContext>* context = nullptr;
>```
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/15982
The incorrect check couldn't stop TRT EP from loading incompatible
engine cache on purpose, which invokes unhandled exception
### Description
Eliminate Cast operator if Shape is the next one.
### Motivation and Context
#### Cast
When working with onnx opset 15 and above, the shape operator now
accepts all types of variables.
This change is documented in the [onnx
Changelog](https://github.com/onnx/onnx/blob/main/docs/Changelog.md#Shape-15).
As a result, casting variables right before the shape operation becomes
unnecessary.
Removing these unnecessary casts will improve the graph and potentially
provide better performance gains.
## Results
On :
torchrun examples/onnxruntime/training/language-modeling/run_clm.py
--model_name_or_path gpt2 --do_train --overwrite_output_dir --output_dir
./outputs/ --seed 1337 --fp16 True --per_device_train_batch_size 4
--num_train_epochs 1 --dataset_name wikitext --dataset_config_name
wikitext-2-raw-v1 --learning_rate 2e-5 --report_to none --optim
adamw_ort_fused
without changes:
***** train metrics *****
epoch = 1.0
train_loss = 3.2981
train_runtime = 0:02:13.29
train_samples = 2318
train_samples_per_second = 17.39
train_steps_per_second = 4.351
With my changes:
***** train metrics *****
epoch = 1.0
train_loss = 3.2981
train_runtime = 0:02:08.98
train_samples = 2318
train_samples_per_second = 17.971
train_steps_per_second = 4.497
We see around 3% gain.
---------
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
### Description
<!-- Describe your changes. -->
Update file list to adjust for recent changes to test infra.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Some CI jobs may interrupted unexpectedly and didn't execute umount data
step. The data left in host device will cause `device or resource busy`
and make subsequent CI jobs fail.
Move the mount data step into docker container, the host machine will
not be occupied when CI jobs exit incorrectly.
enhance cast-propagation for "softmax can be put at fp16 when data flow
is cast-to-fp32 > softmax > cast-to-fp16"
this optimization can save gpu memory and have performance gain
### Description
Before transformers 4.27, the causal mask uses uint8 data type, so there
is extra Cast node to convert it to bool. This adds a pattern that
without Cast node to support attention fusion for GPT-2 models exported
with transformers >= 4.27.
### Motivation and Context
https://github.com/microsoft/onnxruntime/issues/16453
Python script to modify Onnx model to make it aligned with converted QNN
model
### Description
Onnxruntime QNN EP can support context binary file generated by QNN tool chain. However QNN generated context binary file uses channel last and 8 bits or 16 bits for input and output. This script get the QNN model input & output information from QNN converted model_net.json file, and insert Cast, Transpose nodes to Onnx model if required.
🛠️ __Changes in this pull request:__
This pull request introduces two significant changes to the project:
- Changing on device training checkpoint format: The current
implementation stores the on device training checkpoint as a sequence of
tensors in multiple files inside a checkpoint folder, which can be
inefficient in terms of storage and performance. In this PR, I have
modified the checkpoint format to utilize the flatbuffer table to save
the checkpoint to a single file, providing a more compact and efficient
representation. The changes around this are twofold:
- Add the checkpoint flatbuffer schema that will generate the necessary
checkpoint source files.
- Update the checkpoint saving and loading functionality to use the new
format.
- Adding support for onnxruntime minimal build: To support scenarios
where binary size is a constraint, I made changes to ensure that the
training build can work well with the minimal build.
🔍 __Open Issues:__
- In order to extract the optimizer type, the existing implementation
re-loaded the onnx optimizer model and parsed it. This is no longer
possible, since the model format can either be onnx or ort. One idea is
to do the same for ort format optimizer model. This needs some
investigation.
- Changes to the offline tooling to generate ort format training
artifacts.
- End-to-end training example showcasing the use of the minimal training
build.
- Add support for export model for inferencing in a minimal build.
### Description
<!-- Describe your changes. -->
protobuf CopyFrom doesn't work for model > 2GB for version 4.23. This PR
removes the copy for Calibrator.
Currently Calibrator copies the ModelProto to avoid changing it. The
reason is that: quantize_static passes a ModelProto to Calibrator to
calibrate quantitation parameters, and then use it for quantization. If
calibrator changes the ModelProto, quantizaiton won't work.
This PR changes quantize_static to pass in a model path to Calibrator
instead of a ModelProto, and make Calibrator only take in model path as
input, which is how it is used in most cases.
This PR also remove the optimization from quantization. User needs to
call pre-process to optimize the model
### Description
This PR fixes a typo with assigning the `repetition_penalty` input in
the Whisper export with beam search model. It is a follow-up to the
[export stabilization
PR](https://github.com/microsoft/onnxruntime/pull/16297).
### Motivation and Context
The `repetition_penalty` input should be set to `repetition_penalty`
instead of `input_features`.
### Description
Revert docker base image to
nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04@sha256:b754c43fe9d62e88862d168c4ab9282618a376dbc54871467870366cacfa456e
### Motivation and Context
The default img env of nvidia/cuda:11.8.0-cudnn8-devel-ubuntu20.04 has
minor upgrade, which make Linux MultiGPU TensorRT CI (NV12 instance with
Maxwell GPU) fail on three CApiTestGlobalThreadPoolsWithProvider
tests (these three tests have higher error which are above the tolerance)
That minor upgrade includes cudnn 8.7.0->8.9.0, which might be a factor
that make maxwell GPU generator higher error. CIs with T4 GPU are not
affected.
### Description
Modify the creating of webgl context.
Previous behavior:
STEP.1 - create canvas (document.createElement), if failed, goto step.2
else step.3
STEP.2 - create offscreenCanvas, if failed abort
STEP.3 - use the canvas created in step.1 or 2 to create webgl context.
if successful return context else abort
Now bahavior:
STEP.1 create offscreenCanvas, if failed goto step.3
STEP.2 use it to create webgl context. if successful, return context
STEP.3 create canvas (document.createElement). if failed, abort
STEP.4 use it to create webgl context. if successful, return context
else abort
Motivation:
we found in some environment, normalCanvas.getContext() returns null but
offscreenCanvas.getContext() returns the context object. and when
offscreenCanvas is available it is good idea to always prefer to use it.
### Description
DORT support for custom ops
### Motivation and Context
Custom ops registered via custom_ops shared_library cannot be run using
DORT atm. This PR enables it using:
1. registering custom_ops supported in DORT
2. plumbing down session_options from OrtBackend when creating the
InferenceSession, that were used to register the custom_ops shared
library using
`session_options.register_custom_ops_library(shared_library)`
CUDA EP already supports [CUDA
graph](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs),
also we observed some models can benefit from using CUDA graph with
`trtexec`. Therefore, this PR enables the CUDA graph support for TRT EP.
The implementation is based on
https://github.com/microsoft/onnxruntime/pull/9978 with the same
[constraints](https://github.com/microsoft/onnxruntime/pull/9978) as
below:
- Models with control-flow ops (i.e. If, Loop and Scan ops) are not
supported.
- Usage of CUDA Graphs is limited to models where-in all the model ops
(graph nodes) can be partitioned to the TRT EP.
- The input/output types of models need to be tensors.
- Shapes of inputs/outputs cannot change across inference calls.
- IObinding is required.
### Description
<!-- Describe your changes. -->
As title.
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
### Description
<!-- Describe your changes. -->
Set onnxruntime-c local pod path environment variable for react native
e2e tests on react-native-ci.yml
### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Previously the E2E test project is not properly consuming a local built
onnxruntime-c version pod.
https://github.com/microsoft/onnxruntime/pull/16411#issuecomment-1598512816
### Description
Add "_smXX" to trtep engine cache file name, which "sm" stands for
"Streaming Multiprocessor".
> The GPU compute capability version is prefixed with "SM" because
NVIDIA typically improves and updates the SM in each new GPU
architecture.
### Motivation and Context
Github issue: https://github.com/microsoft/onnxruntime/issues/15982
Reduce the chance of misusing incompatible engine cache, when user is
switching GPU devices with different compute capacity
* The prevention can't be 100%, as model size & GPU memory size could be
another factor to make cache incompatible
### Description
<!-- Describe your changes. -->
1. Add a new test lib `onnxruntime_providers_cuda_ut` which is similar
to `onnxruntime_providers_cuda` but `onnxruntime_providers_cuda_ut` is
only built if `onnxruntime_BUILD_UNIT_TESTS` is set. We can call all
CUDA UTs through this ut lib without affecting production lib
`onnxruntime_providers_cuda`.
2. Move all test cases from `core/providers/cuda/test/` to
`test/providers/cuda/`. These test cases are built into lib
`onnxruntime_providers_cuda_ut` and run by `./onnxruntime_test_all
--gtest_filter="*CUDA_EP_Unittest*"`. Since the lib is only for test, we
can use gtest macros in the test cases. Previous implementation do not
support using gtest lib in the CUDA UT cases.
3. The cmake code in `cmake/onnxruntime_providers.cmake` is refactored a
bit. A new function `onnxruntime_add_object_library` is to build a
object target. The 2 libs `onnxruntime_providers_cuda_ut` &
`onnxruntime_providers_cuda` share most of the code, so the object files
can be used in both libs, which helps reduce build time. Another
function `config_cuda_provider_shared_module` is used to configure all 3
similar
targets(onnxruntime_providers_cuda_obj/onnxruntime_providers_cuda/onnxruntime_providers_cuda_ut).
4. Refactored the test to call `testing::InitGoogleTest` &
`RUN_ALL_TESTS` in `libonnxruntime_providers_cuda_ut.so`'s `TestAll`.
After this change, we can see all the cases running in
`CUDA_EP_Unittest.All`:

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
After https://github.com/microsoft/onnxruntime/pull/13016, there are
still test files in test/providers/cuda/ that are not moved to
core/providers/cuda/test/ and the test cases are disabled. This PR helps
to clean the unfinished TODOs.
Even through onnxruntime_shared_lib_test covers some test for CUDA
provider. onnxruntime_shared_lib_test works like a coarse grain
end-to-end test for CUDA provider. If CUDA unittest can run cases for a
single component, this wound be helpful for CUDA developers.
---------
Co-authored-by: Yuhong Guo <yuhong.gyh@antgroup.com>