* `QuantizeBFP` and `DequantizeBFP` schemas - similar to
`QuantizeLinear` and `DeQuantizeLinear`.
* BFP datatype is represented as a `uint8` tensor with shape and stride
metadata. This is preferrable to adding a new datatype for BFP, which is
more disruptive and [discouraged by
PyTorch](https://discuss.pytorch.org/t/training-with-custom-quantized-datatype/152132/2).
Context:
The Microsoft Floating Point (BFP) datatype shares an exponent for every
n numbers called a “bounding box.” Each number still has its own
mantissa and sign bits. BFP has been shown to incur 3-4 less cost
(energy and area) than BFloat16 and INT8 counterparts without reductions
in accuracy for the ImageNet benchmark as described in [Rouhani
2020](https://proceedings.neurips.cc/paper/2020/file/747e32ab0fea7fbd2ad9ec03daa3f840-Paper.pdf).
Requirements:
* There are many variants of BFP (number of mantissa bits, number of
shared exponent bits, size of bounding box, custom bit fields, etc.)
* The size and layout of an BFP variant varies across hardware
* bounding box can be over arbitrary dimensions; for example, for the
channel "C" dimension in a N x C x H x W tensor for convolution
Goals of this PR:
* Add initial versions of QuantizeBFP and DequantizeBFP operators to
enable QDQ-style quantization with BFP. Once the schemas stabilize, we
can consider upstreaming to ONNX.
* Add some basic type and shape inferencing tests; tests that run on an
EP will be a follow-up.
# Motivation
Currently, ORT minimal builds use kernel def hashes to map from nodes to
kernels to execute when loading the model. As the kernel def hashes must
be known ahead of time, this works for statically registered kernels.
This works well for the CPU EP.
For this approach to work, the kernel def hashes must also be known at
ORT format model conversion time, which means the EP with statically
registered kernels must also be enabled then. This is not an issue for
the always-available CPU EP. However, we do not want to require that any
EP which statically registers kernels is always available too.
Consequently, we explore another approach to match nodes to kernels that
does not rely on kernel def hashes. An added benefit of this is the
possibility of moving away from kernel def hashes completely, which
would eliminate the maintenance burden of keeping the hashes stable.
# Approach
In a full build, ORT uses some information from the ONNX op schema to
match a node to a kernel. We want to avoid including the ONNX op schema
in a minimal build to reduce binary size. Essentially, we take the
necessary information from the ONNX op schema and make it available in a
minimal build.
We decouple the ONNX op schema from the kernel matching logic. The
kernel matching logic instead relies on per-op information which can
either be obtained from the ONNX op schema or another source.
This per-op information must be available in a minimal build when there
are no ONNX op schemas. We put it in the ORT format model.
Existing uses of kernel def hashes to look up kernels are replaced
with the updated kernel matching logic. We no longer store
kernel def hashes in the ORT format model’s session state and runtime
optimization representations. We no longer keep the logic to
generate and ensure stability of kernel def hashes.
* Fix bug in pybind get_all_operator_schema due to premature reference dropping
* Add updated operator kernels markdown table
* Update build.py to include documentation generation for DML operators too
* Update GPU pipeline to include DML in the build to so operators can be generated.
* Use a separate pipeline stage, feedback from Changming and Scott
* Appease annoying Python linter
* Add onnxruntime_BUILD_UNIT_TESTS=OFF and remove stale --use_dml in cuda stage
* Make ORT as Pytorch JIT backend
LORT likely doesn't work with aten fallback so we only test LORT in its own CI.
* Revert changes to enable external CUDA allocator. Will add it later.
Revert "Revert changes to enable external CUDA allocator. Will add it later."
This reverts commit d5487f2e193014c805505afae8fb577c53667658.
Fix external allocator
* Relax tolerance and remove commented code
* Print more information in CI
* Fix pointer
* Address comments.
1. Reuse ORT-eager mode's environment.
2. Remove unused ctor.
* Use Pytorch master branch as all PRs are merged
Fix
* Refine based on cpplint feedbacks
* Revert changes to allow custom CUDA allocator in public APIs
* Use torch.testing.assert_close
* Use unittest framework
* Switch docker repo
* Rename *.cpp to *.cc
* Address comments
* Add comment
* Use same pipeline file for eager and lort pipelines
* Address comments
* Add yaml comment
* Fix cmake files
* Address comments
* Rename flags, remove printing code, remove dead comment
* Update to handle multiline declarations for the kernels which are typical these days.
* Update to new path for the cpu contrib_op kernel registrations.
* Update tools/python/find_optimizer_opset_version_updates_required.py
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
* add description of build ORT+TVM EP on Windows
* fix cmake error related to symlink creation on Windows
* add llvm config path to build flags for correct build on Windows
* update TVM_EP.md for llvm_config build arg
* fix warnings skipping during build on Windows
* fix using string or wstring for model path to correct build on Windows (MSVC error)
* fix error in custom logger for correct build on Windows
* implement glob algorithm for Windows
* additional build fixes
* update TVM with export of VM symbols for dll
* description of nasm issue and workaround
* update TVM with export of Executable from VM symbols for dll
* description of installation of ipp-crypto dependencies on Windows
* cmake key for ipp-crypto build
* fix wstring for TVMso EP
* fix ipp-crypto build
* cmake key onnxruntime_TVM_USE_HASH switch off not specific methods, but full hash functionality
* fix absolute path to compiled lib
* update TVM_EP.md, fix lint warnings
* update TVM_EP.md
* small fixes after review
* switch on handshake functionality for Linux workflow
Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
Co-authored-by: KJlaccHoeUM9l <wotpricol@mail.ru>
* infrastructure for handshake mechanism was implemented. sha256 was selected as first hash algorithm
* check hash during compile in TVMso EP
* add IPP-CRYPTO to external dependencies for TVM EP
* made checkHash method constant
* removed the public implementation of the SHA-256 algorithm so as not to cause a license conflict
* implemented SHA-256 calculation using ipp-crypto library
* fix dependency for ipp-crypto
* add provider options for hash check
* update documentation for added provider options
* add hash check condition
* fix docs
* fix lint
* fix ORT_THROW
Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
Co-authored-by: KJlaccHoeUM9l <wotpricol@mail.ru>
* Register signal ops for op set 17
Note code is mostly being moved, not added. These ops were previously
only registered as Microsoft contrib ops and only built if
`BUILD_MS_EXPERIMENTAL_OPS=1`. They've been added to the ai.onnx
standard op set in version 17.
Main components of this change:
* Move the kernels from the conrib_ops directory to the
core directory.
* Add function bodies for ms experimental ops. This will allow
old models that use the contrib ops to continue to function.
All the function bodies consist of a single op (the
new standard op), so performance overhead should be minimal.
Minor clean-up also in this change:
* De-duplicate get_scalar_value_from_tensor: put it in a new utils.h.
* Fix some bugs that caused compilation errors with the experimental
ops. Tested with `build.sh --ms_experimental`
* Fix some spelling errors and lint violations.
* Replace a couple of switch statements with `MLTypeCallDispatcher`.
* Use `InlineVector` instead of `std::vector`.
Unblocks https://github.com/microsoft/onnxruntime/issues/11640
Prior to this every test shared the same tolerances. This meant
that if an ONNX test failed due to a small but acceptable difference in
output, the only alternative was to disable the test entirely.
In op set 17, the DFT operator is being added. Without this change, the
tests for that operator fail because the output is off by about 5e-5.
It's better to keep test coverage for this new op rather than disable
the test entirely.
Also prior to this change, the global tolerances were not shared between
C++, JavaScript, and Python tests. Now they are.
Also fix various minor issues raised by linters.
Unblocks https://github.com/microsoft/onnxruntime/issues/11640.
(1) Support T5 in BeamSearch operator, and add both CPU and CUDA implementation.
(2) Change BeamSearch op: rename encoder_decoder_init attribute to encoder, and add decoder_start_token_id attribute
(3) Update convert_to_onnx for T5 to use int32 instead of int64 inputs as default.
(4) Add more tests in best_beam_search.py
(5) fix ORT_ENFORCE of hypothesis_buffer_offset_
(6) Improve ONNX conversion:
(a) Change encoder some dynamic axes to fixed dim value
(b) add --separate_encoder_and_decoder_init
(c) correct name t5-3B => t5-3b, t5-11B => t5-11b
(d) Add --use_int32_inputs in convert t5 to onnx
(e) Allow t5 beam search conversion in one step
* update description of TVM EP options in docs
* update sample notebook
* update TVM EP documentation
* add link to description of options
Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>
* Initiate Ort SNPE EP
* fix snpe ep windows build which is caused by the utility method (ToUTF8String) name change on master
* correct the source path for libonnxruntime.so while building for andorid package
* add AdditionalDependencies for amr64
* On MS-Windows, the patchfile must be a text file, i.e. CR-LF must be used as line endings. A file with LF may give the error: "Assertion failed, hunk, file patch.c, line 343," unless the option '--binary' is given.
* fix build failure if snpe is not enabled
* update doc for contrib op
* separate out snpe ep settings to onnxruntime_snpe_provider.cmake
* renaming according review comments
* update according review comments
* Implement BitmaskDropout and associated unit tests.
* Implement BitmaskDropoutGrad and associated unit tests.
* Implement Dropout -> BitmaskDropout rewrite rule and associated unit tests.
* Implement (Dropout,DropoutGrad) -> (BitmaskDropout,BitmaskDropoutGrad) rewrite rule.
This commit does not yet include unit tests for this rewrite rule.
This commit also introduces improved documentation for all changes which will be grouped
into this PR.
* bitmask dropout
* fix win build
* bugfix for rocm
* bugfix
* fix code format
* fix ut
* fix build break
* fix ut in win
* resolve comments
* fix ut in trt
* resolve comments
* fix rocm build error
* fix typo
Co-authored-by: Aidan Beggs <aidanbeggs@microsoft.com>
Description: Format all python files under onnxruntime with black and isort.
After checking in, we can use .git-blame-ignore-revs to ignore the formatting PR in git blame.
#11315, #11316
* Implement TreeEnsemble for opset(ai.onnx.ml)==3
* use of InlineVector
* refactoring
* improve attributes retrieval
* avoid creating a temporary buffer
* modifies onnx.ml.cpu.json
* use unordered_map
* update docs/OperatorKernels.md
* address PR comments (TH -> ThresholdType, ORT_RETURN...)
* add a python unit test to load a TreeEnsembleRegressor following ai.onnx.ml==3 specifications
* improve NonZero
* fix megatron_fp16 optimzier, fix the doc
* multi_tensor_applier
* resolve comment
* fix building warning
* fix build error when enabling training and use tensorrt
* change BeamSearch op to support encoder decoder model
* check model_type and decoder attribute
* fix
* update comments
* warn shape inference issue with onnx v1.11 or T5
* skip parity test when tempature != 1.0
* fix build
Work on minimizing memory management calls by
reducing number of allocations and copies.
Replace std::unordered_set to InlinedHashSet
and add usage of InlinedVector.
Employ std::move() to minimize copying and memory allocations.
Remove copying of the const shared data into each of the
PropagateCast transformer instances.
Move inlined_containers.h header to include/common
Adjust AsSpan imlementation for C++ < 17
* add support for bool type
* add TVM EP support for tests
* include TVM EP in python test pool
* fix pylint
* moved technical imports to a separate file
* clean up post build actions & move _ld_preload.py extension to CMake level
* add files for include TVM EP into CI
* implement custom logger for TVM
* replace TVM logging with ONNX RT logging
* update link for TVM EP tutorial
* clean up TVM EP cmake
* add pybind auto enabling for TVM EP
* fix blank spaces
* code review fixes
* replace print with comment
* add list of EP without TVM EP
* enable onnx tests
* disable contrib ops and ml ops
* reuse Dockerfile.ubuntu
* Move install_tvm_test_dependencies.sh out of Docker context dir, update build definition.
Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>