* Implement TreeEnsemble for opset(ai.onnx.ml)==3
* use of InlineVector
* refactoring
* improve attributes retrieval
* avoid creating a temporary buffer
* modifies onnx.ml.cpu.json
* use unordered_map
* update docs/OperatorKernels.md
* address PR comments (TH -> ThresholdType, ORT_RETURN...)
* add a python unit test to load a TreeEnsembleRegressor following ai.onnx.ml==3 specifications
Follow up to #10904.
- Move node EP assignment for ORT format into SessionState::FinalizeSessionState().
- Add unit test for #10904.
- Make convert_onnx_models_to_ort.py optimization level configurable via environment variable.
ARM a55 micro-architecture (with dot product instructions), similar to a53, is widely used as little cores in big.Little configurations. A55 has a narrower memory load/store hardware, where a 128b load instruction would block the pipeline for 2 whole cycles, during which no other instructions can be executed. On the other hand, a 64b load instruction can be duo issued with many other instructions.
This change adds a Symmetric Quant indirect Conv kernel for a55 micro-architecture, where we replace
ldr q4,[x1],
with
ldr d4,[x1],
ldr x11,[x1],
ins v4.d[1],x11
so that we can try to hide the memory load cycles behind computing cycles in the kernel.
With this new kernel, cartoongan model shows significant perf improvement on Pixel5a little cores (2 threads running on two little cores):
new kernel: 2188.59 ms
old kernel: 2360.61 ms
* improve NonZero
* fix megatron_fp16 optimzier, fix the doc
* multi_tensor_applier
* resolve comment
* fix building warning
* fix build error when enabling training and use tensorrt
* Adding optimization step and step parameter to the ORTTrainer constructor
* Added ORTTrainerOptions for optimization step
* Adding Train Step Info Settings to State Dictionary
* Adding train step info key
* Updating comments
* Reverting changes
* Updating test case for new state dict entry train_step_info
* backup debugging information related to debugging a jira ticket
* fixed a bug in checking whether an input can be constand folded
* added more operators that are supported by migraphx
* revert unnecessary changes
* remove unused logger parameter
* rename function to make name style consistent
* backup code changes
* fix review comments
* refactor graph utility functions to add unit tests
* backup additional changes
* fixed a link error in build migraphx_basic_test
* add unit test for some migraphx utility functions
* add more supported ops in migraphx
* get inputs independently for trtexec
* track one process only
* remove engine and profile files
* change time to commit time
* add runtime option for io binding
* move to commit date
* fixes
* add option for graph optimization
* cleanup docker script
* note second time creation
* allow for parameters to be configured from pipeline at runtime
* uncomment
* include optional arguments at runtime
* post second session creation
* update cmake version
* Revert "update cmake version"
This reverts commit 09a1364eae68610724c8e90eeea777b7ee03f74b.
* Move data format import
* Ignore DequantizeLinear nodes in CommonSubexpressionElimination.
Coalescing DQ nodes results in QDQ node groups having overlaps, which the QDQ processing does not support.
* Update DropoutGrad function to support bfloat16
* Eliminate dead comments
* Set opset version for testcase
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
* Update to new builder
Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
* io_binding support
* cover all test cases
* per comments
Co-authored-by: Ethan Tao <ettao@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>