* sync graph proto in node's attributes
* Don't fuse nodes of control flow op until later in control flow op level
* remove unnecessary ep funtions
* remove unnecessary ep funtions
* remove unnecessary ep funtions
* missing 'override' keyword which makes MacOS/Web CI fail
* Add one more test run for Test3LayerNestedSubgraph with disabling graph optimization
* Update the comments to better understand the 4 cases
* moving training pipelines from cuda 11.5 to 11.6 and deprecating cuda 11.3
* change to cuda 11.6.2
* change pytorch's & torchvision's cuda version to 11.6
* specify deps version to 11.6.2
* update pytorch and torch text version
* torch 1.12.1
* change torchvision and torchtext version to be compatible with torch 1.12.1
* change cuda to 11.6 for cuda_home comaptibility
Co-authored-by: Adam Louly <adamlouly@microsoft.com@orttrainingdev7.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
* Add asm statement to model.mm to force linker to link against CoreML.Framework.
Update targets.xml as per Rolf's suggestions
* Remove explicit numpy version from macos build. We don't specify it for other CIs and the version specified doesn't have a pre-built 3.10 wheel. This leads to the CI attempting to build numpy which fails.
Fix arithmetic overflow warning. Suggested fix by static analysis tool
Arithmetic overflow: Using operator '+' on a 4 byte value and then casting the result to a 8 byte value.
Cast the value to the wider type before calling operator '+' to avoid overflow (io.2).
* Fix SAME_UPPER/SAME_LOWER (auto_pad attribute) in ConvTranspose
* Bump ONNX 1.10.2 globally
* load ONNX_VERSION from VERSION_NUMBER
* /
* revert deprecate warning in ORT 1.12
* add a comment about why removing cntk_simple_seg
* correct the implem in DML as well
(1) Modify some lines to fit line length limit 120
(2) Adjust parameter order of LaunchAttentionKernel
(3) Format code with Clang-Format in VS Code
(4) Fix spelling errors
* Make ORT as Pytorch JIT backend
LORT likely doesn't work with aten fallback so we only test LORT in its own CI.
* Revert changes to enable external CUDA allocator. Will add it later.
Revert "Revert changes to enable external CUDA allocator. Will add it later."
This reverts commit d5487f2e193014c805505afae8fb577c53667658.
Fix external allocator
* Relax tolerance and remove commented code
* Print more information in CI
* Fix pointer
* Address comments.
1. Reuse ORT-eager mode's environment.
2. Remove unused ctor.
* Use Pytorch master branch as all PRs are merged
Fix
* Refine based on cpplint feedbacks
* Revert changes to allow custom CUDA allocator in public APIs
* Use torch.testing.assert_close
* Use unittest framework
* Switch docker repo
* Rename *.cpp to *.cc
* Address comments
* Add comment
* Use same pipeline file for eager and lort pipelines
* Address comments
* Add yaml comment
* Fix cmake files
* Address comments
* Rename flags, remove printing code, remove dead comment
* Remove ostream operator<< definitions for TensorShapeProto and TensorProto as they clash with ONNX definitions in onnx/defs/printer.h/cc.
Currently printer.h (unnecessarily) pulls in a number of other ONNX headers which causes naming clashes with parts of ORT. It is also excluded in a minimal build.
Instead convert the onnx::TensorShapeProto to onnxruntime::TensorShape so we use the existing ostream operator<< for TensorShape.
Make GetTensorShapeFromTensorProto consistent with GetTensorShapeFromTensorShapeProto so both return a TensorShape (as the name implies).
QDQ loss debug - Weights Matching
Part 2 of QDQ loss debugging tool: given a float model and its qdq model, return the matching of all weight tensors and their corresponding dequantized weights from the qdq model.
LLVM compiler complains the std::hash<const char*> and suggests std::hash<const void*>. But the intention is to hash the name string instead of the pointer. So use std::hash<std::string> to be explicit.
* add AddBiasTranspose kernel, new format of weights
* Use compact global_q in GEMM
* sequence_index from BxS to S; new stream for copy
* merge input and output pointers in scratch2
* update default benchmark tests
* add new format 0 for weight and bias
* avoid integer overflow
* check gpu memory
* output summary in benchmark
* add logging
* update unit tests with non empty bias value
* add rocblasGemmHelper and rocblasGemmStridedBatchedHelper for Rocm
* use std::variant for synthetic data storage.
* use std::variant to replace TypedCheckpointProperty
* Remvoe shared ptr for checkpoint property
* fix tests
* refine std::variant usage a bit
* remove CheckpointProperty data abstraction
* use InlinedVector and InlinedHashMap if possible
* fix comments
* fix build and test
* fix some comments
* use gsl::span
* fix tests
* refine based on comments
* fix win build
* fix build