* Code refactor
* Modify code to tackle OOM when calibrating on larget dataset
* Fix mismatch issue when setting keepdims on ReduceMin/ReduceMax
* Add COCO val 2017 annotation
* Fix mismatch issue when setting keepdims on ReduceMin/ReduceMax
* Fix bug of "No module named:onnxruntime.quantization.CalTableFlatBuffers"
* Check and install flatbuffers module
* Add script to donwload coco dataset image and refactor example
* Fix bug of "No module
named:onnxruntime.quantization.CalTableFlatBuffers"
* Add CalTableFaltBuffers as module
* Remove annotation, user can download by themselves.
* Uncommet code
* Add back instances_val2017.json
* Make sure flatbuffers installed when ORT is installed
* Refactor code to call coco api
* Enable FP16 for example
* Added new Transpose+Cast+MatMul => Cast+FusedMatMul test scenarios.
* The Cast node may feed more than one node.
* Transpose node may feed multiple nodes and still may be fused with MatMul nodes.
A model from one of our partners regressed with a failure to evaluate due to the addition of strided 64-bit emulation in the DML EP for the Cast operator. Specifically, the model uses a Cast from int32 to int64 to produce the input shape to a Reshape node. When supplied with a shape dimension of -1 (int32 0xffffffff), the strided emulation in Cast ends up producing an int64 result of 0x00000000ffffffff. This is then fed into the Reshape operator, where it produces an incorrect tensor shape and a failure during evaluation.
Generally speaking we assume that using strided 64-bit emulation is safe if a node's inputs came from the DML EP itself. This isn't true in the general case for Cast, however - casting negative signed values can and will produce incorrect outputs with strided emulation.
After this change, Cast nodes with 64-bit types will fall back to CPU unless running on a GPU that native supports 64-bit datatypes.
Related work items: #31768166
* Liqun/ort module perf1 (#6806)
add mysql script to log perf data
Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Resolve HTTP Error 503: Service Unavailable for MNIST dataset (#6989)
* Reduce logging for ORTModule for the end user (#6982)
* Support none types in forward output (#7001)
* Missed test case for none type output (#7014)
* Fix code style according to autopep8
Co-authored-by: liqunfu <liqfu@microsoft.com>
Co-authored-by: baijumeswani <bmeswani@microsoft.com>
Add ARM64X implementation libs, to be forwarded to by the ARM64X lib.
From Ben Niu:
For system dlls that are built outside of windows repo and ingested through vpacks or binary check-ins, we always start by trying to port them to ARM64X. However, due to immature support for ARM64X build from Visual Studio 2019, it could be quite uphill to port dlls to ARM64X.
When that happens, we have an alternative without porting the dll to ARM64X. The alternative solution is, we build an ARM64X pure forwarder from windows repo, for example, onnxruntime.dll. That forwarder does nothing but forwards all the ARM64 API calls to a native ARM64 onnxruntime_arm64.dll, and all the x64 APIs to native x64 onnxruntime_amd64.dll. Please see here for an example: 29ae6ca516
At load time, applications still loads the ARM64X forwarder onnxruntime.dll. In an ARM64 process, that forwarder dll will further load the native ARM64 onnxruntime_arm64.dll; otherwise, the x64 onnxruntime_amd64.dll will be loaded, both the ARM64 and x64 dlls are happy.
The onnxruntime_arm64.dll and onnxruntime_amd64.dll are essentially aliases of their native counterparts, but we cannot directly rename existing native dlls in windows build. The reason is about PDB binplacing. If you simply rename a dll, the PDB name embedded in the dll is still unchanged. So you can imagine that if we just rename the native dlls in ARM64 windows build, there will be two renamed native dlls, onnxruntime_arm64.dll and onnxruntime_amd64.dll, sharing the same PDB name onnxruntime.pdb. When binplacing happens (basically moving dll and pdb from os\obj to os\bin), one PDB will overwrite the other. As a result, we either lose the PDB for the ARM64 dll, or the x64 dll.
That’s why we are asking to change the build pipeline to execute the link commands two extra times to produce onnxruntime_arm64/amd64.dll with different pdb names. You don’t need to do the compilation twice, but just the link. See here for an example: https://microsoft.visualstudio.com/DefaultCollection/Xbox/_git/Xbox.ShaderCompiler.WinTools/pullrequest/5291717
Related work items: #31925159
Change int32_t->ptrdiff_t when interacting with the threadpool.
Migrate more code from MlasMaskMoveAvx->MlasMaskMoveTableAvx.
Update more code to use FUNCTION_ENTRY macro.
Changes include:
* Revert Event Pool changes
* Add copyright and revert unrelated changes
* Add DLPack as submodule and remove to_dlpack and from_dlpack from public API
* Update golden numbers for DHP Parallel tests
* Update ORTTrainer unit test numbers
* Rollback to DLPack v0.3
* Disable flaky test
* Update third party notices and CG manifest file
* Minor refactoring of ORTValue API
1. Migrated it to Ed's new docker build script
2. Use python 3.6 instead, because it is the default one in ubuntu 18.04
3. Move the "pip install" command to the docker image build stage(instead of when running the image)
Miscellaneous changes to synchronize the style used over time:
Remove unneeded PFN types in favor of FN*.
Switch more functions over to using the common FUNCTION_ENTRY macro.
Switch logistic/tanh kernels over to the style used in TransKernelFma3.asm.
1. Remove openmp related packaging pipelines and build jobs.
2. Set continueOnError to true for the TSAUpload tasks. Their service is unstable recently.
3. Update Ubuntu 16 docker images to Ubuntu 18, in prepare for getting C++17 support
4. Cherry-pick the changes in 1.7.1 to the master: updating CFLAGS/CXXFLAGS to strip out debug symbols
Add functionality to the Graph class to be dumped to protobuf using an external binary file for the float initializers.
This change is meant to avoid hitting the 2GB protobuf limit when dumping large graphs.
This limit was particularly easy to exceed when dumping graphs after auto-diff.
The use of the external file is limited to initializers larger than a user-specified threshold.
This gives the possibility to users to include in the onnx file shape constants used by Reshape and Transpose used by Shape Inference.
* fusion support runtime edge shape checking
* trim ctor
* add test
* fix
* Update test_shape_infer_helper.py
* use torch input size as dynamic axis hints
* check dir
* update
* support longformerattention
* update and add support for bert ops
* trim
* review comments
* review comments
Unsolved problems:
1. One test failure was caused by a bug in Cudnn rnn kernels, when they can allocate a buffer and partially initialize it, the garbage data near tail of the buffer caused problem in some of the hardware. To attack this problem in a broader sense, should we add code in our allocators, and during a memory fuzzing test, fill an allocated buffer with garbage before returning to the caller?
2. Prepacking is used more widely than we know. For instance, Cudnn rnn kernels also cache their weights. They mix several weight tensors together into a single buffer, and never touch the original weight tensor anymore. This is the same idea with pre-pack, but they didn't override the virtual function, and they never tried to release those weight tensors, leading to memory waste. It also seems to me that there are some other kernels have similar behavior. Wonder how much memory we can save if we try to cleanup those too.
3. Turning off memory pattern planning does increase memory fragmentation, leading to out of memory error in some training test cases. Perhaps we can revisit the idea of pushing kernels-creation stage earlier, and then during initializer deserialization, we only avoid tracing those that will be prepacked.