1. Remove openmp related packaging pipelines and build jobs.
2. Set continueOnError to true for the TSAUpload tasks. Their service is unstable recently.
3. Update Ubuntu 16 docker images to Ubuntu 18, in prepare for getting C++17 support
4. Cherry-pick the changes in 1.7.1 to the master: updating CFLAGS/CXXFLAGS to strip out debug symbols
Add functionality to the Graph class to be dumped to protobuf using an external binary file for the float initializers.
This change is meant to avoid hitting the 2GB protobuf limit when dumping large graphs.
This limit was particularly easy to exceed when dumping graphs after auto-diff.
The use of the external file is limited to initializers larger than a user-specified threshold.
This gives the possibility to users to include in the onnx file shape constants used by Reshape and Transpose used by Shape Inference.
* fusion support runtime edge shape checking
* trim ctor
* add test
* fix
* Update test_shape_infer_helper.py
* use torch input size as dynamic axis hints
* check dir
* update
* support longformerattention
* update and add support for bert ops
* trim
* review comments
* review comments
Unsolved problems:
1. One test failure was caused by a bug in Cudnn rnn kernels, when they can allocate a buffer and partially initialize it, the garbage data near tail of the buffer caused problem in some of the hardware. To attack this problem in a broader sense, should we add code in our allocators, and during a memory fuzzing test, fill an allocated buffer with garbage before returning to the caller?
2. Prepacking is used more widely than we know. For instance, Cudnn rnn kernels also cache their weights. They mix several weight tensors together into a single buffer, and never touch the original weight tensor anymore. This is the same idea with pre-pack, but they didn't override the virtual function, and they never tried to release those weight tensors, leading to memory waste. It also seems to me that there are some other kernels have similar behavior. Wonder how much memory we can save if we try to cleanup those too.
3. Turning off memory pattern planning does increase memory fragmentation, leading to out of memory error in some training test cases. Perhaps we can revisit the idea of pushing kernels-creation stage earlier, and then during initializer deserialization, we only avoid tracing those that will be prepacked.
Various updates to the int8_t GEMMs:
1) Add ARM64 udot kernel to take advantage of dot product instructions available in newer cores. Some models run 4x faster than the stock implementation we used before.
2) Refactor the x64 kernels to share common code for AVX2(u8u8/u8s8/avxvnni) vs AVX512(u8u8/u8s8/avx512vnni) to reduce binary size.
3) Extend kernels to support per-column zero points for matrix B. This is not currently wired to an operator.
* Update EyeLike CPU kernel.
* Update Mod CPU kernel.
* Update Multinomial CPU kernel.
* Slight improvement to Pad CPU kernel binary size.
* Update RandomNormal[Like], RandomUniform[Like] CPU kernels.
* Added code for Relugrad with GPU support.
Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Add GPU support for DNNL ConvGrad
Signed-off-by: George Nash <george.nash@intel.com>
* Add GPU support for DNNL MaxPoolGrad
Updates to MaxPool for training with GPU
Update oneDNN to version 1.8.1
Signed-off-by: George Nash <george.nash@intel.com>
* Fixed issues found durring code review
- error in code comment
- using auto when the direct type would have been better
- removed ternary operators that were returning bool values
Signed-off-by: George Nash <george.nash@intel.com>
Co-authored-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
* Added support to interchange Cast and Transpose operations.
* Added ONNX models for the Transpose-Cast-MatMul fusion testcases.
* Added python code to generate the ONNX models required for testing Transpose+Cast+Matmul fusion to Cast+FusedMatMul.
* Added diagram of the Transpose+MatMul fusion documentation
Cleaning up some naming in the op kernel type control infrastructure.
"Supported types" was a bit semantically overloaded. Renamed it to "default types". They are the types that are supported by default.
Implement an alternate workaround for the LLVM x86 problem described in PR #5088. That change made the x86 assembly files build with the GNU assembler by using -fno-integrated-as
* disable materialize grads
* gradient builder bugfix
* fix ut
* fix ut
* resolve comments and bugfix
* add more assert
* disable forward compare for now