* init version to use graph instead of model_proto for IsOpSupported
* move add to modelbuilder to use graph node
* move the rest of model_builder to use graph instead of modelproto
* remove redundant code
* Clear some redundant code
* merge master and some minor style changes
* move check if an initializer is external to individual op instead the whole graph
* Addressed comments
* Change the GetType and GetShape to log waring info inside to simplify the caller, remove some redundant onnxruntime namespace
* add squeeze op support, some more code style clean up
* fix a bug where duplicate output can be added to a subgraph, some other minor logging changes
* Add protobuf mutator library as a git submodule
* Added files and instructions to build the protobuf mutator library in CMake
* Added fuzzing flag to build system and added fuzzing dependency library. To run fuzzing test use the flags --fuzz_testing --build_shared_lib --use_full_protobuf --cmake_generator 'Visual Studio 16 2019'
* Added src files and build instructions for the main fuzzing engine
* Removed Random number generation test from inside the engine
* Added license header to files
* Removed all pep8 violations introduced by this change and other E501 violations
* Draft for LayerNorm Optimization
* Modify LayernormGrad kernel based on new backward graph.
* keep two LayernormGrad implementations.
One is implemented based on input X, mean. The other is based on output Y, scale, bias. The first one is enabled by default. The second one can be enabled by --use_invertible_layernorm_grad
* expose use_invertible_layernorm_grad to frontend.
* add fp16 tests.
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
* optimize transpose
* optimize for the case when the tensor is 3D and the permutation is done in last two dimension.
BERT-L throughput is improved ~1.4% from transpose optimization
* fix UT MegatronSelfAttentionPartitionCorrectnessTest
* polish code.
* add test and change tile size to 16x16 for better perf.
* fix UT
* fix test of mask_rcnn
* address code review comments.
Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
For BiasGeluGradDxKernel:
- Implement optimization to first load from global memory into registers as suggested by Weixing.
- Support larger bias sizes which were previously limited by the number of threads per block.
- Address flaky unit test by increasing the error tolerance to the default value.
* Use the file size while reading onnx models. Ensure models are loaded using APIs in model.h for consistency.
* Refactor existing GetFileLength in posix.cc and address PR comments.
* Fix linux build - signed/unsigned conversion
* Add ability to specify just the device when using IOBinding for an output. This enables keeping an output on a different device GPU when it has a dynamic size that is not known ahead of graph execution.
* Keep loss subgraph as FP32 when mixed-p training.
* Fix case where there is no white-list loss op.
* Get nodes from loss_scale instead of whitelist.
* rename const variables.
Co-authored-by: Vincent Wang <weicwang@OrtDevTest2v100.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Change NNAPI CI to run on new NNAPI EP
* update android ci to mac 10.15 and remove in install cmake
* update the android ci to targe android api level 29
* remove unnecessary ndk install git submodule call
* Support quantization linear binary element wise math ops, implement QLinearAdd.
Support tests for quantization linear binary element wise math ops, implement test for QLinearAdd.
Add QlinearAdd with SSE2 intrisinc implemntation, Avx2 assembly implemntation, Neon intrisinc support.
QLinearAdd support VectorOnVector, VectorOnScalar, ScalarOnVector.
Generalized QlinearBinaryOp parallel related with broadcasting.
* Modify according to PR feedbacks. Mainly:
* template helper for generalize the qladd logic on v2v, s2v, v2s
* remove GetKernel related.
* change mixed lagecy MM/SSE code in the AVX code
* formater, typos, convensions, etc.
* Utilize MlasSubtractInt32x4 in MlasDequantizeLinearVector().
* Some format fix.
* More nature parallel parameter type.
* Fix build break for x86.
* Comment goes to 80 before wrap.
* Many change on assembly on Marco related.
Using vminps than vpminsd to handle NaN.
tested on windows.
* Using CLang Format to format the file.
* Fix arm32 build error.
* Remove some duplicate in different #if defined
* working add.u8.vector to vector
* Fix runtime bus error on real arm32 linux.
* fix typo in store last one lane.
* arm32 qlinearadd handle scalar.
* Move qladd to seperate c++ file
* Add neon64 qladd.
* refactor some, enhance two instructions on arm64 only instructions
* Fix typo for arm64
* use strict op in pure c++ (min/max on float value)
* sse2 new version.
* mrege arm/sse2/avx2
* pass arm/sse/avx2 linux test
* remove non-used assembly file.
* Remove unused data definition and tailing spaces.
* Fix broadcasting parallel issue.
* Enhance broadcasting scenarios. Allow testing result diff due to round
on half.
* Add Mlas or MLAS_ prefix for namespace safety.
* Handle alignment issue for arm32 for GCC/MSVC. remove some unused
signed/unsigned int ops.
* Specify /arch:AVX2 for qladd_avx2.cpp
* Fix type during copy/paste when unrolling. Better one GreatEqual
condition. Better formater by splitting two statements on single line.
* Arm neon alignment parameter is bits rather than bytes, change it.
* Move qladd_avx2.cpp to intrinsics/avx2/ folder
* Formatting using mlas style.
* Double check mlas style for these files.
* change indent 2 to 4 for qladd_avx2.cpp
* Fix windows x86 build error due to sse2 no _mm_cvtsi128_si64
* To re-trigger all as old failed pipeline updated.
Co-authored-by: Lei Zhang <phill.zhang@gmail.com>
* Implement BiasDropout Fusion and Kernel
Dropout kernel for residual input
BiasDropout Fusion to take residual input
Fix BiasDropout Kernel
Optimize DropoutGrad with 4 elements per thread
* Add graph transformer UT
* MLTypeCallDispatcher for RatioData
* Use MLTypeDispatcher for ratio tensor
* Handle traing_mode input for BiasDropout fusion
* Add test case for missing ratio input
* Replace using FinalizeNodeFusion
* Make BiasDropout kernel template-less
* Make DropoutGrad template-less
* Make Dropout and TrainableDropout template-less
* Regenerate onnx file for UT
* Minior fix on divmod in BiasDropoutKernel
* Adjust pt frontend test due to dropout randomnesss
* Make dropout kernel opeartion in fp32
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Update function body initialization
* minor fix
* changes per review comments
* minor fix
* format fix
* add function initialization in mixed precision transformer
* more updates
* more fixes
* Support another two format of mask_index input: 2D attention mask, or 1D mask index with end and start positions.
* Update dynamic axes of gpt2 with past state
* Update script to fuse model with attention mask
* add support to internally transpose nchw input to nhwc and only transpose back if it is necessary
* more changes in nchw<->nhc, fixed small issue in concat
* Add option for NNAPI to run on [all device]s/[cpu onl]y/[non-cpu only]
* minor code style changes