* Change NNAPI CI to run on new NNAPI EP
* update android ci to mac 10.15 and remove in install cmake
* update the android ci to targe android api level 29
* remove unnecessary ndk install git submodule call
* Support quantization linear binary element wise math ops, implement QLinearAdd.
Support tests for quantization linear binary element wise math ops, implement test for QLinearAdd.
Add QlinearAdd with SSE2 intrisinc implemntation, Avx2 assembly implemntation, Neon intrisinc support.
QLinearAdd support VectorOnVector, VectorOnScalar, ScalarOnVector.
Generalized QlinearBinaryOp parallel related with broadcasting.
* Modify according to PR feedbacks. Mainly:
* template helper for generalize the qladd logic on v2v, s2v, v2s
* remove GetKernel related.
* change mixed lagecy MM/SSE code in the AVX code
* formater, typos, convensions, etc.
* Utilize MlasSubtractInt32x4 in MlasDequantizeLinearVector().
* Some format fix.
* More nature parallel parameter type.
* Fix build break for x86.
* Comment goes to 80 before wrap.
* Many change on assembly on Marco related.
Using vminps than vpminsd to handle NaN.
tested on windows.
* Using CLang Format to format the file.
* Fix arm32 build error.
* Remove some duplicate in different #if defined
* working add.u8.vector to vector
* Fix runtime bus error on real arm32 linux.
* fix typo in store last one lane.
* arm32 qlinearadd handle scalar.
* Move qladd to seperate c++ file
* Add neon64 qladd.
* refactor some, enhance two instructions on arm64 only instructions
* Fix typo for arm64
* use strict op in pure c++ (min/max on float value)
* sse2 new version.
* mrege arm/sse2/avx2
* pass arm/sse/avx2 linux test
* remove non-used assembly file.
* Remove unused data definition and tailing spaces.
* Fix broadcasting parallel issue.
* Enhance broadcasting scenarios. Allow testing result diff due to round
on half.
* Add Mlas or MLAS_ prefix for namespace safety.
* Handle alignment issue for arm32 for GCC/MSVC. remove some unused
signed/unsigned int ops.
* Specify /arch:AVX2 for qladd_avx2.cpp
* Fix type during copy/paste when unrolling. Better one GreatEqual
condition. Better formater by splitting two statements on single line.
* Arm neon alignment parameter is bits rather than bytes, change it.
* Move qladd_avx2.cpp to intrinsics/avx2/ folder
* Formatting using mlas style.
* Double check mlas style for these files.
* change indent 2 to 4 for qladd_avx2.cpp
* Fix windows x86 build error due to sse2 no _mm_cvtsi128_si64
* To re-trigger all as old failed pipeline updated.
Co-authored-by: Lei Zhang <phill.zhang@gmail.com>
* Implement BiasDropout Fusion and Kernel
Dropout kernel for residual input
BiasDropout Fusion to take residual input
Fix BiasDropout Kernel
Optimize DropoutGrad with 4 elements per thread
* Add graph transformer UT
* MLTypeCallDispatcher for RatioData
* Use MLTypeDispatcher for ratio tensor
* Handle traing_mode input for BiasDropout fusion
* Add test case for missing ratio input
* Replace using FinalizeNodeFusion
* Make BiasDropout kernel template-less
* Make DropoutGrad template-less
* Make Dropout and TrainableDropout template-less
* Regenerate onnx file for UT
* Minior fix on divmod in BiasDropoutKernel
* Adjust pt frontend test due to dropout randomnesss
* Make dropout kernel opeartion in fp32
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* Update function body initialization
* minor fix
* changes per review comments
* minor fix
* format fix
* add function initialization in mixed precision transformer
* more updates
* more fixes
Remove redundant checks in the DML EP which should instead rely on DML's validation. At least one of these checks wrongly prevents legitimate execution (5D Concat is supported in DML, but the DML EP blocks it 🤦♀️). Note this is a small aspect of the larger work (not sufficient to make the models fully work) that I thought I'd flush now while I had the change ready anyway due to investigation.
Related work items: #23232293, #25707941
* Support another two format of mask_index input: 2D attention mask, or 1D mask index with end and start positions.
* Update dynamic axes of gpt2 with past state
* Update script to fuse model with attention mask
* add support to internally transpose nchw input to nhwc and only transpose back if it is necessary
* more changes in nchw<->nhc, fixed small issue in concat
* Add option for NNAPI to run on [all device]s/[cpu onl]y/[non-cpu only]
* minor code style changes
* Move allocators to SessionState so they're decoupled from ExecutionProviders
- when looking up an allocator it's based on OrtMemoryInfo not the EP so SessionState is a more natural place for that infromation to be stored
- add device based lookup
- simplifies logic for copying feeds/fetches across devices
Cleanup SessionState and SessionStateInitializer
- provide more things to SessionState at construction time so we don't construct and instance and immediately after call a bunch of setters
- simplify SessionStateInitializer
- reduced down to FinalizeSessionState method