The definitions for some Eigen classes don't get pulled in leading to errors. Split out the broadcast function creation logic from the functions using std::enable_if to workaround that.
* Symbolic shape inference: fix a case when concat requires merge multiple dims
* Fix a bug triggered in newer version of sympy
Fix a bug in output data type guessing
* add custom logger and global threadpools to C and C++ API
* code cleanup and formatting
* reformat code
* tidy up some more code formatting
* remove comment
* fix API break from merging from master
* renamed API function to CreateEnvWithCustomLoggerAndGlobalThreadPools
* rename log variable and apply clang-format
* bug fix transformer
* fuse cpu kernel for transposescalematmul and matmul
* fuse transpose_scale_matmul cpu kernel with matmul
* fix test
* Add FusedMatMul Contrib Op
* fix test
* fix typo
* plus more updates per review
* Merged PR 5195856: Fix broken cases of zero size tensors in Cast/Reduce
MaskRCNN failed when `Cast` tried to execute `Xor` with emptiness (zero in dimensions). This is perfectly legal and should be treated as a nop.
Ultimately DML itself should treat this case as a nop, just like how C's `memcpy` treats 0 count as a nop, but I'm just addressing it in ORT now, as enabling it in DML would impact more operators to be consistent (probably should incrementally add a flag to tensor validation so operators can be opted in gradually).
Corresponding WindowsAI PR: https://microsoft.visualstudio.com/WindowsAI/_git/WindowsAI/pullrequest/5195850
Related work items: #27469839, #28761382
* Merged PR 5201369: Remove copy of initializers added in DMLXP refactor
When used in ORT, a common method shouldn't copy and return initializer data
Related work items: #29514403
Co-authored-by: Justin Stoecker <justoeck@microsoft.com>
Co-authored-by: Jeff Bloomfield <jeffbloo@microsoft.com>
* [java] Fixing the buffer semantics.
* Renaming bufferCapacity to bufferRemaining.
* Adding a cast to char* so the pointer arithmetic works on Windows.
* Rework broadcasting setup to decrease binary size. Push all the type specific down and separate out the broadcasting/parallelization.
Reductions:
element_wise_ops: 521.0KB -> 268.8KB
where: 25.8 KB -> 17.3 KB
qlinear_binary_op: 28.1 -> 12.8
* Place shape related nodes in CPU
* visit candidates by topological order
* Make CPU node placement a utility function
* skip placing on CPU if the data typs is float16 or bfloat16
* Allow sharing of initializers between sessions.
* Allow sharing of initializers between sessions (2).
* Add test for C#
* Add test for C#; address PR comments
* Address PR comments
Moved AddInitializer logic to internal session options
Added tests for owned buffer
Clarified documentation
Fix bug where memory info and not device was getting compared
* Fix test
* Fix training build
* Add ver 5 end marker and ver 6 starter, add scenario and usage examples.
* bias softmax kernel
* bias softmax kernel
* remove debug comments
* remove debug comment
* windows build doesnt handle unary minus on unsigned type
* int64 => int treated as error
* only support cuda
* add bias softmax fusion tests
* PR comments
* more PR comments
* use MLTypeCallDispatcher
* break function into pieces
* add loop unroll and add to list for inference as well
* use std::min and move operator==
* revert std::min (doesnt work ci pipeline) and fix int to size_t error
* pr comments
* fixes for windows ci
* fix for windows ci
* pr comments on consistency
* p_model_
* fix formatting and add anonymous namespace
Co-authored-by: suffian khan <sukha@OrtTrainingDev1.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
* remove shape inference and fix save large model problem
* remove unnecessary import
* refine code and add external format for quantize_qat
* remove initializers in tensors_to_calibrate
* small refine
Co-authored-by: t-yguo <t-yguo@microsoft.com>