Pytorch cpuinfo library allows us to query current cpu features, micro-architecture and cache size, etc. These information is needed for targeted performance optimizations.
Unfortunately it does not work under Windows/ARM. We need to develop our own later
This PR purely extracts each kernel to a standalone file. No functionality change. It includes specifically:
leave the MlasGemm function and thread handling in the qgemm.cc
put dispatcher functions and the template functions (interfaces) that are required to implement a kernel into qgemm.h
put each kernel implementation in a separate file, which implements/specialize template functions: MlasGemmU8X8FixupZeroPointB, MlasGemmU8X8CopyPackA, MlasGemmU8X8CopyPackB, MlasGemmU8X8Kernel
determine the files to be compiled in cmake file
1. Add support for sequence ops: ConcatFromSequence, SequenceAt, SequenceInsert. There are other sequence ops supported by onnx that worked well after adding these ops, so no need to add all of them in symbolic_shape_infer
2. For If node, the two branches output might have different shapes. In that case, for sequence output, use None in dimension; For tensor output, create a new symbolic dimension.
3. Fix a bug in Tile, where input for repeats might be of unknown value
4. Topological sort of nodes in graph need to consider implicit input in subgraphs for If/Loop/Scan ops
5. Generate unique prefix for new dimensions inside subgraph
* Add metadata_props to ORT model
* Minor update
* Update python binding, and increase the minimal pipeline size threshold
* Fixed a small bug in serializing ir_version
* Remove temp ort.py.fbs and add it to .gitignore
* handle unsqueeze change in opset13
* fix the node arguments index check for square case (x * x)
* Revert "fix the node arguments index check for square case (x * x)"
This reverts commit c66344f0a82c35d8c24d31f2264cf7e9b235ce22.
* handle the square case (x * x) for node argument search
Co-authored-by: Cheng Tang <chenta@microsoft.com>
* Add new transformer that can split node selection from node modification to allow just the modifications to be applied at runtime in a minimal build. This is the first step of a few to enable a QDQ model to be optimized for the NNAPI EP and/or the CPU EP at runtime in a mobile scenario.
Add generic and QDQ specific helpers for selection and modification.
Replace existing QDQ optimizers with optimizer based on new approach.
* first attempt share docker image across python and torch versons
* set dependency between jobs
* fix yaml grammer
* remove python version from first stage
* clean deepspeed directroy
* split into two images according torch version
* fix yaml syntax
* invalidate cache
* remove DS to prevent torch 1.9.0 upgrade
* Add helper to check if node provides a graph output. The current approach unnecessarily creates a vector when most of the optimizers only care about a true/false response.
* Undo accidental change
* Fix a couple of issues due to copying from larger set of changes.