onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-03 03:58:54 +00:00

Author	SHA1	Message	Date
fxmarty	4d2dc8bbbd	Replace all numpy.bool by python builtin bool (#14014 ) `numpy.bool` has been removed as from 1.24.0. It was before an alias for python's `bool`. Fixes https://github.com/huggingface/optimum/issues/610 ### Motivation and Context Numpy 1.24.0 breaks for example IO binding helpers.	2022-12-23 09:27:23 +10:00
Baiju Meswani	1b58331fb3	[QAT] Graph transformer to fuse QDQ pattern into FakeQuant (#13777 ) To perform QAT in onnxruntime, `FakeQuant` op was introduced in #13649. The onnxruntime quantization tool generates a post training static quantization onnx model with `QuantizeLinear`->`DequantizeLinear` nodes. To perform QAT, this pattern needs to be transformed to `FakeQuant`. This pull request introduces a graph transformer that looks for the `Q->DQ` pattern and fuses it to a `FakeQuant` node.	2022-12-22 09:44:39 -08:00
Tianlei Wu	944bff0ad6	Support two stages onnx GPT-2 conversion (#14025 ) ### Description Add support of ONNX conversion of GPT-2 for two stages: * Stage 1 is the initial stage that has empty past state. * Stage 2 has non-empty past state and sequence_length is 1. Add a parameter --stage to specify such stage. For stage 1, we will enable mask_index for Attention so that we can use fused attention in CUDA. Other changes: (1) use int32 inputs as default (otherwise, there is error in inference) (2) update gpt2_parity to include SkipLayerNormalization (see https://github.com/microsoft/onnxruntime/pull/13988) and EmbedLayerNormalization (3) get all environment variables that might impact GPT-2 latency in benchmark_gpt2 ### Motivation and Context To test fused attention for GPT-2 model for https://github.com/microsoft/onnxruntime/pull/13953.	2022-12-22 09:33:01 -08:00
PeixuanZuo	694ba033e9	[ROCm] update skip_layernorm test sample (#14051 ) ### Description <!-- Describe your changes. --> Larger batch_size won't cover more implementations and may block CI, remove batch_size 128. Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-12-22 21:18:10 +08:00
pengwa	2f5bf75e51	Optimize computation orders (#13672 ) ### Optimize computation orders In `Roberta/Electra`, when `ClassificationHead` is used, there is slicing operation on features on sequence_length dimensions, then loss calculations only depend on this sliced data. This is a slicing at axis 1. Before slicing the shape is [batch, sequence_length, hidden], after slicing, it becomes [batch , hidden_stage] We had opportunities to bring this slicing earlier as much as possible, by passing through simple elementwise ops (like Add/Div), or Layernorm/Softmax(if their reduce axis is after the slicing axis), or even MatMul's the left operand (if only it did not affect the last dims). For operators like Reshape/Transpose, it is special since they have either data specified (after slicing we need update), or they have perm specified, which requires the input rank remain unchanged. So for those kinds of operators, we can remain the original rank, but just leave the sliced dim to be 1, after the compute completed, we do a Squeeze. ``` class RobertaClassificationHead(nn.Module): """Head for sentence-level classification tasks.""" def __init__(self, config): super().__init__() self.dense = nn.Linear(config.hidden_size, config.hidden_size) classifier_dropout = ( config.classifier_dropout if config.classifier_dropout is not None else config.hidden_dropout_prob ) self.dropout = nn.Dropout(classifier_dropout) self.out_proj = nn.Linear(config.hidden_size, config.num_labels) def forward(self, features, **kwargs): x = features[:, 0, :] # take <s> token (equiv. to [CLS]) x = self.dropout(x) x = self.dense(x) x = torch.tanh(x) x = self.dropout(x) x = self.out_proj(x) return x ``` src\transformers\models\roberta\modeling_roberta.py src\transformers\models\electra\modeling_electra.py #### Benchmark A simple benchmark shows Robeta training latency dropped from 208ms ~ 199ms. 4.5+% reduction. More comprehensive tests are on the way. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-22 15:12:52 +08:00
Hariharan Seshadri	7ed8bd4f95	Support (Bias)SkipLayerNormalization fusion in GPT2 (#13988 )	2022-12-21 23:04:44 -08:00
Joseph Groenenboom	baba312e30	Add provider selection for gpt2/convert_to_onnx.py (#13982 ) Allows the user to select from supported backends for gpt2/convert_to_onnx.py. Default behavior is preserved if no provider is selected. This allows the ROCm EP to be selected.	2022-12-22 11:41:09 +08:00
PeixuanZuo	a170e40fbb	[ROCm] Update Dockerfiles of ROCm and MIgraphX to ROCm5.4 (#14013 ) Update Dockerfiles of ROCm and MIGraphX to ROCm5.4 Update README.md Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-12-22 10:03:34 +08:00
PeixuanZuo	b5fd2a6a80	[ROCm] Add ROCm5.4 to python package pipeline (#14012 ) Add ROCm5.4 to python package pipeline. The download link of ROCm5.4 nightly build whl is https://download.onnxruntime.ai/onnxruntime_nightly_rocm54.html The download linkd of ROCm5.4 nightly build whl with profiling is https://download.onnxruntime.ai/onnxruntime_nightly_rocm54.profiling.html Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-12-22 10:01:40 +08:00
PeixuanZuo	ab2dd8dfaf	[ROCm] Update ROCm and MigraphX CI to ROCm5.4 (#14011 ) Update ROCm and MigraphX CI to ROCm5.4 Run ortmodule_test with ROCm5.4 and all passed(https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=824742&view=logs&j=8292f886-7946-5da9-7977-04484c342eda&t=5de68eaa-cbdc-5be5-13d0-bb946f4ddb2d). Co-authored-by: peixuanzuo <peixuanzuo@linmif39a000004.zvflicr54joexhdgnhvmxrxygg.phxx.internal.cloudapp.net>	2022-12-22 10:01:05 +08:00
Edward Chen	df8ff34f25	Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset 12+ is not supported yet. (#13983 ) ### Description Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset 12+ is not supported yet. With the way these kernels are currently registered, the documentation shows support for opset 11+. This is not accurate. ### Motivation and Context Fix #13781	2022-12-21 19:01:00 -05:00
Numfor Tiapo	8943d623a4	DML EP Register operators for Opset 16 (#14034 ) This PR Registers the following operators for opset 16 to the DML EP: - LeakyRelu-16 - PRelu-16 - Where-16 - GreaterOrEqual-16 - LessOrEqual-16 Identity-16 was not added in this PR due to pipeline failures Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2022-12-21 09:05:12 -08:00
JiCheng	1a177a1713	Cover beta in all Conv paths. (#14008 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-21 09:02:48 -08:00
pengwa	ccc4487553	fix CI onnxruntime_test_python_sparse_matmul.py (#14039 ) ### Description Numpy1.24.0 removed the np.float. ``` /opt/hostedtoolcache/Python/3.8.15/x64/bin/python onnxruntime_test_python_sparse_matmul.py EE. ====================================================================== ERROR: testRunContribSparseMatMul (__main__.TestSparseToDenseMatmul) Mutliple sparse COO tensor to dense ---------------------------------------------------------------------- Traceback (most recent call last): File "onnxruntime_test_python_sparse_matmul.py", line 407, in testRunContribSparseMatMul np.float, File "/opt/hostedtoolcache/Python/3.8.15/x64/lib/python3.8/site-packages/numpy/__init__.py", line 284, in __getattr__ raise AttributeError("module {!r} has no attribute " AttributeError: module 'numpy' has no attribute 'float' ====================================================================== ERROR: testRunSparseOutputOnly (__main__.TestSparseToDenseMatmul) Try running models using the new run_with_ort_values ---------------------------------------------------------------------- Traceback (most recent call last): File "onnxruntime_test_python_sparse_matmul.py", line 39, in testRunSparseOutputOnly values = np.array([1.764052391052246, 0.40015721321105957, 0.978738009929657], np.float) File "/opt/hostedtoolcache/Python/3.8.15/x64/lib/python3.8/site-packages/numpy/__init__.py", line 284, in __getattr__ raise AttributeError("module {!r} has no attribute " AttributeError: module 'numpy' has no attribute 'float' ``` ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-21 17:31:52 +08:00
JiCheng	7738be9b25	[prefast:Warning]: C26451 (#14036 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-21 16:53:29 +08:00
Changming Sun	05137e6ec4	Use target name for flatbuffers (#13991 ) ### Description Use target name for flatbuffers. Add version range for flatbuffers. It is similar to #13870 ### Motivation and Context To fix a build error: ``` CMake Error at onnxruntime_graph.cmake:88 (add_dependencies): The dependency target "flatbuffers" of target "onnxruntime_graph" does not exist. Call Stack (most recent call first): CMakeLists.txt:1490 (include) ``` It happens when flatbuffers library is already installed. For example, on Ubuntu people may get it from apt-get. But, the one provided by Ubuntu 20.04 is not compatible with our code. The one in Ubuntu 22.04 works fine.	2022-12-20 11:44:02 -08:00
RandySheriffH	cd305a90d6	Stop creating static thread pool to fix random hang in onnx_test_runner (#14023 )	2022-12-19 19:48:14 -08:00
Yulong Wang	533fe37cbd	fix build break in transformer debug dump (#14009 ) ### Description Fix build break in transformer debug dump introduced in #13954.	2022-12-19 16:49:21 -08:00
Changming Sun	fc2a6db573	Update absl to the latest release (#13990 ) ### Description Update absl to a new version ### Motivation and Context The new version contains fixes that are needed for Nvidia GPU build. Once we update it to that version, we don't need to maintain our private patches for Nvidia GPU build.	2022-12-19 14:25:13 -08:00
Hariharan Seshadri	f1044e3b9a	CUDA GreedySearch ProcessLogits optimization (#13823 ) ### Description Explore the possible re-use of the logits buffer in `GreedySearch` for cases where sequence length == 1 (Post the first decoding run, the sequence length is guaranteed to be 1). This re-use will ensure that we do not have to make copies of the logits before processing them. Currently, we make a copy of the logits even if the sequence length == 1 which is not necessary as we can directly re-use the logits buffer for the token generation step. A similar optimization exists in `BeamSearch`, but seems lacking in `GreedySearch`. Since, the logits buffer may contain padded data, we need to adjust the pieces consuming the logits buffer directly to account for any padding. A more invasive change (needs changes in a few places) will be to adjust the interfaces of `ProcessLogits()` such that it takes a reference to the logits and not a const reference as (based on my understanding) this is the only place where the logits from the decoder subgraph will ever be used and giving the `ProcessLogits()` method license to mutate/process the underlying buffer of the logits OrtValue seems reasonable (instead of making a copy and then mutating/processing them). The will also remove the ugly `const_cast`(s) seen in this change.	2022-12-19 13:29:10 -08:00
Chen Fu	28e2b1790f	Moving MLAS threaded QGEMM packing buffer from stack to heap (#14002 ) ### Description MLAS QGEMM kernel need memory buffer for packing of source tensors. This change moves these buffers from stack to heap ### Motivation and Context MLAS QGEMM kernels have packing buffers on the stack since the beginning of time. Emerging hardware demands larger and larger buffers, causing potential stack overflow problems down the road. This change moves these buffers from stack to the heap. This change also introduces a thread initializer per kernel. For instance, in the new AMX instruction set (support coming), we need to initialize the tile registers per thread. This requirement can be easily satisfied by tapping into this change. Co-authored-by: Chen Fu <fuchen@microsoft.com>	2022-12-19 09:39:19 -08:00
Zhang Lei	fba09faf5b	Implement reuse past and present tensor in Attention Ops. (#13791 ) Implement reuse kv_cache past and present tensor in Attention Ops. Unit test for abover feature. Utilize the reuse kv_cache for past and present tensor in Greedy Search. Correctness test for it. Co-authored-by: Zhang Lei <phill.zhang@gmail.com>	2022-12-18 10:03:53 -08:00
cloudhan	2df046fc67	Fix deprecated-builtins (#14001 ) Fix error: builtin __has_trivial_destructor is deprecated; use __is_trivially_destructible instead [-Werror,-Wdeprecated-builtins] This is not a clean fix as in 13783, users will need to manually set `CMAKE_HIP_FLAGS="-Wno-deprecated-builtins"` if they want to use self-built hipclang combining with ROCm 5.3.* or older.	2022-12-17 18:17:05 +08:00
Tianlei Wu	6fb54fc607	Add ms domain during saving onnx model in onnx_model.py (#13978 ) Add domain "com.microsoft" during saving model if needed.	2022-12-16 22:45:57 -08:00
Yulong Wang	cc0a6213e4	[js] update versions of a few build dependencies (#13977 ) ### Description update versions of a few build dependencies for onnxruntime NPM packages. update nodejs version to v16.x in linux CI. v12 is too out-of-dated. see [nodejs release schedule](https://github.com/nodejs/release#release-schedule) ### Motivation and Context - upgrade to latest webpack allows using of latest Node.js LTS version. previous version of webpack does not work on Node.js v18 and it is fixed in latest version - upgrade to latest typescript, ts-loader and other dev deps to accelerate the build and bundling. - upgrade also helps to resolve security warnings that may be vulnerable in out-of-dated version	2022-12-16 17:26:54 -08:00
Chi Lo	ba89cae3bd	Update package pipelines to support TRT 8.5 (#13998 ) Update following package pipelines to support TRT 8.5 after https://github.com/microsoft/onnxruntime/pull/13867: - [Linux Multi GPU TensorRT CI Pipeline](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1016&_a=summary) - [Python packaging pipeline](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=841&_a=summary) - [build-perf-test-binaries](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1130&_a=summary) - [Linux-GPU-EP-Perf](https://aiinfra.visualstudio.com/Lotus/_build?definitionId=841&_a=summary)	2022-12-16 15:01:50 -08:00
Tianlei Wu	848f80f7a9	Skip some attention op tests in A100 (#13980 ) Skip some attention_op tests in A100 due to TF32 is enabled in GEMM, and that causes some unit tests fails in A100.	2022-12-16 10:23:41 -08:00
FFFrog	6705915af8	[CANN] Add the ability to run graph (#13728 ) ### Description Add the ability to run graph ### Motivation and Context A brief description is as follows: 1) If the whole graph is supported, then will be processed by the graph engine, directly. 2) If the whole graph is not supported, the whole graph will be divided into subgraphs and single operators; The sub-graphs will be run on graph engine, and the single operators will fallback to the traditional mode.	2022-12-16 06:57:40 -08:00
Yi Zhang	aa9fbed3d4	Add compilation cache for Linux GPU (#13995 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-12-16 16:38:12 +08:00
Scott McKay	be9ae28d9f	Add ability to set RunOptions config entries to C# API. (#13939 ) ### Description <!-- Describe your changes. --> Add ability to set RunOptions config entries. Largely a cut-and-paste of the existing code for setting SessionOptions config entries. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #13936	2022-12-16 10:28:01 +10:00
Yi Zhang	7d20d889d1	Use cache for compilation in container (#13960 ) ### Description For compilation in container, ADO Cache task doesn't work directly. The workaround is to mount the cache directory to the container, and let CCache in container to read/write cache data. In short, we just leverage ADO API to download/upload cache data. The Post-jobs works in stack-mode, So the PostBuildCleanUp Tasks should be defined first. Thus, The PostBuildCleanUp would be executed lastly. Else, Cache Task would fail to upload cache because the Agent Directory is cleaned.	2022-12-16 07:19:07 +08:00
RandySheriffH	a061fedb5d	Exclude affinity-setting logic from minimal build (#13967 ) Comment out the affinity-setting logic which introduced an unnecessary binary size increase for the minimal build. Co-authored-by: Randy Shuai <rashuai@microsoft.com>	2022-12-15 14:43:42 -08:00
Yulong Wang	0ee5a5f229	[debug] allow dump node placement in transformer models (#13954 ) ### Description allow dump node placement in transformer models.	2022-12-15 14:42:58 -08:00
stevenlix	c4ecbb96d9	Fix issues in TRT model ID generator (#13837 ) There are some issues in https://github.com/microsoft/onnxruntime/pull/13015, 1. Model name should be used rather than graph name in the model ID generator. 2. Hash collision is observed in ID cache, which means different model may have the same key and thus load same hash id from the cache. 3. For the class and function that generate model id, MetaDef in the name is not appropriate. 4. Should reuse murmurhash3 rather than copy it over to TRT EP This PR fixes those issues.	2022-12-15 13:51:19 -08:00
Sunny Shukla	b52e8bf718	[oneDNN ep] QAttention BF16 and GPU support added (#13793 ) ### Description QAttention performance improvement when hardware supports amx and avx-bf16 execution. ### Motivation and Context - Streamlined the code to dynamically switch between BF16 and FP32 execution as and when supported by hardware - Split QKV memory into three different memories for Q, K, and V. This helps to run QAttention on GPU and take advantage of parallel processing. - This change has shown a significant amount of performance gain for QAttention operator on hardware like Sapphire Rapids which supports amx and avx-bf16.	2022-12-15 12:25:43 -08:00
Abhishek Udupa	c882601425	Add noexcept annotation to address prefast warnings (#13965 ) ### Description Add noexcept annotations to move constructors and assignment ops to address prefast warnings. (see https://dev.azure.com/aiinfra/ONNX%20Runtime/_workitems/edit/11012/) Co-authored-by: Abhishek Udupa <abhishek.udupa@microsoft.com>	2022-12-15 09:44:22 -08:00
Tianlei Wu	a3cd36dbfb	change default cudnn_conv_use_max_workspace =1 (#13981 ) ### Description Change the default value of cudnn_conv_use_max_workspace to be consistent with ORT Training: Test results with stable diffusion 1.4: Latency (Seconds per Query) \| T4 \| V100 \| A100 -- \| -- \| -- \| -- ORT FP32 (Before) \| 28.4 \| 10.1 \| 7.2 ORT FP32 (After) \| 26.2 \| 8.3 \| 4.9 Gain \| 8% \| 18% \| 32% Latency (Seconds per Query) \| T4 \| V100 \| A100 -- \| -- \| -- \| -- ORT FP16 (Before) \| 13.1 \| 6.4 \| 4.3 ORT FP16 (After) \| 9.6 \| 3.8 \| 2.4 Gain \| 27% \| 41% \| 44% We can see that there is significant gain after changing the default value. Normal user might not have knowledge for this. It is better to change the default value so that user can get best performance out of box.	2022-12-15 09:09:07 -08:00
Tang, Cheng	a81faee41e	Multi-stream execution support (#13495 ) Description: This PR including following works: 1. provide stream and related synchronization abstractions in onnxruntime. 2. enhance onnxruntime's execution planner / executor / memory arena to support execute multiple streams in parallel. 3. deprecate the parallel executor for cpu. 4. deprecate the Fence mechanism. 5. update the cuda / tensorrt EP to support the stream mechanism, support running different request in different cuda stream. Motivation and Context - Why is this change required? currently, the execution plan is just a linear list of those primitives, ort will execute them step by step. For any given graph, ORT will serialize it to a fixed execution order. This sequential execution design simplifies most scenarios, but it has the following limitations: 1. it is difficult to enable inter-node parallelization, we have a half-baked parallel executor but it is very difficult to make it work with GPU. 2. The fence mechanism can work with single gpu stream + cpu thread case, but when extend to multiple stream, it is difficult to manage the cross GPU stream synchronizations. 3. our cuda EP rely on the BFCArena to make the memory management work with the GPU async kernels, but current BFCArena is not aware of the streams, so it doesn't behavior correctly when run with multiple streams. This PR enhance our existing execution plan and executor to support multiple stream execution. we use an unified algorithm to mange both single stream and multiple stream scenarios. This PR mainly focus on the infrastructure support for multiple stream execution, that is said, given a valid stream assignment, onnxruntime can execute it correctly. How to generate a good stream assignment for a given model will be in the future PR. Co-authored-by: Cheng Tang <chenta@microsoft.com@orttrainingdev9.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net> Co-authored-by: Cheng Tang <chenta@microsoft.com> Co-authored-by: RandySheriffH <48490400+RandySheriffH@users.noreply.github.com> Co-authored-by: Randy Shuai <rashuai@microsoft.com> Co-authored-by: cao lei <jslhcl@gmail.com> Co-authored-by: Lei Cao <leca@microsoft.com>	2022-12-15 07:39:29 -08:00
JiCheng	f4cd35f9b1	[xnnpack-ep] NEW EP API in objc (#13941 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2022-12-15 20:12:02 +08:00
Changming Sun	a9b1fb032b	FIX: macOS CI pipeline doesn't run tests (#13970 ) ### Description Fix a problem: macOS CI pipeline doesn't run tests. It is due a code refactoring I recently made. ### Motivation and Context Add the tests back.	2022-12-14 18:39:31 -08:00
Baiju Meswani	1fd63487fd	ORTModule support for kwargs input that is a dict (#13910 )	2022-12-14 16:23:48 -08:00
Jakub Bachurski	3b17ab7c65	Add float64 kernels for Floor, Ceil, IsNaN (#13906 ) ### Description This PR adds support for `float64` kernels in the latest versions of operators: Floor, Ceil and IsNaN. ### Motivation and Context The lack of these kernels is non-trivial to work around and easily lead to performance losses when it is attempted. When equivalence with an existing implementation is required, precision is easily lost when casting to `float32` instead. IsNaN is common when cleaning up data in an ML pipeline. Floor and Ceil have uses for discretising values and single-precision floats are insufficient to round well when values get larger than a few million. According to my measurement this only increases the binary size by a few kilobytes (on the Python wheel of RelWithDebInfo). Closes #13673 (Round already has float64 support) Partially solves #8791 (Looks like there's parallel issues/PR open for Split, but it is also hard to work around and hence useful) Signed-off-by: jbachurski <kbachurski@gmail.com>	2022-12-14 14:57:14 -08:00
Baiju Meswani	5a55fac402	Miscellaneous updates to training apis (#13929 )	2022-12-14 13:33:07 -08:00
Jian Chen	e5f6689ae7	Allow Tensor to be scalar if it is not per channel. (#13959 ) ### Description Allow Tensor to be scalar if it is not per channel. ### Motivation and Context [<!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->](https://github.com/microsoft/onnxruntime/issues/13915)	2022-12-14 13:23:56 -08:00
Chi Lo	5b492cbae3	[TensorRT EP] support TensorRT 8.5 (#13867 ) Integrate TensorRT 8.5 - Update TensorRT EP to support TensorRT 8.5 - Update relevant CI pipelines - Disable known non-supported ops for TensorRT - Make timeout configurable. We observe more than [20 hours](https://aiinfra.visualstudio.com/Lotus/_build/results?buildId=256729&view=logs&j=71ce39d8-054f-502a-dcd0-e89fa9931f40) of running unit tests with TensorRT 8.5 in package pipelines. Because we can't use placeholder to significantly reduce testing time (c-api application test will deadlock) in package pipelines, we only run subsets of model tests and unit tests that are related to TRT (add new build flag--test_all_timeout and set it to 72000 seconds by package pipelines). Just to remember, we still run all the tests in TensorRT CI pipelines to have full test coverage. - include https://github.com/microsoft/onnxruntime/pull/13918 to fix onnx-tensorrt compile error. Co-authored-by: George Wu <jywu@microsoft.com>	2022-12-14 13:06:03 -08:00
Baiju Meswani	8c249cc8f7	[QAT] FakeQuantGrad and gradient building for FakeQuant (#13825 )	2022-12-14 11:54:02 -08:00
Ashwini Khade	6090d8cd6e	Fix usage of enable_training_ops and reduce ifdef complexity for training builds (#13888 ) ### Description Fix usage of enable_training_ops and reduce ifdef complexity for training builds. ### Motivation and Context This is the second refactoring PR towards creating a dedicated build for on device training. This PR aims to reduce some complexity. We can set ENABLE_TRAINING_OPS in cmake when either ENABLE_TRAINING or ENABLE_TRAINING_ON_DEVICE is selected, this way we dont have to use if defined(ENABLE_TRAINING) \|\| defined(ENABLE_TRAINING_ON_DEVICE ) everywhere in the code. - If it fixes an open issue, please link to the issue here. -->	2022-12-14 08:32:46 -08:00
Yi Zhang	7894d44d2d	Improve MacOS Cache Code (#13958 ) ### Description Update cache key to make cache could be updated.	2022-12-14 20:47:09 +08:00
Vincent Wang	6900109ee8	Bugfix for GetCpuPreferredNodes (#13590 ) GetCpuPreferredNodes is a function to get CPU preferred nodes from a graph for target EP (such as CUDA). It starts from CPU outputs of target EP node and travel the graph and try to fallback tentative nodes from target EP to CPU EP. For example: Shape->Gather->Concat->Reshape, at the beginning, all these 4 nodes are all tentative nodes. Since output of Shape is CPU output, it starts from that output and travel the graph, and fallback Gather and Concat to CPU EP. Reshape cannot fallback because its another input is not CPU input. But for case: Shape->Gather->ReduceProd->Concat->Reshape, since ReduceProd doesn't have int64_t kernel in target EP (CUDA here), so it's not a tentative node. The travelling logic still starts from Shape's output, but with current logic, it will stop when reaching ReduceProd, so that Concat will not fallback at the end and is assigned with target EP, at the end, Memcpy nodes are added before and after the Concat node because both of its input and output are CPU tensors. This PR is to fix this issue. For above case, since ReduceProd is not a tentative node, it means either is already have EP assigned, or there is no kernel found of target EP for it, so we can still continue the graph travelling and make it a CPU node and all its outputs CPU outputs.	2022-12-14 17:54:55 +08:00
PeixuanZuo	80a046b36f	[ROCm] update amd CI huggingface model performance number (#13961 ) Fix CI test failure. Test distilbert-base model performance number on gcramdrr1-mi100-08x and update.	2022-12-14 16:30:25 +08:00

1 2 3 4 5 ...

7886 commits