onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-07-13 18:08:13 +00:00

Author	SHA1	Message	Date
MistEO	faf9a0f6c7	Fix runtime installation error (#17828 )	2023-10-07 11:50:02 -07:00
Wei-Sheng Chin	b5a103ae16	Upgrade transformers to fix CI (#17823 ) Python package pipeline fails due to "tokenizers" compilation. Since "tokenizers" is a dep of "transformers", we update its version and hope a new solution had been there. ``` error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell` --> tokenizers-lib/src/models/bpe/trainer.rs:517:47 ```	2023-10-07 09:51:24 -07:00
Changming Sun	b76994dc3a	Improve CUDA EP's GetCapability (#17809 ) Improve CUDA EP's GetCapability: Add layout transformer support. Currently the code detects if a node is already assigned to some EP, if yes, it will directly return. ```c++ if (!node.GetExecutionProviderType().empty()) { return; } ``` So, if you call the GetCapability function twice, ```c++ auto caps = GetCapability(); assign_nodes_to_eps(..., caps, ...); auto caps2 = GetCapability(); ``` The second GetCapability() call will return fewer results than the first one. Layout transformer needs to call GetCapability twice as above. So the current GetCapability() implementation is incompatible with the Layout transformer. It is not an issue right now because the CUDA EP doesn't need to do layout transform. But we might want to support a different layout.	2023-10-07 09:05:02 -07:00
PeixuanZuo	37f4f27da0	[ROCm] ONNX Runtime training rocm package for ADO (#17683 ) - we will publish the onnxruntime-training-rocm package on ADO feeds. The onnxruntime-training package will solely be for cuda. - Add new pipeline for onnxruntime-training-rocm ADO feeds https://aiinfra.visualstudio.com/Lotus/_build?definitionId=1278. Only package with latest rocm version is publish to ADO.	2023-10-07 10:45:35 +08:00
pengwa	7201def4ec	Fix convergence for dolly+stage3 training (#17685 ) ### Fix convergence for dolly+stage3 training In [ZeROOffloadSubscriber](`216214b7d3/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py (L359C7-L359C28)`), we defined some PythonOp, taking input and returning it inplace, for example: `216214b7d3/orttraining/orttraining/python/training/utils/hooks/_zero_offload_subscriber.py (L223C20-L223C20)`. While it is possible, when ORT runs such a PythonOp, once it completes, it will release the input OrtValue, triggered the data erasing or overridden. But the PythonOp's returned value OrtValue are still pointing to that address, reading or writting on that may introduce a wrong result or even undefined behaviors. ``` /bert_ort/pengwa/py38/lib/python3.8/site-packages/onnxruntime/training/ortmodule/_custom_autograd_function_runner.py:28: UserWarning: .rank-0: onnxruntime.training.utils.hooks._zero_offload_subscriber.ORTZeROOffloadPreForwardFunction->Backward: ONNX Op attribute 'tensor_reuse_map' doesn't indicate 8-th output is reusing any input, but detected inplace_map indicates it is reusing some input index. A clone will be done before returning to ORT, to align with ORT's NO Buffer reuse plan. Please update inplace_map explicitly to avoid such a copy. warnings.warn(f".rank-{get_rank()}: {message}") 0%\|▏ \| 1/1000 [00:04<1:15:08, 4.51s/it][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,023 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 14.1406, 'learning_rate': 0, 'epoch': 0.0} 0%\|▏ \| 1/1000 [00:04<1:15:08, 4.51s/it]Invalidate trace cache @ step 5: expected module 6, but got module 7 0%\|▍ \| 2/1000 [00:04<31:53, 1.92s/it][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,124 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%\|▋ \| 3/1000 [00:04<18:05, 1.09s/it][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,227 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%\|▋ \| 3/1000 [00:04<18:05, 1.09s/it][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,326 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%\|█▏ \| 5/1000 [00:04<08:44, 1.90it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,419 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 0%\|█▏ \| 5/1000 [00:04<08:44, 1.90it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,505 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|█▋ \| 7/1000 [00:05<05:28, 3.02it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,597 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|█▋ \| 7/1000 [00:05<05:28, 3.02it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,690 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|██▏ \| 9/1000 [00:05<03:57, 4.17it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,791 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|██▏ \| 9/1000 [00:05<03:57, 4.17it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,889 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|██▋ \| 11/1000 [00:05<03:06, 5.32it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:44,981 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.0} 1%\|██▋ \| 11/1000 [00:05<03:06, 5.32it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,073 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 1%\|███▏ \| 13/1000 [00:05<02:33, 6.42it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,166 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 1%\|███▏ \| 13/1000 [00:05<02:33, 6.42it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,256 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%\|███▌ \| 15/1000 [00:05<02:12, 7.43it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,348 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%\|███▌ \| 15/1000 [00:05<02:12, 7.43it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,439 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%\|████ \| 17/1000 [00:06<01:59, 8.22it/s][WARNING\|trainer_pt_utils.py:849] 2023-09-25 08:30:45,535 >> tried to get lr value before scheduler/optimizer started stepping, returning lr=0 {'loss': 0.0, 'learning_rate': 0, 'epoch': 0.01} 2%\|████ \| 17/1000 [00:06<01:59, 8.22it/s]Traceback (most recent call last): File "examples/onnxruntime/training/language-modeling/run_clm.py", line 600, in <module> main() File "examples/onnxruntime/training/language-modeling/run_clm.py", line 548, in main train_result = trainer.train(resume_from_checkpoint=checkpoint) File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 457, in train return inner_training_loop( File "/bert_ort/pengwa/optimum/optimum/onnxruntime/trainer.py", line 781, in _inner_training_loop self.deepspeed.step() File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 2084, in step self._take_model_step(lr_kwargs) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/engine.py", line 1990, in _take_model_step self.optimizer.step() File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1854, in step if self._overflow_check_and_loss_scale_update(): File "/bert_ort/pengwa/deepspeed/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, *kwargs) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 1788, in _overflow_check_and_loss_scale_update self._update_scale(self.overflow) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/zero/stage3.py", line 2132, in _update_scale self.loss_scaler.update_scale(has_overflow) File "/bert_ort/pengwa/deepspeed/deepspeed/runtime/fp16/loss_scaler.py", line 175, in update_scale raise Exception( Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run. 2%\|████ \| 17/1000 [00:06<06:07, 2.67it/s] [2023-09-25 08:30:51,075] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 1065120) of binary: /bert_ort/pengwa/py38/bin/python Traceback (most recent call last): File "/bert_ort/pengwa/py38/bin/torchrun", line 8, in <module> sys.exit(main()) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper return f(args, **kwargs) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 806, in main run(args) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/run.py", line 797, in run elastic_launch( File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/bert_ort/pengwa/py38/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ examples/onnxruntime/training/language-modeling/run_clm.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-09-25_08:30:51 host : orttrainingdev10.internal.cloudapp.net rank : 0 (local_rank: 0) exitcode : 1 (pid: 1065120) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ (/bert_ort/pengwa/py38) pengwa@microsoft.com@orttrainingdev10:/bert_ort/pengwa/optim ``` ## The Fix For those output that are reusing input, but ORT is not aware of, we detected on the fly (the first iteration, by checking the output tensor addresses with input tensor addresses) , then do implicit copy before set it as PythonOp's output tensors. With this fix: (left: PyTorch, right: ORT) ![image](https://github.com/microsoft/onnxruntime/assets/10530022/0d72f431-2abd-4e52-af99-19974b85edde)	2023-10-07 08:40:19 +08:00
Bowen Bao	891b50cc68	General INFO logging tracking occurance of GraphTransformer modification (#17819 ) ### Description Adds logging to `GraphTransformer::Apply` whether modification has taken place or not. ### Motivation and Context A general high level info logging to track which optimization occurred for a given model. To help improve dynamo exported model performance by monitoring the difference of triggered transformations between that of torchscript exported model.	2023-10-06 17:03:26 -07:00
Hector Li	385fab5bae	[QNN EP] Qnn cache improvement (#17757 ) ### Description Improve the QNN context binary cache feature to reduce the memory overhead and initialization time overhead. Instead of dumping a Qnn context binary file with metadata as header, we dump a Onnx format file with metadata inside Onnx node. ### Motivation and Context reduce the memory overhead and initialization time overhead	2023-10-06 15:56:33 -07:00
Chi Lo	569876fb16	[TensorRT EP] Refactor OrtTensorRTProviderOptions initialization and make it easy to add new field (#17617 ) Two major modifications of this PR: 1. Refactor OrtTensorRTProviderOptions initialization and make it easy to add new field. 2. Make Python API capable of using TensorRT plugins by adding new Python binding api `register_tensorrt_plugins_as_custom_ops`. (It needs to register ep's custom op domain before model load. For C++ API, it's slightly different, when calling SessionOptionsAppendExecutionProvider_TensorRT_XX, it appends cutom op domain to session option. Later ORT can register custom op domain from session option before model loading)	2023-10-06 14:12:20 -07:00
Yulong Wang	6ea493571e	[js/web] use esbuild to accelerate bundle build (#17745 ) ### Description Use esbuild to accelerate bundle build. This change uses esbuild to replace webpack for onnxruntime-web. Bundle build time reduced from ~20sec to ~0.6sec on my windows dev box. A few changes applied: - import nodejs modules using "node:" prefix - remove enum declaration inside namespace (EncoderUsage) - use "fs/promise" to replace the old promisify from "util" - separate ort-web and test-runner. Previously they are bundled together, now they are built into 2 files. - optimize karma runner launch time - remove unnecessary sourcemap preprocessor. sourcemaps are handled inside esbuild - remove unnecessary proxies (because ort-web and test-runner are separated now, the path are correctly inferred) - remove file watcher for test data - optimize special handling as esbuild plugins: - polyfill dummy imports for node.js modules when targetting browser. - load as content string for ort-wasm-.worker.js - load as content string for ./proxy-worker/main.ts - a source patch to ort-wasm-threaded*.js (see details in comments in code) - updated debug configurations for sourcemap mapping to ensure out-of-box good dev experience	2023-10-06 13:37:37 -07:00
Kaz Nishimura	be1e51af2a	Add length checks to fusion_transpose.py (#17608 ) This change adds list length checks to node's inputs in fusion_transpose.py. It bypasses the optimization if not applicable. ### Motivation and Context Unsqueeze in opset (<13) has only one input and cause runtime exceptions.	2023-10-06 12:06:13 -07:00
Changming Sun	735df7e2a8	[webgpu]: add a simple GetCapability implementation (#17643 ) Most of the function body was copied from CUDA EP.	2023-10-06 10:52:17 -07:00
Sheil Kumar	cb9408e89c	Enable cpp20 builds for DML EP and WinML API (#17800 ) Enable cpp20 builds for DML EP and WinML API 1) Missing typename for templated types 2) unmove helper for inline references to rvalue temporaries This is okay since per the standard a temporary bound to a reference parameter in a function call exists until the end of the full expression containing that function call: if the function returns a reference, which outlives the full expression, it becomes a dangling reference. 3) static now not needed for template specializations --------- Co-authored-by: Sheil Kumar <sheilk@microsoft.com>	2023-10-06 10:33:38 -07:00
JiCheng	3878011ce2	Remove MPI dependency (#17624 ) ### Description <!-- Describe your changes. --> Support launch multi-GPU without MPI ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-10-06 15:33:18 +08:00
George Wu	b306b02a86	[QNN EP] fixed input for InstanceNormU8 unit test and update copy lib paths (#17806 ) -update InstanceNormU8 with fixed input. With this input, it fails consistently using QNN 2.15.1 -update QNN lib paths (target is deprecated) and additionally copy V73 skel file	2023-10-05 22:17:15 -07:00
Justin Chu	be7541ef4a	[Linter] Bump ruff and remove pylint (#17797 ) Bump ruff version and remove pylint from the linter list. Fix any new error detected by ruff. ### Motivation and Context Ruff covers many of the pylint rules. Since pylint is not enabled in this repo and runs slow, we remove it from the linters	2023-10-05 21:07:33 -07:00
Adrian Lizarraga	7417fd41e2	[QNN EP] Add better unit tests for rank 5 ReduceSum (#17802 ) ### Description We previously had a unit test that checked that QNN EP rejected rank 5 reduce ops. This PR: - Allows the underlying QNN APIs to validate the input rank for Reduce ops. - Modifies a rank 5 ReduceSum unit test so that it can be used to reproduce a graph finalization error on QNN SDK 2.15.1. - Adds a new rank 5 ReduceSum unit test with a configuration that is known to work in QNN SDK 2.15.1. ### Motivation and Context Allows us to more easily test/verify rank 5 support for ReduceSum.	2023-10-05 16:16:05 -07:00
Rachel Guo	5be79e2e29	Remove swift files on ORT main repo (#17799 ) ### Description <!-- Describe your changes. --> Move the swift files to ORT SPM repo now: https://github.com/microsoft/onnxruntime-swift-package-manager ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: rachguo <rachguo@rachguos-Mac-mini.local>	2023-10-05 15:27:15 -07:00
Wei-Sheng Chin	faef9c32fa	ONNX-Native Tensor Parallel: Using Distributed MatMul as Example (#17695 ) This PR introduces - New data structure to represent kernel-level (aka node-level or op-level) tensor sharding informaiton. I consider it as the fundamentaion of ONNX distribtued inference. - Building blocks for distribtued kernels implementation especially stateless implementation for communication ops. - Implementation of DistributedMatMul and its tests. Code structure: - sharding.h/.cc: Function to shard and reshard tensors (calling into NCCL). - sharding_spec.h/.cc: Representation of how a tensor is sharded. - distributed_matmul.h/.cc: Implementation of tensor parallel MatMul. Inputs and outputs are sharded across devices. - onnxruntime_test_distributed.py: distributed operator tests. Example of specifying sharding information ```python @onnxscript.script() def matmul_rs_sr_rr(tensor_x: FLOAT, tensor_w: FLOAT) -> FLOAT: # Run MatMul by sharding x along column axis and w along row axis on # 2 GPUs. return MICROSOFT_OPSET.DistributedMatMul( tensor_x, tensor_w, device_mesh_shape=[2], device_mesh_elements=[0, 1], input_shard_specs=["RS[0]", "S[0]R"], output_shard_specs=["RR"], ) onnx_model = matmul_rs_sr_rr.to_model_proto( input_types=[FLOAT[2, "s"], FLOAT["s", 2]], output_types=[FLOAT[2, 2]], ) ``` In this example, the device mesh can be visualized as 1-D tensor, `[0, 1]`. The 2nd axis of `tensor_x` is sharded across `[0, 1]` (i.e., the 0-axis of the device mesh). Similarly, the 1st axis of `tensor_w` is sharded across `[0, 1]` as well. C++ classes to represent tensor sharding (copied from sharding_spec.h): ```cpp class DeviceMesh { public: // [Device Mesh and Tensor Sharding for Tensor Parallel] // Device mesh is a tensor of device indices. // A tensor can then be partitioned along specific mesh axes. // // Assume we have 4 GPUs indexed by 0, 1, 2, and 3. // Let's consider some examples. // 1. 1D device mesh [0, 1, 2, 3]. In this case, // device_mesh_shape is [4] and device_mesh_elements // is [0, 1, 2, 3]. // If we want to shard a 2-D tensor along its axis 1, the // corresponding sharding spec is a string "RS[0]". // 2. 2D device mesh [[0, 1], [2, 3]]. In this case, // device_mesh_shape is [2, 2] and device_mesh_elements // is [0, 1, 2, 3]. // If we want to shard a 2-D tensor's // rows along mesh axis 1 and // columns along mesh axis 0, the // corresponding sharding spec is a string "S[1]S[0]". // If that 2-D tensor's value is np.array([[5, 6], [7, 8]]), // GPU 0/1/2/3 owns 5/7/6/8. Below is a visualization the sharding // proccess. // - Start with a 2-D device mesh [[0, 1], [2, 3]] and // a 2-D tensor [[5, 6], [7, 8]] // - GPU: [[0, 1], [2, 3]], Tensor: [[5, 6], [7, 8]] // - Split GPU mesh along axis 1 and tensor along // axis 0 for "S[1]" in "S[1]S[0]" // - GPU: [[0], [2]], Tensor: [[5, 6]] // GPU: [[1], [3]], Tensor: [[7, 8]] // - Split GPU mesh along axis 0 and tensor along // axis 1 for "S[0]" in "S[1]S[0]" // - GPU: [[0]], Tensor: [[5]] // - GPU: [[2]], Tensor: [[6]] // - GPU: [[1]], Tensor: [[7]] // - GPU: [[3]], Tensor: [[8]] // Actual shape of device mesh represented by `device_mesh_elements`. std::vector<int64_t> device_mesh_shape; // Flattened device mesh. std::vector<int64_t> device_mesh_elements; }; class AxisPartitionSpec { // [Device Mesh and Tensor Sharding for Tensor Parallel] // This class is the in-memory representation of // 1. if a tensor is sharded or not (aka replica), and // 2. which tensor axis is shard by which device mesh axis. // Let's consider sharding 2-D tensor along column axis on // device mesh [0, 1] as an example. // The required sharding spec RS[0] can be represented by // - AxisPartitionSpec(Condition::Replica, -1) // - AxisPartitionSpec(Condition::Shard, 0) public: // Status of a tensor axis. // A tensor axis can be either sharded or replicated // along a device mesh axis. enum class Condition { Replica, Shard }; // This field tells if a tensor axis is sharded or not. Condition cond; // If a tensor axis is sharded, this field tells which device // mesh axis to distribute the shards along. // If a tensor axis is not sharded, this field is ignored. int device_mesh_axis; // A helper to construct a replica spec for a tensor axis. static AxisPartitionSpec CreateReplica() { return AxisPartitionSpec(Condition::Replica, -1); } // A helper to construct a sharding spec for a tensor axis. // This tensor axis is sharded along `device_mesh_axis` in device mesh. static AxisPartitionSpec CreateShard(int device_mesh_axis) { return AxisPartitionSpec(Condition::Shard, device_mesh_axis); } }; class TensorPartitionSpec { // [Device Mesh and Tensor Sharding for Tensor Parallel] // TensorPartitionSpec holds a collection of AxisPartitionSpec and an // associated DeviceMesh. It is responsible for determining how a tensor // should be partitioned across a device mesh. // // Example 1: RS[0] // In this scenario, `axis_specs` would contain two `AxisPartitionSpec` objects. // - The first object is a Replica, denoting that the first axis of the tensor is // not sharded but is instead replicated. // - The second object is a Shard along the 0-th axis of the device mesh. It denotes // that the second axis of the tensor is sharded along the first axis of the // device mesh. // // Example 2: S[0]RR // In this scenario, `axis_specs` would contain three `AxisPartitionSpec` objects. // - The first object is a Shard along the 0-th axis of the device mesh, indicating // that the first axis of the tensor is sharded along the first axis of the // device mesh. // - The second and third objects are Replicas, indicating that the second and third // axes of the tensor are not sharded but are instead replicated. public: // axis_specs[i]: AxisPartitionSpec for tensor axis i. For a 2-D tensor, // axis_specs[0] is for row axis and axis_specs[1] is for // column axis. axis_specs[i].device_mesh_axis = j means that // tensor axis i is sharded along device mesh axis j. std::vector<AxisPartitionSpec> axis_specs; // device_mesh: DeviceMesh for sharding the associated tensor. // Read [Device Mesh and Tensor Sharding for Tensor Parallel] in DeviceMesh's comment. DeviceMesh device_mesh; }; ```	2023-10-05 14:22:25 -07:00
Benedikt Hilmes	742069a8e8	Add option for max intermediate outputs for MinMaxCalibrater (#17029 ) ### Description <!-- Describe your changes. --> Adds the option to set max_intermediate_outputs for quantization with the MinMaxCalibrater via. extra_options following the structure of existing flags. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> When running quantization with the MinMaxCalibrater with larger datasets, one quickly runs out of memory since it tries to load the full dataset. Since merging and clearing of the intermediate_outputs is already implemented within the Calibrater this simply adds an optional flag to make use of these functions during quantization.	2023-10-05 11:43:12 -07:00
Edward Chen	b6bef0f063	Add test for iOS dynamic framework (#17790 ) Add test to cover iOS dynamic framework usage.	2023-10-05 11:18:51 -07:00
Ye Wang	0e988239cc	[BeamSearch]optimize key cache reordering (#17771 ) ### Description <!-- Describe your changes. --> Replace onnxruntime::cuda::Transpose4DKernelParallelizeMultipleElementsPerThreadInInnermostDim() with custom transpose kernel in ReorderPastState(). The original implementation doesn't benefit from vectorized loading and coalesced accessing(write). and not fully utilize threads in the block. benchmarked with TNLGv4 model(batch=4, seq_len=4K) transpose kernel speed up: ~1.9X (392 μs -> 206 μs) overall reordering speedup: ~1.48X Latency: before: ![image](https://github.com/microsoft/onnxruntime/assets/52801275/34c7ab73-3da1-4c41-a036-e9fb6a966891) after: ![image](https://github.com/microsoft/onnxruntime/assets/52801275/337818ec-9598-4d8a-9e9b-7215b6862498) GPU matrix: before: ![image](https://github.com/microsoft/onnxruntime/assets/52801275/4962248f-703c-49bd-8586-deaeccd9bce0) after: ![image](https://github.com/microsoft/onnxruntime/assets/52801275/a795a892-4c5d-432d-8375-0bb67385d2bc) ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Your Name <you@example.com>	2023-10-05 10:29:11 -07:00
Hector Li	e1a089c23c	[QNN EP] Skip Op validation for Q & DQ node with 5D data (#17792 ) [QNN EP] Skip Op validation for Q & DQ node with 5D data ### Description Skip Op validation for Q & DQ node with 5D data to walk around a bug in QNN	2023-10-05 09:54:56 -07:00
Tianlei Wu	d6dad96923	Add CUDA EP in StableDiffusion demo (#17788 ) Add CUDA EP to the demo of stable diffusion. ### A100 Performance Test \| Engine Property \| Batch Size \| TRT Latency (ms) \| ORT_TRT Latency (ms) \| ORT_CUDA Latency (ms) \| TORCH Latency (ms) -- \| -- \| -- \| -- \| -- \| -- \| -- SD 1.5, 50 steps, 512x512 \| Static Input Shape \| 1 \| 861 \| 851 \| 861 \| N/A SD 1.5, 50 steps, 512x512 \| Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 \| 1 \| 974 \| 1079 \| 928 \| 1222 SD 1.5, 50 steps, 768x768 \| Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 \| 1 \| 2492 \| OOM \| 1901 \| 1971 SD 1.5, 50 steps, 768x768 \| Dynamic Input Shape, Optimized for batch size 1 and image size 512x512 \| 4 \|9091 \| OOM \| 6785 \| 6700 We can see that ORT_CUDA is the most robust one for handling dynamic input shape. PyTorch could be a good choice if you run large batch size. The above result is from one A100-SXM4-80GB GPU (in Standard_ND96amsr_A100_v4 Azure VM) with 50 steps to generate 512x512 or 768x768 images using StableDiffusion 1.5. Onnxruntime-gpu is built from source, and the following packages or libraries are used in this test: * tensorrt==8.6.1.post1 * torch==2.2.0.dev20230920+cu121 * transformers==4.31.0 * diffusers==0.19.3 * onnx==1.14.1 * onnx-graphsurgeon==0.3.27 * polygraphy==0.47.1 * protobuf==3.20.2 * onnxruntime-gpu==1.17.0 (built from source of main branch) * CUDA 12.2.2 * cuDNN 8.9.5.29 * python 3.10.13 For static input shape, the engine is built with static batch size and static image shape, and cuda graph is enabled. For dynamic input shape, the engine is built to support dynamic batch size and dynamic image shape, and cuda graph is disabled. The TensorRT engine is built for batch size 1~4, image size 256x256 ~ 1024x1024, and the optimized image size is 512x512. The script to test static and dynamic input shape are like the following: ``` prompt="a cute magical flying dog, fantasy art drawn by disney concept artists, highly detailed, digital paintining" for e in TRT ORT_TRT ORT_CUDA do python demo_txt2img.py --engine $e "$prompt" python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape "$prompt" python demo_txt2img.py --engine $e --disable-cuda-graph --build-dynamic-batch --build-dynamic-shape --height 768 --width 768 "$prompt" done ``` Performance of PyTorch is from commands like the following: ``` python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 512 --width 512 python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 1 --height 768 --width 768 python benchmark.py -e torch -v 1.5 --enable_torch_compile -b 4 --height 768 --width 768 ```	2023-10-05 08:19:20 -07:00
Jiajia Qin	db3901ab97	[js/webgpu] Enable the NCHW ConvMatMul path (#17717 ) 1) Enable pointwise NCHW conv2d by MatMul. 2) Enable non-pointwise NCHW conv2d by convMatMul. 3) Fix bug when `sameSize` is true --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-10-05 00:26:01 -07:00
Edward Chen	1bc115719c	Unify handling of public headers in onnxruntime.cmake. (#17779 ) The changes in PR #8919 overwrote the PUBLIC_HEADER property value of the `onnxruntime` target with a list that did not include EP-specific headers. We should probably be using a consistent set of header files across packages anyway.	2023-10-04 08:55:08 -07:00
Tianlei Wu	a05580ed5b	StableDiffusion XL with TensorRT EP (#17748 ) Accelerate StableDiffusion XL with TensorRT EP. It is modified from TensorRT demo diffusion, and we updated the design to make the pipeline works with different backend engines. The following result is from A100 80GB with 30 steps of Base, or 30 steps Base & 30 Steps Refiner to generate 1024x1024 images. The engine is built with static input shape, and cuda graph is enabled. \| Batch Size \| TRT Latency (ms) \| ORT_TRT Latency (ms) \| Diff -- \| -- \| -- \| -- \| -- Base \| 1 \| 2714 \| 2679 \| -1.3% Base & Refiner \| 1 \| 3593 \| 3530 \| -1.8% The test environment: onnxruntime-gpu is built from source, and the following packages or libraries are used in this test: * tensorrt==8.6.1.post1 * torch==2.2.0.dev20230920+cu121 * transformers==4.31.0 * diffusers==0.19.3 * onnx==1.14.1 * onnx-graphsurgeon==0.3.27 * polygraphy==0.47.1 * protobuf==3.20.2 * onnxruntime-gpu==1.17.0 (built from source of main branch) * CUDA 12.2.2 * cuDNN 8.9.5.29 * python 3.10.13	2023-10-04 08:01:39 -07:00
Adrian Lizarraga	8e6019af2e	[QNN EP] Enable QNN Saver for debugging issues (#17747 ) ### Description - Enables option to use the QNN Saver backend for dumping QNN API calls to file. - Adds logic to read environment variable `ORT_UNIT_TEST_ENABLE_QNN_SAVER` from QNN EP unit tests. If enabled, unit tests will use the QNN Saver backend and dump files to `./saver_output/`. ### Motivation and Context QNN Saver makes it easier to debug issues when unit tests fail. The output files generated by QNN Saver can be used to replay the exact QNN API calls that lead to a specific error condition. QNN Saver dumps QNN API calls (and weights) to disk. - saver_output/saver_output.c: C file containing all QNN API calls. - saver_output/params.bin: binary file containing all input/output/parameter tensor data provided during tensor creation, op config validation, and graph execution. Enabling the QNN Saver backend has 2 note-worthy effects: 1. All QNN API calls will succeed. 2. Inference output returns dummy data. Because the output files from QNN Saver are always overwritten, it is recommended to run individual unit tests via the `--gtest_filter` command-line option. Example (linux): ```shell $ ORT_UNIT_TEST_ENABLE_QNN_SAVER=1 ./onnxruntime_test_all --gtest_filter=QnnHTPBackendTests.Resize_DownSample_Linear_AlignCorners ```	2023-10-03 16:24:33 -07:00
Xu Xing	992f3e4609	[js/webgpu] Support where (#17544 ) Supported type: float. int32_t, uint32_t, bool. Case where_broadcast.jsonc is not enabled due to https://github.com/microsoft/onnxruntime/issues/17405. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> --------- Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-10-03 14:28:21 -07:00
Guenther Schmuelling	f8a8452a6b	[js/webgpu] fix pad operator (#17775 ) fix pad operator	2023-10-03 13:39:50 -07:00
Arthur Islamov	d0519a7603	[js/web] BiasSplitGelu and BiasAdd kernels (#17161 ) ### Description Two contrib kernels that supposed to speed-up StableDiffusion according to this doc https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md However, there is no noticable effect in speed or memory consumption. So i guess the only way to make it faster is to implement MultiHeadAttention but i'm not capable of doing that right now. So i'll focus on existing PRs and finding the JSEP kernel that produces incorrect results. It should be one of the old ones (i suspect Conv or ConvTranspose), as SD was not generating images correctly on webgpu since i started working on it. I hoped someone else would fix that by the time i finish with kernels/optimizations 😅 --------- Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-10-03 12:20:20 -07:00
Kaz Nishimura	d11e053412	Add option to specify the EP to use, enabling DML EP and others (#17490 ) ### Description Add DML EP to the acceptable provider list in the optimizer. ### Motivation and Context With DML EP, graph optimization was not performed in onnxruntime.	2023-10-02 23:53:09 -07:00
Yulong Wang	451c02543a	[js/webgpu] allow specify preferredLayout (#17756 ) ### Description Allow WebGPU backend to specify `preferredLayout`. Default is NHWC. ```js const options = {executionProviders: [{name:'webgpu', preferredLayout: 'NCHW'}]}; sess1 = await ort.InferenceSession.create('./mobilenetv2-12.onnx', options); ``` ### Motivation and Context - implement @qjia7's requirement for an easier way to do performance comparison between NCHW vs NHWC. - It's possible that NCHW does better on some models and NHWC on others. So offer user the capability to switch.	2023-10-02 21:25:12 -07:00
zesongw	f158f394d6	[WebNN EP] Support Softmax since version 13 (#17714 ) ### Description <!-- Describe your changes. --> WebNN only supports 2-D input tensor along axis 1. For now, we use Reshape and Transpose wraparound to get the compatible input. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Enable more models to run on WebNN.	2023-10-02 13:01:04 -07:00
Scott McKay	ac4e726046	Add bytes model loading test to react native e2e (#17749 ) ### Description <!-- Describe your changes. --> Update E2E test to also check InferenceSession.create with bytes. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Add tests to validate #17739	2023-10-02 12:25:28 +10:00
Ella Charlaix	63acaf47d2	Fix onnx quantizer activation and weight type attribute (#17651 ) In [`quantize_subgraph`](https://github.com/microsoft/onnxruntime/blob/v1.16.0/onnxruntime/python/tools/quantization/onnx_quantizer.py#L188-L189) `self.weight_qType` and `self.activation_qType` are [integers](https://github.com/microsoft/onnxruntime/blob/v1.16.0/onnxruntime/python/tools/quantization/onnx_quantizer.py#L115-L116) while `ONNXQuantizer` expects `QuantType`	2023-09-30 18:06:34 -07:00
xhcao	0d60604638	[JS/WebGPU] support Range operator (#17233 ) The patch also introduces the method which copies data from GPU to CPU synchronously. ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-09-30 02:05:32 -07:00
Arthur Islamov	a941dd583e	[js/web] FP16 Conv, ConvTranspose and MatMul (#17514 ) ### Description Another three ops for fp16 --------- Co-authored-by: Guenther Schmuelling <guschmue@microsoft.com> Co-authored-by: Yulong Wang <7679871+fs-eire@users.noreply.github.com>	2023-09-30 00:00:23 -07:00
Changming Sun	9aad78721c	Update debug_alloc.cc: filter out one more memory leak from absl (#17746 )	2023-09-29 20:40:09 -07:00
Pranav Sharma	668c70ee11	Add support for specifying a custom logging function per session. (#17727 ) ### Description Add support for specifying a custom logging function per session. Bindings for other languages will be added after this PR is merged. ### Motivation and Context Users want a way to override the logging provided by the environment.	2023-09-29 19:46:55 -07:00
Caroline Zhu	6a5f469d44	Add training interfaces to js/common (#17333 ) ### Description Following the design document: * Added CreateTrainingSessionHandler to the Backend interface * All existing Backend implementations throw an error for the new method createTrainingSessionHandler * Created TrainingSession namespace, interface, and TrainingSessionFactory interface * Created TrainingSessionImpl class implementation As methods are implemented, the TrainingSession interface will be added to or modified. ### Motivation and Context Adding the public-facing interfaces to the onnxruntime-common package is one of the first steps to support ORT training for web bindings. --------- Co-authored-by: Caroline Zhu <carolinezhu@microsoft.com>	2023-09-29 19:05:10 -07:00
Rachel Guo	e106b1eb8f	Fix react native load from Uint8Array buffer bug (#17739 ) ### Description <!-- Describe your changes. --> Use `.buffer` of Uint8Array to get ArrayBuffer. TODO: Add E2E React Native test case to cover JS level testing to avoid future breakage. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> #17732 Co-authored-by: rachguo <rachguo@rachguos-Mini.attlocal.net>	2023-09-29 18:03:28 -07:00
shaahji	5a623dca01	Python API to check whether collective ops are available or not (#17730 ) Python API to check whether collective ops are available or not ### Description <!-- Describe your changes. --> Adding an API to check whether collective ops are available or not. Since there is no independent MPI enabled build, this flag can be used on Python front for branching. Specifically, to conditionally enable tests. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Flag to be used in Python to check whether onnxruntime supports collective ops or not. Handy for conditionally enabling/disabling tests and for other branching decisions.	2023-09-29 14:11:05 -07:00
Changming Sun	14d349e290	Enable backtrace in unit tests (#17655 ) ### Description Google test can be built either with absl/re2 or not. This PR enables the build option so that google test framework can print out a nice stacktrace when something went wrong. It helps locate test errors in CI build pipelines. Also, Google test will remove the build option and make it always ON. So sooner or later we must make this change.	2023-09-29 12:32:56 -07:00
Yulong Wang	561aca97cf	[js/webgpu] support IO binding (#17480 ) <del> This PR is based on a few prerequisites PRs. They are listed as below: - #17465 - #17469 - #17470 - #17472 - #17473 - #17484 Please review the current change by only looking at commit e2e6623e673ec6de55a5c1f8edcbd3a46b535a89 and later. </del> ### Description This PR introduces WebGPU IO binding. This new feature allows onnxruntime-web users to use tensors created from GPU as model input/output so that a model inferencing can be done without unnecessary data copy between CPU and GPU for model input/output. ### Examples An E2E demo/example is being worked on. Following is some simple demo with code snippet. Let's first check today how we do: ```js // STEP.1 - create an inference session: const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'] }); // STEP.2 - create model input: (supposing myImageCpuData is a Float32Array) const feeds = { 'input_image:0': new ort.Tensor('float32', myImageCpuData, [1, 224, 224, 3]) }; // STEP.3 - run model const myResults = await mySession.run(feeds); // STEP.4 - get output data const myData = myResults['output_image:0'].data; // Float32Array ``` #### for inputs (GPU tensor): Now, with IO binding, you can create a tensor from a GPU buffer, and feed it to the model: ```js // new STEP.2.A - create model input from a GPU buffer: (supposing myInputGpuBuffer is a `GPUBuffer` object with input data) const feeds = { 'input_image:0': ort.Tensor.fromGpuBuffer(myInputGpuBuffer, { dataType: 'float32', dims: [1, 224, 224, 3] }) }; ``` ### for outputs (pre-allocated GPU tensor) you can also do that for output, if you know the output shape: ```js // new STEP.2.B - create model output from a GPU buffer: (supposing myOutputGpuBuffer is a pre-allocated `GPUBuffer` object) const fetches = { 'output_image:0': ort.Tensor.fromGpuBuffer(myOutputGpuBuffer, { dataType: 'float32', dims: [1, 512, 512, 3] }) }; // new STEP.3 - run model with pre-allocated output (fetches) const myResults = await mySession.run(feeds, fetches); ``` ### for outputs (specify location) if you do not know the output shape, you can specify the output location when creating the session: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: "gpu-buffer" }); ``` if the model has multiple outputs, you can specify them seperately: ```js // new STEP.1 - create an inference session with an option "preferredOutputLocation": const mySession = await ort.InferenceSession.create('./my_model.onnx', { executionProviders: ['webgpu'], preferredOutputLocation: { "output_image:0": "gpu-buffer" } }); ``` now you don't need to prepare the `fetches` object and onnxruntime-web will prepare output data on the location that specified. #### read data when you get the output tensor, you can: ```js // get the gpu buffer object: const gpuBuffer = myOutputTensor.gpuBuffer; // GPUBuffer // get the CPU data asynchronizely const cpuData = await myOutputTensor.getData(); // get the CPU data asynchronizely and release the underlying GPU resources const cpuData = await myOutputTensor.getData(true); // dispose the tensor (release the underlying GPU resources). This tensor object will be invalid after dispose() is called. myOutputTensor.dispose(); ``` #### resource management JavaScript has GC so you don't need to worry about managing JavaScript objects. But there are 2 types of resources that are not managed by GC: - GPU buffer that used in tensors - Underlying ORT native resources To simplify, most of the unmanaged resources and handled inside ORT web. But there are a few resources that need users to manage: - All external GPU resources, including GPU buffers inside all tensors created by `Tensor.fromGpuBuffer()`, will not be managed by ORT. User should manage those GPU buffers themselves. - When a session is created with `preferredOutputLocation` == "gpu-buffer" specified in session options, and the corresponding output is not pre-allocated, user need to call the output tensor's `dispose()` or `getData(true)` to manually release the underlying GPU buffers. - ORT internal errors (including providing a pre-allocated output tensor with wrong type/dims) will invalidate the whole wasm memory and is not recoverable. An exception is thrown in this situation.	2023-09-29 11:24:42 -07:00
satyajandhyala	b4fbc25b1f	[JS/Web] Add ConvTranspose implementation using MatMul (#17573 ) ### Description Add ConvTranspose implementation using MatMul to increase perf. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-09-29 11:00:44 -07:00
Changming Sun	caf98128c1	Update linux-wasm-ci.yml: remove the ln command (#17735 ) ### Description /usr/local/bin can only be modified by root. This command seems unnecessary	2023-09-28 21:43:29 -07:00
Scott McKay	9cb60c5b86	Resize and EP specific transpose optimization updates (#17664 ) ### Description <!-- Describe your changes. --> - Treat Resize as layout sensitive by default - whilst the ONNX spec does not specify a layout, EPs tend to implement only one - add second usage in L2 of TransposeOptimizer to plugin the ability to push a Transpose through a Resize assigned to the CPU EP - Allow EP specific logic for changes the ops considered to be layout sensitive to be plugged in - expected usage is for #17200 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Finish simplifying/clarifying transpose optimization and layout transformation that was proposed in #15552. This PR along with #17618 should complete the changes. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>	2023-09-29 08:11:36 +10:00
Tianlei Wu	20f96fd096	Fix Attention Runtime Error for CLIP model (#17729 ) ### Description The condition check is not correct ``` if (is_unidirectional_ && enable_fused_causal_attention_) { // GPT } else { // BERT } ``` Change it to ``` if (is_unidirectional_) { // GPT } else { // BERT } ``` Another walkaround is to enable fused causal attention by adding an environment variable `ORT_ENABLE_FUSED_CAUSAL_ATTENTION=1` before running stable diffusion. ### Motivation and Context Without the fix, optimized CLIP model of stable diffusion will encounter error in running Attention node: 2023-09-24 16:15:31.206037898 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Attention node. Name:'Attention_0' Status Message: /onnxruntime_src/onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/mha_runner.cu:207 bool onnxruntime::contrib::cuda::FusedMHARunnerFP16v2::mhaImpl::is_flash_attention(int) const interface->mHasCausalMask == false was false. Note that the bug has been there for a long time. It is just surfaced since we recently added a fusion for CLIP, which will trigger the error. We will add a comprehensive test for causal attention later to avoid such corner cases.	2023-09-28 14:32:08 -07:00
Jian Chen	fc9a69dcae	Update VecAddMoveOnlyFunctor and VecAddWithIsSupportedMethod with Default constructor (#17705 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2023-09-28 09:30:42 -07:00
Yi Zhang	9136748462	Fix: Fail to skip disabledmodel in winml (#17728 ) ### Description Move appending source name behind the ModifyNameIfDisabledTest ### Motivation and Context In winml, disabled test name doesn't include the model source name. WinML job will be broken in the new image. https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1151451&view=logs&s=4eef7ad1-5202-529d-b414-e2b14d056c05 ### Verified https://dev.azure.com/onnxruntime/onnxruntime/_build/results?buildId=1151691&view=logs&s=4eef7ad1-5202-529d-b414-e2b14d056c05	2023-09-28 13:46:44 +08:00

1 2 3 4 5 ...

9729 commits