onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-06-24 02:47:54 +00:00

History

pengwa bcebd3b1ca Allow upstream for Slice on single axis (#16410 ) ### Allow upstream for Slice on single axis #### Benchmark on 8x32GB V100 + DeepSpeed On Bloom560M model, there is 1.5% throughput gains on the same max batch size 6. ``` torchrun --nproc_per_node=8 examples/onnxruntime/training/language-modeling/run_clm.py --model_name_or_path bigscience/bloom-560m --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 6 --per_device_eval_batch_size 1 --do_train --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 200 --logging_steps 1 --use_module_with_loss --deepspeed aml_ds_config_zero_1.json ``` ##### Main branch ``` Total overhead: 38957ms where export takes 35493ms. *** train metrics *** epoch = 4.08 train_loss = 2.6841 train_runtime = 0:03:10.67 train_samples = 2318 train_samples_per_second = 50.348 train_steps_per_second = 1.049 throughput per gpu=4.08 * 2318 / (190.67 - 38.957) / 8(gpu) = 7.792 samples/second ``` ##### This PR ``` Total overhead: 38649ms where export takes 34946ms. *** train metrics *** epoch = 4.08 train_loss = 2.6757 train_runtime = 0:03:08.08 train_samples = 2318 train_samples_per_second = 51.04 train_steps_per_second = 1.063 throughput per gpu=4.08 * 2318 / (188.08 - 38.649) / 8(gpu) = 7.911 samples/second ``` #### Benchmark on 4x16GB V100 + AutoCast On Bloom560M model, there is 1.8% throughput gains on the same batch size, 24% gains with corresponding maximum batch size. Also it allow ORT run bigger batch size (from 3 to 4) on following recipe. ``` torchrun --nproc_per_node=4 examples/onnxruntime/training/language-modeling/run_clm.py --model_name_or_path bigscience/bloom-560m --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 3 --per_device_eval_batch_size 1 --do_train --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 200 --logging_steps 1 --use_module_with_loss ``` ##### Main branch ``` Total overhead: 4789ms where export takes 3798ms. *** train metrics *** epoch = 1.02 train_loss = 20.3338 train_runtime = 0:01:42.78 train_samples = 2343 train_samples_per_second = 23.349 train_steps_per_second = 1.946 throughput per gpu=1.02 * 2343 / (102.78 - 4.789) / 4(gpu) = 6.097 samples/second ``` ##### This PR ``` Total overhead: 4608ms where export takes 3555ms. *** train metrics *** epoch = 1.02 train_loss = 20.3364 train_runtime = 0:01:40.87 train_samples = 2343 train_samples_per_second = 23.792 throughput per gpu=1.02 * 2343 / (100.87 - 4.608) / 4(gpu) = 6.207 samples/second ``` With this PR, also can run batch size 4 (main branch fails), ``` Total overhead: 4743ms where export takes 3698ms. *** train metrics *** epoch = 1.36 train_loss = 20.2096 train_runtime = 0:01:50.42 train_samples = 2343 train_samples_per_second = 28.979 train_steps_per_second = 1.811 throughput per gpu= 1.36 * 2343 / (110 - 4.743) / 4(gpu) =7.57 sample/second ``` #### Benchmark on 8x32GB V100 + AutoCast On Bloom560M model, there is 0.9% throughput gains on the same batch size, 8.6% gains with corresponding maximum batch size. Also it allow ORT run bigger batch size (from 3 to 4) on following recipe. ``` torchrun --nproc_per_node=8 examples/onnxruntime/training/language-modeling/run_clm.py --model_name_or_path bigscience/bloom-560m --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 --num_train_epochs 10 --per_device_train_batch_size 3 --per_device_eval_batch_size 1 --do_train --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused --max_steps 200 --logging_steps 1 --use_module_with_loss ``` ##### Main branch ``` Total overhead: 55259ms where export takes 51140ms. *** train metrics *** epoch = 2.06 train_loss = 2.8788 train_runtime = 0:02:36.65 train_samples = 2318 train_samples_per_second = 30.64 train_steps_per_second = 1.277 throughput per gpu=2.06 * 2318 / (156.65 - 55.259) / 8(gpu) = 5.887 samples/second ``` ##### This PR ``` Total overhead: 55712ms where export takes 51418ms. *** train metrics *** epoch = 2.06 train_loss = 2.8696 train_runtime = 0:02:36.19 train_samples = 2318 train_samples_per_second = 30.731 train_steps_per_second = 1.28 throughput per gpu=2.06 * 2318/ (156.19 - 55.712) / 8(gpu) = 5.940 samples/second ``` With this PR, also can run batch size 4 (main branch fails), ``` Total overhead: 54238ms where export takes 49899ms. *** train metrics *** epoch = 2.74 train_loss = 2.7692 train_runtime = 0:02:58.47 train_samples = 2318 train_samples_per_second = 35.859 train_steps_per_second = 1.121 throughput per gpu= 2.74 * 2318 / (178.47 - 54.238) / 8(gpu) =6.391sample/second ```		2023-07-10 08:36:11 +08:00
..
api_tests_without_env	Run clang-format in CI (#15524 )	2023-04-18 09:26:58 -07:00
common	Support SCELossInternal/SCELossInternalGrad run with larger sized input (#16363 )	2023-06-30 08:36:06 +08:00
contrib_ops	[CUDA] Optimize BiasGelu/BiasGeluGrad Kernel (#16608 )	2023-07-07 08:28:38 +08:00
custom_op_registration	Support custom ops taking float 8 tensors as inputs and outputs (#16323 )	2023-07-06 14:36:06 +02:00
debug_node_inputs_outputs	Separate out operator vs model testing. (#16228 )	2023-06-17 12:58:57 +10:00
framework	Re-organize the transpose optimization and layout transformation files. (#16246 )	2023-07-07 08:24:47 +10:00
fuzzing	Run clang-format in CI (#15524 )	2023-04-18 09:26:58 -07:00
global_thread_pools	Run clang-format in CI (#15524 )	2023-04-18 09:26:58 -07:00
ir	Run clang-format in CI (#15524 )	2023-04-18 09:26:58 -07:00
logging_apis	Run clang-format in CI (#15524 )	2023-04-18 09:26:58 -07:00
mlas	NhwcFusedConv: Add before Activation (#15837 )	2023-05-08 21:02:35 -07:00
onnx	Enable `-Wshorten-64-to-32` warning if available. (#16524 )	2023-07-07 08:11:44 -07:00
opaque_api	Run clang-format in CI (#15524 )	2023-04-18 09:26:58 -07:00
optimizer	Allow upstream for Slice on single axis (#16410 )	2023-07-10 08:36:11 +08:00
perftest	Enable `-Wshorten-64-to-32` warning if available. (#16524 )	2023-07-07 08:11:44 -07:00
platform	Run clang-format in CI (#15524 )	2023-04-18 09:26:58 -07:00
proto
providers	clean unused parameter in ORT_UNUSED_PARAMETER (#16538 )	2023-07-07 13:20:36 -07:00
python	Allow saving of large models after optimization (github issue 12882) (#16440 )	2023-06-21 22:46:26 -07:00
quantization	remove AllocatorMgr class (#16509 )	2023-06-28 15:43:19 -07:00
shared_lib	Fix Reduced Ops pipeline (#16612 )	2023-07-06 14:32:59 -07:00
testdata	Support custom ops taking float 8 tensors as inputs and outputs (#16323 )	2023-07-06 14:36:06 +02:00
unittest_main	[TensorRT EP] avoid excessive library load/unload overhead when running unit tests. (#15639 )	2023-04-24 14:43:13 -07:00
util	Support SCELossInternal/SCELossInternalGrad run with larger sized input (#16363 )	2023-06-30 08:36:06 +08:00
wasm	Enable Web CI on Linux (#16419 )	2023-06-22 15:42:58 +08:00
win_getopt	Run clang-format in CI (#15524 )	2023-04-18 09:26:58 -07:00
xctest	Run clang-format in CI (#15524 )	2023-04-18 09:26:58 -07:00