onnxruntime/onnxruntime/test
pengwa bcebd3b1ca
Allow upstream for Slice on single axis (#16410)
### Allow upstream for Slice on single axis

#### Benchmark on 8x32GB V100 + DeepSpeed

On Bloom560M model, there is 1.5% throughput gains on the same max batch
size 6.
```
torchrun --nproc_per_node=8 examples/onnxruntime/training/language-modeling/run_clm.py  --model_name_or_path bigscience/bloom-560m --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1  --num_train_epochs 10 --per_device_train_batch_size 6 --per_device_eval_batch_size 1 --do_train  --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused  --max_steps 200 --logging_steps 1 --use_module_with_loss --deepspeed aml_ds_config_zero_1.json
```

##### Main branch

```
Total overhead: 38957ms where export takes 35493ms.
***** train metrics *****
  epoch                    =       4.08
  train_loss               =     2.6841
  train_runtime            = 0:03:10.67
  train_samples            =       2318
  train_samples_per_second =     50.348
  train_steps_per_second   =      1.049

throughput  per gpu=4.08 * 2318 / (190.67 - 38.957) / 8(gpu) = 7.792 samples/second
```

##### This PR

```
Total overhead: 38649ms where export takes 34946ms.

***** train metrics *****
  epoch                    =       4.08
  train_loss               =     2.6757
  train_runtime            = 0:03:08.08
  train_samples            =       2318
  train_samples_per_second =      51.04
  train_steps_per_second   =      1.063

throughput  per gpu=4.08 * 2318 / (188.08 - 38.649) / 8(gpu) = 7.911 samples/second
```

#### Benchmark on 4x16GB V100 + AutoCast

On Bloom560M model, there is 1.8% throughput gains on the same batch
size, 24% gains with corresponding maximum batch size.

Also it allow ORT run bigger batch size (from 3 to 4) on following
recipe.

```
torchrun --nproc_per_node=4 examples/onnxruntime/training/language-modeling/run_clm.py  --model_name_or_path bigscience/bloom-560m --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1  --num_train_epochs 10 --per_device_train_batch_size 3 --per_device_eval_batch_size 1 --do_train  --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused  --max_steps 200 --logging_steps 1 --use_module_with_loss
```

##### Main branch

```
Total overhead: 4789ms where export takes 3798ms.
***** train metrics *****
  epoch                    =       1.02
  train_loss               =    20.3338
  train_runtime            = 0:01:42.78
  train_samples            =       2343
  train_samples_per_second =     23.349
  train_steps_per_second   =      1.946

throughput  per gpu=1.02 * 2343 / (102.78 - 4.789) / 4(gpu) = 6.097 samples/second
```

##### This PR

```
Total overhead: 4608ms where export takes 3555ms.
***** train metrics *****
  epoch                    =       1.02
  train_loss               =    20.3364
  train_runtime            = 0:01:40.87
  train_samples            =       2343
  train_samples_per_second =     23.792

throughput  per gpu=1.02 * 2343 / (100.87 - 4.608) / 4(gpu) = 6.207 samples/second
```

With this PR, also can run batch size 4 (main branch fails), 

```
Total overhead: 4743ms where export takes 3698ms.
***** train metrics *****
  epoch                    =       1.36
  train_loss               =    20.2096
  train_runtime            = 0:01:50.42
  train_samples            =       2343
  train_samples_per_second =     28.979
  train_steps_per_second   =      1.811


throughput  per gpu= 1.36 *  2343 / (110 - 4.743) / 4(gpu) =7.57 sample/second
```



#### Benchmark on 8x32GB V100 + AutoCast

On Bloom560M model, there is 0.9% throughput gains on the same batch
size, 8.6% gains with corresponding maximum batch size.

Also it allow ORT run bigger batch size (from 3 to 4) on following
recipe.

```
torchrun --nproc_per_node=8 examples/onnxruntime/training/language-modeling/run_clm.py  --model_name_or_path bigscience/bloom-560m --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1  --num_train_epochs 10 --per_device_train_batch_size 3 --per_device_eval_batch_size 1 --do_train  --overwrite_output_dir --output_dir ./outputs/ --seed 1137 --fp16 --report_to none --optim adamw_ort_fused  --max_steps 200 --logging_steps 1 --use_module_with_loss

```

##### Main branch

```
Total overhead: 55259ms where export takes 51140ms.
***** train metrics *****
  epoch                    =       2.06
  train_loss               =     2.8788
  train_runtime            = 0:02:36.65
  train_samples            =       2318
  train_samples_per_second =      30.64
  train_steps_per_second   =      1.277

throughput  per gpu=2.06 * 2318 / (156.65 - 55.259) / 8(gpu) = 5.887 samples/second
```

##### This PR

```
Total overhead: 55712ms where export takes 51418ms.
***** train metrics *****
  epoch                    =       2.06
  train_loss               =     2.8696
  train_runtime            = 0:02:36.19
  train_samples            =       2318
  train_samples_per_second =     30.731
  train_steps_per_second   =       1.28

throughput  per gpu=2.06 * 2318/ (156.19 - 55.712) / 8(gpu) = 5.940 samples/second
```

With this PR, also can run batch size 4 (main branch fails), 

```
Total overhead: 54238ms where export takes 49899ms.
***** train metrics *****
  epoch                    =       2.74
  train_loss               =     2.7692
  train_runtime            = 0:02:58.47
  train_samples            =       2318
  train_samples_per_second =     35.859
  train_steps_per_second   =      1.121

throughput  per gpu= 2.74 * 2318 / (178.47 - 54.238) / 8(gpu) =6.391sample/second
```
2023-07-10 08:36:11 +08:00
..
api_tests_without_env Run clang-format in CI (#15524) 2023-04-18 09:26:58 -07:00
common Support SCELossInternal/SCELossInternalGrad run with larger sized input (#16363) 2023-06-30 08:36:06 +08:00
contrib_ops [CUDA] Optimize BiasGelu/BiasGeluGrad Kernel (#16608) 2023-07-07 08:28:38 +08:00
custom_op_registration Support custom ops taking float 8 tensors as inputs and outputs (#16323) 2023-07-06 14:36:06 +02:00
debug_node_inputs_outputs Separate out operator vs model testing. (#16228) 2023-06-17 12:58:57 +10:00
framework Re-organize the transpose optimization and layout transformation files. (#16246) 2023-07-07 08:24:47 +10:00
fuzzing Run clang-format in CI (#15524) 2023-04-18 09:26:58 -07:00
global_thread_pools Run clang-format in CI (#15524) 2023-04-18 09:26:58 -07:00
ir Run clang-format in CI (#15524) 2023-04-18 09:26:58 -07:00
logging_apis Run clang-format in CI (#15524) 2023-04-18 09:26:58 -07:00
mlas NhwcFusedConv: Add before Activation (#15837) 2023-05-08 21:02:35 -07:00
onnx Enable -Wshorten-64-to-32 warning if available. (#16524) 2023-07-07 08:11:44 -07:00
opaque_api Run clang-format in CI (#15524) 2023-04-18 09:26:58 -07:00
optimizer Allow upstream for Slice on single axis (#16410) 2023-07-10 08:36:11 +08:00
perftest Enable -Wshorten-64-to-32 warning if available. (#16524) 2023-07-07 08:11:44 -07:00
platform Run clang-format in CI (#15524) 2023-04-18 09:26:58 -07:00
proto
providers clean unused parameter in ORT_UNUSED_PARAMETER (#16538) 2023-07-07 13:20:36 -07:00
python Allow saving of large models after optimization (github issue 12882) (#16440) 2023-06-21 22:46:26 -07:00
quantization remove AllocatorMgr class (#16509) 2023-06-28 15:43:19 -07:00
shared_lib Fix Reduced Ops pipeline (#16612) 2023-07-06 14:32:59 -07:00
testdata Support custom ops taking float 8 tensors as inputs and outputs (#16323) 2023-07-06 14:36:06 +02:00
unittest_main [TensorRT EP] avoid excessive library load/unload overhead when running unit tests. (#15639) 2023-04-24 14:43:13 -07:00
util Support SCELossInternal/SCELossInternalGrad run with larger sized input (#16363) 2023-06-30 08:36:06 +08:00
wasm Enable Web CI on Linux (#16419) 2023-06-22 15:42:58 +08:00
win_getopt Run clang-format in CI (#15524) 2023-04-18 09:26:58 -07:00
xctest Run clang-format in CI (#15524) 2023-04-18 09:26:58 -07:00