### Support SCELoss/SCELossGrad run with larger sized input #### Motivation and Context: Run bigger batch size for Bloom model. For Bloom560M model, ORT has potential to run bigger batch size from initialally 6 to now 10. SCELoss/SCELossGrad's input size is Bsz X 1023 * 250680. When Bsz is bigger than 8, totoal element count cannot be represented by int32_t, which those kernels are using to passing total elem count. There is silent overflow causing other indirectly exceptions, or wrong mistake without errors. #### Changes in this PR - For SCELossInternal/SCELossGradInternal CUDA kernels, use uint64_t if total element count is bigger than int32::max() to pass all element count and element index for the ops mentioned above. - For SCELossInternal/SCELossGradInternal CPU kernels, - always use uint64_t to pass the element count. - update the Eigen functions involved in the two kernels' implementations, to use `ptrdiff_t` to pass element count instead of original `int`. - Parallelize SCELossInternal/SCELossGradInternal CPU kernels, otherwise, it is super slow when handling so many elements. - Others changed needed: - Add `CompareOrtValueNumerals` to compare two OrtValue with different data types (float or float16), without caller explicitly converting to the lower-precision data types. The comparison is also done in parallel, which reduce the comparsion time for the large UT case from 22s to ~1.6s. - The check of `IsResultCloselyMatch` is buggy for nan/inf cases, so fix the bugs. - The cross entropy tests are running CPU base line with float, then the result is used to compare with float16 results of CUDA runs. But there is precision issue when we check the results. Because the randomized input data is represented in float, CPU use it directly, but CUDA use a float16 version of it, so there is precision diff between the inputs, as the test data count increases, it make the results fail even on 1e-2. The fix is: generate data in float16, convert to float for CPU run, directly use float16 for CUDA runs. When compare the output, cast back CPU float to float16 then compare with CUDA outputs. - `RandomValueGenerator ` for the large size take about ~20second, so `ParallelRandomValueGenerator ` is added to random input in parallel, it takes about <2s for preparing input data. #### Non-goals `SoftmaxCrossEntropyLoss` && `SoftmaxCrossEntropyLossGrad` is not covered in this PR |
||
|---|---|---|
| .config | ||
| .devcontainer | ||
| .gdn | ||
| .github | ||
| .pipelines | ||
| .vscode | ||
| cgmanifests | ||
| cmake | ||
| csharp | ||
| dockerfiles | ||
| docs | ||
| include/onnxruntime/core | ||
| java | ||
| js | ||
| objectivec | ||
| onnxruntime | ||
| orttraining | ||
| rust | ||
| samples | ||
| swift/OnnxRuntimeBindingsTests | ||
| tools | ||
| winml | ||
| .clang-format | ||
| .clang-tidy | ||
| .dockerignore | ||
| .gitattributes | ||
| .gitignore | ||
| .gitmodules | ||
| .lintrunner.toml | ||
| build.amd64.1411.bat | ||
| build.bat | ||
| build.sh | ||
| CITATION.cff | ||
| CODEOWNERS | ||
| CONTRIBUTING.md | ||
| lgtm.yml | ||
| LICENSE | ||
| NuGet.config | ||
| ort.wprp | ||
| ORT_icon_for_light_bg.png | ||
| Package.swift | ||
| packages.config | ||
| pyproject.toml | ||
| README.md | ||
| requirements-dev.txt | ||
| requirements-doc.txt | ||
| requirements-lintrunner.txt | ||
| requirements-training.txt | ||
| requirements.txt.in | ||
| SECURITY.md | ||
| setup.py | ||
| ThirdPartyNotices.txt | ||
| VERSION_NUMBER | ||

ONNX Runtime is a cross-platform inference and training machine-learning accelerator.
ONNX Runtime inference can enable faster customer experiences and lower costs, supporting models from deep learning frameworks such as PyTorch and TensorFlow/Keras as well as classical machine learning libraries such as scikit-learn, LightGBM, XGBoost, etc. ONNX Runtime is compatible with different hardware, drivers, and operating systems, and provides optimal performance by leveraging hardware accelerators where applicable alongside graph optimizations and transforms. Learn more →
ONNX Runtime training can accelerate the model training time on multi-node NVIDIA GPUs for transformer models with a one-line addition for existing PyTorch training scripts. Learn more →
Get Started & Resources
-
General Information: onnxruntime.ai
-
Usage documention and tutorials: onnxruntime.ai/docs
-
YouTube video tutorials: youtube.com/@ONNXRuntime
-
Companion sample repositories:
- ONNX Runtime Inferencing: microsoft/onnxruntime-inference-examples
- ONNX Runtime Training: microsoft/onnxruntime-training-examples
Builtin Pipeline Status
| System | Inference | Training |
|---|---|---|
| Windows | ||
| Linux | ||
| Mac | ||
| Android | ||
| iOS | ||
| Web | ||
| Other |
Third-party Pipeline Status
| System | Inference | Training |
|---|---|---|
| Linux |
Data/Telemetry
Windows distributions of this project may collect usage data and send it to Microsoft to help improve our products and services. See the privacy statement for more details.
Contributions and Feedback
We welcome contributions! Please see the contribution guidelines.
For feature requests or bug reports, please file a GitHub Issue.
For general discussion or questions, please use GitHub Discussions.
Code of Conduct
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
License
This project is licensed under the MIT License.