mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
# Summary This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met: - `x`'s scale should be a 1-dimensional tensor of length `M`. - `y`'s scale should be a 1-dimensional tensor of length `N`. It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row". The following two PRs were required to enable local builds: - [PR #126185](https://github.com/pytorch/pytorch/pull/126185) - [PR #125523](https://github.com/pytorch/pytorch/pull/125523) ### Todo We still do not build our Python wheels with this architecture. @ptrblck @malfet, should we replace `sm_90` with `sm_90a`? The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit: https://github.com/pytorch/pytorch/pull/125204/files#r1586986954 #### ifdef I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \ defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this Kernel Credit: @jwfromm Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204 Approved by: https://github.com/lw, https://github.com/malfet |
||
|---|---|---|
| .. | ||
| benchmark@0d98dba29d | ||
| cpp-httplib@3b6597bba9 | ||
| cpuinfo@d6860c477c | ||
| cudnn_frontend@b740542818 | ||
| cutlass@bbe579a9e3 | ||
| eigen@3147391d94 | ||
| fbgemm@dbc3157bf2 | ||
| flatbuffers@01834de25e | ||
| fmt@e69e5f977d | ||
| foxi@c278588e34 | ||
| FP16@4dfe081cf6 | ||
| FXdiv@b408327ac2 | ||
| gemmlowp | ||
| gloo@5354032ea0 | ||
| googletest@e2239ee604 | ||
| ideep@55ca019168 | ||
| ittapi@5b8a7d7422 | ||
| kineto@be1317644c | ||
| mimalloc@b66e3214d8 | ||
| miniz-2.1.0 | ||
| nccl | ||
| nlohmann@87cda1d664 | ||
| NNPACK@c07e3a0400 | ||
| onnx@990217f043 | ||
| opentelemetry-cpp@a799f4aed9 | ||
| pocketfft@9d3ab05a7f | ||
| protobuf@d1eca4e4b4 | ||
| psimd@072586a71b | ||
| pthreadpool@4fe0e1e183 | ||
| pybind11@3e9dfa2866 | ||
| python-peachpy@f45429b087 | ||
| sleef@60e76d2bce | ||
| tensorflow_cuda_bazel_build/cuda | ||
| tensorpipe@52791a2fd2 | ||
| valgrind-headers | ||
| VulkanMemoryAllocator@a6bfc23725 | ||
| XNNPACK@fcbf55af6c | ||
| BUCK.oss | ||
| BUILD | ||
| build_bundled.py | ||
| cpp-httplib.BUILD | ||
| cuda.BUILD | ||
| cudnn.BUILD | ||
| cudnn_frontend.BUILD | ||
| cutlass.BUILD | ||
| eigen.BUILD | ||
| fmt.BUILD | ||
| foxi.BUILD | ||
| generate-cpuinfo-wrappers.py | ||
| generate-xnnpack-wrappers.py | ||
| glog.buck.bzl | ||
| gloo.BUILD | ||
| ideep.BUILD | ||
| kineto.buck.bzl | ||
| kineto.BUILD | ||
| LICENSES_BUNDLED.txt | ||
| METADATA.bzl | ||
| mkl-dnn.BUILD | ||
| mkl.BUILD | ||
| mkl_headers.BUILD | ||
| onnx.BUILD | ||
| opentelemetry-cpp.BUILD | ||
| README.md | ||
| sleef.BUILD | ||
| sleef.bzl | ||
| substitution.bzl | ||
| tensorpipe.BUILD | ||
| xnnpack.buck.bzl | ||
| xnnpack_src_defs.bzl | ||
| xnnpack_wrapper_defs.bzl | ||
| xpu.txt | ||
This folder contains vendored copies of third-party libraries that we use.