pytorch/third_party
drisspg 5dc9128229 FP8 rowwise scaling (#125204)
# Summary
This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)

### Todo
We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

#### ifdef

I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
    defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204
Approved by: https://github.com/lw, https://github.com/malfet
2024-06-05 15:46:40 +00:00
..
benchmark@0d98dba29d
cpp-httplib@3b6597bba9 [distributed] Add cpp-httplib to pytorch (#126470) 2024-05-17 19:45:08 +00:00
cpuinfo@d6860c477c
cudnn_frontend@b740542818 [BE][Ez]: Update cudnn_frontend submodule to v1.4.0 (#127175) 2024-05-29 14:23:38 +00:00
cutlass@bbe579a9e3
eigen@3147391d94
fbgemm@dbc3157bf2
flatbuffers@01834de25e
fmt@e69e5f977d
foxi@c278588e34
FP16@4dfe081cf6
FXdiv@b408327ac2
gemmlowp
gloo@5354032ea0
googletest@e2239ee604
ideep@55ca019168 [Reopen] Upgrade submodule oneDNN to v3.4.2 (#126137) 2024-05-16 12:00:16 +00:00
ittapi@5b8a7d7422
kineto@be1317644c update kineto submodule hash (#126780) 2024-05-27 18:11:48 +00:00
mimalloc@b66e3214d8
miniz-2.1.0 Reland add write_record_metadata to PyTorchFileWriter (#126087) 2024-05-14 21:48:44 +00:00
nccl Update NCCL submodule to v2.20.5 (#121635) 2024-03-11 17:23:59 +00:00
nlohmann@87cda1d664
NNPACK@c07e3a0400
onnx@990217f043 update submodule onnx==1.16.0 (#123125) 2024-04-02 20:41:22 +00:00
opentelemetry-cpp@a799f4aed9 [rfc] opentelemetry in pytorch (#122999) 2024-04-21 15:20:21 +00:00
pocketfft@9d3ab05a7f
protobuf@d1eca4e4b4
psimd@072586a71b
pthreadpool@4fe0e1e183
pybind11@3e9dfa2866 Upgrade submodule pybind to 2.12.0 (#122899) 2024-03-31 11:29:40 +00:00
python-peachpy@f45429b087
sleef@60e76d2bce Enable x86 CPU vectorization on windows [submodule sleef] (#118980) 2024-03-31 03:07:32 +00:00
tensorflow_cuda_bazel_build/cuda
tensorpipe@52791a2fd2
valgrind-headers
VulkanMemoryAllocator@a6bfc23725
XNNPACK@fcbf55af6c
BUCK.oss [fbcode] remove xcode_public_headers_symlinks (#125966) 2024-05-13 15:06:35 +00:00
BUILD
build_bundled.py [rfc] opentelemetry in pytorch (#122999) 2024-04-21 15:20:21 +00:00
cpp-httplib.BUILD Reapply "distributed debug handlers (#126601)" (#127805) 2024-06-04 19:44:30 +00:00
cuda.BUILD
cudnn.BUILD
cudnn_frontend.BUILD
cutlass.BUILD FP8 rowwise scaling (#125204) 2024-06-05 15:46:40 +00:00
eigen.BUILD
fmt.BUILD
foxi.BUILD
generate-cpuinfo-wrappers.py
generate-xnnpack-wrappers.py
glog.buck.bzl
gloo.BUILD
ideep.BUILD
kineto.buck.bzl
kineto.BUILD
LICENSES_BUNDLED.txt [rfc] opentelemetry in pytorch (#122999) 2024-04-21 15:20:21 +00:00
METADATA.bzl
mkl-dnn.BUILD [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051) 2024-05-31 01:20:45 +00:00
mkl.BUILD [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051) 2024-05-31 01:20:45 +00:00
mkl_headers.BUILD
onnx.BUILD
opentelemetry-cpp.BUILD [rfc] opentelemetry in pytorch (#122999) 2024-04-21 15:20:21 +00:00
README.md
sleef.BUILD Enable x86 CPU vectorization on windows [submodule sleef] (#118980) 2024-03-31 03:07:32 +00:00
sleef.bzl
substitution.bzl
tensorpipe.BUILD
xnnpack.buck.bzl
xnnpack_src_defs.bzl
xnnpack_wrapper_defs.bzl
xpu.txt Update torch-xpu-ops pin (ATen XPU implementation) (#127879) 2024-06-05 02:13:46 +00:00

This folder contains vendored copies of third-party libraries that we use.