Add CPU kernels for DynamicTimeWarping and UnfoldTensor. (#22033)

### Description Add CPU kernels for DynamicTimeWarping and UnfoldTensor.
2026-05-14 20:48:00 +00:00 · 2024-10-12 01:44:18 +09:00 · 2024-10-12 01:44:18 +09:00 · 3c80aa9fee
commit 3c80aa9fee
parent f1f3d94e2d
10 changed files with 312 additions and 20 deletions
--- a/docs/ContribOperators.md
+++ b/docs/ContribOperators.md
@ -1545,7 +1545,7 @@ This version of the operator has been available since version 1 of the 'com.micr

 ### <a name="com.microsoft.DynamicTimeWarping"></a><a name="com.microsoft.dynamictimewarping">**com.microsoft.DynamicTimeWarping**</a>

-  Input is cost matrix where each value in input[r][c] is the cost for pass the point (r, c). From current point(r, c),  points (r+1, c), (r+1, c+1) or (r, c+1) could be arrived in next move. Given such cost matrix, return dynamic time wrapping of shape [2, x], where the path made by all points (output[0][t], output[1][t])have the lowest cost among all paths from (0, 0) to (M-1, N-1).
+  Input is cost matrix where each value in input[r][c] is the cost for pass the point (r, c). From current point(r, c),  points (r+1, c), (r+1, c+1) or (r, c+1) could be arrived in next move. Given such cost matrix, return dynamic time warping of shape [2, x], where the path made by all points (output[0][t], output[1][t])have the lowest cost among all paths from (0, 0) to (M-1, N-1).

 #### Version

@ -5974,7 +5974,7 @@ This version of the operator has been available since version 1 of the 'com.micr

 ### <a name="com.microsoft.UnfoldTensor"></a><a name="com.microsoft.unfoldtensor">**com.microsoft.UnfoldTensor**</a>

-  Returns a tensor which contains all slices of size size from input tensor in the dimension dim. Step between two slices is given by step. If sizedim is the size of dimension dim for input tensor, the size of dimension dim in the returned tensor will be (sizedim - size) / step + 1. An additional dimension of size size is appended in the returned tensor.
+  Returns a tensor which contains all slices of size `size` from input tensor in the dimension `dim`. Step between two slices is given by `step`. If `sizedim` is the size of dimension `dim` for input tensor, the size of dimension `dim` in the returned tensor will be `(sizedim - size) / step + 1`. An additional dimension of size `size` is appended in the returned tensor.

 #### Version

--- a/docs/OperatorKernels.md
+++ b/docs/OperatorKernels.md
@ -471,6 +471,7 @@ Do not modify directly.*
 |DequantizeLinear|*in* x:**T1**<br> *in* x_scale:**T2**<br> *in* x_zero_point:**T1**<br> *out* y:**T2**|1+|**T1** = tensor(int16), tensor(int32), tensor(int4), tensor(int8), tensor(uint16), tensor(uint4), tensor(uint8)<br/> **T2** = tensor(float)|
 |DynamicQuantizeLSTM|*in* X:**T**<br> *in* W:**T2**<br> *in* R:**T2**<br> *in* B:**T**<br> *in* sequence_lens:**T1**<br> *in* initial_h:**T**<br> *in* initial_c:**T**<br> *in* P:**T**<br> *in* W_scale:**T**<br> *in* W_zero_point:**T2**<br> *in* R_scale:**T**<br> *in* R_zero_point:**T2**<br> *out* Y:**T**<br> *out* Y_h:**T**<br> *out* Y_c:**T**|1+|**T** = tensor(float)<br/> **T1** = tensor(int32)<br/> **T2** = tensor(int8), tensor(uint8)|
 |DynamicQuantizeMatMul|*in* A:**T1**<br> *in* B:**T2**<br> *in* b_scale:**T1**<br> *in* b_zero_point:**T2**<br> *in* bias:**T1**<br> *out* Y:**T1**|1+|**T1** = tensor(float)<br/> **T2** = tensor(int8), tensor(uint8)|
+|DynamicTimeWarping|*in* input:**F**<br> *out* output:**I**|1+|**F** = tensor(float)<br/> **I** = tensor(int32)|
 |EmbedLayerNormalization|*in* input_ids:**T1**<br> *in* segment_ids:**T1**<br> *in* word_embedding:**T**<br> *in* position_embedding:**T**<br> *in* segment_embedding:**T**<br> *in* gamma:**T**<br> *in* beta:**T**<br> *in* mask:**T1**<br> *in* position_ids:**T1**<br> *out* output:**T**<br> *out* mask_index:**T1**<br> *out* embedding_sum:**T**|1+|**T** = tensor(float)|
 |ExpandDims|*in* X:**T**<br> *in* axis:**tensor(int32)**<br> *out* Y:**T**|1+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)<br/> **axis** = tensor(int32)|
 |FastGelu|*in* X:**T**<br> *in* bias:**T**<br> *out* Y:**T**|1+|**T** = tensor(float)|
@ -518,6 +519,7 @@ Do not modify directly.*
 |Tokenizer|*in* X:**T**<br> *out* Y:**T**|1+|**T** = tensor(string)|
 |TransposeMatMul|*in* A:**T**<br> *in* B:**T**<br> *out* Y:**T**|1+|**T** = tensor(float)|
 |Trilu|*in* X:**T**<br> *in* k:**tensor(int64)**<br> *out* Y:**T**|1+|**T** = tensor(double), tensor(float), tensor(int64)|
+|UnfoldTensor|*in* input:**T**<br> *out* output:**T**|1+|**T** = tensor(bfloat16), tensor(bool), tensor(double), tensor(float), tensor(float16), tensor(int16), tensor(int32), tensor(int64), tensor(int8), tensor(string), tensor(uint16), tensor(uint32), tensor(uint64), tensor(uint8)|
 |Unique|*in* x:**T**<br> *out* y:**T**<br> *out* idx:**tensor(int64)**<br> *out* counts:**tensor(int64)**|1+|**T** = tensor(float)|
 |WhisperBeamSearch|*in* input_ids:**F**<br> *in* max_length:**I**<br> *in* min_length:**I**<br> *in* num_beams:**I**<br> *in* num_return_sequences:**I**<br> *in* length_penalty:**T**<br> *in* repetition_penalty:**T**<br> *in* vocab_mask:**M**<br> *in* prefix_vocab_mask:**M**<br> *in* attention_mask:**I**<br> *in* decoder_input_ids:**I**<br> *in* logits_processor:**I**<br> *in* cross_qk_layer_head:**I**<br> *in* extra_decoding_ids:**I**<br> *in* temperature:**T**<br> *out* sequences:**I**<br> *out* sequences_scores:**T**<br> *out* scores:**T**<br> *out* cross_qk:**V**<br> *out* non_speech_probs:**T**|1+|**T** = tensor(float)|
 |WordConvEmbedding|*in* Sequence:**T**<br> *in* W:**T1**<br> *in* B:**T1**<br> *in* C:**T1**<br> *out* Y:**T1**|1+|**T** = tensor(int32)<br/> **T1** = tensor(float)|
--- a/onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc
+++ b/onnxruntime/contrib_ops/cpu/cpu_contrib_kernels.cc
@ -148,6 +148,8 @@ class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1,
 class ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, MLFloat16, SkipSimplifiedLayerNormalization);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, Inverse);
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, Trilu);
+class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, UnfoldTensor);
+class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, DynamicTimeWarping);

 #ifdef ENABLE_ATEN
 class ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kPytorchAtenDomain, 1, ATen);
@ -358,6 +360,8 @@ Status RegisterCpuContribKernels(KernelRegistry& kernel_registry) {
      BuildKernelCreateInfo<ONNX_OPERATOR_TYPED_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, MLFloat16, SkipSimplifiedLayerNormalization)>,
      BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, Inverse)>,
      BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, Trilu)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, UnfoldTensor)>,
+      BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kMSDomain, 1, DynamicTimeWarping)>,

 #ifdef ENABLE_ATEN
      BuildKernelCreateInfo<ONNX_OPERATOR_KERNEL_CLASS_NAME(kCpuExecutionProvider, kPytorchAtenDomain, 1, ATen)>,
--- a/onnxruntime/contrib_ops/cpu/tensor/dynamic_time_warping.cc
+++ b/onnxruntime/contrib_ops/cpu/tensor/dynamic_time_warping.cc
@ -0,0 +1,101 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "contrib_ops/cpu/tensor/dynamic_time_warping.h"
+#include "core/providers/cpu/tensor/utils.h"
+
+#include <vector>
+#include <numeric>
+
+using namespace onnxruntime::common;
+
+namespace onnxruntime {
+namespace contrib {
+
+ONNX_OPERATOR_KERNEL_EX(
+    DynamicTimeWarping,
+    kMSDomain,
+    1,
+    kCpuExecutionProvider,
+    (*KernelDefBuilder::Create())
+        .TypeConstraint("F", DataTypeImpl::GetTensorType<float>())
+        .TypeConstraint("I", DataTypeImpl::GetTensorType<int32_t>()),
+    DynamicTimeWarping);
+
+Status DynamicTimeWarping::Compute(OpKernelContext* ctx) const {
+  const Tensor& input_tensor = *ctx->Input<Tensor>(0);
+  const auto& input_dims = input_tensor.Shape().GetDims();
+  int rank = SafeInt<int>(input_dims.size());
+  ORT_ENFORCE(rank == 2 || (rank == 3 && input_dims[0] == 1),
+              "Currently input rank must be 2, or (3 with first dim equal to 1), but got:", rank);
+
+  const size_t rows = SafeInt<size_t>(input_dims[rank == 3 ? 1 : 0]);
+  const size_t cols = SafeInt<size_t>(input_dims[rank == 3 ? 2 : 1]);
+
+  std::vector<std::vector<float>> cost(rows + 1, std::vector<float>(cols + 1, std::numeric_limits<float>::infinity()));
+  std::vector<std::vector<int8_t>> trace(rows + 1, std::vector<int8_t>(cols + 1, -1));
+  std::vector<std::vector<int32_t>> path_helper;
+
+  // Compute the cost and trace matrices
+  cost[0][0] = 0;
+  for (size_t j = 1; j <= cols; ++j) {
+    for (size_t i = 1; i <= rows; ++i) {
+      const float c0 = cost[i - 1][j - 1];
+      const float c1 = cost[i - 1][j];
+      const float c2 = cost[i][j - 1];
+
+      float cur_cost;
+      int8_t cur_trace;
+      if (c0 < c1 && c0 < c2) {
+        cur_cost = c0;
+        cur_trace = 0;
+      } else if (c1 < c0 && c1 < c2) {
+        cur_cost = c1;
+        cur_trace = 1;
+      } else {
+        cur_cost = c2;
+        cur_trace = 2;
+      }
+
+      cost[i][j] = cur_cost + input_tensor.Data<float>()[(i - 1) * cols + j - 1];
+      trace[i][j] = cur_trace;
+    }
+  }
+
+  // Back-tracing to find the optimal path
+  int i = static_cast<int>(rows);
+  int j = static_cast<int>(cols);
+  int result_len = 0;
+  while (i > 0 && j > 0) {
+    path_helper.push_back({i - 1, j - 1});
+    ++result_len;
+    int8_t cur_trace = trace[i][j];
+    switch (cur_trace) {
+      case 0:
+        --i;
+        --j;
+        break;
+      case 1:
+        --i;
+        break;
+      case 2:
+        --j;
+        break;
+      default:
+        ORT_THROW("Invalid trace value: ", cur_trace);
+    }
+  }
+
+  // Update the output tensor
+  Tensor* output_tensor = ctx->Output(0, TensorShape{2LL, SafeInt<int64_t>(result_len)});
+  auto* output_data = output_tensor->MutableData<int32_t>();
+  for (int k = 0; k < result_len; ++k) {
+    output_data[k] = path_helper[static_cast<size_t>(result_len) - k - 1][0];
+    output_data[k + result_len] = path_helper[static_cast<size_t>(result_len) - k - 1][1];
+  }
+
+  return Status::OK();
+}
+
+}  // namespace contrib
+}  // namespace onnxruntime
--- a/onnxruntime/contrib_ops/cpu/tensor/dynamic_time_warping.h
+++ b/onnxruntime/contrib_ops/cpu/tensor/dynamic_time_warping.h
@ -0,0 +1,24 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/framework/op_kernel.h"
+#include <core/common/safeint.h>
+
+namespace onnxruntime {
+namespace contrib {
+
+using onnxruntime::OpKernelContext;
+using onnxruntime::OpKernelInfo;
+
+class DynamicTimeWarping : public OpKernel {
+ public:
+  DynamicTimeWarping(const OpKernelInfo& info) : OpKernel(info) {}
+
+  ~DynamicTimeWarping() = default;
+
+  Status Compute(OpKernelContext* context) const override;
+};
+
+}  // namespace contrib
+}  // namespace onnxruntime
--- a/onnxruntime/contrib_ops/cpu/tensor/unfold.cc
+++ b/onnxruntime/contrib_ops/cpu/tensor/unfold.cc
@ -0,0 +1,109 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#include "contrib_ops/cpu/tensor/unfold.h"
+#include "core/providers/cpu/tensor/utils.h"
+#include "core/providers/common.h"
+#include "core/platform/threadpool.h"
+
+#include <vector>
+#include <numeric>
+
+using namespace onnxruntime::common;
+
+namespace onnxruntime {
+namespace contrib {
+
+ONNX_OPERATOR_KERNEL_EX(
+    UnfoldTensor,
+    kMSDomain,
+    1,
+    kCpuExecutionProvider,
+    (*KernelDefBuilder::Create())
+        .TypeConstraint("T", DataTypeImpl::AllTensorTypes()),
+    UnfoldTensor);
+
+template <typename T>
+Status LaunchUnfoldTensor(const T* input,
+                          T* output,
+                          int64_t leading_dims_size,
+                          int64_t unfold_dim_size,
+                          int64_t tailing_dims_size,
+                          int64_t unfold_size,
+                          int64_t step_size,
+                          concurrency::ThreadPool* tp) {
+  int64_t unfold_dim_size_dst = (unfold_dim_size - unfold_size) / step_size + 1;
+  int64_t N = leading_dims_size * unfold_dim_size_dst * tailing_dims_size * unfold_size;
+
+  int64_t stride_leading_dst = unfold_size * tailing_dims_size * unfold_dim_size_dst;
+  int64_t stride_fold_dim_src = tailing_dims_size * step_size;
+  int64_t stride_leading_src = tailing_dims_size * unfold_dim_size;
+
+  static constexpr double cost = 1.0;
+  concurrency::ThreadPool::TryParallelFor(tp, static_cast<ptrdiff_t>(N), cost,
+                                          [&](std::ptrdiff_t begin, std::ptrdiff_t end) {
+                                            for (std::ptrdiff_t i = begin; i != end; ++i) {
+                                              const int64_t idx = static_cast<int64_t>(i);
+                                              const int64_t idx_leading = idx / stride_leading_dst;
+                                              int64_t n = idx % stride_leading_dst;
+                                              const int64_t stride_fold_dim_dst = tailing_dims_size * unfold_size;
+                                              const int64_t idx_fold = n / stride_fold_dim_dst;
+                                              n %= stride_fold_dim_dst;
+                                              const int64_t idx_tailing = n / unfold_size;
+                                              const int64_t idx_append = n % unfold_size;
+
+                                              int64_t idx_src = idx_leading * stride_leading_src +
+                                                                idx_fold * stride_fold_dim_src + idx_tailing +
+                                                                idx_append * tailing_dims_size;
+                                              output[idx] = input[idx_src];
+                                            }
+                                          });
+
+  return Status::OK();
+}
+
+Status UnfoldTensor::Compute(OpKernelContext* ctx) const {
+  const Tensor& input_tensor = *ctx->Input<Tensor>(0);
+  const auto& input_dims = input_tensor.Shape().GetDims();
+  int rank = SafeInt<int>(input_dims.size());
+
+  int dim = SafeInt<int>(HandleNegativeAxis(dim_, rank));
+  ORT_ENFORCE(dim < rank, "input rank:", rank, " is not bigger than attribut specified dim: ", dim);
+  ORT_ENFORCE(input_dims[dim] >= size_, "dimsize:", input_dims[dim], " is less than unfold size:", size_);
+
+  int64_t leading_dims = std::accumulate(input_dims.begin(), input_dims.begin() + static_cast<size_t>(dim),
+                                         1LL, std::multiplies<int64_t>());
+  int64_t tailing_dims = std::accumulate(input_dims.begin() + (static_cast<size_t>(dim) + 1),
+                                         input_dims.end(), 1LL, std::multiplies<int64_t>());
+
+  std::vector<int64_t> output_dims(static_cast<size_t>(rank) + 1, 0);
+  std::copy(input_dims.begin(), input_dims.end(), output_dims.begin());
+  output_dims[dim] = (input_dims[dim] - size_) / step_ + 1;
+  output_dims.back() = size_;
+  TensorShape output_shape(output_dims);
+  Tensor* output_tensor = ctx->Output(0, output_shape);
+
+  auto* tp = ctx->GetOperatorThreadPool();
+
+  Status status;
+  if (input_tensor.IsDataType<float>()) {
+    status = LaunchUnfoldTensor<float>(input_tensor.Data<float>(), output_tensor->MutableData<float>(),
+                                       leading_dims, input_dims[dim], tailing_dims, size_, step_, tp);
+  } else if (input_tensor.IsDataType<double>()) {
+    status = LaunchUnfoldTensor<double>(input_tensor.Data<double>(), output_tensor->MutableData<double>(),
+                                        leading_dims, input_dims[dim], tailing_dims, size_, step_, tp);
+  } else if (input_tensor.IsDataType<int32_t>()) {
+    status = LaunchUnfoldTensor<int32_t>(input_tensor.Data<int32_t>(), output_tensor->MutableData<int32_t>(),
+                                         leading_dims, input_dims[dim], tailing_dims, size_, step_, tp);
+  } else if (input_tensor.IsDataType<int64_t>()) {
+    status = LaunchUnfoldTensor<int64_t>(input_tensor.Data<int64_t>(), output_tensor->MutableData<int64_t>(),
+                                         leading_dims, input_dims[dim], tailing_dims, size_, step_, tp);
+  } else {
+    return ORT_MAKE_STATUS(ONNXRUNTIME, INVALID_ARGUMENT, "Unsupported data type: ", input_tensor.DataType());
+  }
+
+  return status;
+}
+
+}  // namespace contrib
+}  // namespace onnxruntime
--- a/onnxruntime/contrib_ops/cpu/tensor/unfold.h
+++ b/onnxruntime/contrib_ops/cpu/tensor/unfold.h
@ -0,0 +1,47 @@
+// Copyright (c) Microsoft Corporation. All rights reserved.
+// Licensed under the MIT License.
+
+#pragma once
+#include "core/framework/op_kernel.h"
+#include <core/common/safeint.h>
+
+namespace onnxruntime {
+namespace contrib {
+
+using onnxruntime::OpKernelContext;
+using onnxruntime::OpKernelInfo;
+
+template <typename T>
+Status LaunchUnfoldTensor(
+    const T* input,
+    T* output,
+    int64_t leading_dims_size,
+    int64_t unfold_dim_size,
+    int64_t tailing_dims_size,
+    int64_t unfold_size,
+    int64_t step_size);
+
+class UnfoldTensor final : public OpKernel {
+ public:
+  UnfoldTensor(const OpKernelInfo& info) : OpKernel(info) {
+    dim_ = SafeInt<int>(info.GetAttrOrDefault<int64_t>("dim", -1LL));
+    step_ = SafeInt<int>(info.GetAttrOrDefault<int64_t>("step", 1LL));
+    ORT_ENFORCE(step_ > 0, "step must greater than zero!");
+
+    int64_t temp_size;
+    ORT_ENFORCE(info.GetAttr("size", &temp_size).IsOK());
+    size_ = SafeInt<int>(temp_size);
+  }
+
+  ~UnfoldTensor() = default;
+
+  Status Compute(OpKernelContext* context) const override;
+
+ private:
+  int dim_;
+  int size_;
+  int step_;
+};
+
+}  // namespace contrib
+}  // namespace onnxruntime
--- a/onnxruntime/core/graph/contrib_ops/contrib_defs.cc
+++ b/onnxruntime/core/graph/contrib_ops/contrib_defs.cc
@ -1067,11 +1067,11 @@ ONNX_MS_OPERATOR_SET_SCHEMA(GridSample, 1,
 ONNX_MS_OPERATOR_SET_SCHEMA(
    UnfoldTensor, 1,
    OpSchema()
-        .SetDoc("Returns a tensor which contains all slices of size size from input tensor in the dimension dim. "
-                "Step between two slices is given by step. "
-                "If sizedim is the size of dimension dim for input tensor, the size of dimension dim in "
-                "the returned tensor will be (sizedim - size) / step + 1. "
-                "An additional dimension of size size is appended in the returned tensor.")
+        .SetDoc("Returns a tensor which contains all slices of size `size` from input tensor in the dimension `dim`. "
+                "Step between two slices is given by `step`. "
+                "If `sizedim` is the size of dimension `dim` for input tensor, the size of dimension `dim` in "
+                "the returned tensor will be `(sizedim - size) / step + 1`. "
+                "An additional dimension of size `size` is appended in the returned tensor.")
        .Attr("dim", "specify the dimension to unfold", AttributeProto::INT, static_cast<int64_t>(-1))
        .Attr("size", "specify the size", AttributeProto::INT)
        .Attr("step", "specify the step.", AttributeProto::INT, static_cast<int64_t>(1))
@ -1122,7 +1122,7 @@ ONNX_MS_OPERATOR_SET_SCHEMA(
    OpSchema()
        .SetDoc("Input is cost matrix where each value in input[r][c] is the cost for pass the point (r, c). From current point"
                "(r, c),  points (r+1, c), (r+1, c+1) or (r, c+1) could be arrived in next move. Given such cost matrix, return "
-                "dynamic time wrapping of shape [2, x], where the path made by all points (output[0][t], output[1][t])"
+                "dynamic time warping of shape [2, x], where the path made by all points (output[0][t], output[1][t])"
                "have the lowest cost among all paths from (0, 0) to (M-1, N-1).")
        .Input(0, "input", "Input cost tensor, it must be 2D tensor of shape M x N, or 1 x M x N", "F")
        .Output(0, "output", "Output tensor. shape is [2, x], where max(M, N) <= x < M + N", "I")
--- a/onnxruntime/test/contrib_ops/dynamic_time_warping_op_test.cc
+++ b/onnxruntime/test/contrib_ops/dynamic_time_warping_op_test.cc
@ -11,12 +11,12 @@ using namespace ONNX_NAMESPACE;
 namespace onnxruntime {
 namespace test {

+TEST(DynamicTimeWarping, simple) {
 #ifdef USE_CUDA
-
-TEST(DynamicTimeWarp, simple) {
  if (NeedSkipIfCudaArchLowerThan(530)) {
    return;
  }
+#endif

  std::vector<float> X = {
      3.0f,
@ -113,11 +113,12 @@ TEST(DynamicTimeWarp, simple) {
  tester.AddOutput<int32_t>("output", {2, 12}, Y);

  std::vector<std::unique_ptr<IExecutionProvider>> execution_providers;
+#ifdef USE_CUDA
  execution_providers.push_back(DefaultCudaExecutionProvider());
+#endif
+  execution_providers.push_back(DefaultCpuExecutionProvider());
  tester.Run(OpTester::ExpectResult::kExpectSuccess, "", {}, nullptr, &execution_providers);
 }

-#endif
-
 }  // namespace test
 }  // namespace onnxruntime
--- a/onnxruntime/test/contrib_ops/tensor_op_test.cc
+++ b/onnxruntime/test/contrib_ops/tensor_op_test.cc
@ -205,12 +205,12 @@ TEST(MVNContribOpTest, MeanVarianceNormalizationCPUTest_Version1_TO_8) {
  MeanVarianceNormalizationPerChannel(false, true);
 }

-#ifdef USE_CUDA
-
 TEST(UnfoldTensorOpTest, LastDim) {
+#ifdef USE_CUDA
  if (NeedSkipIfCudaArchLowerThan(530)) {
    return;
  }
+#endif

  std::vector<float> X = {
      1.0f, 2.0f, 3.0f, 4.0f,
@ -229,7 +229,10 @@ TEST(UnfoldTensorOpTest, LastDim) {
  tester.AddOutput<float>("output", {3, 2, 3}, output);

  std::vector<std::unique_ptr<IExecutionProvider>> execution_providers;
+#ifdef USE_CUDA
  execution_providers.push_back(DefaultCudaExecutionProvider());
+#endif
+  execution_providers.push_back(DefaultCpuExecutionProvider());
  tester.Run(OpTester::ExpectResult::kExpectSuccess, "", {}, nullptr, &execution_providers);
 }

@ -238,13 +241,13 @@ TEST(UnfoldTensorOpTest, NormalDim) {
    return;
  }

-  std::vector<int16_t> X = {
+  std::vector<int64_t> X = {
      1, 2, 3, 4, 2, 2, 3, 4, 3, 2, 3, 4,
      4, 6, 7, 8, 5, 6, 7, 8, 6, 6, 7, 8,
      6, 7, 8, 9, 7, 7, 8, 9, 8, 7, 8, 9,
      9, 7, 8, 9, 10, 7, 8, 9, 11, 7, 8, 9};

-  std::vector<int16_t> output = {
+  std::vector<int64_t> output = {
      1, 2, 3,
      2, 2, 2,
      3, 3, 3,
@ -269,15 +272,16 @@ TEST(UnfoldTensorOpTest, NormalDim) {
  tester.AddAttribute<int64_t>("dim", 1LL);
  tester.AddAttribute<int64_t>("size", 3LL);
  tester.AddAttribute<int64_t>("step", 2LL);
-  tester.AddInput<int16_t>("input", {2, 6, 4}, X);
-  tester.AddOutput<int16_t>("output", {2, 2, 4, 3}, output);
+  tester.AddInput<int64_t>("input", {2, 6, 4}, X);
+  tester.AddOutput<int64_t>("output", {2, 2, 4, 3}, output);

  std::vector<std::unique_ptr<IExecutionProvider>> execution_providers;
+#ifdef USE_CUDA
  execution_providers.push_back(DefaultCudaExecutionProvider());
+#endif
+  execution_providers.push_back(DefaultCpuExecutionProvider());
  tester.Run(OpTester::ExpectResult::kExpectSuccess, "", {}, nullptr, &execution_providers);
 }

-#endif
-
 }  // namespace test
 }  // namespace onnxruntime