onnxruntime

mirror of https://github.com/saymrwulf/onnxruntime.git synced 2026-05-16 21:00:14 +00:00

Author	SHA1	Message	Date
Edward Chen	df8ff34f25	Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset 12+ is not supported yet. (#13983 ) ### Description Update CUDA ArgMin/ArgMax op kernels to have end version 11 since opset 12+ is not supported yet. With the way these kernels are currently registered, the documentation shows support for opset 11+. This is not accurate. ### Motivation and Context Fix #13781	2022-12-21 19:01:00 -05:00
Numfor Tiapo	8943d623a4	DML EP Register operators for Opset 16 (#14034 ) This PR Registers the following operators for opset 16 to the DML EP: - LeakyRelu-16 - PRelu-16 - Where-16 - GreaterOrEqual-16 - LessOrEqual-16 Identity-16 was not added in this PR due to pipeline failures Co-authored-by: Numfor Mbiziwo-Tiapo <numform@microsoft.com>	2022-12-21 09:05:12 -08:00
Zhang Lei	fba09faf5b	Implement reuse past and present tensor in Attention Ops. (#13791 ) Implement reuse kv_cache past and present tensor in Attention Ops. Unit test for abover feature. Utilize the reuse kv_cache for past and present tensor in Greedy Search. Correctness test for it. Co-authored-by: Zhang Lei <phill.zhang@gmail.com>	2022-12-18 10:03:53 -08:00
Jakub Bachurski	3b17ab7c65	Add float64 kernels for Floor, Ceil, IsNaN (#13906 ) ### Description This PR adds support for `float64` kernels in the latest versions of operators: Floor, Ceil and IsNaN. ### Motivation and Context The lack of these kernels is non-trivial to work around and easily lead to performance losses when it is attempted. When equivalence with an existing implementation is required, precision is easily lost when casting to `float32` instead. IsNaN is common when cleaning up data in an ML pipeline. Floor and Ceil have uses for discretising values and single-precision floats are insufficient to round well when values get larger than a few million. According to my measurement this only increases the binary size by a few kilobytes (on the Python wheel of RelWithDebInfo). Closes #13673 (Round already has float64 support) Partially solves #8791 (Looks like there's parallel issues/PR open for Split, but it is also hard to work around and hence useful) Signed-off-by: jbachurski <kbachurski@gmail.com>	2022-12-14 14:57:14 -08:00
Hariharan Seshadri	abc5c25a85	Updates to GreedySearch/BeamSearch (#13943 )	2022-12-13 20:25:26 -08:00
Patrice Vignola	8246ff015a	[DML EP] Add EmbedLayerNorm (#13868 ) ### Description Add EmbedLayerNorm to the DML EP	2022-12-13 13:23:53 -08:00
Jian Chen	d7d932c1c2	Cjian/where python operator (#12795 ) Description: This PR will enable the python tool to run QWhere and QDQWhere operation Limitation: s8s8 Where is still not supported.	2022-12-12 13:27:47 -08:00
Edward Chen	8cfbc4fe91	Add support for other data types to Split CPU kernel. (#13900 ) Split copies data - we can add support for all data types without too much binary size impact by using data type size-based implementations. The DispatchStridedCopy() function used here does this.	2022-12-12 09:29:15 -08:00
Nat Kershaw (MSFT)	21dd341e52	Add Google Analytics to python apidocs (#13901 )	2022-12-09 15:44:12 -08:00
Patrice Vignola	96d8d2c278	[DML EP] Add SkipLayerNormalization (#13849 ) ### Description Add SkipLayerNormalization for the DML EP	2022-12-07 01:49:14 -08:00
Hariharan Seshadri	004a1538d3	Extend vocab padding for logits MatMul for fp16 GPT2 GreedySearch (#13842 )	2022-12-06 19:39:20 -08:00
Patrice Vignola	b53bbe7370	[DML EP] Add an implementation for NonZero (#13768 ) ### Description Add the NonZero op for DML ### Motivation and Context NonZero is used in a few transformer models, so having a DML implementation will stop large tensors from being transferred to the CPU and back to the GPU	2022-12-02 18:39:21 -08:00
Patrice Vignola	a0b470bc35	[DML EP] Add mixed datatype support for DML's LayerNorm contrib op (#13734 ) ### Description Add mixed datatype support for DML's LayerNorm contrib op. ### Motivation and Context The fusion logic removes casts around LayerNorm in the graph because the contrib version of the op supports mixed datatypes. Scale, Bias and Output's datatypes must match, but input's datatype can be different.	2022-12-01 14:08:18 -08:00
Patrice Vignola	e9b92fdf33	[DML EP] Add DML implementation for BiasGelu (#13795 ) ### Description Add DML implementation for BiasGelu	2022-12-01 09:23:19 -08:00
Tianlei Wu	8b0e0f4927	Add RemovePadding and RestorePadding for BERT model (#13701 ) Add two operators RemovePadding and RestorePadding based on ideal of effective transformer (https://github.com/bytedance/effective_transformer) to improve large batch size inference for BERT model.	2022-11-22 10:00:23 -08:00
Hariharan Seshadri	c7329e004d	Improve fp16 performance of GPT-2's logits MatMul while using BeamSearch (#13686 )	2022-11-18 18:50:19 -08:00
Ye Wang	38a74af45d	Support position_ids broadcasting in EmbedLayerNorm (#13677 ) ### Description <!-- Describe your changes. --> ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> fix https://github.com/microsoft/onnxruntime/issues/13508	2022-11-17 17:56:27 -08:00
pengwa	d5721b3464	Fix wrong import path in docs (#13680 ) ### Fix wrong import path in docs	2022-11-17 18:15:02 +08:00
Patrice Vignola	3482180ec2	DML EP add a registration for Shape and Size (#13442 ) ### Description Add a DML registration for Shape to avoid copying back to the CPU just to get the shape of a GPU tensor. ### Motivation and Context When using free dimensions, many Transformers models extensively use the `Shape` operator. This causes hundreds of GPU->CPU copy that should be completely avoidable. Note that this change also uses the same heuristics as other providers (e.g. CUDA) to force some tensors on the CPU in certain situations. Co-authored-by: Patrice Vignola <pavignol@microsoft.com>	2022-11-08 19:29:37 -08:00
pengwa	ab9ac2acc4	Add guidelines for ORTModule (#13553 ) ### Add guidelines for ORTModule As title. Feel free to let me know if I missed something. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-11-04 19:42:10 +08:00
pengwa	a3e7da60e7	Trade subgraph recompute for memory (#12852 ) Description: Subgraph-level recompute This PR adds an optional capability trading additional re-computation for better memory efficiency. Specifically, a pre-defined operator list used to iterate the Graph to find some subgraphs for recompute, to reduce some stashed activations whose lifetime across forward and backward pass. When training with ORTModule, by default, the graph transformer will scan the execution graph to find all eligible subgraph to recompute, along with sizes that can save. An example looks like below. If we want to enable some of them to recompute, we can define env variable this way: `export ORTMODULE_ENABLE_MEMORY_ALLEVIATION="Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1:-1,BiasGelu+:1:-1,BitmaskDropout+Cast+:1:-1,FusedMatMul+:1:-1,Cast+:1:-1,Mul+Add+:1:-1,Mul+Sub+:1:-1"` ``` [1,0]<stderr>:2,022-10-12 14:47:39.302,954,530 [W:onnxruntime:, memory_alleviation.cc:595 PrintSummary] [1,0]<stderr>:MemoryAlleviation Summary: [1,0]<stderr>: User config: [1,0]<stderr>: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+:1,BiasGelu+:1,BitmaskDropout+Cast+:1,FusedMatMul+:1,Cast+:1,Mul+Add+:1,Mul+Sub+:1 [1,0]<stderr>: ================================= [1,0]<stderr>: Subgraph: BitmaskDropout+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 1,024 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: BiasGelu+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Reshape[1,0]<stderr>:+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:labels_dim0 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Unsqueeze+Unsqueeze+Cast+Sub+Mul+Mul+FusedMatMul+Cast+Add+BiasSoftmaxDropout+Cast+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:23 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+FusedMatMul+Cast+Unsqueeze+Unsqueeze+Cast+Sub+Mul+Add+BiasSoftmaxDropout+Cast+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x input_ids_dim1 x Frequency:1 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+Add+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: FusedMatMul+Cast+Add+Reshape+Cast+ [1,0]<stderr>: AlleviationType: Disabled [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 2 x 4 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Mul+Sub+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x 16 x input_ids_dim1 x 1 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: Cast+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:1,024 x 1,024 x Frequency:97 [1,0]<stderr>: PatternShape:3 x 1,024 x Frequency:1 [1,0]<stderr>: PatternShape:8 x 64 x Frequency:24 [1,0]<stderr>: PatternShape:1,024 x 4,096 x Frequency:24 [1,0]<stderr>: PatternShape:4,096 x Frequency:24 [1,0]<stderr>: PatternShape:4,096 x 1,024 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: Subgraph: FusedMatMul+ [1,0]<stderr>: AlleviationType: Recompute [1,0]<stderr>: Patterns: [1,0]<stderr>: PatternShape:input_ids_dim0 x input_ids_dim1 x 4,096 x Frequency:24 [1,0]<stderr>: -------------------------------- [1,0]<stderr>: ================================= ``` "Type config:" whether recompute is enabled by users. 0 - disable, 1- enable. "Subgraph" means what kind of subgraph will be recomputed, in this case, it is a single node "Gelu", and it will be "Recompute". "Shape && Frequency" means, for this recompute, one tensor of size (batch size, 500) will be saved because it will be recomputed. Baseline On a 1P model (DEBERTA V2), sequence length 256, training with 16 A100 GPUs. With latest main branch, we can run batch size 16, and the maximum batch size < 32. So 16 is usually chosen by data scientists. 65% of 40GB memory is used during training. The SamplesPerSec=479.2543353561354. ![image](https://user-images.githubusercontent.com/10530022/188320941-13dde5e7-c32b-4399-a64b-6803fbb9dcda.png) With this PR Gelu is recomputed for saving memory peak, batch size 32 can be run. The 97% of 40GB A100 is used, the SamplesPerSec=562.041593991271 (1.17X of baseline). ![image](https://user-images.githubusercontent.com/10530022/188321081-f64811bf-9637-4873-8095-349de8d498cc.png) Motivation and Context - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here.	2022-11-03 13:49:41 +08:00
Vincent Wang	8b0669bf63	QuickGelu Fusion (#12417 ) Some models have QuickGelu(x)=x*sigmoid(1.702x), which has 3 Ops for forward and 5 Ops for backward. The PR is to fuse this to a single Op named QuickGelu and its gradient QuickGeluGrad. For CUDA, tested in V100 using input tensor with shape [64,128,2048] and float16 type: Before, FW takes 335us, BW takes 614us ![image](https://user-images.githubusercontent.com/11661208/182291335-15188709-ffe7-44d1-9d14-0b544cbe5e55.png) After, FW takes 115us, BW takes 139us, which is much faster. ![image](https://user-images.githubusercontent.com/11661208/182291502-f0b5161c-b95c-45fc-90f8-ad0c592d2433.png) For CPU kernel, using same shape and float type: Before, FW takes 10us, BW takes 49us Mul: 3480[µs] Sigmoid: 1996[µs] Mul: 4789[µs] Mul: 4642[µs] Mul: 4195[µs] SigmoidGrad: 18328[µs] Mul: 2988[µs] Sum: 18576[µs] After, FW takes 4us, BW takes 5us, which is also much faster. QuickGelu: 3939[µs] QuickGeluGrad: 5089[µs] Co-authored-by: Vincent Wang <weicwang@microsoft.com>	2022-10-28 18:12:07 +08:00
Changming Sun	07271b6c8a	Update docs/OperatorKernels.md (#13485 )	2022-10-27 20:11:49 -07:00
Scott McKay	ab71c4bbc0	Document generation CI is broken (#13308 ) ### Description <!-- Describe your changes. --> Fix document generation CI. It's not currently updating the docs as we're skipping the tests, which is the invocation of build.py that would have generated the documentation. Setup specific task to generate documentation for greater clarity. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Operator kernel documentation is not getting updated and is now out of date.	2022-10-28 07:20:48 +10:00
Tianlei Wu	7aafd86229	Update Attention operator to support separated Q/K/V inputs (#13410 ) ### Description Allow separated Q, K and V inputs to support cross attention: * Q: [batch_size, sequence_length, hidden_size] * K: [batch_size, kv_sequence_length, hidden_size] * V: [batch_size, kv_sequence_length, v_hidden_size] * Output: [batch_size, sequence_length, v_hidden_size] To use separated Q/K/V inputs, the input tensor is for query, and two optional inputs are added for key and value. Weights for input projection is not included for now, so the MatMul of input projection shall be done out of Attention operator, but Add bias is included for performance consideration.	2022-10-25 11:51:06 -07:00
Jian Chen	397edf9918	Bumping up version number to 1.14.0 on main branch (#13401 ) ### Description Bumping up version number to 1.14.0 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-21 19:16:44 -04:00
Ye Wang	928c9889a3	A few fixes for generative model ops (#13363 ) ### Description <!-- Describe your changes. --> Fix a bug in GreedySearch Op when batch > 1 Support custom attention mask in GreedySearch and BeamSearch with GPT2 ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-21 15:00:18 -07:00
Yi Zhang	ea128cdb18	skip windows GPU check if changes only in doc (#13248 ) ### Description Use Path filter and fake workflow to skip windows GPU check if there's only changes in doc. Refs: https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/defining-the-mergeability-of-pull-requests/troubleshooting-required-status-checks#handling-skipped-but-required-checks The fake github yaml is generated by code. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> ###verifications:### In this PR: since the win-gpu-ci-pipeline.yml and .github are updated, so the real Windows GPU workflows are always triggered. in #13256 To avoid update win-gpu-ci-pipleline.yml, I added the path filter in devops page. the fake win GPU workflows triggered, and the real workflows are skipped.	2022-10-11 13:51:44 +08:00
garanews	38906625a3	fix some typo in docs (#13212 ) ### Description <!-- Describe your changes. --> fix some typo in docs ### Motivation and Context singed vs signed succeding vs succeeding fileter vs filter kernal vs kernel libary vs library	2022-10-07 15:58:18 -07:00
ashari4	b09dd11ece	BFP schemas: Change block dimension type to Int (#13169 ) * Change block dimension type to Int from Ints. * In response to feedback that the block dimension corresponds to the reduction dimension of the consuming matrix multiplication. There is always only 1 reduction dimension.	2022-10-06 11:11:43 -07:00
Tony Xia	c7522e547a	Fixed a minor typo (#13194 ) ### Description binraries ==> binaries ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. -->	2022-10-05 12:10:14 -07:00
Tony Xia	962fee5fe5	Fix typo enviroment => environment (#13195 )	2022-10-03 17:02:26 -07:00
Changming Sun	dd2aec170d	Update Coding_Conventions_and_Standards.md (#7705 )	2022-09-29 23:32:37 -07:00
ashari4	c4a7e88fc8	QuantizeBFP and DequantizeBFP (#12833 ) * `QuantizeBFP` and `DequantizeBFP` schemas - similar to `QuantizeLinear` and `DeQuantizeLinear`. * BFP datatype is represented as a `uint8` tensor with shape and stride metadata. This is preferrable to adding a new datatype for BFP, which is more disruptive and [discouraged by PyTorch](https://discuss.pytorch.org/t/training-with-custom-quantized-datatype/152132/2). Context: The Microsoft Floating Point (BFP) datatype shares an exponent for every n numbers called a “bounding box.” Each number still has its own mantissa and sign bits. BFP has been shown to incur 3-4 less cost (energy and area) than BFloat16 and INT8 counterparts without reductions in accuracy for the ImageNet benchmark as described in [Rouhani 2020](https://proceedings.neurips.cc/paper/2020/file/747e32ab0fea7fbd2ad9ec03daa3f840-Paper.pdf). Requirements: * There are many variants of BFP (number of mantissa bits, number of shared exponent bits, size of bounding box, custom bit fields, etc.) * The size and layout of an BFP variant varies across hardware * bounding box can be over arbitrary dimensions; for example, for the channel "C" dimension in a N x C x H x W tensor for convolution Goals of this PR: * Add initial versions of QuantizeBFP and DequantizeBFP operators to enable QDQ-style quantization with BFP. Once the schemas stabilize, we can consider upstreaming to ONNX. * Add some basic type and shape inferencing tests; tests that run on an EP will be a follow-up.	2022-09-22 14:02:55 -07:00
Edward Chen	454f77cd94	Update kernel matching logic: decouple from op schemas and remove kernel def hashes (#12791 ) # Motivation Currently, ORT minimal builds use kernel def hashes to map from nodes to kernels to execute when loading the model. As the kernel def hashes must be known ahead of time, this works for statically registered kernels. This works well for the CPU EP. For this approach to work, the kernel def hashes must also be known at ORT format model conversion time, which means the EP with statically registered kernels must also be enabled then. This is not an issue for the always-available CPU EP. However, we do not want to require that any EP which statically registers kernels is always available too. Consequently, we explore another approach to match nodes to kernels that does not rely on kernel def hashes. An added benefit of this is the possibility of moving away from kernel def hashes completely, which would eliminate the maintenance burden of keeping the hashes stable. # Approach In a full build, ORT uses some information from the ONNX op schema to match a node to a kernel. We want to avoid including the ONNX op schema in a minimal build to reduce binary size. Essentially, we take the necessary information from the ONNX op schema and make it available in a minimal build. We decouple the ONNX op schema from the kernel matching logic. The kernel matching logic instead relies on per-op information which can either be obtained from the ONNX op schema or another source. This per-op information must be available in a minimal build when there are no ONNX op schemas. We put it in the ORT format model. Existing uses of kernel def hashes to look up kernels are replaced with the updated kernel matching logic. We no longer store kernel def hashes in the ORT format model’s session state and runtime optimization representations. We no longer keep the logic to generate and ensure stability of kernel def hashes.	2022-09-20 14:24:59 -07:00
Alexey Gladyshev	2b5b11d373	[C#][TVM EP] Fix issues related to using TVM EP in C# front-end (#12958 ) Changes in this PR: * Update building of Nuget package for TVM EP * Update of documentation for using TVM EP in C#	2022-09-16 16:04:59 +02:00
RandySheriffH	64466c2d62	Remove nuphar provider folder (#12939 )	2022-09-13 09:10:52 -07:00
Dwayne Robinson	8e4eb24648	Update operator kernel table to include DML operators (#12887 ) * Fix bug in pybind get_all_operator_schema due to premature reference dropping * Add updated operator kernels markdown table * Update build.py to include documentation generation for DML operators too * Update GPU pipeline to include DML in the build to so operators can be generated. * Use a separate pipeline stage, feedback from Changming and Scott * Appease annoying Python linter * Add onnxruntime_BUILD_UNIT_TESTS=OFF and remove stale --use_dml in cuda stage	2022-09-09 10:21:25 -07:00
Hariharan Seshadri	ad69aac491	Introduce ordered quantization ops for the CUDA EP [1/n] (#12582 ) Initial core small set for the ordered quantization ops for cuda EP.	2022-09-07 11:58:15 -07:00
Yulong Wang	1a402a3f25	replace 'master' branch ref to 'main' for onnx repo (#12678 )	2022-08-30 13:41:42 -07:00
Yulong Wang	c144acc534	Replace 'master' branch ref to 'main' in the code (#12547 )	2022-08-22 10:48:12 -07:00
Wei-Sheng Chin	dc486d146b	Make ORT callable from various Pytorch compilers (LazyTensor, TorchDynamo, etc) (#10460 ) * Make ORT as Pytorch JIT backend LORT likely doesn't work with aten fallback so we only test LORT in its own CI. * Revert changes to enable external CUDA allocator. Will add it later. Revert "Revert changes to enable external CUDA allocator. Will add it later." This reverts commit d5487f2e193014c805505afae8fb577c53667658. Fix external allocator * Relax tolerance and remove commented code * Print more information in CI * Fix pointer * Address comments. 1. Reuse ORT-eager mode's environment. 2. Remove unused ctor. * Use Pytorch master branch as all PRs are merged Fix * Refine based on cpplint feedbacks * Revert changes to allow custom CUDA allocator in public APIs * Use torch.testing.assert_close * Use unittest framework * Switch docker repo * Rename .cpp to .cc * Address comments * Add comment * Use same pipeline file for eager and lort pipelines * Address comments * Add yaml comment * Fix cmake files * Address comments * Rename flags, remove printing code, remove dead comment	2022-08-22 09:40:40 -07:00
Cheng	64e991a9fc	[Qlinearsoftmax] contrib cpu (#12177 ) * [Qlinearsoftmax] contrib cpu * int8 implementation * contrib operator md * qdq transformer test * new attribute: opset * doc * quantized tool * remove template to reduce Binary size * doc of contribe operators * enforce x_shape is valid * fix reduce_size if input-shape is dynamic * add UT * register one op for reducing binarysize * kernel hash update * docs/ContribOperators.md	2022-08-10 10:52:02 +08:00
Vincent Wang	cfa09d16d9	[CUDA] Mod Op Kernel (#12499 ) * mod for cuda and rocm * fix bfloat16 ut * change bf16 ut number * fix opset version * fix op kernel doc	2022-08-09 13:05:40 +08:00
Vincent Wang	37995a7245	[CUDA] BiasSoftmax Supporting New Pattern (#12361 )	2022-08-05 06:59:24 +08:00
Scott McKay	a3de1bbf7d	Update script to find optimizers that potentially need supported opset updates (#12330 ) * Update to handle multiline declarations for the kernels which are typical these days. * Update to new path for the cpu contrib_op kernel registrations. * Update tools/python/find_optimizer_opset_version_updates_required.py Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2022-08-04 07:37:27 +10:00
Dmitri Smirnov	dc984a03d5	Container and memory allocation guidelines (#12387 ) Container and memory allocation guidelines Re-org and add code samples Clarify the wording on returning gsl::span	2022-08-03 10:31:59 -07:00
Changming Sun	44ec2cf088	Update publish-python-apidocs.yml (#12433 )	2022-08-03 10:17:00 -07:00
Ye Wang	b622e5fa9b	Support vocab_mask/prefix_vocab_mask/no_repeat_number in greedysearch op (#12327 ) * support more inputs for greedy search * fix docs * refactor test * lint * review comments	2022-08-03 10:10:08 -07:00
Valery Chernov	e2423bb55c	[TVM EP] Build on Windows with ipp-crypto support (#12336 ) * update TVM EP docs for ipp-crypto build conditions * add ipp-crypto by ExternalProject_Add Co-authored-by: Valery Chernov <valery.chernov@deelvin.com>	2022-07-28 15:40:19 +02:00

1 2 3 4 5 ...

452 commits