Commit graph

434 commits

Author SHA1 Message Date
Chen Fu
4a4488baae
Release buffers for prepacked tensors (#6820)
Unsolved problems:

1. One test failure was caused by a bug in Cudnn rnn kernels, when they can allocate a buffer and partially initialize it, the garbage data near tail of the buffer caused problem in some of the hardware. To attack this problem in a broader sense, should we add code in our allocators, and during a memory fuzzing test, fill an allocated buffer with garbage before returning to the caller?


2. Prepacking is used more widely than we know. For instance, Cudnn rnn kernels also cache their weights. They mix several weight tensors together into a single buffer, and never touch the original weight tensor anymore. This is the same idea with pre-pack, but they didn't override the virtual function, and they never tried to release those weight tensors, leading to memory waste. It also seems to me that there are some other kernels have similar behavior. Wonder how much memory we can save if we try to cleanup those too.

3. Turning off memory pattern planning does increase memory fragmentation, leading to out of memory error in some training test cases. Perhaps we can revisit the idea of pushing kernels-creation stage earlier, and then during initializer deserialization, we only avoid tracing those that will be prepacked.
2021-03-10 14:07:20 -08:00
Wei-Sheng Chin
bdaea1d9ae
Update baseline due to loss scale fix (#6948) 2021-03-10 09:46:15 +08:00
Wei-Sheng Chin
de6e66f3d4
Fix loss scaling when running ORTTrainer with BERT under mixed-precision mode (#6932)
* Fix missed Loss scale

* not to dump
2021-03-08 21:12:33 +08:00
Suffian Khan
7915b6709a
Revert Gather Grad optimization in PR 6381 targeted for Rocm (#6880)
* revert gather_grad_impl.cu

* put stream changes back in

* restrict changes to commenting launch of optimized version
2021-03-04 10:21:49 -08:00
Weixing Zhang
40fa40f3ce
Enable more unit tests for ROCM EP (#6776)
* enable more ops and unit tests for ROCM EP
2021-02-24 15:20:50 -08:00
jingyanwangms
c02ec38f8a
[Running CI now] Remove duplicate tests to speed up CI (#6768)
* remove tests to speed up CI

* add back _into_data_parallelism tests to see how long the CI test takes

* remove unnecessary save calls

* add back data_parallelism_full_precision_bart_path

* add data_parallelism_full_precision_path

* remove data parallelism tests

Co-authored-by: Jingyan Wang <jingywa@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-02-23 23:21:06 -08:00
Wei-Sheng Chin
9e67b88c83
Use local rank as GPU ID (#6719) 2021-02-17 22:42:54 +08:00
Edward Chen
b2cddc5337
Consolidate MLTypeCallDispatcher classes (#6651) 2021-02-12 13:26:56 -08:00
Suffian Khan
e6de0eb813
Add nightly pipeline for MI100 to run convergence and batch size test similar to V100. (#6611)
* Partial updating of ROCM reduction code.

* Update reduction_all.cu

* Add reduce template parameters.

* miopen common

* Reuse CUDA's reduction_functions.cc

* Reduction ops.

* Update remaining reduction ops to use MIOpen.  double datatype is not supported, so disable those typed kernels.

* Disable a couple more unsupported tests.

* Code formatting.

* Delete ROCM-specific reduction code that is identical to CUDA reduction code.

* Fix scratch buffer early free.

* Fix merge conflict.

* first attempt nightly amd ci pipeline

* try fix bad yaml file

* try again with corrected model directory

* add convergence test as well

* update reference loss for amd mi100

* include mi100 test results csv

* update the mi100  convergence test reference values

* update batch sizes for mi100 32g

* fix gpu sku for run_convergence_test.py

* undo unrelated changes to master

* pr comments

* pr comment

Co-authored-by: Jesse Benson <jesseb@microsoft.com>
2021-02-12 13:22:06 -08:00
Yufeng Li
1c3168c0f6
Skip constant folding dequantizelinear for quant qdq format (#6643)
* skip constant folding dequantizelinear for quant qdq format
2021-02-11 14:06:13 -08:00
Derek Murray
88d48063fa
Log warning when GetGradientForOp() silently fails. (#6586)
* Add warning when GetGradientForOp() silently fails.

In some cases, `GetGradientForOp()` can return without creating any nodes, which may lead to an invalid graph being created.
2021-02-10 10:01:16 -08:00
Wei-Sheng Chin
8972621138
Generate shape-independent graph if any input dimension < 2 (#6581)
* Throw for non-supported case

* Not to go to shape-dependent branch when seeing unsupported shapes
2021-02-10 15:44:25 +08:00
Cian Hayes
16eed68a1e
Fix layer_norm.cc on x86 (#6556)
* Fix LayerNromGrad on x86

* PR feedback
2021-02-08 17:36:14 -08:00
Jesse Benson
d18aa45b46 Enable more ROCM ops that are sharing CUDA code. Some are needed for Turing NLG models. 2021-02-06 14:40:34 -08:00
George Nash
b50b0a89aa Fix build failure when building with --build_wheel on Windows
This resolves issue #6536

Signed-off-by: George Nash <george.nash@intel.com>
2021-02-05 18:59:01 -08:00
Scott McKay
ccfd90291b
Remove condition from ORT_RETURN_IF[_NOT] macro output. (#6563)
Remove condition from ORT_RETURN_IF[_NOT] macro output as repeating the condition doesn't add much value compared to the explicit error message, and the error message includes the file and line anyway so it's easy enough to find the condition if needed.
Update the few places where the macros were used without an explicit error message to provide an explicit error message.

Saves 12.5KB in a minimal MinSizeRel build with all DNN ops, 16KB in full release build.
2021-02-05 17:33:29 -08:00
Weixing Zhang
299ace0759
Support to allow user to specify compute stream per session (#3723)
* Support to allow user to specify compute stream per session

Create computation cuda stream explicitly rather than use default legacy stream or per-thread default stream.

remove some redudant cudaStreamSynchronize

fix gpt2 model test failures

don't use default stream in nccl either.

add stream schronization in OnRunEnd()

using cub::DeviceScan::InclusiveSum which can be called with stream specified.

fix topK failure due to latest rebase

fix tensorrt

support user specified stream

add user_stream support in tensorrt EP

use same stream for both tensort and CUDA EP.

fix ScatterND

specify stream for adasum and p2p kernels.

fix loop

fix CApiTest.custom_op_handler

fix CApiTest.varied_input_custom_op_handler

change for cudaMemcpyFromSymbol

improve provider options for user specified compute stream

* add changes for ROCM EP

* fix GatherGrad UT for ROCM EP

* clean code and fix NonMaxSuppression

* use default stream for ROCM now

* fix CApiTest.custom_op_handler:OrtFormatCustomOpTests.ConvertOnnxModelToOrt

* fix tensorrt ut: CApiTest.io_binding_cuda

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-02-05 15:48:18 -08:00
Jesse Benson
a9e4d70b50 Fix merge conflict. 2021-02-04 15:00:05 -08:00
Jesse Benson
86ac11af1a Delete ROCM-specific reduction code that is identical to CUDA reduction code. 2021-02-04 15:00:05 -08:00
Jesse Benson
5d8792705b Code formatting. 2021-02-04 15:00:05 -08:00
Jesse Benson
0b147702af Update remaining reduction ops to use MIOpen. double datatype is not supported, so disable those typed kernels. 2021-02-04 15:00:05 -08:00
Jesse Benson
a28ddb85b6 Reduction ops. 2021-02-04 15:00:05 -08:00
Jesse Benson
554184bcc4 Add reduce template parameters. 2021-02-04 15:00:05 -08:00
Jesse Benson
c4b6559be9 Update reduction_all.cu 2021-02-04 15:00:05 -08:00
Changming Sun
aa31ba5774
Merge CPU packaging pipelines (#6480)
1. Merge Nuget CPU pipeline, Java CPU pipeline, C-API pipeline into a single one.
2. Enable compile warnings for cuda files(*.cu) on Windows.
3. Enable static code analyze for the Windows builds in these jobs. For example, this is our first time scanning the JNI code.
4. Fix some warnings in the training code.
5. Enable code sign for Java. Previously we forgot it.
6. Update TPN.txt to remove Jemalloc.
2021-02-04 08:38:56 -08:00
Derek Murray
14f7d56c81
Add optimized version of ConvGrad for pointwise convolutions. (#6531)
Co-authored-by: Tracy Sharpe <tracysh@microsoft.com>
2021-02-02 08:09:20 -08:00
Cian Hayes
6fc5237d9e
Introduce --enable_training_ops build flag (#6523)
* minimal_build with training ops

* Removing redundant comment from an earlier attempt at a fix

* Fixing a bad merge conflict resolution

* Responding to PR feedback

* tweaking the makefiles based on feedback

* combining two enable_training blocks in CMakeLists.txt
2021-02-01 21:54:16 -08:00
George Nash
a36f627a4c
Dnnl training (#6045)
* Add ReluGrad and ConvGrad ops for the dnnl provider

* the mnist sample is updated to add the --use_dnnl option that
will cause the sample to use the dnnl execution provider for
nodes that exist in dnnl provider.

* Added the ability to find forward ops. Dnnl backward gradient
ops require the forward primitive description and workspace
from the forward operation.

* Enable specifying the execution provider for Gradient Checker Tests

* Prevent memory leak when running dnnl_provider in training mode

Prevent creating a SubgraphPrimitivePool when the code is built with the
ENABLE_TRAINING build flag. Instead create a SubgraphPrimitive directly.

The SubgraphPrimitivePool was causing a pool of SubgraphPrimitives to be
stashed in a map for reuse. Due to the way the Training Loop uses threads
the pool of SubgraphPrimitives were not being reuse instead a new pool of
SubgraphPrimitives being created each run. The old pool was not instantly
freed. This behavior could be a language error when using thread_local
memory.

Signed-off-by: George Nash <george.nash@intel.com>

* Added fixes to maxpoolgrad and memory leak.

Maxpoolgrad will now pass all unit tests.
With the conv and convgrad disabled for dnnl, mnist is able to train till 95%

Signed-off-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>

* Fixed misc issues when testing training code with dnnl provider

* fix conv_grad dnnl tests with dilation to run dnnl execution provider

* update mnist training sample to accept convolution type models

  convolution models require the input shape to be {1, 28, 28}
  instead of the flat {728} image that is used for the gemm models

  this will enable models that require the different shape by adding
 `--model_type conv` to the command line when running the mnist sample.
 (while testing a workaround was used see #4762)

* Disable weight caching in dnnl conv operator when using training

  When training we can not use cached weights because the weight
  will be updated each run. This re-enables dnnl Conv and ConvGrad Ops.
  The weight caching was the source of the error from Conv when training.

* Fix issues found when building grad ops on Linux
  * The dnnl_convgrad code was over using the scope operator
    causing a compilation problem.
  * The dnnl_maxpoolgrad code had a logic error that is was
    comparing with the source description when it should have
    been comparing with the destination despription.

* Update BUILD.md so it shows DNNL for training
  * Updated the table of contents. Since the same providers
    are listed twice. Once for Infrance and again for Training
    an HTML anchor was added to distinguish the second header
    from the first for the TOC.

* Fix build failure when not using --enable-training build option

* reorganize the gradient operators so they are grouped together

* Fix issues found when running onnx_backend_test_series.py

* Pooling code only supports 2 outputs when built with --enable-training

* Address code review feedback
  * class member variables end in underscore_
  * use dst instead of dist to match pattern use elsewhere in DNNL code.

* Remove workaround that was introduced to handle problems running
  convolution based training models. See issue #4762

Signed-off-by: George Nash <george.nash@intel.com>

* Isolate training code and code cleanup

* Do not build if dnnl_gpu_runtime if enable_training is set training code
  does not support dnnl_gpu_runtime yet.
* Isolated Training code inside ifdefs so that they wont affect
  project if built without training enabled
* Inadvertant changes in whitespace were removed to make code review simpler
* Undid some code reordering that was not needed
* comments added to closing #endif statments to simplify reading complex ifdefs
* Modified the GetPrimitiveDesc functions to return shared_ptr instead of raw
  pointer. This matches what was done in Pool code and is safer memory code.

Signed-off-by: George Nash <george.nash@intel.com>

* Address code review issues

- whitespace changes caused by running clang-format on the code
- Several spelling errors fixed
- Removed/changed some ifdefs to improve readability
- other misc. changes in responce to code review.

Signed-off-by: George Nash <george.nash@intel.com>

* Code changes to address code review

- Simplify iteration code using `auto` keyword
- remove C style cast that was not needed
- remove instance variable that was not needed [relugrad.h]
- added the execution providers to `ComputeGradientErrorInternal()`
  and `ComputeTheoreticalJacobianTranspose()` instead of using
  a pointer to an instance varaible [gradient_checker.h/.cc]

Signed-off-by: George Nash <george.nash@intel.com>

* Combined the default gradient ops test and dnnl gradient ops test for ConvGrad and MaxPoolGrad into one function with the help of a helper function.
This will reduce repeated code.
Signed-off-by: Palangotu Keshava, Chethan's avatarChethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>

* Replaced the stack used by convgrad to vector so that the vector(used as stack) can be easily cleared everytime the graph is created.
This will prevent memory leak from convolution kernels being pushed constantly onto the stack.
Signed-off-by: chethan.palangotu.keshava@intel.com

* Code clean up and formating updates

 - Removed empty else statment
 - updated indentation of code that was causing double curly brackets to look unususal
 - Changed check for NumDimensions to Size in Relu and ReluGrad error checking code.
 - isolated training code

Signed-off-by: George Nash <george.nash@intel.com>

* Restore inadvertantly removed ConvGrad tests

When combining the DNNL and CPU version of the ConvGrad
tests two test were inadvertantly excluded.  This adds
back the Conv3d and Conv3d with strides test cases.

Signed-off-by: George Nash <george.nash@intel.com>

* Add validation to ConvGrad

This validates the dimensions of the ConvGrad match the
passed in Convolution forward primitive description.

The current code for DNNL ConvGrad makes the assumption that the ConvGrad
nodes will be visited in the reverse order from the corresponding Conv nodes

The added validation will return an error if this assumption is not true.

Signed-off-by: George Nash <george.nash@intel.com>

* Do not create new execution providers in provider_test_utils

This removes the code that generated new execution providers in the
OpTester::Run function. This was added because the std::move was
leaving the `entry` value empty so subsequent calls would cause a
segfault.

Problem is this potentially changed the execution_provider because it
would create the default provider dropping any custom arguments.

When the now removed code was originally added the std::move was causing
crashes when the GradientChecker unit tests were run.  However, it is no
longer causing problems even with the code removed.

Signed-off-by: George Nash <george.nash@intel.com>

* Change the forward conv stack to a forward conv map

This changes how the forward conv kernel is mapped to the bwd ConvGrad
kernel the problematic stack is no longer used.

The convolution stack made the assumption that the corresponding
ConvGrad operator would be visited in reverse order of the forward
Conv operators.  This was always problematic and was unlikely to
work for inception models.

Important changes:
- The weight_name is added to the ConvGrad dnnl_node making it
  possible to use the weight_name as a lookup key to find the
  Conv forward Kernel
- the `std::vector fwd_conv_stack_` has been replaced with a
  `std::map fwd_conv_kernel_map_`
- Although it is not needed lock_guards were added when writing
  to and reading from the fwd_conv_kernel_map_ as well as the
  fwd_kernel_map_. These should always be accessed by a single
  thread when preparing the dnnl subgraphs so the guard should not
  be needed but its added just in case.
- Updated the comments ConvGrad.h code to no longer mention the
  stack. The error check is not removed. It will be good to verify
  there are no errors as we continue to test against more models.

Signed-off-by: George Nash <george.nash@intel.com>

Co-authored-by: Chethan Palangotu Keshava <chethan.palangotu.keshava@intel.com>
Co-authored-by: unknown <63478620+jeyblu@users.noreply.github.com>
2021-01-29 16:05:58 -08:00
Weixing Zhang
7f5731741d
Optimize GatherGrad for AMD GPU (#6381)
* optimize gathergrad

* address comments

Co-authored-by: Weixing Zhang <wezhan@microsoft.com>
2021-01-29 13:50:08 -08:00
Suffian Khan
76bc0e479c
Enable dense sequence optimized version of Pytorch exported BERT-L on AMD GPU (#6504)
* Permit dense seq optimization on BERT-L pytorch export by enabling ReduceSumTraining, Equal, and NonZero on AMD

* enable Equal tests

* enable fast_matrix_reduction test case
2021-01-29 13:12:34 -08:00
liqunfu
00afd00059
merge e2e with distributed pipeline (#6443)
merge e2e with distributed pipeline
2021-01-28 14:17:47 -08:00
pengwa
fd43806252
fix max norm clipping test in python packaging pipeline test (#6468)
* fix python packaging pipeline

* make clip norm test compatabile with both V100 and M60 GPUs
2021-01-28 01:09:12 +08:00
Ryota Tomioka
9835b46a1d
Add an option to save the training graph after optimization (#6410)
* expose optimized_model_filepath in SessionOptions as `debug.graph_save_paths.model_with_training_graph_after_optimization_path` in `ORTTrainerOptions`
2021-01-27 07:39:46 +00:00
M. Zeeshan Siddiqui
f3a0344f9a
Farewell TrainableDropout (#5793)
* Deprecate TrainableDropout kernel.

* Update bert_toy_postprocessed.onnx to opset 12.

* Add more dropout tests.

* Fix BiasDropout kernel.

Co-authored-by: Ubuntu <OrtTrainingDev3@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sherlock Huang <bahuang@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: Sergii Dymchenko <sedymche@microsoft.com>
2021-01-25 16:37:42 -08:00
liqunfu
6ed12402a4
Liqun/liqun/enable pipeline parallel test2 (#6399)
* enable data and pipeline parallism test

Co-authored-by: liqun <liqun@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-01-25 15:15:26 -08:00
ashbhandare
60c772e2bc
Megatron checkpointing (#6293)
* Add bart fairseq run script

* Add frontend change to enable megatron

* Initial changes for checkpointing

* Megatron optim state loading, checkpoint aggregation, frontend distributed tests for H, D+H

* Add load_checkpoint changes

* Fix CI

* Cleanup

* Fix CI

* review comments

* review comments

* review comments:
2021-01-22 11:26:47 -08:00
pengwa
453431f7bb
Add max_norm for gradient clipping. (#6289)
* add max_norm as user option for gradient clipping

* add adam and lamb test cases for clip norm

* add frontend tests
2021-01-21 01:01:11 +08:00
wezuo
5b6753ce27
Wezuo/memory analysis (#5658)
* merged alloc_plan

* pass compilation

* Start running, incorrect allocation memory info

* add in comments

* fix a bug of recording pattern too early.

* debugging lifetime

* fix lifetime

* passed mnist

* in process of visualization

* Add code to generate chrome trace for allocations.

* in process of collecting fragmentation

* before rebuild

* passed mnist

* passed bert tiny

* fix the inplace reuse

* fix the exception of weight in pinned memory

* add guards to ensure the tensor is in AllocPlan

* add customized profiling

* debugging

* debugging

* fix the reuse of differnt location type

* add rank

* add the rank

* add fragmentation

* add time_step_trace

* Add summary for each execution step (total bytes, used/free bytes).

* add top k

* change type of top k parameter

* remove prints

* change heap to set{

* add the name pattern

* add the useage for pattern

* add partition

* change to static class

* add custom group

* remove const

* update memory_info

* in process of adding it as runtime config

* change the memory profiling to be an argument

* add some comments

* add checks to recored meomry_info in traaining session

* set the "local rank setting" to correct argument.

* addressing comments

* format adjustment

* formatting

* remove alloc_interval

* update memory_info.cc to skip session when there is no tensor for a particular memory type

* fix memory_info multiple iteration seg-fault

* consolidate mainz changes

* fixed some minor errors

* guard by ORT_MINIMAL_BUILD

* add ORT_MEMORY_PROFILE flag

* added compiler flag to turn on/off memory profiling related code

* clean up the code regarding comments

* add comments

* revoke the onnx version

* clean up the code to match master

* clean up the code to match master

* clean up the code to match master

Co-authored-by: Jesse Benson <benson.jesse@gmail.com>
Co-authored-by: Wei Zuo <wezuo@OrtTrainingDev3.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-mgtbby.eastus.cloudapp.azure.com>
Co-authored-by: wezuo <wezuo@az-eus-v100-32gb-5-worker-yclzsf.eastus.cloudapp.azure.com>
2021-01-19 08:30:55 -08:00
Wei-Sheng Chin
8ce252caa9
Pipeline Parallel Experimental Python API (#5815) 2021-01-15 12:07:28 +08:00
ashbhandare
fd21c84eb8
Enable graph save for orttrainer (#6333)
* Enable graph save for orttrainer

* Fix CI

* Update orttraining/orttraining/python/training/orttrainer_options.py

* Update orttraining/orttraining/python/training/orttrainer_options.py

* Update orttraining/orttraining/python/training/orttrainer_options.py

* Update orttraining/orttraining/python/training/orttrainer_options.py

* Update orttraining/orttraining/python/training/orttrainer_options.py

Co-authored-by: Thiago Crepaldi <thiago.crepaldi@microsoft.com>
2021-01-14 10:07:54 -08:00
Vincent Wang
4df356d1c9
Train BERT Using BFloat16 on A100 (#6090)
* traing bert using bf16

* Adam support bf16

* bugfix

* add fusedmatmul support

* fix after merge from master.

* bugfix

* bugfix after merge from master

* fast reduction for bf16.

* resolve comments

* fix win build

* bugfix

* change header file.

Co-authored-by: Vincent Wang <weicwang@microsoft.com>
2021-01-14 19:04:32 +08:00
Tixxx
d367941cc4
changed wording. (#6337) 2021-01-13 15:12:04 -08:00
Edward Chen
d761571afc
Deprecate Python global configuration functions [Part 2] (#6171)
Update Python API to allow more flexibility for setting providers and provider options.

The providers argument (InferenceSession/TrainingSession constructors, InferenceSession.set_providers()) now also accepts a tuple of (name, options dict).
Fix get_available_providers() API (and the corresponding function in the C API) to return the providers in default priority order. Now it can be used as a starting point for the providers argument and maintain the default priority order.
Convert some usages of the deprecated global configuration functions to use EP-specific options instead.

Update some EP-specific option parsing to fail on unknown options.

Other clean up.
2021-01-07 10:10:55 -08:00
Tang, Cheng
431604ef89
add bfloat16 to gathergrad type constrains (#6267)
Co-authored-by: Cheng Tang <chenta@microsoft.com>
2021-01-06 15:04:14 -08:00
pengwa
eea3806db1
model parallel refinement (#6244)
* Megatron Transformation as a seperate step

* remove useless header

* clang formating

* Re-Structure megatron transformer for subsquent changes

* fix  comments
2021-01-06 10:30:22 +08:00
ashbhandare
493bf931c5
Add the Concat Slice Elimination transform, fix constant_folding transform (#5457)
* Add concat slice transform + test

* Cosmetic improvements in concat slice transform

* Remove unrelated file, fix comment, fix constant folding bug

* Add test onnx graph

* fix windows build

* Review comments

* review comment

Co-authored-by: Aishwarya <aibhanda@OrtTrainingDev4.af05slrtruoetgaxwwjv5nsq5e.px.internal.cloudapp.net>
2021-01-04 16:18:33 -08:00
baijumeswani
93bf7c4d52
Documentation for distributed CI tests pipeline (#6140) 2021-01-04 10:09:39 -08:00
Suffian Khan
46e0e4e69f
Tune BiasGeluGradDx kernel in approximation mode to avoid tanh(...) on Rocm (#6239)
* bias gelu grad use exp(...) instead

* update cuda to rocm

* missing semicolon

* comment

* remove dockerfile

* missing factor of two
2021-01-02 08:54:16 -08:00
Jesse Benson
7ccdfed1a6 Remove most ROCm-specific element-wise code and reuse CUDA element-wise code. 2020-12-27 10:30:29 -08:00
Jesse Benson
52228a703c Use TArray in AMD element-wise kernels, rather than manually copying memory to device. 2020-12-27 10:30:29 -08:00