mirror of
https://github.com/saymrwulf/pytorch.git
synced 2026-05-14 20:57:59 +00:00
This enables inductor micro benchmark on CPU (x86): * Running on AWS metal runner for more accurate benchmark * I add a new `arch` column, which will be either x86_64 or arm64 for CPU or GPU name for GPU. We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64) The next step would be to run this one cpu arm64, and cuda (a10g). ### Testing Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180 ``` name,metric,target,actual,dtype,device,arch,is_model mlp_layer_norm_gelu,flops_utilization,0.8,17.36,bfloat16,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),990,170.80,int8,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),1060,204.78,bfloat16,cpu,x86_64,False Mixtral-8x7B-v0.1,token_per_sec,175,26.68,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,171.91,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,compilation_time(s),162,47.36,int8,cpu,x86_64,True gemv,memory_bandwidth(GB/s),870,236.36,int8,cpu,x86_64,False gemv,memory_bandwidth(GB/s),990,305.71,bfloat16,cpu,x86_64,False Llama-2-7b-chat-hf,token_per_sec,94,14.01,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,185.18,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),162,74.99,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,token_per_sec,144,25.09,int8,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,165.83,int8,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),172,70.69,int8,cpu,x86_64,True layer_norm,memory_bandwidth(GB/s),950,172.03,bfloat16,cpu,x86_64,False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135042 Approved by: https://github.com/yanboliang
31 lines
652 B
YAML
31 lines
652 B
YAML
tracking_issue: 24422
|
|
ciflow_tracking_issue: 64124
|
|
ciflow_push_tags:
|
|
- ciflow/binaries
|
|
- ciflow/binaries_conda
|
|
- ciflow/binaries_libtorch
|
|
- ciflow/binaries_wheel
|
|
- ciflow/inductor
|
|
- ciflow/inductor-rocm
|
|
- ciflow/inductor-perf-compare
|
|
- ciflow/inductor-micro-benchmark
|
|
- ciflow/inductor-micro-benchmark-cpu-x86
|
|
- ciflow/inductor-cu124
|
|
- ciflow/linux-aarch64
|
|
- ciflow/mps
|
|
- ciflow/nightly
|
|
- ciflow/periodic
|
|
- ciflow/rocm
|
|
- ciflow/slow
|
|
- ciflow/trunk
|
|
- ciflow/unstable
|
|
- ciflow/xpu
|
|
- ciflow/torchbench
|
|
retryable_workflows:
|
|
- pull
|
|
- trunk
|
|
- linux-binary
|
|
- windows-binary
|
|
labeler_config: labeler.yml
|
|
label_to_label_config: label_to_label.yml
|
|
mergebot: True
|