New hat/docs/Profiling/opencl-intercept-layer.md

 1 # Using OpenCL Intercept Layer for HAT
 2 [Back to Index ../](../index.md)
 3 
 4 The [OpenCL Intercept Layer](https://github.com/intel/opencl-intercept-layer) is a tool that intercepts OpenCL calls
 5 for debugging and performance analysis. We can use this tool for multiple OpenCL platforms, including Intel, NVIDIA and macOS.
 6 
 7 ## How to install OpenCL Intercept Layer?
 8 
 9 ```bash
10 git clone https://github.com/intel/opencl-intercept-layer.git
11 cd opencl-intercept-layer
12 mkdir build
13 cd build
14 ## We can optionally enable cliprof, but we mainly use cliloader
15 cmake .. -DENABLE_CLIPROF=1
16 ```
17 
18 Then, add in your `PATH` the `opencl-intercept-layer/build/cliloader` directory.
19 
20 ```bash
21 export PATH=/path/to/opencl-intercept-layer/build/cliloader:$PATH
22 ```
23 
24 ## How to use with HAT
25 
26 ```bash
27 cliloader \
28   -d -h \
29   java @.ffi-opencl-example tensors.Main --iterations=10 --verbose
30 ```
31 
32 Example of output:
33 
34 ```bash
35 Host Performance Timing Results:
36 
37 Total Time (ns): 374760223
38 
39                           Function Name,  Calls,     Time (ns), Time (%),  Average (ns),      Min (ns),      Max (ns)
40                (device timing overhead),     60,         57423,    0.02%,           957,             0,          4459
41                         iclBuildProgram,      3,      70517666,   18.82%,      23505888,        677208,      61181833
42                         iclCreateBuffer,     10,         20667,    0.01%,          2066,           375,          5666
43                   iclCreateCommandQueue,      1,         45291,    0.01%,         45291,         45291,         45291
44                        iclCreateContext,      1,        448625,    0.12%,        448625,        448625,        448625
45                         iclCreateKernel,      3,     133957582,   35.74%,      44652527,        292833,     133316541
46              iclCreateProgramWithSource,      3,         43709,    0.01%,         14569,         12667,         16208
47            iclEnqueueMarkerWithWaitList,    120,        300161,    0.08%,          2501,           125,         11166
48  iclEnqueueNDRangeKernel( mxmNaiveF16 ),     10,          9917,    0.00%,           991,           542,          1750
49  iclEnqueueNDRangeKernel( mxmNaiveF32 ),     10,         13252,    0.00%,          1325,           500,          2583
50 iclEnqueueNDRangeKernel( mxmTensorsCM ),     10,          9624,    0.00%,           962,           583,          1625
51                    iclEnqueueReadBuffer,     30,        163125,    0.04%,          5437,          3834,          9667
52                   iclEnqueueWriteBuffer,     90,        683671,    0.18%,          7596,           542,         70291
53                         iclGetDeviceIDs,      2,      24621167,    6.57%,      12310583,           250,      24620917
54                        iclGetDeviceInfo,    660,         38704,    0.01%,            58,             0,           875
55                       iclGetPlatformIDs,      2,            83,    0.00%,            41,            41,            42
56                      iclGetPlatformInfo,    180,        338296,    0.09%,          1879,             0,        330709
57                  iclGetProgramBuildInfo,      9,          5959,    0.00%,           662,            42,          2333
58                         iclReleaseEvent,    270,         45167,    0.01%,           167,            41,          1125
59                         iclSetKernelArg,    150,         25224,    0.01%,           168,            41,           750
60                        iclWaitForEvents,     60,     143414910,   38.27%,       2390248,         15166,      20305042
61 
62 Device Performance Timing Results for Apple M4 Max (40CUs, 1000MHz):
63 
64 Total Time (ns): 3174486
65 
66                    Function Name,  Calls,     Time (ns), Time (%),  Average (ns),      Min (ns),      Max (ns)
67             iclEnqueueReadBuffer,     30,         48206,    1.52%,          1606,           758,          6029
68            iclEnqueueWriteBuffer,     90,         90860,    2.86%,          1009,            53,          9861
69                      mxmNaiveF16,     10,        906729,   28.56%,         90672,         89916,         96693
70                      mxmNaiveF32,     10,       1675136,   52.77%,        167513,         98113,        382520
71                     mxmTensorsCM,     10,        453555,   14.29%,         45355,         38614,         46225
72 ```
73 
74 ## How to use with Chrome Tracing
75 
76 ```bash
77 cliloader -d -h \
78   --chrome-call-logging \
79   --chrome-device-timeline \
80   --chrome-kernel-timeline \
81   --chrome-device-stages \
82   java @.ffi-opencl-example tensors.Main --iterations=10 --verbose
83 ```
84 
85 The same functionality could be achived by invoking the `scripts/cliloader-chrome-opencl.bash` script.
86 
87 ```bash
88 sh scripts/cliloader-opencl.bash tensors.Main --iterations=10 --verbose
89 ```
90 
91 Then open Chrome and enter the following url: `chrome://tracing`.
92 
93 Then load the traces (usually a file called `CLIntercept_Trace.json`) that is stored in the default location of the `cliloader` tool.
94 
95 To obtain the default location, run `cliloader | grep dump-dir -A 3`.
96 
97 
98 ## Documentation
99 - https://github.com/intel/opencl-intercept-layer/tree/main/docs