1 # Compute Analysis or Runtime tracing
  2 
  3 ----
  4 * [Contents](hat-00.md)
  5 * Build Babylon and HAT
  6     * [Quick Install](hat-01-quick-install.md)
  7     * [Building Babylon with jtreg](hat-01-02-building-babylon.md)
  8     * [Building HAT with jtreg](hat-01-03-building-hat.md)
  9         * [Enabling the NVIDIA CUDA Backend](hat-01-05-building-hat-for-cuda.md)
 10 * [Testing Framework](hat-02-testing-framework.md)
 11 * [Running Examples](hat-03-examples.md)
 12 * [HAT Programming Model](hat-03-programming-model.md)
 13 * Interface Mapping
 14     * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
 15     * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
 16 * Development
 17     * [Project Layout](hat-01-01-project-layout.md)
 18     * [IntelliJ Code Formatter](hat-development.md)
 19 * Implementation Details
 20     * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
 21     * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
 22 * [Running HAT with Docker on NVIDIA GPUs](hat-07-docker-build-nvidia.md)
 23 ---
 24 
 25 # Compute Analysis or Runtime tracing
 26 
 27 HAT does not dictate how a backend chooses to optimize execution, but does
 28 provide the tools (Babylon's Code Models) and some helpers which the Backend is encouraged
 29 use.
 30 
 31 The ComputeContext contains all the information that the backend needs, but does not
 32 include any 'policy' for minimizing data movements.
 33 
 34 Our assumption is that backend can use various tools to deduce the most efficient execution strategy.
 35 
 36 ## Some possible strategies..
 37 
 38 ### Copy data every time 'just in case' (JIC execution ;) )
 39 Just naiively execute the code as described in Compute graph. So the backend will copy each buffer to the device, execute the kernel and copy the data back again.
 40 
 41 ### Use kernel knowledge to minimise data movement
 42 Execute the code described in the Compute Graph, but use knowledge extracted from kernel models
 43 to only copy to device buffers that the kernel is going to read, and only copy back from the device
 44 buffers that the kernel has written to.
 45 
 46 ### Use Compute knowledge and kernel knowledge to further minimise data movement
 47 Use knowledge extracted from the compute reachable graph and the kernel
 48 graphs to determine whether Java has mutated buffers between kernel dispatches
 49 and only copy data to the device that we know the Java code has mutated.
 50 
 51 This last strategy is ideal
 52 
 53 We can achieve this using static analysis of the compute and kernel models or by being
 54 involved in the execution process at runtime.
 55 
 56 #### Static analysis
 57 
 58 #### Runtime Tracking
 59 
 60 * Dynamical
 61 1. We 'close over' the call/dispatch graph from the entrypoint to all kernels and collect the kernels reachable from the entrypoint and all methods reachable from methods reachable by kernels.
 62 2. We essentially end up with a graph of codemodels 'rooted' at the entrypoint
 63 3. For each kernel we also determine how the kernel accesses it's 'MemorySegment` parameters, for each MemorySegment parameters we keep a side table of whther the kernel reads or writes to the segment. We keep this infomation in a side map.
 64 
 65 This resulting 'ComputeClosure' (tree of codemodels and relevant side tables) is made available to the accelerator to coordinate execution.
 66 
 67 Note that our very simple Compute::compute method neither expresses the movement of the MemorySegment to a device, or the retrieval of the data from a device when the kernel has executed.
 68 
 69 Our assumption is that given the ComputeClosure we can deduce such movements.
 70 
 71 There are many ways to achieve this.  One way would be by static analysis.
 72 
 73 Given the Compute::compute entrypoint it is easy to determine that we are always (no conditional or loops) passing (making available
 74 might be a better term) a memory segment to a kernel (Compute::kernel) and this kernel only mutates the  `MemorySegment`.
 75 
 76 So from simple static analysis we could choose to inject one or more calls into the model representing the need for the accelerator to move data to the devices and/ord back from the device, after the kernel dispatch.
 77 
 78 This modified model, would look like we had presented it with this code.
 79 
 80 ```java
 81  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
 82         Accelerator.Range range = accelerator.range(len);
 83         accelerator.run(Compute::kernel, range, memorySegment);
 84         accelerator.injectedCopyFromDevice(memorySegment);
 85     }
 86 ```
 87 
 88 Note the ```injectedCopyFromDevice()``` call.
 89 
 90 Because the kernel does not read the `MemorySegment` we only need inject the code to request a move back from the device.
 91 
 92 To do this requires HAT to analyse the kernel(s) and inject appropriate code into
 93 the Compute::compute method to inform the vendor backend when it should perform such moves.
 94 
 95 Another strategy would be to not rely on static analysis but to inject code to trace 'actual' mutations of the MemorySegments and use these flags to guard against unnecessary copies
 96 
 97 ```java
 98  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
 99         boolean injectedMemorySegmentIsDirty = false;
100         Accelerator.Range range = accelerator.range(len);
101         if (injectedMemorySegmentIsDirty){
102             accelerator.injectedCopyToDevice(memorySegment);
103         }
104         accelerator.run(Compute::kernel, range, memorySegment);
105         injectedMemorySegmentIsDirty = true; // based on Compute::kernel sidetable
106         if (injectedMemorySegmentIsDirty) {
107             accelerator.injectedCopyFromDevice(memorySegment);
108         }
109     }
110 ```
111 
112 
113 Whether this code mutation generates Java bytecode and executes (or interprets) on the JVM or whether the
114 CodeModels for the closure are handed over to a backend which reifies the kernel code and the
115 logic for dispatch is not defined.
116 
117 The code model for the compute will be mutated to inject the appropriate nodes to achieve the goal
118 
119 It is possible that some vendors may just take the original code model and analyse themselves.
120 
121 Clearly this is a trivial compute closure.   Lets discuss the required kernel analysis
122 and proposed pseudo code.
123 
124 ## Copying data based on kernel MemorySegment analysis
125 
126 Above we showed that we should be able to determine whether a kernel mutates or accesses any of
127 it's Kernel MemorySegment parameters.
128 
129 We determined above that the kernel only called set() so we need
130 not copy the data to the device.
131 
132 The following example shows a kernel which reads and mutates a memorysegment
133 ```java
134     static class Compute {
135     @Reflect  public static
136     void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
137         int temp = memorySegment.get(JAVA_INT, ndrange.id.x);
138         memorySegment.set(JAVA_INT, temp*2);
139     }
140 
141     @Reflect public static
142     void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
143         Accelerator.Range range = accelerator.range(len);
144         accelerator.run(Compute::doubleup, range, memorySegment);
145     }
146 }
147 ```
148 Here our analysis needs to determine that the kernel reads and writes to the segment (it does)
149 so the generated compute model would equate to
150 
151 ```java
152  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
153         Accelerator.Range range = accelerator.range(len);
154         accelerator.copyToDevice(memorySegment); // injected via Babylon
155         accelerator.run(Compute::doubleup, range, memorySegment);
156         accelerator.copyFromDevice(memorySegment); // injected via Babylon
157     }
158 ```
159 So far the deductions are fairly trivial
160 
161 Consider
162 ```java
163  @Reflect public static
164     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
165         Accelerator.Range range = accelerator.range(len);
166         for (int i=0; i<count; i++) {
167             accelerator.run(Compute::doubleup, range, memorySegment);
168         }
169     }
170 ```
171 
172 Here HAT should deduce that the java side is merely looping over the kernel dispatch
173 and has no interest in the memorysegment between dispatches.
174 
175 So the new model need only copy in once (before the fist kernel) and out once (prior to return)
176 
177 ```java
178  @Reflect public static
179     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
180         Accelerator.Range range = accelerator.range(len);
181         accelerator.copyToDevice(memorySegment); // injected via Babylon
182         for (int i=0; i<count; i++) {
183             accelerator.run(Compute::doubleup, range, memorySegment);
184         }
185         accelerator.copyFromDevice(memorySegment); // injected via Babylon
186     }
187 ```
188 
189 Things get slightly more interesting when we do indeed access the memory segment
190 from the Java code inside the loop.
191 
192 ```java
193  @Reflect public static
194     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
195         Accelerator.Range range = accelerator.range(len);
196         for (int i=0; i<count; i++) {
197             accelerator.run(Compute::doubleup, range, memorySegment);
198             int slot0 = memorySegment.get(INTVALUE, 0);
199             System.out.println("slot0 ", slot0);
200         }
201     }
202 ```
203 Now we expect babylon to inject a read inside the loop to make the data available java side
204 
205 ```java
206  @Reflect public static
207     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
208         Accelerator.Range range = accelerator.range(len);
209         accelerator.copyToDevice(memorySegment); // injected via Babylon
210         for (int i=0; i<count; i++) {
211             accelerator.run(Compute::doubleup, range, memorySegment);
212             accelerator.copyFromDevice(memorySegment); // injected via Babylon
213             int slot0 = memorySegment.get(INTVALUE, 0);
214             System.out.println("slot0 ", slot0);
215         }
216 
217     }
218 ```
219 
220 Note that in this case we are only accessing 0th int from the segment so a possible
221 optimization might be to allow the vendor to only copy back this one element....
222 ```java
223  @Reflect public static
224     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
225         Accelerator.Range range = accelerator.range(len);
226         accelerator.copyToDevice(memorySegment); // injected via Babylon
227         for (int i=0; i<count; i++) {
228             accelerator.run(Compute::doubleup, range, memorySegment);
229             if (i+1==count){// injected
230                 accelerator.copyFromDevice(memorySegment); // injected
231             }else {
232                 accelerator.copyFromDevice(memorySegment, 1); // injected via Babylon
233             }
234             int slot0 = memorySegment.get(INTVALUE, 0);
235             System.out.println("slot0 ", slot0);
236         }
237 
238     }
239 ```
240 
241 Again HAT will merely mutate the code model of the compute method,
242 the vendor may choose to interpret bytecode, generate bytecode and execute
243 or take complete plyTable and execute the model in native code.
244 
245 So within HAT we must find all set/get calls on MemorySegments and trace them back to kernel parameters.
246 
247 We should allow aliasing of memory segments... but in the short term we may well throw an exception when we see such aliasing
248 
249 
250 ```java
251  @Reflect  public static
252     void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
253         MemorySegment alias = memorySegment;
254         alias.set(JAVA_INT, ndrange.id.x, alias.get(JAVA_INT, ndrange.id.x)*2);
255     }
256 ```
257 
258 ## Weed warning #1
259 
260 We could find common kernel errors when analyzing
261 
262 This code is probably wrong, as it is racey writing to 0th element
263 
264 ```java
265  void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
266     MemorySegment alias = memorySegment;
267     alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x)*2);
268 }
269 ```
270 
271 By allowing a 'lint' like plugin mechanism for code model it would be easy to find.
272 If we ever find a constant index in set(...., <constant> ) we are probably in a world of hurt.
273 Unless the set is included in some conditional which itself is dependant on a value extracted from a memory segment.
274 
275 ```java
276  void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
277     MemorySegment alias = memorySegment;
278     if (????){
279         alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x) * 2);
280     }
281 }
282 ```
283 
284 There are a lot opportunities for catching such bugs.
285 
286 
287 ## Flipping Generations
288 
289 Many algorithms require us to process data from generations. Consider
290 Convolutions or Game Of Life style problems where we have an image or game bufferState and
291 we need to calculate the result of applying rules to cells in the image or game.
292 
293 It is important that when we process the next generation (either in parallel or sequentially) we
294 must ensure that we only use prev generation data to generate next generation data.
295 
296 ```
297 [ ][ ][*][ ][ ]       [ ][ ][ ][ ][ ]
298 [ ][ ][*][ ][ ]       [ ][*][*][*][ ]
299 [ ][ ][*][ ][ ]   ->  [ ][ ][ ][ ][ ]
300 [ ][ ][ ][ ][ ]       [ ][ ][ ][ ][ ]
301 
302 ```
303 
304 This usually requires us to hold two copies,  and applying the kernel to one input set
305 which writes to the output.
306 
307 In the case of the Game Of Life we may well use the output as the next input...
308 
309 ```java
310 @Reflect void conway(Accelerator.NDRange ndrange,
311                             MemorySegment in, MemorySegment out, int width, int height) {
312     int cx = ndrange.id.x % ndrange.id.maxx;
313     int cy = ndrange.id.x / ndrange.id.maxx;
314 
315     int sum = 0;
316     for (int dx = -1; dx < 2; dy++) {
317         for (int dy = -1; dy < 2; dy++) {
318             if (dx != 0 || dy != 0) {
319                 int x = cx + dx;
320                 int y = cy + dy;
321                 if (x >= 0 && x < widh && y >= 0 && y < height) {
322                     sum += in.get(INT, x * width + h);
323                 }
324             }
325         }
326     }
327     result = GOLRules(sum, in.get(INT, ndrange.id.x));
328     out.set(INT, ndrange.id.x);
329 
330 }
331 ```
332 
333 In this case the assumption is that the compute layer will swap the buffers for alternate passes
334 
335 ```java
336 import java.lang.foreign.MemorySegment;
337 
338 @Reflect
339 void compute(Accelerator accelerator, MemorySegment gameState,
340              int width, int height, int maxGenerations) {
341     MemorySegment s1 = gameState;
342     MemorySegment s2 = allocateGameState(width, height);
343     for (int generation = 0; generation < maxGenerations; generation++){
344         MemorySegment from = generation%2==0?s1?s2;
345         MemorySegment to = generation%2==1?s1?s2;
346         accelerator.run(Compute::conway, from, to, range, width, height);
347     }
348     if (maxGenerations%2==1){ // ?
349         gameState.copyFrom(s2);
350     }
351 }
352 ```
353 
354 This common pattern includes some aliasing of MemorySegments that we need to untangle.
355 
356 HAT needs to be able to track the aliases to determine the minimal number of copies.
357 ```java
358 import java.lang.foreign.MemorySegment;
359 
360 @Reflect
361 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height, int maxGenerations,
362              DisplaySAM displaySAM) {
363     MemorySegment s1 = gameState;
364     MemorySegment s2 = allocateGameState(width, height);
365 
366     for (int generation = 0; generation < maxGenerations; generation++){
367         MemorySegment from = generation%2==0?s1?s2;
368         MemorySegment to = generation%2==1?s1?s2;
369         if (generation == 0) {             /// injected
370             accerator.copyToDevice(from);    // injected
371         }                                  // injected
372         accelerator.run(Compute::conway, from, to, range, width, height, 1000);
373         if (generation == maxGenerations-1){ // injected
374             accerator.copyFromDevice(to);    //injected
375         }                                    //injected
376     }
377     if (maxGenerations%2==1){ // ?
378         gameState.copyFrom(s2);
379     }
380 
381 }
382 ```
383 
384 ```java
385 import java.lang.foreign.MemorySegment;
386 
387 @Reflect
388 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height,
389              int maxGenerations,
390              DisplaySAM displaySAM) {
391     MemorySegment s1 = gameState;
392     MemorySegment s2 = allocateGameState(width, height);
393 
394     for (int generation = 0; generation < maxGenerations; generation++){
395         MemorySegment from = generation%2==0?s1?s2;
396         MemorySegment to = generation%2==1?s1?s2;
397         accelerator.run(Compute::conway, from, to, range, width, height,1000);
398         displaySAM.display(s2,width, height);
399     }
400     if (maxGenerations%2==1){ // ?
401         gameState.copyFrom(to);
402     }
403 }
404 ```
405 
406 
407 
408 ### MavenStyleProject babylon transform to track buffer mutations.
409 
410 One goal of hat was to automate the movement of buffers from Java to device.
411 
412 One strategy employed by `NativeBackends` might be to track 'ifaceMappedSegment' accesses and inject tracking data into the compute method.
413 
414 Here is a transformation for that
415 
416 ```java
417  static FuncOpWrapper injectBufferTracking(ComputeClosure.ResolvedMethodCall resolvedMethodCall) {
418         FuncOpWrapper original = resolvedMethodCall.funcOpWrapper();
419         var transformed = original.transformInvokes((builder, invoke) -> {
420                     if (invoke.isIfaceBufferMethod()) { // void array(long idx, T value) or T array(long idx)
421                         // Get the first parameter (computeClosure)
422                         CopyContext cc = builder.context();
423                         Value computeClosure = cc.getValue(original.parameter(0));
424                         // Get the buffer receiver value in the output model
425                         Value receiver = cc.getValue(invoke.operand(0)); // The buffer we are mutatibg or accessing
426                         if (invoke.isIfaceMutator()) {
427                             // inject CLWrapComputeContext.preMutate(buffer);
428                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_MUTATE, computeClosure, receiver));
429                             builder.op(invoke.op());
430                            // inject CLWrapComputeContext.postMutate(buffer);
431                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_MUTATE, computeClosure, receiver));
432                         } else if ( invoke.isIfaceAccessor()) {
433                            // inject CLWrapComputeContext.preAccess(buffer);
434                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_ACCESS, computeClosure, receiver));
435                             builder.op(invoke.op());
436                             // inject CLWrapComputeContext.postAccess(buffer);
437                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_ACCESS, computeClosure, receiver));
438                         } else {
439                             builder.op(invoke.op());
440                         }
441                     }else{
442                         builder.op(invoke.op());
443                     }
444                     return builder;
445                 }
446         );
447         transformed.op().writeTo(System.out);
448         resolvedMethodCall.funcOpWrapper(transformed);
449         return transformed;
450     }
451 ```
452 
453 So in our `OpenCLBackend` for example
454 ```java
455     public void mutateIfNeeded(ComputeClosure.MethodCall methodCall) {
456        injectBufferTracking(entrypoint);
457     }
458 
459     @Override
460     public void computeContextClosed(ComputeContext CLWrapComputeContext){
461         var codeBuilder = new OpenCLKernelBuilder();
462         C99Code kernelCode = createKernelCode(CLWrapComputeContext, codeBuilder);
463         System.out.println(codeBuilder);
464     }
465 ```
466 I hacked the Mandle example. So the compute accessed and mutated it's arrays.
467 
468 ```java
469   @Reflect
470     static float doubleit(float f) {
471         return f * 2;
472     }
473 
474     @Reflect
475     static float scaleUp(float f) {
476         return doubleit(f);
477     }
478 
479     @Reflect
480     static public void compute(final ComputeContext CLWrapComputeContext, S32Array2D s32Array2D, float x, float y, float scale) {
481         scale = scaleUp(scale);
482         var range = CLWrapComputeContext.accelerator.range(s32Array2D.size());
483         int i = s32Array2D.get(10,10);
484         s32Array2D.set(10,10,i);
485         CLWrapComputeContext.dispatchKernel(MandelCompute::kernel, range, s32Array2D, pallette, x, y, scale);
486     }
487 ```
488 So here is the transformation being applied to the above compute
489 
490 BEFORE (note the !'s indicating accesses through ifacebuffers)
491 ```
492 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
493     %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext";
494     %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
495     %7 : Var<float> = var %2 @"x";
496     %8 : Var<float> = var %3 @"y";
497     %9 : Var<float> = var %4 @"scale";
498     %10 : float = var.load %9;
499     %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
500     var.store %9 %11;
501     %12 : hat.ComputeContext = var.load %5;
502     %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
503     %14 : hat.buffer.S32Array2D = var.load %6;
504 !   %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
505     %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
506     %17 : Var<hat.NDRange> = var %16 @"range";
507     %18 : hat.buffer.S32Array2D = var.load %6;
508     %19 : int = constant @"10";
509     %20 : int = constant @"10";
510 !   %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
511     %22 : Var<int> = var %21 @"i";
512     %23 : hat.buffer.S32Array2D = var.load %6;
513     %24 : int = constant @"10";
514     %25 : int = constant @"10";
515     %26 : int = var.load %22;
516  !  invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
517     %27 : hat.ComputeContext = var.load %5;
518     ...
519 ```
520 AFTER
521 ```
522 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
523     %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext";
524     %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
525     %7 : Var<float> = var %2 @"x";
526     %8 : Var<float> = var %3 @"y";
527     %9 : Var<float> = var %4 @"scale";
528     %10 : float = var.load %9;
529     %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
530     var.store %9 %11;
531     %12 : hat.ComputeContext = var.load %5;
532     %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
533     %14 : hat.buffer.S32Array2D = var.load %6;
534     invoke %0 %14 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
535 !    %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
536     invoke %0 %14 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
537     %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
538     %17 : Var<hat.NDRange> = var %16 @"range";
539     %18 : hat.buffer.S32Array2D = var.load %6;
540     %19 : int = constant @"10";
541     %20 : int = constant @"10";
542     invoke %0 %18 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
543  !   %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
544     invoke %0 %18 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
545     %22 : Var<int> = var %21 @"i";
546     %23 : hat.buffer.S32Array2D = var.load %6;
547     %24 : int = constant @"10";
548     %25 : int = constant @"10";
549     %26 : int = var.load %22;
550     invoke %0 %23 @"hat.ComputeClosure::preMutate(hat.buffer.Buffer)void";
551  !   invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
552     invoke %0 %23 @"hat.ComputeClosure::postMutate(hat.buffer.Buffer)void";
553     %27 : hat.ComputeContext = var.load %5;
554 ```
555 And here at runtime the ComputeClosure is reporting accesses when executing via the interpreter after the injected calls.
556 
557 ```
558 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
559 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
560 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
561 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
562 ComputeClosure.preMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
563 ComputeClosure.postMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
564 ```
565 ## Why inject this info?
566 So the idea is that the ComputeContext would maintain sets of dirty buffers, one set for `gpuDirty` and one set for `javaDirty`.
567 
568 We have the code for kernel models. So we know which kernel accesses, mutates or accesses AND mutates particular parameters.
569 
570 So when the ComputeContext receives  `preAccess(x)` or `preMutate(x)` the ComputeContext would determine if `x` is in the `gpuDirty` set.
571 If so it would delegate to the backend to  copy the GPU data back from device into the memory segment (assuming the memory is not coherent!)
572 before removing the buffer from `gpuDirty` set and returning.
573 
574 Now the Java access to the segment sees the latest buffer.
575 
576 After `postMutate(x)` it will place the buffer in `javaDirty` set.
577 
578 When a kernel dispatch comes along, the parameters to the kernel are all checked against the `javaDirty` set.
579 If the parameter is 'accessed' by the kernel. The backend will copy the segment to device. Remove the parameter
580 from the `javaDirty` set and then invoke the kernel.
581 When the kernel completes (lets assume synchronous for a moment) all parameters are checked again, and if the parameter
582 is known to be mutated by the kernel the parameter is added to the 'gpuDirty' set.
583 
584 This way we don't have to force the developer to request data movements.
585 
586 BTW if kernel requests are async ;) then the ComputeContext maintains a map of buffer to kernel.  So `preAccess(x)` or `preMutate(x)` calls
587 can wait on the kernel that is due to 'dirty' the buffer to complete.
588 
589 ### Marking hat buffers directly.
590 
591 
592 
593 
594 
595 
596 
597 
598