1 # Compute Analysis or Runtime tracing
  2 
  3 ----
  4 * [Contents](hat-00.md)
  5 * Build Babylon and HAT
  6     * [Quick Install](hat-01-quick-install.md)
  7     * [Building Babylon with jtreg](hat-01-02-building-babylon.md)
  8     * [Building HAT with jtreg](hat-01-03-building-hat.md)
  9         * [Enabling the NVIDIA CUDA Backend](hat-01-05-building-hat-for-cuda.md)
 10 * [Testing Framework](hat-02-testing-framework.md)
 11 * [Running Examples](hat-03-examples.md)
 12 * [HAT Programming Model](hat-03-programming-model.md)
 13 * Interface Mapping
 14     * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
 15     * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
 16 * Development
 17     * [Project Layout](hat-01-01-project-layout.md)
 18 * Implementation Details
 19     * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
 20     * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
 21 * [Running HAT with Docker on NVIDIA GPUs](hat-07-docker-build-nvidia.md)
 22 ---
 23 
 24 # Compute Analysis or Runtime tracing
 25 
 26 HAT does not dictate how a backend chooses to optimize execution, but does
 27 provide the tools (Babylon's Code Models) and some helpers which the Backend is encouraged
 28 use.
 29 
 30 The ComputeContext contains all the information that the backend needs, but does not
 31 include any 'policy' for minimizing data movements.
 32 
 33 Our assumption is that backend can use various tools to deduce the most efficient execution strategy.
 34 
 35 ## Some possible strategies..
 36 
 37 ### Copy data every time 'just in case' (JIC execution ;) )
 38 Just naiively execute the code as described in Compute graph. So the backend will copy each buffer to the device, execute the kernel and copy the data back again.
 39 
 40 ### Use kernel knowledge to minimise data movement
 41 Execute the code described in the Compute Graph, but use knowledge extracted from kernel models
 42 to only copy to device buffers that the kernel is going to read, and only copy back from the device
 43 buffers that the kernel has written to.
 44 
 45 ### Use Compute knowledge and kernel knowledge to further minimise data movement
 46 Use knowledge extracted from the compute reachable graph and the kernel
 47 graphs to determine whether Java has mutated buffers between kernel dispatches
 48 and only copy data to the device that we know the Java code has mutated.
 49 
 50 This last strategy is ideal
 51 
 52 We can achieve this using static analysis of the compute and kernel models or by being
 53 involved in the execution process at runtime.
 54 
 55 #### Static analysis
 56 
 57 #### Runtime Tracking
 58 
 59 * Dynamical
 60 1. We 'close over' the call/dispatch graph from the entrypoint to all kernels and collect the kernels reachable from the entrypoint and all methods reachable from methods reachable by kernels.
 61 2. We essentially end up with a graph of codemodels 'rooted' at the entrypoint
 62 3. For each kernel we also determine how the kernel accesses it's 'MemorySegment` parameters, for each MemorySegment parameters we keep a side table of whther the kernel reads or writes to the segment. We keep this infomation in a side map.
 63 
 64 This resulting 'ComputeClosure' (tree of codemodels and relevant side tables) is made available to the accelerator to coordinate execution.
 65 
 66 Note that our very simple Compute::compute method neither expresses the movement of the MemorySegment to a device, or the retrieval of the data from a device when the kernel has executed.
 67 
 68 Our assumption is that given the ComputeClosure we can deduce such movements.
 69 
 70 There are many ways to achieve this.  One way would be by static analysis.
 71 
 72 Given the Compute::compute entrypoint it is easy to determine that we are always (no conditional or loops) passing (making available
 73 might be a better term) a memory segment to a kernel (Compute::kernel) and this kernel only mutates the  `MemorySegment`.
 74 
 75 So from simple static analysis we could choose to inject one or more calls into the model representing the need for the accelerator to move data to the devices and/ord back from the device, after the kernel dispatch.
 76 
 77 This modified model, would look like we had presented it with this code.
 78 
 79 ```java
 80  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
 81         Accelerator.Range range = accelerator.range(len);
 82         accelerator.run(Compute::kernel, range, memorySegment);
 83         accelerator.injectedCopyFromDevice(memorySegment);
 84     }
 85 ```
 86 
 87 Note the ```injectedCopyFromDevice()``` call.
 88 
 89 Because the kernel does not read the `MemorySegment` we only need inject the code to request a move back from the device.
 90 
 91 To do this requires HAT to analyse the kernel(s) and inject appropriate code into
 92 the Compute::compute method to inform the vendor backend when it should perform such moves.
 93 
 94 Another strategy would be to not rely on static analysis but to inject code to trace 'actual' mutations of the MemorySegments and use these flags to guard against unnecessary copies
 95 
 96 ```java
 97  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
 98         boolean injectedMemorySegmentIsDirty = false;
 99         Accelerator.Range range = accelerator.range(len);
100         if (injectedMemorySegmentIsDirty){
101             accelerator.injectedCopyToDevice(memorySegment);
102         }
103         accelerator.run(Compute::kernel, range, memorySegment);
104         injectedMemorySegmentIsDirty = true; // based on Compute::kernel sidetable
105         if (injectedMemorySegmentIsDirty) {
106             accelerator.injectedCopyFromDevice(memorySegment);
107         }
108     }
109 ```
110 
111 
112 Whether this code mutation generates Java bytecode and executes (or interprets) on the JVM or whether the
113 CodeModels for the closure are handed over to a backend which reifies the kernel code and the
114 logic for dispatch is not defined.
115 
116 The code model for the compute will be mutated to inject the appropriate nodes to achieve the goal
117 
118 It is possible that some vendors may just take the original code model and analyse themselves.
119 
120 Clearly this is a trivial compute closure.   Lets discuss the required kernel analysis
121 and proposed pseudo code.
122 
123 ## Copying data based on kernel MemorySegment analysis
124 
125 Above we showed that we should be able to determine whether a kernel mutates or accesses any of
126 it's Kernel MemorySegment parameters.
127 
128 We determined above that the kernel only called set() so we need
129 not copy the data to the device.
130 
131 The following example shows a kernel which reads and mutates a memorysegment
132 ```java
133     static class Compute {
134     @Reflect  public static
135     void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
136         int temp = memorySegment.get(JAVA_INT, ndrange.id.x);
137         memorySegment.set(JAVA_INT, temp*2);
138     }
139 
140     @Reflect public static
141     void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
142         Accelerator.Range range = accelerator.range(len);
143         accelerator.run(Compute::doubleup, range, memorySegment);
144     }
145 }
146 ```
147 Here our analysis needs to determine that the kernel reads and writes to the segment (it does)
148 so the generated compute model would equate to
149 
150 ```java
151  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
152         Accelerator.Range range = accelerator.range(len);
153         accelerator.copyToDevice(memorySegment); // injected via Babylon
154         accelerator.run(Compute::doubleup, range, memorySegment);
155         accelerator.copyFromDevice(memorySegment); // injected via Babylon
156     }
157 ```
158 So far the deductions are fairly trivial
159 
160 Consider
161 ```java
162  @Reflect public static
163     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
164         Accelerator.Range range = accelerator.range(len);
165         for (int i=0; i<count; i++) {
166             accelerator.run(Compute::doubleup, range, memorySegment);
167         }
168     }
169 ```
170 
171 Here HAT should deduce that the java side is merely looping over the kernel dispatch
172 and has no interest in the memorysegment between dispatches.
173 
174 So the new model need only copy in once (before the fist kernel) and out once (prior to return)
175 
176 ```java
177  @Reflect public static
178     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
179         Accelerator.Range range = accelerator.range(len);
180         accelerator.copyToDevice(memorySegment); // injected via Babylon
181         for (int i=0; i<count; i++) {
182             accelerator.run(Compute::doubleup, range, memorySegment);
183         }
184         accelerator.copyFromDevice(memorySegment); // injected via Babylon
185     }
186 ```
187 
188 Things get slightly more interesting when we do indeed access the memory segment
189 from the Java code inside the loop.
190 
191 ```java
192  @Reflect public static
193     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
194         Accelerator.Range range = accelerator.range(len);
195         for (int i=0; i<count; i++) {
196             accelerator.run(Compute::doubleup, range, memorySegment);
197             int slot0 = memorySegment.get(INTVALUE, 0);
198             System.out.println("slot0 ", slot0);
199         }
200     }
201 ```
202 Now we expect babylon to inject a read inside the loop to make the data available java side
203 
204 ```java
205  @Reflect public static
206     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
207         Accelerator.Range range = accelerator.range(len);
208         accelerator.copyToDevice(memorySegment); // injected via Babylon
209         for (int i=0; i<count; i++) {
210             accelerator.run(Compute::doubleup, range, memorySegment);
211             accelerator.copyFromDevice(memorySegment); // injected via Babylon
212             int slot0 = memorySegment.get(INTVALUE, 0);
213             System.out.println("slot0 ", slot0);
214         }
215 
216     }
217 ```
218 
219 Note that in this case we are only accessing 0th int from the segment so a possible
220 optimization might be to allow the vendor to only copy back this one element....
221 ```java
222  @Reflect public static
223     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
224         Accelerator.Range range = accelerator.range(len);
225         accelerator.copyToDevice(memorySegment); // injected via Babylon
226         for (int i=0; i<count; i++) {
227             accelerator.run(Compute::doubleup, range, memorySegment);
228             if (i+1==count){// injected
229                 accelerator.copyFromDevice(memorySegment); // injected
230             }else {
231                 accelerator.copyFromDevice(memorySegment, 1); // injected via Babylon
232             }
233             int slot0 = memorySegment.get(INTVALUE, 0);
234             System.out.println("slot0 ", slot0);
235         }
236 
237     }
238 ```
239 
240 Again HAT will merely mutate the code model of the compute method,
241 the vendor may choose to interpret bytecode, generate bytecode and execute
242 or take complete plyTable and execute the model in native code.
243 
244 So within HAT we must find all set/get calls on MemorySegments and trace them back to kernel parameters.
245 
246 We should allow aliasing of memory segments... but in the short term we may well throw an exception when we see such aliasing
247 
248 
249 ```java
250  @Reflect  public static
251     void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
252         MemorySegment alias = memorySegment;
253         alias.set(JAVA_INT, ndrange.id.x, alias.get(JAVA_INT, ndrange.id.x)*2);
254     }
255 ```
256 
257 ## Weed warning #1
258 
259 We could find common kernel errors when analyzing
260 
261 This code is probably wrong, as it is racey writing to 0th element
262 
263 ```java
264  void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
265     MemorySegment alias = memorySegment;
266     alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x)*2);
267 }
268 ```
269 
270 By allowing a 'lint' like plugin mechanism for code model it would be easy to find.
271 If we ever find a constant index in set(...., <constant> ) we are probably in a world of hurt.
272 Unless the set is included in some conditional which itself is dependant on a value extracted from a memory segment.
273 
274 ```java
275  void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
276     MemorySegment alias = memorySegment;
277     if (????){
278         alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x) * 2);
279     }
280 }
281 ```
282 
283 There are a lot opportunities for catching such bugs.
284 
285 
286 ## Flipping Generations
287 
288 Many algorithms require us to process data from generations. Consider
289 Convolutions or Game Of Life style problems where we have an image or game bufferState and
290 we need to calculate the result of applying rules to cells in the image or game.
291 
292 It is important that when we process the next generation (either in parallel or sequentially) we
293 must ensure that we only use prev generation data to generate next generation data.
294 
295 ```
296 [ ][ ][*][ ][ ]       [ ][ ][ ][ ][ ]
297 [ ][ ][*][ ][ ]       [ ][*][*][*][ ]
298 [ ][ ][*][ ][ ]   ->  [ ][ ][ ][ ][ ]
299 [ ][ ][ ][ ][ ]       [ ][ ][ ][ ][ ]
300 
301 ```
302 
303 This usually requires us to hold two copies,  and applying the kernel to one input set
304 which writes to the output.
305 
306 In the case of the Game Of Life we may well use the output as the next input...
307 
308 ```java
309 @Reflect void conway(Accelerator.NDRange ndrange,
310                             MemorySegment in, MemorySegment out, int width, int height) {
311     int cx = ndrange.id.x % ndrange.id.maxx;
312     int cy = ndrange.id.x / ndrange.id.maxx;
313 
314     int sum = 0;
315     for (int dx = -1; dx < 2; dy++) {
316         for (int dy = -1; dy < 2; dy++) {
317             if (dx != 0 || dy != 0) {
318                 int x = cx + dx;
319                 int y = cy + dy;
320                 if (x >= 0 && x < widh && y >= 0 && y < height) {
321                     sum += in.get(INT, x * width + h);
322                 }
323             }
324         }
325     }
326     result = GOLRules(sum, in.get(INT, ndrange.id.x));
327     out.set(INT, ndrange.id.x);
328 
329 }
330 ```
331 
332 In this case the assumption is that the compute layer will swap the buffers for alternate passes
333 
334 ```java
335 import java.lang.foreign.MemorySegment;
336 
337 @Reflect
338 void compute(Accelerator accelerator, MemorySegment gameState,
339              int width, int height, int maxGenerations) {
340     MemorySegment s1 = gameState;
341     MemorySegment s2 = allocateGameState(width, height);
342     for (int generation = 0; generation < maxGenerations; generation++){
343         MemorySegment from = generation%2==0?s1?s2;
344         MemorySegment to = generation%2==1?s1?s2;
345         accelerator.run(Compute::conway, from, to, range, width, height);
346     }
347     if (maxGenerations%2==1){ // ?
348         gameState.copyFrom(s2);
349     }
350 }
351 ```
352 
353 This common pattern includes some aliasing of MemorySegments that we need to untangle.
354 
355 HAT needs to be able to track the aliases to determine the minimal number of copies.
356 ```java
357 import java.lang.foreign.MemorySegment;
358 
359 @Reflect
360 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height, int maxGenerations,
361              DisplaySAM displaySAM) {
362     MemorySegment s1 = gameState;
363     MemorySegment s2 = allocateGameState(width, height);
364 
365     for (int generation = 0; generation < maxGenerations; generation++){
366         MemorySegment from = generation%2==0?s1?s2;
367         MemorySegment to = generation%2==1?s1?s2;
368         if (generation == 0) {             /// injected
369             accerator.copyToDevice(from);    // injected
370         }                                  // injected
371         accelerator.run(Compute::conway, from, to, range, width, height, 1000);
372         if (generation == maxGenerations-1){ // injected
373             accerator.copyFromDevice(to);    //injected
374         }                                    //injected
375     }
376     if (maxGenerations%2==1){ // ?
377         gameState.copyFrom(s2);
378     }
379 
380 }
381 ```
382 
383 ```java
384 import java.lang.foreign.MemorySegment;
385 
386 @Reflect
387 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height,
388              int maxGenerations,
389              DisplaySAM displaySAM) {
390     MemorySegment s1 = gameState;
391     MemorySegment s2 = allocateGameState(width, height);
392 
393     for (int generation = 0; generation < maxGenerations; generation++){
394         MemorySegment from = generation%2==0?s1?s2;
395         MemorySegment to = generation%2==1?s1?s2;
396         accelerator.run(Compute::conway, from, to, range, width, height,1000);
397         displaySAM.display(s2,width, height);
398     }
399     if (maxGenerations%2==1){ // ?
400         gameState.copyFrom(to);
401     }
402 }
403 ```
404 
405 
406 
407 ### MavenStyleProject babylon transform to track buffer mutations.
408 
409 One goal of hat was to automate the movement of buffers from Java to device.
410 
411 One strategy employed by `NativeBackends` might be to track 'ifaceMappedSegment' accesses and inject tracking data into the compute method.
412 
413 Here is a transformation for that
414 
415 ```java
416  static FuncOpWrapper injectBufferTracking(ComputeClosure.ResolvedMethodCall resolvedMethodCall) {
417         FuncOpWrapper original = resolvedMethodCall.funcOpWrapper();
418         var transformed = original.transformInvokes((builder, invoke) -> {
419                     if (invoke.isIfaceBufferMethod()) { // void array(long idx, T value) or T array(long idx)
420                         // Get the first parameter (computeClosure)
421                         CopyContext cc = builder.context();
422                         Value computeClosure = cc.getValue(original.parameter(0));
423                         // Get the buffer receiver value in the output model
424                         Value receiver = cc.getValue(invoke.operand(0)); // The buffer we are mutatibg or accessing
425                         if (invoke.isIfaceMutator()) {
426                             // inject CLWrapComputeContext.preMutate(buffer);
427                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_MUTATE, computeClosure, receiver));
428                             builder.op(invoke.op());
429                            // inject CLWrapComputeContext.postMutate(buffer);
430                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_MUTATE, computeClosure, receiver));
431                         } else if ( invoke.isIfaceAccessor()) {
432                            // inject CLWrapComputeContext.preAccess(buffer);
433                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_ACCESS, computeClosure, receiver));
434                             builder.op(invoke.op());
435                             // inject CLWrapComputeContext.postAccess(buffer);
436                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_ACCESS, computeClosure, receiver));
437                         } else {
438                             builder.op(invoke.op());
439                         }
440                     }else{
441                         builder.op(invoke.op());
442                     }
443                     return builder;
444                 }
445         );
446         transformed.op().writeTo(System.out);
447         resolvedMethodCall.funcOpWrapper(transformed);
448         return transformed;
449     }
450 ```
451 
452 So in our `OpenCLBackend` for example
453 ```java
454     public void mutateIfNeeded(ComputeClosure.MethodCall methodCall) {
455        injectBufferTracking(entrypoint);
456     }
457 
458     @Override
459     public void computeContextClosed(ComputeContext CLWrapComputeContext){
460         var codeBuilder = new OpenCLKernelBuilder();
461         C99Code kernelCode = createKernelCode(CLWrapComputeContext, codeBuilder);
462         System.out.println(codeBuilder);
463     }
464 ```
465 I hacked the Mandle example. So the compute accessed and mutated it's arrays.
466 
467 ```java
468   @Reflect
469     static float doubleit(float f) {
470         return f * 2;
471     }
472 
473     @Reflect
474     static float scaleUp(float f) {
475         return doubleit(f);
476     }
477 
478     @Reflect
479     static public void compute(final ComputeContext CLWrapComputeContext, S32Array2D s32Array2D, float x, float y, float scale) {
480         scale = scaleUp(scale);
481         var range = CLWrapComputeContext.accelerator.range(s32Array2D.size());
482         int i = s32Array2D.get(10,10);
483         s32Array2D.set(10,10,i);
484         CLWrapComputeContext.dispatchKernel(MandelCompute::kernel, range, s32Array2D, pallette, x, y, scale);
485     }
486 ```
487 So here is the transformation being applied to the above compute
488 
489 BEFORE (note the !'s indicating accesses through ifacebuffers)
490 ```
491 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
492     %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext";
493     %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
494     %7 : Var<float> = var %2 @"x";
495     %8 : Var<float> = var %3 @"y";
496     %9 : Var<float> = var %4 @"scale";
497     %10 : float = var.load %9;
498     %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
499     var.store %9 %11;
500     %12 : hat.ComputeContext = var.load %5;
501     %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
502     %14 : hat.buffer.S32Array2D = var.load %6;
503 !   %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
504     %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
505     %17 : Var<hat.NDRange> = var %16 @"range";
506     %18 : hat.buffer.S32Array2D = var.load %6;
507     %19 : int = constant @"10";
508     %20 : int = constant @"10";
509 !   %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
510     %22 : Var<int> = var %21 @"i";
511     %23 : hat.buffer.S32Array2D = var.load %6;
512     %24 : int = constant @"10";
513     %25 : int = constant @"10";
514     %26 : int = var.load %22;
515  !  invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
516     %27 : hat.ComputeContext = var.load %5;
517     ...
518 ```
519 AFTER
520 ```
521 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
522     %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext";
523     %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
524     %7 : Var<float> = var %2 @"x";
525     %8 : Var<float> = var %3 @"y";
526     %9 : Var<float> = var %4 @"scale";
527     %10 : float = var.load %9;
528     %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
529     var.store %9 %11;
530     %12 : hat.ComputeContext = var.load %5;
531     %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
532     %14 : hat.buffer.S32Array2D = var.load %6;
533     invoke %0 %14 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
534 !    %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
535     invoke %0 %14 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
536     %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
537     %17 : Var<hat.NDRange> = var %16 @"range";
538     %18 : hat.buffer.S32Array2D = var.load %6;
539     %19 : int = constant @"10";
540     %20 : int = constant @"10";
541     invoke %0 %18 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
542  !   %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
543     invoke %0 %18 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
544     %22 : Var<int> = var %21 @"i";
545     %23 : hat.buffer.S32Array2D = var.load %6;
546     %24 : int = constant @"10";
547     %25 : int = constant @"10";
548     %26 : int = var.load %22;
549     invoke %0 %23 @"hat.ComputeClosure::preMutate(hat.buffer.Buffer)void";
550  !   invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
551     invoke %0 %23 @"hat.ComputeClosure::postMutate(hat.buffer.Buffer)void";
552     %27 : hat.ComputeContext = var.load %5;
553 ```
554 And here at runtime the ComputeClosure is reporting accesses when executing via the interpreter after the injected calls.
555 
556 ```
557 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
558 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
559 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
560 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
561 ComputeClosure.preMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
562 ComputeClosure.postMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
563 ```
564 ## Why inject this info?
565 So the idea is that the ComputeContext would maintain sets of dirty buffers, one set for `gpuDirty` and one set for `javaDirty`.
566 
567 We have the code for kernel models. So we know which kernel accesses, mutates or accesses AND mutates particular parameters.
568 
569 So when the ComputeContext receives  `preAccess(x)` or `preMutate(x)` the ComputeContext would determine if `x` is in the `gpuDirty` set.
570 If so it would delegate to the backend to  copy the GPU data back from device into the memory segment (assuming the memory is not coherent!)
571 before removing the buffer from `gpuDirty` set and returning.
572 
573 Now the Java access to the segment sees the latest buffer.
574 
575 After `postMutate(x)` it will place the buffer in `javaDirty` set.
576 
577 When a kernel dispatch comes along, the parameters to the kernel are all checked against the `javaDirty` set.
578 If the parameter is 'accessed' by the kernel. The backend will copy the segment to device. Remove the parameter
579 from the `javaDirty` set and then invoke the kernel.
580 When the kernel completes (lets assume synchronous for a moment) all parameters are checked again, and if the parameter
581 is known to be mutated by the kernel the parameter is added to the 'gpuDirty' set.
582 
583 This way we don't have to force the developer to request data movements.
584 
585 BTW if kernel requests are async ;) then the ComputeContext maintains a map of buffer to kernel.  So `preAccess(x)` or `preMutate(x)` calls
586 can wait on the kernel that is due to 'dirty' the buffer to complete.
587 
588 ### Marking hat buffers directly.
589 
590 
591 
592 
593 
594 
595 
596 
597