1 
  2 # Compute Analysis or Runtime tracing
  3 
  4 ----
  5 
  6 * [Contents](hat-00.md)
  7 * House Keeping
  8     * [Project Layout](hat-01-01-project-layout.md)
  9     * [Building Babylon](hat-01-02-building-babylon.md)
 10     * [Building HAT](hat-01-03-building-hat.md)
 11 * Programming Model
 12     * [Programming Model](hat-03-programming-model.md)
 13 * Interface Mapping
 14     * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
 15     * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
 16 * Implementation Detail
 17     * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
 18     * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
 19 
 20 ----
 21 
 22 # Compute Analysis or Runtime tracing
 23 
 24 HAT does not dictate how a backend chooses to optimize execution, but does
 25 provide the tools (Babylon's Code Models) and some helpers which the Backend is encouraged
 26 use.
 27 
 28 The ComputeContext contains all the information that the backend needs, but does not
 29 include any 'policy' for minimizing data movements.
 30 
 31 Our assumption is that backend can use various tools to deduce the most efficient execution strategy.
 32 
 33 ## Some possible strategies..
 34 
 35 ### Copy data every time 'just in case' (JIC execution ;) )
 36 Just naiively execute the code as described in Compute graph. So the backend will copy each buffer to the device, execute the kernel and copy the data back again.
 37 
 38 ### Use kernel knowledge to minimise data movement
 39 Execute the code described in the Compute Graph, but use knowledge extracted from kernel models
 40 to only copy to device buffers that the kernel is going to read, and only copy back from the device
 41 buffers that the kernel has written to.
 42 
 43 ### Use Compute knowledge and kernel knowledge to further minimise data movement
 44 Use knowledge extracted from the compute reachable graph and the kernel
 45 graphs to determine whether Java has mutated buffers between kernel dispatches
 46 and only copy data to the device that we know the Java code has mutated.
 47 
 48 This last strategy is ideal
 49 
 50 We can achieve this using static analysis of the compute and kernel models or by being
 51 involved in the execution process at runtime.
 52 
 53 #### Static analysis
 54 
 55 #### Runtime Tracking
 56 
 57 * Dynamical
 58 1. We 'close over' the call/dispatch graph from the entrypoint to all kernels and collect the kernels reachable from the entrypoint and all methods reachable from methods reachable by kernels.
 59 2. We essentially end up with a graph of codemodels 'rooted' at the entrypoint
 60 3. For each kernel we also determine how the kernel accesses it's 'MemorySegment` parameters, for each MemorySegment parameters we keep a side table of whther the kernel reads or writes to the segment. We keep this infomation in a side map.
 61 
 62 This resulting 'ComputeClosure' (tree of codemodels and relevant side tables) is made available to the accelerator to coordinate execution.
 63 
 64 Note that our very simple Compute::compute method neither expresses the movement of the MemorySegment to a device, or the retrieval of the data from a device when the kernel has executed.
 65 
 66 Our assumption is that given the ComputeClosure we can deduce such movements.
 67 
 68 There are many ways to achieve this.  One way would be by static analysis.
 69 
 70 Given the Compute::compute entrypoint it is easy to determine that we are always (no conditional or loops) passing (making available
 71 might be a better term) a memory segment to a kernel (Compute::kernel) and this kernel only mutates the  `MemorySegment`.
 72 
 73 So from simple static analysis we could choose to inject one or more calls into the model representing the need for the accelerator to move data to the devices and/ord back from the device, after the kernel dispatch.
 74 
 75 This modified model, would look like we had presented it with this code.
 76 
 77 ```java
 78  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
 79         Accelerator.Range range = accelerator.range(len);
 80         accelerator.run(Compute::kernel, range, memorySegment);
 81         accelerator.injectedCopyFromDevice(memorySegment);
 82     }
 83 ```
 84 
 85 Note the ```injectedCopyFromDevice()``` call.
 86 
 87 Because the kernel does not read the `MemorySegment` we only need inject the code to request a move back from the device.
 88 
 89 To do this requires HAT to analyse the kernel(s) and inject appropriate code into
 90 the Compute::compute method to inform the vendor backend when it should perform such moves.
 91 
 92 Another strategy would be to not rely on static analysis but to inject code to trace 'actual' mutations of the MemorySegments and use these flags to guard against unnecessary copies
 93 
 94 ```java
 95  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
 96         boolean injectedMemorySegmentIsDirty = false;
 97         Accelerator.Range range = accelerator.range(len);
 98         if (injectedMemorySegmentIsDirty){
 99             accelerator.injectedCopyToDevice(memorySegment);
100         }
101         accelerator.run(Compute::kernel, range, memorySegment);
102         injectedMemorySegmentIsDirty = true; // based on Compute::kernel sidetable
103         if (injectedMemorySegmentIsDirty) {
104             accelerator.injectedCopyFromDevice(memorySegment);
105         }
106     }
107 ```
108 
109 
110 Whether this code mutation generates Java bytecode and executes (or interprets) on the JVM or whether the
111 CodeModels for the closure are handed over to a backend which reifies the kernel code and the
112 logic for dispatch is not defined.
113 
114 The code model for the compute will be mutated to inject the appropriate nodes to achieve the goal
115 
116 It is possible that some vendors may just take the original code model and analyse themselves.
117 
118 Clearly this is a trivial compute closure.   Lets discuss the required kernel analysis
119 and proposed pseudo code.
120 
121 ## Copying data based on kernel MemorySegment analysis
122 
123 Above we showed that we should be able to determine whether a kernel mutates or accesses any of
124 it's Kernel MemorySegment parameters.
125 
126 We determined above that the kernel only called set() so we need
127 not copy the data to the device.
128 
129 The following example shows a kernel which reads and mutates a memorysegment
130 ```java
131     static class Compute {
132     @CodeReflection  public static
133     void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
134         int temp = memorySegment.get(JAVA_INT, ndrange.id.x);
135         memorySegment.set(JAVA_INT, temp*2);
136     }
137 
138     @CodeReflection public static
139     void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
140         Accelerator.Range range = accelerator.range(len);
141         accelerator.run(Compute::doubleup, range, memorySegment);
142     }
143 }
144 ```
145 Here our analysis needs to determine that the kernel reads and writes to the segment (it does)
146 so the generated compute model would equate to
147 
148 ```java
149  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
150         Accelerator.Range range = accelerator.range(len);
151         accelerator.copyToDevice(memorySegment); // injected via Babylon
152         accelerator.run(Compute::doubleup, range, memorySegment);
153         accelerator.copyFromDevice(memorySegment); // injected via Babylon
154     }
155 ```
156 So far the deductions are fairly trivial
157 
158 Consider
159 ```java
160  @CodeReflection public static
161     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
162         Accelerator.Range range = accelerator.range(len);
163         for (int i=0; i<count; i++) {
164             accelerator.run(Compute::doubleup, range, memorySegment);
165         }
166     }
167 ```
168 
169 Here HAT should deduce that the java side is merely looping over the kernel dispatch
170 and has no interest in the memorysegment between dispatches.
171 
172 So the new model need only copy in once (before the fist kernel) and out once (prior to return)
173 
174 ```java
175  @CodeReflection public static
176     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
177         Accelerator.Range range = accelerator.range(len);
178         accelerator.copyToDevice(memorySegment); // injected via Babylon
179         for (int i=0; i<count; i++) {
180             accelerator.run(Compute::doubleup, range, memorySegment);
181         }
182         accelerator.copyFromDevice(memorySegment); // injected via Babylon
183     }
184 ```
185 
186 Things get slightly more interesting when we do indeed access the memory segment
187 from the Java code inside the loop.
188 
189 ```java
190  @CodeReflection public static
191     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
192         Accelerator.Range range = accelerator.range(len);
193         for (int i=0; i<count; i++) {
194             accelerator.run(Compute::doubleup, range, memorySegment);
195             int slot0 = memorySegment.get(INTVALUE, 0);
196             System.out.println("slot0 ", slot0);
197         }
198     }
199 ```
200 Now we expect babylon to inject a read inside the loop to make the data available java side
201 
202 ```java
203  @CodeReflection public static
204     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
205         Accelerator.Range range = accelerator.range(len);
206         accelerator.copyToDevice(memorySegment); // injected via Babylon
207         for (int i=0; i<count; i++) {
208             accelerator.run(Compute::doubleup, range, memorySegment);
209             accelerator.copyFromDevice(memorySegment); // injected via Babylon
210             int slot0 = memorySegment.get(INTVALUE, 0);
211             System.out.println("slot0 ", slot0);
212         }
213 
214     }
215 ```
216 
217 Note that in this case we are only accessing 0th int from the segment so a possible
218 optimization might be to allow the vendor to only copy back this one element....
219 ```java
220  @CodeReflection public static
221     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
222         Accelerator.Range range = accelerator.range(len);
223         accelerator.copyToDevice(memorySegment); // injected via Babylon
224         for (int i=0; i<count; i++) {
225             accelerator.run(Compute::doubleup, range, memorySegment);
226             if (i+1==count){// injected
227                 accelerator.copyFromDevice(memorySegment); // injected
228             }else {
229                 accelerator.copyFromDevice(memorySegment, 1); // injected via Babylon
230             }
231             int slot0 = memorySegment.get(INTVALUE, 0);
232             System.out.println("slot0 ", slot0);
233         }
234 
235     }
236 ```
237 
238 Again HAT will merely mutate the code model of the compute method,
239 the vendor may choose to interpret bytecode, generate bytecode and execute
240 or take complete plyTable and execute the model in native code.
241 
242 So within HAT we must find all set/get calls on MemorySegments and trace them back to kernel parameters.
243 
244 We should allow aliasing of memory segments... but in the short term we may well throw an exception when we see such aliasing
245 
246 
247 ```java
248  @CodeReflection  public static
249     void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
250         MemorySegment alias = memorySegment;
251         alias.set(JAVA_INT, ndrange.id.x, alias.get(JAVA_INT, ndrange.id.x)*2);
252     }
253 ```
254 
255 ## Weed warning #1
256 
257 We could find common kernel errors when analyzing
258 
259 This code is probably wrong, as it is racey writing to 0th element
260 
261 ```java
262  void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
263     MemorySegment alias = memorySegment;
264     alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x)*2);
265 }
266 ```
267 
268 By allowing a 'lint' like plugin mechanism for code model it would be easy to find.
269 If we ever find a constant index in set(...., <constant> ) we are probably in a world of hurt.
270 Unless the set is included in some conditional which itself is dependant on a value extracted from a memory segment.
271 
272 ```java
273  void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
274     MemorySegment alias = memorySegment;
275     if (????){
276         alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x) * 2);
277     }
278 }
279 ```
280 
281 There are a lot opportunities for catching such bugs.
282 
283 
284 ## Flipping Generations
285 
286 Many algorithms require us to process data from generations. Consider
287 Convolutions or Game Of Life style problems where we have an image or game bufferState and
288 we need to calculate the result of applying rules to cells in the image or game.
289 
290 It is important that when we process the next generation (either in parallel or sequentially) we
291 must ensure that we only use prev generation data to generate next generation data.
292 
293 ```
294 [ ][ ][*][ ][ ]       [ ][ ][ ][ ][ ]
295 [ ][ ][*][ ][ ]       [ ][*][*][*][ ]
296 [ ][ ][*][ ][ ]   ->  [ ][ ][ ][ ][ ]
297 [ ][ ][ ][ ][ ]       [ ][ ][ ][ ][ ]
298 
299 ```
300 
301 This usually requires us to hold two copies,  and applying the kernel to one input set
302 which writes to the output.
303 
304 In the case of the Game Of Life we may well use the output as the next input...
305 
306 ```java
307 @CodeReflection void conway(Accelerator.NDRange ndrange,
308                             MemorySegment in, MemorySegment out, int width, int height) {
309     int cx = ndrange.id.x % ndrange.id.maxx;
310     int cy = ndrange.id.x / ndrange.id.maxx;
311 
312     int sum = 0;
313     for (int dx = -1; dx < 2; dy++) {
314         for (int dy = -1; dy < 2; dy++) {
315             if (dx != 0 || dy != 0) {
316                 int x = cx + dx;
317                 int y = cy + dy;
318                 if (x >= 0 && x < widh && y >= 0 && y < height) {
319                     sum += in.get(INT, x * width + h);
320                 }
321             }
322         }
323     }
324     result = GOLRules(sum, in.get(INT, ndrange.id.x));
325     out.set(INT, ndrange.id.x);
326 
327 }
328 ```
329 
330 In this case the assumption is that the compute layer will swap the buffers for alternate passes
331 
332 ```java
333 import java.lang.foreign.MemorySegment;
334 
335 @CodeReflection
336 void compute(Accelerator accelerator, MemorySegment gameState,
337              int width, int height, int maxGenerations) {
338     MemorySegment s1 = gameState;
339     MemorySegment s2 = allocateGameState(width, height);
340     for (int generation = 0; generation < maxGenerations; generation++){
341         MemorySegment from = generation%2==0?s1?s2;
342         MemorySegment to = generation%2==1?s1?s2;
343         accelerator.run(Compute::conway, from, to, range, width, height);
344     }
345     if (maxGenerations%2==1){ // ?
346         gameState.copyFrom(s2);
347     }
348 }
349 ```
350 
351 This common pattern includes some aliasing of MemorySegments that we need to untangle.
352 
353 HAT needs to be able to track the aliases to determine the minimal number of copies.
354 ```java
355 import java.lang.foreign.MemorySegment;
356 
357 @CodeReflection
358 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height, int maxGenerations,
359              DisplaySAM displaySAM) {
360     MemorySegment s1 = gameState;
361     MemorySegment s2 = allocateGameState(width, height);
362 
363     for (int generation = 0; generation < maxGenerations; generation++){
364         MemorySegment from = generation%2==0?s1?s2;
365         MemorySegment to = generation%2==1?s1?s2;
366         if (generation == 0) {             /// injected
367             accerator.copyToDevice(from);    // injected
368         }                                  // injected
369         accelerator.run(Compute::conway, from, to, range, width, height, 1000);
370         if (generation == maxGenerations-1){ // injected
371             accerator.copyFromDevice(to);    //injected
372         }                                    //injected
373     }
374     if (maxGenerations%2==1){ // ?
375         gameState.copyFrom(s2);
376     }
377 
378 }
379 ```
380 
381 ```java
382 import java.lang.foreign.MemorySegment;
383 
384 @CodeReflection
385 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height,
386              int maxGenerations,
387              DisplaySAM displaySAM) {
388     MemorySegment s1 = gameState;
389     MemorySegment s2 = allocateGameState(width, height);
390 
391     for (int generation = 0; generation < maxGenerations; generation++){
392         MemorySegment from = generation%2==0?s1?s2;
393         MemorySegment to = generation%2==1?s1?s2;
394         accelerator.run(Compute::conway, from, to, range, width, height,1000);
395         displaySAM.display(s2,width, height);
396     }
397     if (maxGenerations%2==1){ // ?
398         gameState.copyFrom(to);
399     }
400 }
401 ```
402 
403 
404 
405 ### MavenStyleProject babylon transform to track buffer mutations.
406 
407 One goal of hat was to automate the movement of buffers from Java to device.
408 
409 One strategy employed by `NativeBackends` might be to track 'ifaceMappedSegment' accesses and inject tracking data into the compute method.
410 
411 Here is a transformation for that
412 
413 ```java
414  static FuncOpWrapper injectBufferTracking(ComputeClosure.ResolvedMethodCall resolvedMethodCall) {
415         FuncOpWrapper original = resolvedMethodCall.funcOpWrapper();
416         var transformed = original.transformInvokes((builder, invoke) -> {
417                     if (invoke.isIfaceBufferMethod()) { // void array(long idx, T value) or T array(long idx)
418                         // Get the first parameter (computeClosure)
419                         CopyContext cc = builder.context();
420                         Value computeClosure = cc.getValue(original.parameter(0));
421                         // Get the buffer receiver value in the output model
422                         Value receiver = cc.getValue(invoke.operand(0)); // The buffer we are mutatibg or accessing
423                         if (invoke.isIfaceMutator()) {
424                             // inject CLWrapComputeContext.preMutate(buffer);
425                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_MUTATE, computeClosure, receiver));
426                             builder.op(invoke.op());
427                            // inject CLWrapComputeContext.postMutate(buffer);
428                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_MUTATE, computeClosure, receiver));
429                         } else if ( invoke.isIfaceAccessor()) {
430                            // inject CLWrapComputeContext.preAccess(buffer);
431                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_ACCESS, computeClosure, receiver));
432                             builder.op(invoke.op());
433                             // inject CLWrapComputeContext.postAccess(buffer);
434                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_ACCESS, computeClosure, receiver));
435                         } else {
436                             builder.op(invoke.op());
437                         }
438                     }else{
439                         builder.op(invoke.op());
440                     }
441                     return builder;
442                 }
443         );
444         transformed.op().writeTo(System.out);
445         resolvedMethodCall.funcOpWrapper(transformed);
446         return transformed;
447     }
448 ```
449 
450 So in our `OpenCLBackend` for example
451 ```java
452     public void mutateIfNeeded(ComputeClosure.MethodCall methodCall) {
453        injectBufferTracking(entrypoint);
454     }
455 
456     @Override
457     public void computeContextClosed(ComputeContext CLWrapComputeContext){
458         var codeBuilder = new OpenCLKernelBuilder();
459         C99Code kernelCode = createKernelCode(CLWrapComputeContext, codeBuilder);
460         System.out.println(codeBuilder);
461     }
462 ```
463 I hacked the Mandle example. So the compute accessed and mutated it's arrays.
464 
465 ```java
466   @CodeReflection
467     static float doubleit(float f) {
468         return f * 2;
469     }
470 
471     @CodeReflection
472     static float scaleUp(float f) {
473         return doubleit(f);
474     }
475 
476     @CodeReflection
477     static public void compute(final ComputeContext CLWrapComputeContext, S32Array2D s32Array2D, float x, float y, float scale) {
478         scale = scaleUp(scale);
479         var range = CLWrapComputeContext.accelerator.range(s32Array2D.size());
480         int i = s32Array2D.get(10,10);
481         s32Array2D.set(10,10,i);
482         CLWrapComputeContext.dispatchKernel(MandelCompute::kernel, range, s32Array2D, pallette, x, y, scale);
483     }
484 ```
485 So here is the transformation being applied to the above compute
486 
487 BEFORE (note the !'s indicating accesses through ifacebuffers)
488 ```
489 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
490     %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext";
491     %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
492     %7 : Var<float> = var %2 @"x";
493     %8 : Var<float> = var %3 @"y";
494     %9 : Var<float> = var %4 @"scale";
495     %10 : float = var.load %9;
496     %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
497     var.store %9 %11;
498     %12 : hat.ComputeContext = var.load %5;
499     %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
500     %14 : hat.buffer.S32Array2D = var.load %6;
501 !   %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
502     %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
503     %17 : Var<hat.NDRange> = var %16 @"range";
504     %18 : hat.buffer.S32Array2D = var.load %6;
505     %19 : int = constant @"10";
506     %20 : int = constant @"10";
507 !   %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
508     %22 : Var<int> = var %21 @"i";
509     %23 : hat.buffer.S32Array2D = var.load %6;
510     %24 : int = constant @"10";
511     %25 : int = constant @"10";
512     %26 : int = var.load %22;
513  !  invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
514     %27 : hat.ComputeContext = var.load %5;
515     ...
516 ```
517 AFTER
518 ```
519 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
520     %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext";
521     %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
522     %7 : Var<float> = var %2 @"x";
523     %8 : Var<float> = var %3 @"y";
524     %9 : Var<float> = var %4 @"scale";
525     %10 : float = var.load %9;
526     %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
527     var.store %9 %11;
528     %12 : hat.ComputeContext = var.load %5;
529     %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
530     %14 : hat.buffer.S32Array2D = var.load %6;
531     invoke %0 %14 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
532 !    %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
533     invoke %0 %14 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
534     %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
535     %17 : Var<hat.NDRange> = var %16 @"range";
536     %18 : hat.buffer.S32Array2D = var.load %6;
537     %19 : int = constant @"10";
538     %20 : int = constant @"10";
539     invoke %0 %18 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
540  !   %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
541     invoke %0 %18 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
542     %22 : Var<int> = var %21 @"i";
543     %23 : hat.buffer.S32Array2D = var.load %6;
544     %24 : int = constant @"10";
545     %25 : int = constant @"10";
546     %26 : int = var.load %22;
547     invoke %0 %23 @"hat.ComputeClosure::preMutate(hat.buffer.Buffer)void";
548  !   invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
549     invoke %0 %23 @"hat.ComputeClosure::postMutate(hat.buffer.Buffer)void";
550     %27 : hat.ComputeContext = var.load %5;
551 ```
552 And here at runtime the ComputeClosure is reporting accesses when executing via the interpreter after the injected calls.
553 
554 ```
555 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
556 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
557 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
558 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
559 ComputeClosure.preMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
560 ComputeClosure.postMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
561 ```
562 ## Why inject this info?
563 So the idea is that the ComputeContext would maintain sets of dirty buffers, one set for `gpuDirty` and one set for `javaDirty`.
564 
565 We have the code for kernel models. So we know which kernel accesses, mutates or accesses AND mutates particular parameters.
566 
567 So when the ComputeContext receives  `preAccess(x)` or `preMutate(x)` the ComputeContext would determine if `x` is in the `gpuDirty` set.
568 If so it would delegate to the backend to  copy the GPU data back from device into the memory segment (assuming the memory is not coherent!)
569 before removing the buffer from `gpuDirty` set and returning.
570 
571 Now the Java access to the segment sees the latest buffer.
572 
573 After `postMutate(x)` it will place the buffer in `javaDirty` set.
574 
575 When a kernel dispatch comes along, the parameters to the kernel are all checked against the `javaDirty` set.
576 If the parameter is 'accessed' by the kernel. The backend will copy the segment to device. Remove the parameter
577 from the `javaDirty` set and then invoke the kernel.
578 When the kernel completes (lets assume synchronous for a moment) all parameters are checked again, and if the parameter
579 is known to be mutated by the kernel the parameter is added to the 'gpuDirty' set.
580 
581 This way we don't have to force the developer to request data movements.
582 
583 BTW if kernel requests are async ;) then the ComputeContext maintains a map of buffer to kernel.  So `preAccess(x)` or `preMutate(x)` calls
584 can wait on the kernel that is due to 'dirty' the buffer to complete.
585 
586 ### Marking hat buffers directly.
587 
588 
589 
590 
591 
592 
593 
594 
595