1 
  2 # Compute Analysis or Runtime tracing
  3 
  4 ----
  5 
  6 * [Contents](hat-00.md)
  7 * House Keeping
  8     * [Project Layout](hat-01-01-project-layout.md)
  9     * [Building Babylon](hat-01-02-building-babylon.md)
 10     * [Building HAT](hat-01-03-building-hat.md)
 11 * Programming Model
 12     * [Programming Model](hat-03-programming-model.md)
 13 * Interface Mapping
 14     * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
 15     * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
 16 * Implementation Detail
 17     * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
 18 
 19 ----
 20 
 21 # Compute Analysis or Runtime tracing
 22 
 23 HAT does not dictate how a backend chooses to optimize execution, but does
 24 provide the tools (Babylon's Code Models) and some helpers which the Backend is encouraged
 25 use.
 26 
 27 The ComputeContext contains all the information that the backend needs, but does not
 28 include any 'policy' for minimizing data movements.
 29 
 30 Our assumption is that backend can use various tools to deduce the most efficient execution strategy.
 31 
 32 ## Some possible strategies..
 33 
 34 ### Copy data every time 'just in case' (JIC execution ;) )
 35 Just naiively execute the code as described in Compute graph. So the backend will copy each buffer to the device, execute the kernel and copy the data back again.
 36 
 37 ### Use kernel knowledge to minimise data movement
 38 Execute the code described in the Compute Graph, but use knowledge extracted from kernel models
 39 to only copy to device buffers that the kernel is going to read, and only copy back from the device
 40 buffers that the kernel has written to.
 41 
 42 ### Use Compute knowledge and kernel knowledge to further minimise data movement
 43 Use knowledge extracted from the compute reachable graph and the kernel
 44 graphs to determine whether Java has mutated buffers between kernel dispatches
 45 and only copy data to the device that we know the Java code has mutated.
 46 
 47 This last strategy is ideal
 48 
 49 We can achieve this using static analysis of the compute and kernel models or by being
 50 involved in the execution process at runtime.
 51 
 52 #### Static analysis
 53 
 54 #### Runtime Tracking
 55 
 56 * Dynamical
 57 1. We 'close over' the call/dispatch graph from the entrypoint to all kernels and collect the kernels reachable from the entrypoint and all methods reachable from methods reachable by kernels.
 58 2. We essentially end up with a graph of codemodels 'rooted' at the entrypoint
 59 3. For each kernel we also determine how the kernel accesses it's 'MemorySegment` parameters, for each MemorySegment parameters we keep a side table of whther the kernel reads or writes to the segment. We keep this infomation in a side map.
 60 
 61 This resulting 'ComputeClosure' (tree of codemodels and relevant side tables) is made available to the accelerator to coordinate execution.
 62 
 63 Note that our very simple Compute::compute method neither expresses the movement of the MemorySegment to a device, or the retrieval of the data from a device when the kernel has executed.
 64 
 65 Our assumption is that given the ComputeClosure we can deduce such movements.
 66 
 67 There are many ways to achieve this.  One way would be by static analysis.
 68 
 69 Given the Compute::compute entrypoint it is easy to determine that we are always (no conditional or loops) passing (making available
 70 might be a better term) a memory segment to a kernel (Compute::kernel) and this kernel only mutates the  `MemorySegment`.
 71 
 72 So from simple static analysis we could choose to inject one or more calls into the model representing the need for the accelerator to move data to the devices and/ord back from the device, after the kernel dispatch.
 73 
 74 This modified model, would look like we had presented it with this code.
 75 
 76 ```java
 77  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
 78         Accelerator.Range range = accelerator.range(len);
 79         accelerator.run(Compute::kernel, range, memorySegment);
 80         accelerator.injectedCopyFromDevice(memorySegment);
 81     }
 82 ```
 83 
 84 Note the ```injectedCopyFromDevice()``` call.
 85 
 86 Because the kernel does not read the `MemorySegment` we only need inject the code to request a move back from the device.
 87 
 88 To do this requires HAT to analyse the kernel(s) and inject appropriate code into
 89 the Compute::compute method to inform the vendor backend when it should perform such moves.
 90 
 91 Another strategy would be to not rely on static analysis but to inject code to trace 'actual' mutations of the MemorySegments and use these flags to guard against unnecessary copies
 92 
 93 ```java
 94  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
 95         boolean injectedMemorySegmentIsDirty = false;
 96         Accelerator.Range range = accelerator.range(len);
 97         if (injectedMemorySegmentIsDirty){
 98             accelerator.injectedCopyToDevice(memorySegment);
 99         }
100         accelerator.run(Compute::kernel, range, memorySegment);
101         injectedMemorySegmentIsDirty = true; // based on Compute::kernel sidetable
102         if (injectedMemorySegmentIsDirty) {
103             accelerator.injectedCopyFromDevice(memorySegment);
104         }
105     }
106 ```
107 
108 
109 Whether this code mutation generates Java bytecode and executes (or interprets) on the JVM or whether the
110 CodeModels for the closure are handed over to a backend which reifies the kernel code and the
111 logic for dispatch is not defined.
112 
113 The code model for the compute will be mutated to inject the appropriate nodes to achieve the goal
114 
115 It is possible that some vendors may just take the original code model and analyse themselves.
116 
117 Clearly this is a trivial compute closure.   Lets discuss the required kernel analysis
118 and proposed pseudo code.
119 
120 ## Copying data based on kernel MemorySegment analysis
121 
122 Above we showed that we should be able to determine whether a kernel mutates or accesses any of
123 it's Kernel MemorySegment parameters.
124 
125 We determined above that the kernel only called set() so we need
126 not copy the data to the device.
127 
128 The following example shows a kernel which reads and mutates a memorysegment
129 ```java
130     static class Compute {
131     @CodeReflection  public static
132     void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
133         int temp = memorySegment.get(JAVA_INT, ndrange.id.x);
134         memorySegment.set(JAVA_INT, temp*2);
135     }
136 
137     @CodeReflection public static
138     void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
139         Accelerator.Range range = accelerator.range(len);
140         accelerator.run(Compute::doubleup, range, memorySegment);
141     }
142 }
143 ```
144 Here our analysis needs to determine that the kernel reads and writes to the segment (it does)
145 so the generated compute model would equate to
146 
147 ```java
148  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
149         Accelerator.Range range = accelerator.range(len);
150         accelerator.copyToDevice(memorySegment); // injected via Babylon
151         accelerator.run(Compute::doubleup, range, memorySegment);
152         accelerator.copyFromDevice(memorySegment); // injected via Babylon
153     }
154 ```
155 So far the deductions are fairly trivial
156 
157 Consider
158 ```java
159  @CodeReflection public static
160     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
161         Accelerator.Range range = accelerator.range(len);
162         for (int i=0; i<count; i++) {
163             accelerator.run(Compute::doubleup, range, memorySegment);
164         }
165     }
166 ```
167 
168 Here HAT should deduce that the java side is merely looping over the kernel dispatch
169 and has no interest in the memorysegment between dispatches.
170 
171 So the new model need only copy in once (before the fist kernel) and out once (prior to return)
172 
173 ```java
174  @CodeReflection public static
175     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
176         Accelerator.Range range = accelerator.range(len);
177         accelerator.copyToDevice(memorySegment); // injected via Babylon
178         for (int i=0; i<count; i++) {
179             accelerator.run(Compute::doubleup, range, memorySegment);
180         }
181         accelerator.copyFromDevice(memorySegment); // injected via Babylon
182     }
183 ```
184 
185 Things get slightly more interesting when we do indeed access the memory segment
186 from the Java code inside the loop.
187 
188 ```java
189  @CodeReflection public static
190     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
191         Accelerator.Range range = accelerator.range(len);
192         for (int i=0; i<count; i++) {
193             accelerator.run(Compute::doubleup, range, memorySegment);
194             int slot0 = memorySegment.get(INTVALUE, 0);
195             System.out.println("slot0 ", slot0);
196         }
197     }
198 ```
199 Now we expect babylon to inject a read inside the loop to make the data available java side
200 
201 ```java
202  @CodeReflection public static
203     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
204         Accelerator.Range range = accelerator.range(len);
205         accelerator.copyToDevice(memorySegment); // injected via Babylon
206         for (int i=0; i<count; i++) {
207             accelerator.run(Compute::doubleup, range, memorySegment);
208             accelerator.copyFromDevice(memorySegment); // injected via Babylon
209             int slot0 = memorySegment.get(INTVALUE, 0);
210             System.out.println("slot0 ", slot0);
211         }
212 
213     }
214 ```
215 
216 Note that in this case we are only accessing 0th int from the segment so a possible
217 optimization might be to allow the vendor to only copy back this one element....
218 ```java
219  @CodeReflection public static
220     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
221         Accelerator.Range range = accelerator.range(len);
222         accelerator.copyToDevice(memorySegment); // injected via Babylon
223         for (int i=0; i<count; i++) {
224             accelerator.run(Compute::doubleup, range, memorySegment);
225             if (i+1==count){// injected
226                 accelerator.copyFromDevice(memorySegment); // injected
227             }else {
228                 accelerator.copyFromDevice(memorySegment, 1); // injected via Babylon
229             }
230             int slot0 = memorySegment.get(INTVALUE, 0);
231             System.out.println("slot0 ", slot0);
232         }
233 
234     }
235 ```
236 
237 Again HAT will merely mutate the code model of the compute method,
238 the vendor may choose to interpret bytecode, generate bytecode and execute
239 or take complete plyTable and execute the model in native code.
240 
241 So within HAT we must find all set/get calls on MemorySegments and trace them back to kernel parameters.
242 
243 We should allow aliasing of memory segments... but in the short term we may well throw an exception when we see such aliasing
244 
245 
246 ```java
247  @CodeReflection  public static
248     void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
249         MemorySegment alias = memorySegment;
250         alias.set(JAVA_INT, ndrange.id.x, alias.get(JAVA_INT, ndrange.id.x)*2);
251     }
252 ```
253 
254 ## Weed warning #1
255 
256 We could find common kernel errors when analyzing
257 
258 This code is probably wrong, as it is racey writing to 0th element
259 
260 ```java
261  void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
262     MemorySegment alias = memorySegment;
263     alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x)*2);
264 }
265 ```
266 
267 By allowing a 'lint' like plugin mechanism for code model it would be easy to find.
268 If we ever find a constant index in set(...., <constant> ) we are probably in a world of hurt.
269 Unless the set is included in some conditional which itself is dependant on a value extracted from a memory segment.
270 
271 ```java
272  void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
273     MemorySegment alias = memorySegment;
274     if (????){
275         alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x) * 2);
276     }
277 }
278 ```
279 
280 There are a lot opportunities for catching such bugs.
281 
282 
283 ## Flipping Generations
284 
285 Many algorithms require us to process data from generations. Consider
286 Convolutions or Game Of Life style problems where we have an image or game state and
287 we need to calculate the result of applying rules to cells in the image or game.
288 
289 It is important that when we process the next generation (either in parallel or sequentially) we
290 must ensure that we only use prev generation data to generate next generation data.
291 
292 ```
293 [ ][ ][*][ ][ ]       [ ][ ][ ][ ][ ]
294 [ ][ ][*][ ][ ]       [ ][*][*][*][ ]
295 [ ][ ][*][ ][ ]   ->  [ ][ ][ ][ ][ ]
296 [ ][ ][ ][ ][ ]       [ ][ ][ ][ ][ ]
297 
298 ```
299 
300 This usually requires us to hold two copies,  and applying the kernel to one input set
301 which writes to the output.
302 
303 In the case of the Game Of Life we may well use the output as the next input...
304 
305 ```java
306 @CodeReflection void conway(Accelerator.NDRange ndrange,
307                             MemorySegment in, MemorySegment out, int width, int height) {
308     int cx = ndrange.id.x % ndrange.id.maxx;
309     int cy = ndrange.id.x / ndrange.id.maxx;
310 
311     int sum = 0;
312     for (int dx = -1; dx < 2; dy++) {
313         for (int dy = -1; dy < 2; dy++) {
314             if (dx != 0 || dy != 0) {
315                 int x = cx + dx;
316                 int y = cy + dy;
317                 if (x >= 0 && x < widh && y >= 0 && y < height) {
318                     sum += in.get(INT, x * width + h);
319                 }
320             }
321         }
322     }
323     result = GOLRules(sum, in.get(INT, ndrange.id.x));
324     out.set(INT, ndrange.id.x);
325 
326 }
327 ```
328 
329 In this case the assumption is that the compute layer will swap the buffers for alternate passes
330 
331 ```java
332 import java.lang.foreign.MemorySegment;
333 
334 @CodeReflection
335 void compute(Accelerator accelerator, MemorySegment gameState,
336              int width, int height, int maxGenerations) {
337     MemorySegment s1 = gameState;
338     MemorySegment s2 = allocateGameState(width, height);
339     for (int generation = 0; generation < maxGenerations; generation++){
340         MemorySegment from = generation%2==0?s1?s2;
341         MemorySegment to = generation%2==1?s1?s2;
342         accelerator.run(Compute::conway, from, to, range, width, height);
343     }
344     if (maxGenerations%2==1){ // ?
345         gameState.copyFrom(s2);
346     }
347 }
348 ```
349 
350 This common pattern includes some aliasing of MemorySegments that we need to untangle.
351 
352 HAT needs to be able to track the aliases to determine the minimal number of copies.
353 ```java
354 import java.lang.foreign.MemorySegment;
355 
356 @CodeReflection
357 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height, int maxGenerations,
358              DisplaySAM displaySAM) {
359     MemorySegment s1 = gameState;
360     MemorySegment s2 = allocateGameState(width, height);
361 
362     for (int generation = 0; generation < maxGenerations; generation++){
363         MemorySegment from = generation%2==0?s1?s2;
364         MemorySegment to = generation%2==1?s1?s2;
365         if (generation == 0) {             /// injected
366             accerator.copyToDevice(from);    // injected
367         }                                  // injected
368         accelerator.run(Compute::conway, from, to, range, width, height, 1000);
369         if (generation == maxGenerations-1){ // injected
370             accerator.copyFromDevice(to);    //injected
371         }                                    //injected
372     }
373     if (maxGenerations%2==1){ // ?
374         gameState.copyFrom(s2);
375     }
376 
377 }
378 ```
379 
380 ```java
381 import java.lang.foreign.MemorySegment;
382 
383 @CodeReflection
384 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height,
385              int maxGenerations,
386              DisplaySAM displaySAM) {
387     MemorySegment s1 = gameState;
388     MemorySegment s2 = allocateGameState(width, height);
389 
390     for (int generation = 0; generation < maxGenerations; generation++){
391         MemorySegment from = generation%2==0?s1?s2;
392         MemorySegment to = generation%2==1?s1?s2;
393         accelerator.run(Compute::conway, from, to, range, width, height,1000);
394         displaySAM.display(s2,width, height);
395     }
396     if (maxGenerations%2==1){ // ?
397         gameState.copyFrom(to);
398     }
399 }
400 ```
401 
402 
403 
404 ### Example babylon transform to track buffer mutations.
405 
406 One goal of hat was to automate the movement of buffers from Java to device.
407 
408 One strategy employed by `NativeBackends` might be to track 'ifaceMappedSegment' accesses and inject tracking data into the compute method.
409 
410 Here is a transformation for that
411 
412 ```java
413  static FuncOpWrapper injectBufferTracking(ComputeClosure.ResolvedMethodCall resolvedMethodCall) {
414         FuncOpWrapper original = resolvedMethodCall.funcOpWrapper();
415         var transformed = original.transformInvokes((builder, invoke) -> {
416                     if (invoke.isIfaceBufferMethod()) { // void array(long idx, T value) or T array(long idx)
417                         // Get the first parameter (computeClosure)
418                         CopyContext cc = builder.context();
419                         Value computeClosure = cc.getValue(original.parameter(0));
420                         // Get the buffer receiver value in the output model
421                         Value receiver = cc.getValue(invoke.operand(0)); // The buffer we are mutatibg or accessing
422                         if (invoke.isIfaceMutator()) {
423                             // inject computeContext.preMutate(buffer);
424                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_MUTATE, computeClosure, receiver));
425                             builder.op(invoke.op());
426                            // inject computeContext.postMutate(buffer);
427                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_MUTATE, computeClosure, receiver));
428                         } else if ( invoke.isIfaceAccessor()) {
429                            // inject computeContext.preAccess(buffer);
430                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_ACCESS, computeClosure, receiver));
431                             builder.op(invoke.op());
432                             // inject computeContext.postAccess(buffer);
433                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_ACCESS, computeClosure, receiver));
434                         } else {
435                             builder.op(invoke.op());
436                         }
437                     }else{
438                         builder.op(invoke.op());
439                     }
440                     return builder;
441                 }
442         );
443         transformed.op().writeTo(System.out);
444         resolvedMethodCall.funcOpWrapper(transformed);
445         return transformed;
446     }
447 ```
448 
449 So in our `OpenCLBackend` for example
450 ```java
451     public void mutateIfNeeded(ComputeClosure.MethodCall methodCall) {
452        injectBufferTracking(entrypoint);
453     }
454 
455     @Override
456     public void computeContextClosed(ComputeContext computeContext){
457         var codeBuilder = new OpenCLKernelBuilder();
458         C99Code kernelCode = createKernelCode(computeContext, codeBuilder);
459         System.out.println(codeBuilder);
460     }
461 ```
462 I hacked the Mandle example. So the compute accessed and mutated it's arrays.
463 
464 ```java
465   @CodeReflection
466     static float doubleit(float f) {
467         return f * 2;
468     }
469 
470     @CodeReflection
471     static float scaleUp(float f) {
472         return doubleit(f);
473     }
474 
475     @CodeReflection
476     static public void compute(final ComputeContext computeContext, S32Array2D s32Array2D, float x, float y, float scale) {
477         scale = scaleUp(scale);
478         var range = computeContext.accelerator.range(s32Array2D.size());
479         int i = s32Array2D.get(10,10);
480         s32Array2D.set(10,10,i);
481         computeContext.dispatchKernel(MandelCompute::kernel, range, s32Array2D, pallette, x, y, scale);
482     }
483 ```
484 So here is the transformation being applied to the above compute
485 
486 BEFORE (note the !'s indicating accesses through ifacebuffers)
487 ```
488 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
489     %5 : Var<hat.ComputeContext> = var %0 @"computeContext";
490     %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
491     %7 : Var<float> = var %2 @"x";
492     %8 : Var<float> = var %3 @"y";
493     %9 : Var<float> = var %4 @"scale";
494     %10 : float = var.load %9;
495     %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
496     var.store %9 %11;
497     %12 : hat.ComputeContext = var.load %5;
498     %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
499     %14 : hat.buffer.S32Array2D = var.load %6;
500 !   %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
501     %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
502     %17 : Var<hat.NDRange> = var %16 @"range";
503     %18 : hat.buffer.S32Array2D = var.load %6;
504     %19 : int = constant @"10";
505     %20 : int = constant @"10";
506 !   %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
507     %22 : Var<int> = var %21 @"i";
508     %23 : hat.buffer.S32Array2D = var.load %6;
509     %24 : int = constant @"10";
510     %25 : int = constant @"10";
511     %26 : int = var.load %22;
512  !  invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
513     %27 : hat.ComputeContext = var.load %5;
514     ...
515 ```
516 AFTER
517 ```
518 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
519     %5 : Var<hat.ComputeContext> = var %0 @"computeContext";
520     %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
521     %7 : Var<float> = var %2 @"x";
522     %8 : Var<float> = var %3 @"y";
523     %9 : Var<float> = var %4 @"scale";
524     %10 : float = var.load %9;
525     %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
526     var.store %9 %11;
527     %12 : hat.ComputeContext = var.load %5;
528     %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
529     %14 : hat.buffer.S32Array2D = var.load %6;
530     invoke %0 %14 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
531 !    %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
532     invoke %0 %14 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
533     %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
534     %17 : Var<hat.NDRange> = var %16 @"range";
535     %18 : hat.buffer.S32Array2D = var.load %6;
536     %19 : int = constant @"10";
537     %20 : int = constant @"10";
538     invoke %0 %18 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
539  !   %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
540     invoke %0 %18 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
541     %22 : Var<int> = var %21 @"i";
542     %23 : hat.buffer.S32Array2D = var.load %6;
543     %24 : int = constant @"10";
544     %25 : int = constant @"10";
545     %26 : int = var.load %22;
546     invoke %0 %23 @"hat.ComputeClosure::preMutate(hat.buffer.Buffer)void";
547  !   invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
548     invoke %0 %23 @"hat.ComputeClosure::postMutate(hat.buffer.Buffer)void";
549     %27 : hat.ComputeContext = var.load %5;
550 ```
551 And here at runtime the ComputeClosure is reporting accesses when executing via the interpreter after the injected calls.
552 
553 ```
554 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
555 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
556 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
557 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
558 ComputeClosure.preMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
559 ComputeClosure.postMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
560 ```
561 ## Why inject this info?
562 So the idea is that the ComputeContext would maintain sets of dirty buffers, one set for `gpuDirty` and one set for `javaDirty`.
563 
564 We have the code for kernel models. So we know which kernel accesses, mutates or accesses AND mutates particular parameters.
565 
566 So when the ComputeContext receives  `preAccess(x)` or `preMutate(x)` the ComputeContext would determine if `x` is in the `gpuDirty` set.
567 If so it would delegate to the backend to  copy the GPU data back from device into the memory segment (assuming the memory is not coherent!)
568 before removing the buffer from `gpuDirty` set and returning.
569 
570 Now the Java access to the segment sees the latest buffer.
571 
572 After `postMutate(x)` it will place the buffer in `javaDirty` set.
573 
574 When a kernel dispatch comes along, the parameters to the kernel are all checked against the `javaDirty` set.
575 If the parameter is 'accessed' by the kernel. The backend will copy the segment to device. Remove the parameter
576 from the `javaDirty` set and then invoke the kernel.
577 When the kernel completes (lets assume synchronous for a moment) all parameters are checked again, and if the parameter
578 is known to be mutated by the kernel the parameter is added to the 'gpuDirty' set.
579 
580 This way we don't have to force the developer to request data movements.
581 
582 BTW if kernel requests are async ;) then the ComputeContext maintains a map of buffer to kernel.  So `preAccess(x)` or `preMutate(x)` calls
583 can wait on the kernel that is due to 'dirty' the buffer to complete.
584 
585 ### Marking hat buffers directly.
586 
587 
588 
589 
590 
591 
592 
593 
594