New hat/docs/Implementation/kernel-analysis.md

  1 # Compute Analysis or Runtime tracing
  2 [Back to Index ../](../index.md)
  3 
  4 # Compute Analysis or Runtime tracing
  5 
  6 HAT does not dictate how a backend chooses to optimize execution, but does
  7 provide the tools (Babylon's Code Models) and some helpers which the Backend is encouraged
  8 use.
  9 
 10 The ComputeContext contains all the information that the backend needs, but does not
 11 include any 'policy' for minimizing data movements.
 12 
 13 Our assumption is that backend can use various tools to deduce the most efficient execution strategy.
 14 
 15 ## Some possible strategies..
 16 
 17 ### Copy data every time 'just in case' (JIC execution ;) )
 18 Just naiively execute the code as described in Compute graph. So the backend will copy each buffer to the device, execute the kernel and copy the data back again.
 19 
 20 ### Use kernel knowledge to minimise data movement
 21 Execute the code described in the Compute Graph, but use knowledge extracted from kernel models
 22 to only copy to device buffers that the kernel is going to read, and only copy back from the device
 23 buffers that the kernel has written to.
 24 
 25 ### Use Compute knowledge and kernel knowledge to further minimise data movement
 26 Use knowledge extracted from the compute reachable graph and the kernel
 27 graphs to determine whether Java has mutated buffers between kernel dispatches
 28 and only copy data to the device that we know the Java code has mutated.
 29 
 30 This last strategy is ideal
 31 
 32 We can achieve this using static analysis of the compute and kernel models or by being
 33 involved in the execution process at runtime.
 34 
 35 #### Static analysis
 36 
 37 #### Runtime Tracking
 38 
 39 * Dynamical
 40 1. We 'close over' the call/dispatch graph from the entrypoint to all kernels and collect the kernels reachable from the entrypoint and all methods reachable from methods reachable by kernels.
 41 2. We essentially end up with a graph of codemodels 'rooted' at the entrypoint
 42 3. For each kernel we also determine how the kernel accesses it's 'MemorySegment` parameters, for each MemorySegment parameters we keep a side table of whther the kernel reads or writes to the segment. We keep this infomation in a side map.
 43 
 44 This resulting 'ComputeClosure' (tree of codemodels and relevant side tables) is made available to the accelerator to coordinate execution.
 45 
 46 Note that our very simple Compute::compute method neither expresses the movement of the MemorySegment to a device, or the retrieval of the data from a device when the kernel has executed.
 47 
 48 Our assumption is that given the ComputeClosure we can deduce such movements.
 49 
 50 There are many ways to achieve this.  One way would be by static analysis.
 51 
 52 Given the Compute::compute entrypoint it is easy to determine that we are always (no conditional or loops) passing (making available
 53 might be a better term) a memory segment to a kernel (Compute::kernel) and this kernel only mutates the  `MemorySegment`.
 54 
 55 So from simple static analysis we could choose to inject one or more calls into the model representing the need for the accelerator to move data to the devices and/ord back from the device, after the kernel dispatch.
 56 
 57 This modified model, would look like we had presented it with this code.
 58 
 59 ```java
 60  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
 61         Accelerator.Range range = accelerator.range(len);
 62         accelerator.run(Compute::kernel, range, memorySegment);
 63         accelerator.injectedCopyFromDevice(memorySegment);
 64     }
 65 ```
 66 
 67 Note the ```injectedCopyFromDevice()``` call.
 68 
 69 Because the kernel does not read the `MemorySegment` we only need inject the code to request a move back from the device.
 70 
 71 To do this requires HAT to analyse the kernel(s) and inject appropriate code into
 72 the Compute::compute method to inform the vendor backend when it should perform such moves.
 73 
 74 Another strategy would be to not rely on static analysis but to inject code to trace 'actual' mutations of the MemorySegments and use these flags to guard against unnecessary copies
 75 
 76 ```java
 77  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
 78         boolean injectedMemorySegmentIsDirty = false;
 79         Accelerator.Range range = accelerator.range(len);
 80         if (injectedMemorySegmentIsDirty){
 81             accelerator.injectedCopyToDevice(memorySegment);
 82         }
 83         accelerator.run(Compute::kernel, range, memorySegment);
 84         injectedMemorySegmentIsDirty = true; // based on Compute::kernel sidetable
 85         if (injectedMemorySegmentIsDirty) {
 86             accelerator.injectedCopyFromDevice(memorySegment);
 87         }
 88     }
 89 ```
 90 
 91 
 92 Whether this code mutation generates Java bytecode and executes (or interprets) on the JVM or whether the
 93 CodeModels for the closure are handed over to a backend which reifies the kernel code and the
 94 logic for dispatch is not defined.
 95 
 96 The code model for the compute will be mutated to inject the appropriate nodes to achieve the goal
 97 
 98 It is possible that some vendors may just take the original code model and analyse themselves.
 99 
100 Clearly this is a trivial compute closure.   Lets discuss the required kernel analysis
101 and proposed pseudo code.
102 
103 ## Copying data based on kernel MemorySegment analysis
104 
105 Above we showed that we should be able to determine whether a kernel mutates or accesses any of
106 it's Kernel MemorySegment parameters.
107 
108 We determined above that the kernel only called set() so we need
109 not copy the data to the device.
110 
111 The following example shows a kernel which reads and mutates a memorysegment
112 ```java
113     static class Compute {
114     @Reflect  public static
115     void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
116         int temp = memorySegment.get(JAVA_INT, ndrange.id.x);
117         memorySegment.set(JAVA_INT, temp*2);
118     }
119 
120     @Reflect public static
121     void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
122         Accelerator.Range range = accelerator.range(len);
123         accelerator.run(Compute::doubleup, range, memorySegment);
124     }
125 }
126 ```
127 Here our analysis needs to determine that the kernel reads and writes to the segment (it does)
128 so the generated compute model would equate to
129 
130 ```java
131  void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
132         Accelerator.Range range = accelerator.range(len);
133         accelerator.copyToDevice(memorySegment); // injected via Babylon
134         accelerator.run(Compute::doubleup, range, memorySegment);
135         accelerator.copyFromDevice(memorySegment); // injected via Babylon
136     }
137 ```
138 So far the deductions are fairly trivial
139 
140 Consider
141 ```java
142  @Reflect public static
143     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
144         Accelerator.Range range = accelerator.range(len);
145         for (int i=0; i<count; i++) {
146             accelerator.run(Compute::doubleup, range, memorySegment);
147         }
148     }
149 ```
150 
151 Here HAT should deduce that the java side is merely looping over the kernel dispatch
152 and has no interest in the memorysegment between dispatches.
153 
154 So the new model need only copy in once (before the fist kernel) and out once (prior to return)
155 
156 ```java
157  @Reflect public static
158     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
159         Accelerator.Range range = accelerator.range(len);
160         accelerator.copyToDevice(memorySegment); // injected via Babylon
161         for (int i=0; i<count; i++) {
162             accelerator.run(Compute::doubleup, range, memorySegment);
163         }
164         accelerator.copyFromDevice(memorySegment); // injected via Babylon
165     }
166 ```
167 
168 Things get slightly more interesting when we do indeed access the memory segment
169 from the Java code inside the loop.
170 
171 ```java
172  @Reflect public static
173     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
174         Accelerator.Range range = accelerator.range(len);
175         for (int i=0; i<count; i++) {
176             accelerator.run(Compute::doubleup, range, memorySegment);
177             int slot0 = memorySegment.get(INTVALUE, 0);
178             System.out.println("slot0 ", slot0);
179         }
180     }
181 ```
182 Now we expect babylon to inject a read inside the loop to make the data available java side
183 
184 ```java
185  @Reflect public static
186     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
187         Accelerator.Range range = accelerator.range(len);
188         accelerator.copyToDevice(memorySegment); // injected via Babylon
189         for (int i=0; i<count; i++) {
190             accelerator.run(Compute::doubleup, range, memorySegment);
191             accelerator.copyFromDevice(memorySegment); // injected via Babylon
192             int slot0 = memorySegment.get(INTVALUE, 0);
193             System.out.println("slot0 ", slot0);
194         }
195 
196     }
197 ```
198 
199 Note that in this case we are only accessing 0th int from the segment so a possible
200 optimization might be to allow the vendor to only copy back this one element....
201 ```java
202  @Reflect public static
203     void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
204         Accelerator.Range range = accelerator.range(len);
205         accelerator.copyToDevice(memorySegment); // injected via Babylon
206         for (int i=0; i<count; i++) {
207             accelerator.run(Compute::doubleup, range, memorySegment);
208             if (i+1==count){// injected
209                 accelerator.copyFromDevice(memorySegment); // injected
210             }else {
211                 accelerator.copyFromDevice(memorySegment, 1); // injected via Babylon
212             }
213             int slot0 = memorySegment.get(INTVALUE, 0);
214             System.out.println("slot0 ", slot0);
215         }
216 
217     }
218 ```
219 
220 Again HAT will merely mutate the code model of the compute method,
221 the vendor may choose to interpret bytecode, generate bytecode and execute
222 or take complete plyTable and execute the model in native code.
223 
224 So within HAT we must find all set/get calls on MemorySegments and trace them back to kernel parameters.
225 
226 We should allow aliasing of memory segments... but in the short term we may well throw an exception when we see such aliasing
227 
228 
229 ```java
230  @Reflect  public static
231     void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
232         MemorySegment alias = memorySegment;
233         alias.set(JAVA_INT, ndrange.id.x, alias.get(JAVA_INT, ndrange.id.x)*2);
234     }
235 ```
236 
237 ## Weed warning #1
238 
239 We could find common kernel errors when analyzing
240 
241 This code is probably wrong, as it is racey writing to 0th element
242 
243 ```java
244  void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
245     MemorySegment alias = memorySegment;
246     alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x)*2);
247 }
248 ```
249 
250 By allowing a 'lint' like plugin mechanism for code model it would be easy to find.
251 If we ever find a constant index in set(...., <constant> ) we are probably in a world of hurt.
252 Unless the set is included in some conditional which itself is dependant on a value extracted from a memory segment.
253 
254 ```java
255  void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
256     MemorySegment alias = memorySegment;
257     if (????){
258         alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x) * 2);
259     }
260 }
261 ```
262 
263 There are a lot opportunities for catching such bugs.
264 
265 
266 ## Flipping Generations
267 
268 Many algorithms require us to process data from generations. Consider
269 Convolutions or Game Of Life style problems where we have an image or game bufferState and
270 we need to calculate the result of applying rules to cells in the image or game.
271 
272 It is important that when we process the next generation (either in parallel or sequentially) we
273 must ensure that we only use prev generation data to generate next generation data.
274 
275 ```
276 [ ][ ][*][ ][ ]       [ ][ ][ ][ ][ ]
277 [ ][ ][*][ ][ ]       [ ][*][*][*][ ]
278 [ ][ ][*][ ][ ]   ->  [ ][ ][ ][ ][ ]
279 [ ][ ][ ][ ][ ]       [ ][ ][ ][ ][ ]
280 
281 ```
282 
283 This usually requires us to hold two copies,  and applying the kernel to one input set
284 which writes to the output.
285 
286 In the case of the Game Of Life we may well use the output as the next input...
287 
288 ```java
289 @Reflect void conway(Accelerator.NDRange ndrange,
290                             MemorySegment in, MemorySegment out, int width, int height) {
291     int cx = ndrange.id.x % ndrange.id.maxx;
292     int cy = ndrange.id.x / ndrange.id.maxx;
293 
294     int sum = 0;
295     for (int dx = -1; dx < 2; dy++) {
296         for (int dy = -1; dy < 2; dy++) {
297             if (dx != 0 || dy != 0) {
298                 int x = cx + dx;
299                 int y = cy + dy;
300                 if (x >= 0 && x < widh && y >= 0 && y < height) {
301                     sum += in.get(INT, x * width + h);
302                 }
303             }
304         }
305     }
306     result = GOLRules(sum, in.get(INT, ndrange.id.x));
307     out.set(INT, ndrange.id.x);
308 
309 }
310 ```
311 
312 In this case the assumption is that the compute layer will swap the buffers for alternate passes
313 
314 ```java
315 import java.lang.foreign.MemorySegment;
316 
317 @Reflect
318 void compute(Accelerator accelerator, MemorySegment gameState,
319              int width, int height, int maxGenerations) {
320     MemorySegment s1 = gameState;
321     MemorySegment s2 = allocateGameState(width, height);
322     for (int generation = 0; generation < maxGenerations; generation++){
323         MemorySegment from = generation%2==0?s1?s2;
324         MemorySegment to = generation%2==1?s1?s2;
325         accelerator.run(Compute::conway, from, to, range, width, height);
326     }
327     if (maxGenerations%2==1){ // ?
328         gameState.copyFrom(s2);
329     }
330 }
331 ```
332 
333 This common pattern includes some aliasing of MemorySegments that we need to untangle.
334 
335 HAT needs to be able to track the aliases to determine the minimal number of copies.
336 ```java
337 import java.lang.foreign.MemorySegment;
338 
339 @Reflect
340 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height, int maxGenerations,
341              DisplaySAM displaySAM) {
342     MemorySegment s1 = gameState;
343     MemorySegment s2 = allocateGameState(width, height);
344 
345     for (int generation = 0; generation < maxGenerations; generation++){
346         MemorySegment from = generation%2==0?s1?s2;
347         MemorySegment to = generation%2==1?s1?s2;
348         if (generation == 0) {             /// injected
349             accerator.copyToDevice(from);    // injected
350         }                                  // injected
351         accelerator.run(Compute::conway, from, to, range, width, height, 1000);
352         if (generation == maxGenerations-1){ // injected
353             accerator.copyFromDevice(to);    //injected
354         }                                    //injected
355     }
356     if (maxGenerations%2==1){ // ?
357         gameState.copyFrom(s2);
358     }
359 
360 }
361 ```
362 
363 ```java
364 import java.lang.foreign.MemorySegment;
365 
366 @Reflect
367 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height,
368              int maxGenerations,
369              DisplaySAM displaySAM) {
370     MemorySegment s1 = gameState;
371     MemorySegment s2 = allocateGameState(width, height);
372 
373     for (int generation = 0; generation < maxGenerations; generation++){
374         MemorySegment from = generation%2==0?s1?s2;
375         MemorySegment to = generation%2==1?s1?s2;
376         accelerator.run(Compute::conway, from, to, range, width, height,1000);
377         displaySAM.display(s2,width, height);
378     }
379     if (maxGenerations%2==1){ // ?
380         gameState.copyFrom(to);
381     }
382 }
383 ```
384 
385 
386 
387 ### MavenStyleProject babylon transform to track buffer mutations.
388 
389 One goal of hat was to automate the movement of buffers from Java to device.
390 
391 One strategy employed by `NativeBackends` might be to track 'ifaceMappedSegment' accesses and inject tracking data into the compute method.
392 
393 Here is a transformation for that
394 
395 ```java
396  static FuncOpWrapper injectBufferTracking(ComputeClosure.ResolvedMethodCall resolvedMethodCall) {
397         FuncOpWrapper original = resolvedMethodCall.funcOpWrapper();
398         var transformed = original.transformInvokes((builder, invoke) -> {
399                     if (invoke.isIfaceBufferMethod()) { // void array(long idx, T value) or T array(long idx)
400                         // Get the first parameter (computeClosure)
401                         CopyContext cc = builder.context();
402                         Value computeClosure = cc.getValue(original.parameter(0));
403                         // Get the buffer receiver value in the output model
404                         Value receiver = cc.getValue(invoke.operand(0)); // The buffer we are mutatibg or accessing
405                         if (invoke.isIfaceMutator()) {
406                             // inject CLWrapComputeContext.preMutate(buffer);
407                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_MUTATE, computeClosure, receiver));
408                             builder.op(invoke.op());
409                            // inject CLWrapComputeContext.postMutate(buffer);
410                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_MUTATE, computeClosure, receiver));
411                         } else if ( invoke.isIfaceAccessor()) {
412                            // inject CLWrapComputeContext.preAccess(buffer);
413                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_ACCESS, computeClosure, receiver));
414                             builder.op(invoke.op());
415                             // inject CLWrapComputeContext.postAccess(buffer);
416                             builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_ACCESS, computeClosure, receiver));
417                         } else {
418                             builder.op(invoke.op());
419                         }
420                     }else{
421                         builder.op(invoke.op());
422                     }
423                     return builder;
424                 }
425         );
426         transformed.op().writeTo(System.out);
427         resolvedMethodCall.funcOpWrapper(transformed);
428         return transformed;
429     }
430 ```
431 
432 So in our `OpenCLBackend` for example
433 ```java
434     public void mutateIfNeeded(ComputeClosure.MethodCall methodCall) {
435        injectBufferTracking(entrypoint);
436     }
437 
438     @Override
439     public void computeContextClosed(ComputeContext CLWrapComputeContext){
440         var codeBuilder = new OpenCLKernelBuilder();
441         C99Code kernelCode = createKernelCode(CLWrapComputeContext, codeBuilder);
442         System.out.println(codeBuilder);
443     }
444 ```
445 I hacked the Mandle example. So the compute accessed and mutated it's arrays.
446 
447 ```java
448   @Reflect
449     static float doubleit(float f) {
450         return f * 2;
451     }
452 
453     @Reflect
454     static float scaleUp(float f) {
455         return doubleit(f);
456     }
457 
458     @Reflect
459     static public void compute(final ComputeContext CLWrapComputeContext, S32Array2D s32Array2D, float x, float y, float scale) {
460         scale = scaleUp(scale);
461         var range = CLWrapComputeContext.accelerator.range(s32Array2D.size());
462         int i = s32Array2D.get(10,10);
463         s32Array2D.set(10,10,i);
464         CLWrapComputeContext.dispatchKernel(MandelCompute::kernel, range, s32Array2D, pallette, x, y, scale);
465     }
466 ```
467 So here is the transformation being applied to the above compute
468 
469 BEFORE (note the !'s indicating accesses through ifacebuffers)
470 ```
471 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
472     %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext";
473     %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
474     %7 : Var<float> = var %2 @"x";
475     %8 : Var<float> = var %3 @"y";
476     %9 : Var<float> = var %4 @"scale";
477     %10 : float = var.load %9;
478     %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
479     var.store %9 %11;
480     %12 : hat.ComputeContext = var.load %5;
481     %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
482     %14 : hat.buffer.S32Array2D = var.load %6;
483 !   %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
484     %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
485     %17 : Var<hat.NDRange> = var %16 @"range";
486     %18 : hat.buffer.S32Array2D = var.load %6;
487     %19 : int = constant @"10";
488     %20 : int = constant @"10";
489 !   %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
490     %22 : Var<int> = var %21 @"i";
491     %23 : hat.buffer.S32Array2D = var.load %6;
492     %24 : int = constant @"10";
493     %25 : int = constant @"10";
494     %26 : int = var.load %22;
495  !  invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
496     %27 : hat.ComputeContext = var.load %5;
497     ...
498 ```
499 AFTER
500 ```
501 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
502     %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext";
503     %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
504     %7 : Var<float> = var %2 @"x";
505     %8 : Var<float> = var %3 @"y";
506     %9 : Var<float> = var %4 @"scale";
507     %10 : float = var.load %9;
508     %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
509     var.store %9 %11;
510     %12 : hat.ComputeContext = var.load %5;
511     %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
512     %14 : hat.buffer.S32Array2D = var.load %6;
513     invoke %0 %14 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
514 !    %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
515     invoke %0 %14 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
516     %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
517     %17 : Var<hat.NDRange> = var %16 @"range";
518     %18 : hat.buffer.S32Array2D = var.load %6;
519     %19 : int = constant @"10";
520     %20 : int = constant @"10";
521     invoke %0 %18 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
522  !   %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
523     invoke %0 %18 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
524     %22 : Var<int> = var %21 @"i";
525     %23 : hat.buffer.S32Array2D = var.load %6;
526     %24 : int = constant @"10";
527     %25 : int = constant @"10";
528     %26 : int = var.load %22;
529     invoke %0 %23 @"hat.ComputeClosure::preMutate(hat.buffer.Buffer)void";
530  !   invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
531     invoke %0 %23 @"hat.ComputeClosure::postMutate(hat.buffer.Buffer)void";
532     %27 : hat.ComputeContext = var.load %5;
533 ```
534 And here at runtime the ComputeClosure is reporting accesses when executing via the interpreter after the injected calls.
535 
536 ```
537 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
538 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
539 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
540 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
541 ComputeClosure.preMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
542 ComputeClosure.postMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
543 ```
544 ## Why inject this info?
545 So the idea is that the ComputeContext would maintain sets of dirty buffers, one set for `gpuDirty` and one set for `javaDirty`.
546 
547 We have the code for kernel models. So we know which kernel accesses, mutates or accesses AND mutates particular parameters.
548 
549 So when the ComputeContext receives  `preAccess(x)` or `preMutate(x)` the ComputeContext would determine if `x` is in the `gpuDirty` set.
550 If so it would delegate to the backend to  copy the GPU data back from device into the memory segment (assuming the memory is not coherent!)
551 before removing the buffer from `gpuDirty` set and returning.
552 
553 Now the Java access to the segment sees the latest buffer.
554 
555 After `postMutate(x)` it will place the buffer in `javaDirty` set.
556 
557 When a kernel dispatch comes along, the parameters to the kernel are all checked against the `javaDirty` set.
558 If the parameter is 'accessed' by the kernel. The backend will copy the segment to device. Remove the parameter
559 from the `javaDirty` set and then invoke the kernel.
560 When the kernel completes (lets assume synchronous for a moment) all parameters are checked again, and if the parameter
561 is known to be mutated by the kernel the parameter is added to the 'gpuDirty' set.
562 
563 This way we don't have to force the developer to request data movements.
564 
565 BTW if kernel requests are async ;) then the ComputeContext maintains a map of buffer to kernel.  So `preAccess(x)` or `preMutate(x)` calls
566 can wait on the kernel that is due to 'dirty' the buffer to complete.
567 
568 ### Marking hat buffers directly.
569 
570 
571 
572 
573 
574 
575 
576 
577