1 # Compute Analysis or Runtime tracing
2
3 ----
4
5 * [Contents](hat-00.md)
6 * House Keeping
7 * [Project Layout](hat-01-01-project-layout.md)
8 * [Building Babylon](hat-01-02-building-babylon.md)
9 * [Building HAT](hat-01-03-building-hat.md)
10 * [Enabling the CUDA Backend](hat-01-05-building-hat-for-cuda.md)
11 * Programming Model
12 * [Programming Model](hat-03-programming-model.md)
13 * Interface Mapping
14 * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
15 * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
16 * Implementation Detail
17 * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
18 * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
19
20 ----
21
22 # Compute Analysis or Runtime tracing
23
24 HAT does not dictate how a backend chooses to optimize execution, but does
25 provide the tools (Babylon's Code Models) and some helpers which the Backend is encouraged
26 use.
27
28 The ComputeContext contains all the information that the backend needs, but does not
29 include any 'policy' for minimizing data movements.
30
31 Our assumption is that backend can use various tools to deduce the most efficient execution strategy.
32
33 ## Some possible strategies..
34
35 ### Copy data every time 'just in case' (JIC execution ;) )
36 Just naiively execute the code as described in Compute graph. So the backend will copy each buffer to the device, execute the kernel and copy the data back again.
37
38 ### Use kernel knowledge to minimise data movement
39 Execute the code described in the Compute Graph, but use knowledge extracted from kernel models
40 to only copy to device buffers that the kernel is going to read, and only copy back from the device
41 buffers that the kernel has written to.
42
43 ### Use Compute knowledge and kernel knowledge to further minimise data movement
44 Use knowledge extracted from the compute reachable graph and the kernel
45 graphs to determine whether Java has mutated buffers between kernel dispatches
46 and only copy data to the device that we know the Java code has mutated.
47
48 This last strategy is ideal
49
50 We can achieve this using static analysis of the compute and kernel models or by being
51 involved in the execution process at runtime.
52
53 #### Static analysis
54
55 #### Runtime Tracking
56
57 * Dynamical
58 1. We 'close over' the call/dispatch graph from the entrypoint to all kernels and collect the kernels reachable from the entrypoint and all methods reachable from methods reachable by kernels.
59 2. We essentially end up with a graph of codemodels 'rooted' at the entrypoint
60 3. For each kernel we also determine how the kernel accesses it's 'MemorySegment` parameters, for each MemorySegment parameters we keep a side table of whther the kernel reads or writes to the segment. We keep this infomation in a side map.
61
62 This resulting 'ComputeClosure' (tree of codemodels and relevant side tables) is made available to the accelerator to coordinate execution.
63
64 Note that our very simple Compute::compute method neither expresses the movement of the MemorySegment to a device, or the retrieval of the data from a device when the kernel has executed.
65
66 Our assumption is that given the ComputeClosure we can deduce such movements.
67
68 There are many ways to achieve this. One way would be by static analysis.
69
70 Given the Compute::compute entrypoint it is easy to determine that we are always (no conditional or loops) passing (making available
71 might be a better term) a memory segment to a kernel (Compute::kernel) and this kernel only mutates the `MemorySegment`.
72
73 So from simple static analysis we could choose to inject one or more calls into the model representing the need for the accelerator to move data to the devices and/ord back from the device, after the kernel dispatch.
74
75 This modified model, would look like we had presented it with this code.
76
77 ```java
78 void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
79 Accelerator.Range range = accelerator.range(len);
80 accelerator.run(Compute::kernel, range, memorySegment);
81 accelerator.injectedCopyFromDevice(memorySegment);
82 }
83 ```
84
85 Note the ```injectedCopyFromDevice()``` call.
86
87 Because the kernel does not read the `MemorySegment` we only need inject the code to request a move back from the device.
88
89 To do this requires HAT to analyse the kernel(s) and inject appropriate code into
90 the Compute::compute method to inform the vendor backend when it should perform such moves.
91
92 Another strategy would be to not rely on static analysis but to inject code to trace 'actual' mutations of the MemorySegments and use these flags to guard against unnecessary copies
93
94 ```java
95 void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
96 boolean injectedMemorySegmentIsDirty = false;
97 Accelerator.Range range = accelerator.range(len);
98 if (injectedMemorySegmentIsDirty){
99 accelerator.injectedCopyToDevice(memorySegment);
100 }
101 accelerator.run(Compute::kernel, range, memorySegment);
102 injectedMemorySegmentIsDirty = true; // based on Compute::kernel sidetable
103 if (injectedMemorySegmentIsDirty) {
104 accelerator.injectedCopyFromDevice(memorySegment);
105 }
106 }
107 ```
108
109
110 Whether this code mutation generates Java bytecode and executes (or interprets) on the JVM or whether the
111 CodeModels for the closure are handed over to a backend which reifies the kernel code and the
112 logic for dispatch is not defined.
113
114 The code model for the compute will be mutated to inject the appropriate nodes to achieve the goal
115
116 It is possible that some vendors may just take the original code model and analyse themselves.
117
118 Clearly this is a trivial compute closure. Lets discuss the required kernel analysis
119 and proposed pseudo code.
120
121 ## Copying data based on kernel MemorySegment analysis
122
123 Above we showed that we should be able to determine whether a kernel mutates or accesses any of
124 it's Kernel MemorySegment parameters.
125
126 We determined above that the kernel only called set() so we need
127 not copy the data to the device.
128
129 The following example shows a kernel which reads and mutates a memorysegment
130 ```java
131 static class Compute {
132 @CodeReflection public static
133 void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
134 int temp = memorySegment.get(JAVA_INT, ndrange.id.x);
135 memorySegment.set(JAVA_INT, temp*2);
136 }
137
138 @CodeReflection public static
139 void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
140 Accelerator.Range range = accelerator.range(len);
141 accelerator.run(Compute::doubleup, range, memorySegment);
142 }
143 }
144 ```
145 Here our analysis needs to determine that the kernel reads and writes to the segment (it does)
146 so the generated compute model would equate to
147
148 ```java
149 void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
150 Accelerator.Range range = accelerator.range(len);
151 accelerator.copyToDevice(memorySegment); // injected via Babylon
152 accelerator.run(Compute::doubleup, range, memorySegment);
153 accelerator.copyFromDevice(memorySegment); // injected via Babylon
154 }
155 ```
156 So far the deductions are fairly trivial
157
158 Consider
159 ```java
160 @CodeReflection public static
161 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
162 Accelerator.Range range = accelerator.range(len);
163 for (int i=0; i<count; i++) {
164 accelerator.run(Compute::doubleup, range, memorySegment);
165 }
166 }
167 ```
168
169 Here HAT should deduce that the java side is merely looping over the kernel dispatch
170 and has no interest in the memorysegment between dispatches.
171
172 So the new model need only copy in once (before the fist kernel) and out once (prior to return)
173
174 ```java
175 @CodeReflection public static
176 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
177 Accelerator.Range range = accelerator.range(len);
178 accelerator.copyToDevice(memorySegment); // injected via Babylon
179 for (int i=0; i<count; i++) {
180 accelerator.run(Compute::doubleup, range, memorySegment);
181 }
182 accelerator.copyFromDevice(memorySegment); // injected via Babylon
183 }
184 ```
185
186 Things get slightly more interesting when we do indeed access the memory segment
187 from the Java code inside the loop.
188
189 ```java
190 @CodeReflection public static
191 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
192 Accelerator.Range range = accelerator.range(len);
193 for (int i=0; i<count; i++) {
194 accelerator.run(Compute::doubleup, range, memorySegment);
195 int slot0 = memorySegment.get(INTVALUE, 0);
196 System.out.println("slot0 ", slot0);
197 }
198 }
199 ```
200 Now we expect babylon to inject a read inside the loop to make the data available java side
201
202 ```java
203 @CodeReflection public static
204 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
205 Accelerator.Range range = accelerator.range(len);
206 accelerator.copyToDevice(memorySegment); // injected via Babylon
207 for (int i=0; i<count; i++) {
208 accelerator.run(Compute::doubleup, range, memorySegment);
209 accelerator.copyFromDevice(memorySegment); // injected via Babylon
210 int slot0 = memorySegment.get(INTVALUE, 0);
211 System.out.println("slot0 ", slot0);
212 }
213
214 }
215 ```
216
217 Note that in this case we are only accessing 0th int from the segment so a possible
218 optimization might be to allow the vendor to only copy back this one element....
219 ```java
220 @CodeReflection public static
221 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
222 Accelerator.Range range = accelerator.range(len);
223 accelerator.copyToDevice(memorySegment); // injected via Babylon
224 for (int i=0; i<count; i++) {
225 accelerator.run(Compute::doubleup, range, memorySegment);
226 if (i+1==count){// injected
227 accelerator.copyFromDevice(memorySegment); // injected
228 }else {
229 accelerator.copyFromDevice(memorySegment, 1); // injected via Babylon
230 }
231 int slot0 = memorySegment.get(INTVALUE, 0);
232 System.out.println("slot0 ", slot0);
233 }
234
235 }
236 ```
237
238 Again HAT will merely mutate the code model of the compute method,
239 the vendor may choose to interpret bytecode, generate bytecode and execute
240 or take complete plyTable and execute the model in native code.
241
242 So within HAT we must find all set/get calls on MemorySegments and trace them back to kernel parameters.
243
244 We should allow aliasing of memory segments... but in the short term we may well throw an exception when we see such aliasing
245
246
247 ```java
248 @CodeReflection public static
249 void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
250 MemorySegment alias = memorySegment;
251 alias.set(JAVA_INT, ndrange.id.x, alias.get(JAVA_INT, ndrange.id.x)*2);
252 }
253 ```
254
255 ## Weed warning #1
256
257 We could find common kernel errors when analyzing
258
259 This code is probably wrong, as it is racey writing to 0th element
260
261 ```java
262 void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
263 MemorySegment alias = memorySegment;
264 alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x)*2);
265 }
266 ```
267
268 By allowing a 'lint' like plugin mechanism for code model it would be easy to find.
269 If we ever find a constant index in set(...., <constant> ) we are probably in a world of hurt.
270 Unless the set is included in some conditional which itself is dependant on a value extracted from a memory segment.
271
272 ```java
273 void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
274 MemorySegment alias = memorySegment;
275 if (????){
276 alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x) * 2);
277 }
278 }
279 ```
280
281 There are a lot opportunities for catching such bugs.
282
283
284 ## Flipping Generations
285
286 Many algorithms require us to process data from generations. Consider
287 Convolutions or Game Of Life style problems where we have an image or game bufferState and
288 we need to calculate the result of applying rules to cells in the image or game.
289
290 It is important that when we process the next generation (either in parallel or sequentially) we
291 must ensure that we only use prev generation data to generate next generation data.
292
293 ```
294 [ ][ ][*][ ][ ] [ ][ ][ ][ ][ ]
295 [ ][ ][*][ ][ ] [ ][*][*][*][ ]
296 [ ][ ][*][ ][ ] -> [ ][ ][ ][ ][ ]
297 [ ][ ][ ][ ][ ] [ ][ ][ ][ ][ ]
298
299 ```
300
301 This usually requires us to hold two copies, and applying the kernel to one input set
302 which writes to the output.
303
304 In the case of the Game Of Life we may well use the output as the next input...
305
306 ```java
307 @CodeReflection void conway(Accelerator.NDRange ndrange,
308 MemorySegment in, MemorySegment out, int width, int height) {
309 int cx = ndrange.id.x % ndrange.id.maxx;
310 int cy = ndrange.id.x / ndrange.id.maxx;
311
312 int sum = 0;
313 for (int dx = -1; dx < 2; dy++) {
314 for (int dy = -1; dy < 2; dy++) {
315 if (dx != 0 || dy != 0) {
316 int x = cx + dx;
317 int y = cy + dy;
318 if (x >= 0 && x < widh && y >= 0 && y < height) {
319 sum += in.get(INT, x * width + h);
320 }
321 }
322 }
323 }
324 result = GOLRules(sum, in.get(INT, ndrange.id.x));
325 out.set(INT, ndrange.id.x);
326
327 }
328 ```
329
330 In this case the assumption is that the compute layer will swap the buffers for alternate passes
331
332 ```java
333 import java.lang.foreign.MemorySegment;
334
335 @CodeReflection
336 void compute(Accelerator accelerator, MemorySegment gameState,
337 int width, int height, int maxGenerations) {
338 MemorySegment s1 = gameState;
339 MemorySegment s2 = allocateGameState(width, height);
340 for (int generation = 0; generation < maxGenerations; generation++){
341 MemorySegment from = generation%2==0?s1?s2;
342 MemorySegment to = generation%2==1?s1?s2;
343 accelerator.run(Compute::conway, from, to, range, width, height);
344 }
345 if (maxGenerations%2==1){ // ?
346 gameState.copyFrom(s2);
347 }
348 }
349 ```
350
351 This common pattern includes some aliasing of MemorySegments that we need to untangle.
352
353 HAT needs to be able to track the aliases to determine the minimal number of copies.
354 ```java
355 import java.lang.foreign.MemorySegment;
356
357 @CodeReflection
358 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height, int maxGenerations,
359 DisplaySAM displaySAM) {
360 MemorySegment s1 = gameState;
361 MemorySegment s2 = allocateGameState(width, height);
362
363 for (int generation = 0; generation < maxGenerations; generation++){
364 MemorySegment from = generation%2==0?s1?s2;
365 MemorySegment to = generation%2==1?s1?s2;
366 if (generation == 0) { /// injected
367 accerator.copyToDevice(from); // injected
368 } // injected
369 accelerator.run(Compute::conway, from, to, range, width, height, 1000);
370 if (generation == maxGenerations-1){ // injected
371 accerator.copyFromDevice(to); //injected
372 } //injected
373 }
374 if (maxGenerations%2==1){ // ?
375 gameState.copyFrom(s2);
376 }
377
378 }
379 ```
380
381 ```java
382 import java.lang.foreign.MemorySegment;
383
384 @CodeReflection
385 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height,
386 int maxGenerations,
387 DisplaySAM displaySAM) {
388 MemorySegment s1 = gameState;
389 MemorySegment s2 = allocateGameState(width, height);
390
391 for (int generation = 0; generation < maxGenerations; generation++){
392 MemorySegment from = generation%2==0?s1?s2;
393 MemorySegment to = generation%2==1?s1?s2;
394 accelerator.run(Compute::conway, from, to, range, width, height,1000);
395 displaySAM.display(s2,width, height);
396 }
397 if (maxGenerations%2==1){ // ?
398 gameState.copyFrom(to);
399 }
400 }
401 ```
402
403
404
405 ### MavenStyleProject babylon transform to track buffer mutations.
406
407 One goal of hat was to automate the movement of buffers from Java to device.
408
409 One strategy employed by `NativeBackends` might be to track 'ifaceMappedSegment' accesses and inject tracking data into the compute method.
410
411 Here is a transformation for that
412
413 ```java
414 static FuncOpWrapper injectBufferTracking(ComputeClosure.ResolvedMethodCall resolvedMethodCall) {
415 FuncOpWrapper original = resolvedMethodCall.funcOpWrapper();
416 var transformed = original.transformInvokes((builder, invoke) -> {
417 if (invoke.isIfaceBufferMethod()) { // void array(long idx, T value) or T array(long idx)
418 // Get the first parameter (computeClosure)
419 CopyContext cc = builder.context();
420 Value computeClosure = cc.getValue(original.parameter(0));
421 // Get the buffer receiver value in the output model
422 Value receiver = cc.getValue(invoke.operand(0)); // The buffer we are mutatibg or accessing
423 if (invoke.isIfaceMutator()) {
424 // inject CLWrapComputeContext.preMutate(buffer);
425 builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_MUTATE, computeClosure, receiver));
426 builder.op(invoke.op());
427 // inject CLWrapComputeContext.postMutate(buffer);
428 builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_MUTATE, computeClosure, receiver));
429 } else if ( invoke.isIfaceAccessor()) {
430 // inject CLWrapComputeContext.preAccess(buffer);
431 builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_ACCESS, computeClosure, receiver));
432 builder.op(invoke.op());
433 // inject CLWrapComputeContext.postAccess(buffer);
434 builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_ACCESS, computeClosure, receiver));
435 } else {
436 builder.op(invoke.op());
437 }
438 }else{
439 builder.op(invoke.op());
440 }
441 return builder;
442 }
443 );
444 transformed.op().writeTo(System.out);
445 resolvedMethodCall.funcOpWrapper(transformed);
446 return transformed;
447 }
448 ```
449
450 So in our `OpenCLBackend` for example
451 ```java
452 public void mutateIfNeeded(ComputeClosure.MethodCall methodCall) {
453 injectBufferTracking(entrypoint);
454 }
455
456 @Override
457 public void computeContextClosed(ComputeContext CLWrapComputeContext){
458 var codeBuilder = new OpenCLKernelBuilder();
459 C99Code kernelCode = createKernelCode(CLWrapComputeContext, codeBuilder);
460 System.out.println(codeBuilder);
461 }
462 ```
463 I hacked the Mandle example. So the compute accessed and mutated it's arrays.
464
465 ```java
466 @CodeReflection
467 static float doubleit(float f) {
468 return f * 2;
469 }
470
471 @CodeReflection
472 static float scaleUp(float f) {
473 return doubleit(f);
474 }
475
476 @CodeReflection
477 static public void compute(final ComputeContext CLWrapComputeContext, S32Array2D s32Array2D, float x, float y, float scale) {
478 scale = scaleUp(scale);
479 var range = CLWrapComputeContext.accelerator.range(s32Array2D.size());
480 int i = s32Array2D.get(10,10);
481 s32Array2D.set(10,10,i);
482 CLWrapComputeContext.dispatchKernel(MandelCompute::kernel, range, s32Array2D, pallette, x, y, scale);
483 }
484 ```
485 So here is the transformation being applied to the above compute
486
487 BEFORE (note the !'s indicating accesses through ifacebuffers)
488 ```
489 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
490 %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext";
491 %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
492 %7 : Var<float> = var %2 @"x";
493 %8 : Var<float> = var %3 @"y";
494 %9 : Var<float> = var %4 @"scale";
495 %10 : float = var.load %9;
496 %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
497 var.store %9 %11;
498 %12 : hat.ComputeContext = var.load %5;
499 %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
500 %14 : hat.buffer.S32Array2D = var.load %6;
501 ! %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
502 %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
503 %17 : Var<hat.NDRange> = var %16 @"range";
504 %18 : hat.buffer.S32Array2D = var.load %6;
505 %19 : int = constant @"10";
506 %20 : int = constant @"10";
507 ! %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
508 %22 : Var<int> = var %21 @"i";
509 %23 : hat.buffer.S32Array2D = var.load %6;
510 %24 : int = constant @"10";
511 %25 : int = constant @"10";
512 %26 : int = var.load %22;
513 ! invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
514 %27 : hat.ComputeContext = var.load %5;
515 ...
516 ```
517 AFTER
518 ```
519 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
520 %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext";
521 %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
522 %7 : Var<float> = var %2 @"x";
523 %8 : Var<float> = var %3 @"y";
524 %9 : Var<float> = var %4 @"scale";
525 %10 : float = var.load %9;
526 %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
527 var.store %9 %11;
528 %12 : hat.ComputeContext = var.load %5;
529 %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
530 %14 : hat.buffer.S32Array2D = var.load %6;
531 invoke %0 %14 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
532 ! %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
533 invoke %0 %14 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
534 %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
535 %17 : Var<hat.NDRange> = var %16 @"range";
536 %18 : hat.buffer.S32Array2D = var.load %6;
537 %19 : int = constant @"10";
538 %20 : int = constant @"10";
539 invoke %0 %18 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
540 ! %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
541 invoke %0 %18 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
542 %22 : Var<int> = var %21 @"i";
543 %23 : hat.buffer.S32Array2D = var.load %6;
544 %24 : int = constant @"10";
545 %25 : int = constant @"10";
546 %26 : int = var.load %22;
547 invoke %0 %23 @"hat.ComputeClosure::preMutate(hat.buffer.Buffer)void";
548 ! invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
549 invoke %0 %23 @"hat.ComputeClosure::postMutate(hat.buffer.Buffer)void";
550 %27 : hat.ComputeContext = var.load %5;
551 ```
552 And here at runtime the ComputeClosure is reporting accesses when executing via the interpreter after the injected calls.
553
554 ```
555 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
556 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
557 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
558 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
559 ComputeClosure.preMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
560 ComputeClosure.postMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
561 ```
562 ## Why inject this info?
563 So the idea is that the ComputeContext would maintain sets of dirty buffers, one set for `gpuDirty` and one set for `javaDirty`.
564
565 We have the code for kernel models. So we know which kernel accesses, mutates or accesses AND mutates particular parameters.
566
567 So when the ComputeContext receives `preAccess(x)` or `preMutate(x)` the ComputeContext would determine if `x` is in the `gpuDirty` set.
568 If so it would delegate to the backend to copy the GPU data back from device into the memory segment (assuming the memory is not coherent!)
569 before removing the buffer from `gpuDirty` set and returning.
570
571 Now the Java access to the segment sees the latest buffer.
572
573 After `postMutate(x)` it will place the buffer in `javaDirty` set.
574
575 When a kernel dispatch comes along, the parameters to the kernel are all checked against the `javaDirty` set.
576 If the parameter is 'accessed' by the kernel. The backend will copy the segment to device. Remove the parameter
577 from the `javaDirty` set and then invoke the kernel.
578 When the kernel completes (lets assume synchronous for a moment) all parameters are checked again, and if the parameter
579 is known to be mutated by the kernel the parameter is added to the 'gpuDirty' set.
580
581 This way we don't have to force the developer to request data movements.
582
583 BTW if kernel requests are async ;) then the ComputeContext maintains a map of buffer to kernel. So `preAccess(x)` or `preMutate(x)` calls
584 can wait on the kernel that is due to 'dirty' the buffer to complete.
585
586 ### Marking hat buffers directly.
587
588
589
590
591
592
593
594
595