1 2 # Compute Analysis or Runtime tracing 3 4 ---- 5 6 * [Contents](hat-00.md) 7 * House Keeping 8 * [Project Layout](hat-01-01-project-layout.md) 9 * [Building Babylon](hat-01-02-building-babylon.md) 10 * [Building HAT](hat-01-03-building-hat.md) 11 * Programming Model 12 * [Programming Model](hat-03-programming-model.md) 13 * Interface Mapping 14 * [Interface Mapping Overview](hat-04-01-interface-mapping.md) 15 * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md) 16 * Implementation Detail 17 * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md) 18 19 ---- 20 21 # Compute Analysis or Runtime tracing 22 23 HAT does not dictate how a backend chooses to optimize execution, but does 24 provide the tools (Babylon's Code Models) and some helpers which the Backend is encouraged 25 use. 26 27 The ComputeContext contains all the information that the backend needs, but does not 28 include any 'policy' for minimizing data movements. 29 30 Our assumption is that backend can use various tools to deduce the most efficient execution strategy. 31 32 ## Some possible strategies.. 33 34 ### Copy data every time 'just in case' (JIC execution ;) ) 35 Just naiively execute the code as described in Compute graph. So the backend will copy each buffer to the device, execute the kernel and copy the data back again. 36 37 ### Use kernel knowledge to minimise data movement 38 Execute the code described in the Compute Graph, but use knowledge extracted from kernel models 39 to only copy to device buffers that the kernel is going to read, and only copy back from the device 40 buffers that the kernel has written to. 41 42 ### Use Compute knowledge and kernel knowledge to further minimise data movement 43 Use knowledge extracted from the compute reachable graph and the kernel 44 graphs to determine whether Java has mutated buffers between kernel dispatches 45 and only copy data to the device that we know the Java code has mutated. 46 47 This last strategy is ideal 48 49 We can achieve this using static analysis of the compute and kernel models or by being 50 involved in the execution process at runtime. 51 52 #### Static analysis 53 54 #### Runtime Tracking 55 56 * Dynamical 57 1. We 'close over' the call/dispatch graph from the entrypoint to all kernels and collect the kernels reachable from the entrypoint and all methods reachable from methods reachable by kernels. 58 2. We essentially end up with a graph of codemodels 'rooted' at the entrypoint 59 3. For each kernel we also determine how the kernel accesses it's 'MemorySegment` parameters, for each MemorySegment parameters we keep a side table of whther the kernel reads or writes to the segment. We keep this infomation in a side map. 60 61 This resulting 'ComputeClosure' (tree of codemodels and relevant side tables) is made available to the accelerator to coordinate execution. 62 63 Note that our very simple Compute::compute method neither expresses the movement of the MemorySegment to a device, or the retrieval of the data from a device when the kernel has executed. 64 65 Our assumption is that given the ComputeClosure we can deduce such movements. 66 67 There are many ways to achieve this. One way would be by static analysis. 68 69 Given the Compute::compute entrypoint it is easy to determine that we are always (no conditional or loops) passing (making available 70 might be a better term) a memory segment to a kernel (Compute::kernel) and this kernel only mutates the `MemorySegment`. 71 72 So from simple static analysis we could choose to inject one or more calls into the model representing the need for the accelerator to move data to the devices and/ord back from the device, after the kernel dispatch. 73 74 This modified model, would look like we had presented it with this code. 75 76 ```java 77 void compute(Accelerator accelerator, MemorySegment memorySegment, int len) { 78 Accelerator.Range range = accelerator.range(len); 79 accelerator.run(Compute::kernel, range, memorySegment); 80 accelerator.injectedCopyFromDevice(memorySegment); 81 } 82 ``` 83 84 Note the ```injectedCopyFromDevice()``` call. 85 86 Because the kernel does not read the `MemorySegment` we only need inject the code to request a move back from the device. 87 88 To do this requires HAT to analyse the kernel(s) and inject appropriate code into 89 the Compute::compute method to inform the vendor backend when it should perform such moves. 90 91 Another strategy would be to not rely on static analysis but to inject code to trace 'actual' mutations of the MemorySegments and use these flags to guard against unnecessary copies 92 93 ```java 94 void compute(Accelerator accelerator, MemorySegment memorySegment, int len) { 95 boolean injectedMemorySegmentIsDirty = false; 96 Accelerator.Range range = accelerator.range(len); 97 if (injectedMemorySegmentIsDirty){ 98 accelerator.injectedCopyToDevice(memorySegment); 99 } 100 accelerator.run(Compute::kernel, range, memorySegment); 101 injectedMemorySegmentIsDirty = true; // based on Compute::kernel sidetable 102 if (injectedMemorySegmentIsDirty) { 103 accelerator.injectedCopyFromDevice(memorySegment); 104 } 105 } 106 ``` 107 108 109 Whether this code mutation generates Java bytecode and executes (or interprets) on the JVM or whether the 110 CodeModels for the closure are handed over to a backend which reifies the kernel code and the 111 logic for dispatch is not defined. 112 113 The code model for the compute will be mutated to inject the appropriate nodes to achieve the goal 114 115 It is possible that some vendors may just take the original code model and analyse themselves. 116 117 Clearly this is a trivial compute closure. Lets discuss the required kernel analysis 118 and proposed pseudo code. 119 120 ## Copying data based on kernel MemorySegment analysis 121 122 Above we showed that we should be able to determine whether a kernel mutates or accesses any of 123 it's Kernel MemorySegment parameters. 124 125 We determined above that the kernel only called set() so we need 126 not copy the data to the device. 127 128 The following example shows a kernel which reads and mutates a memorysegment 129 ```java 130 static class Compute { 131 @CodeReflection public static 132 void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) { 133 int temp = memorySegment.get(JAVA_INT, ndrange.id.x); 134 memorySegment.set(JAVA_INT, temp*2); 135 } 136 137 @CodeReflection public static 138 void compute(Accelerator accelerator, MemorySegment memorySegment, int len) { 139 Accelerator.Range range = accelerator.range(len); 140 accelerator.run(Compute::doubleup, range, memorySegment); 141 } 142 } 143 ``` 144 Here our analysis needs to determine that the kernel reads and writes to the segment (it does) 145 so the generated compute model would equate to 146 147 ```java 148 void compute(Accelerator accelerator, MemorySegment memorySegment, int len) { 149 Accelerator.Range range = accelerator.range(len); 150 accelerator.copyToDevice(memorySegment); // injected via Babylon 151 accelerator.run(Compute::doubleup, range, memorySegment); 152 accelerator.copyFromDevice(memorySegment); // injected via Babylon 153 } 154 ``` 155 So far the deductions are fairly trivial 156 157 Consider 158 ```java 159 @CodeReflection public static 160 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) { 161 Accelerator.Range range = accelerator.range(len); 162 for (int i=0; i<count; i++) { 163 accelerator.run(Compute::doubleup, range, memorySegment); 164 } 165 } 166 ``` 167 168 Here HAT should deduce that the java side is merely looping over the kernel dispatch 169 and has no interest in the memorysegment between dispatches. 170 171 So the new model need only copy in once (before the fist kernel) and out once (prior to return) 172 173 ```java 174 @CodeReflection public static 175 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) { 176 Accelerator.Range range = accelerator.range(len); 177 accelerator.copyToDevice(memorySegment); // injected via Babylon 178 for (int i=0; i<count; i++) { 179 accelerator.run(Compute::doubleup, range, memorySegment); 180 } 181 accelerator.copyFromDevice(memorySegment); // injected via Babylon 182 } 183 ``` 184 185 Things get slightly more interesting when we do indeed access the memory segment 186 from the Java code inside the loop. 187 188 ```java 189 @CodeReflection public static 190 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) { 191 Accelerator.Range range = accelerator.range(len); 192 for (int i=0; i<count; i++) { 193 accelerator.run(Compute::doubleup, range, memorySegment); 194 int slot0 = memorySegment.get(INTVALUE, 0); 195 System.out.println("slot0 ", slot0); 196 } 197 } 198 ``` 199 Now we expect babylon to inject a read inside the loop to make the data available java side 200 201 ```java 202 @CodeReflection public static 203 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) { 204 Accelerator.Range range = accelerator.range(len); 205 accelerator.copyToDevice(memorySegment); // injected via Babylon 206 for (int i=0; i<count; i++) { 207 accelerator.run(Compute::doubleup, range, memorySegment); 208 accelerator.copyFromDevice(memorySegment); // injected via Babylon 209 int slot0 = memorySegment.get(INTVALUE, 0); 210 System.out.println("slot0 ", slot0); 211 } 212 213 } 214 ``` 215 216 Note that in this case we are only accessing 0th int from the segment so a possible 217 optimization might be to allow the vendor to only copy back this one element.... 218 ```java 219 @CodeReflection public static 220 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) { 221 Accelerator.Range range = accelerator.range(len); 222 accelerator.copyToDevice(memorySegment); // injected via Babylon 223 for (int i=0; i<count; i++) { 224 accelerator.run(Compute::doubleup, range, memorySegment); 225 if (i+1==count){// injected 226 accelerator.copyFromDevice(memorySegment); // injected 227 }else { 228 accelerator.copyFromDevice(memorySegment, 1); // injected via Babylon 229 } 230 int slot0 = memorySegment.get(INTVALUE, 0); 231 System.out.println("slot0 ", slot0); 232 } 233 234 } 235 ``` 236 237 Again HAT will merely mutate the code model of the compute method, 238 the vendor may choose to interpret bytecode, generate bytecode and execute 239 or take complete plyTable and execute the model in native code. 240 241 So within HAT we must find all set/get calls on MemorySegments and trace them back to kernel parameters. 242 243 We should allow aliasing of memory segments... but in the short term we may well throw an exception when we see such aliasing 244 245 246 ```java 247 @CodeReflection public static 248 void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) { 249 MemorySegment alias = memorySegment; 250 alias.set(JAVA_INT, ndrange.id.x, alias.get(JAVA_INT, ndrange.id.x)*2); 251 } 252 ``` 253 254 ## Weed warning #1 255 256 We could find common kernel errors when analyzing 257 258 This code is probably wrong, as it is racey writing to 0th element 259 260 ```java 261 void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) { 262 MemorySegment alias = memorySegment; 263 alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x)*2); 264 } 265 ``` 266 267 By allowing a 'lint' like plugin mechanism for code model it would be easy to find. 268 If we ever find a constant index in set(...., <constant> ) we are probably in a world of hurt. 269 Unless the set is included in some conditional which itself is dependant on a value extracted from a memory segment. 270 271 ```java 272 void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) { 273 MemorySegment alias = memorySegment; 274 if (????){ 275 alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x) * 2); 276 } 277 } 278 ``` 279 280 There are a lot opportunities for catching such bugs. 281 282 283 ## Flipping Generations 284 285 Many algorithms require us to process data from generations. Consider 286 Convolutions or Game Of Life style problems where we have an image or game state and 287 we need to calculate the result of applying rules to cells in the image or game. 288 289 It is important that when we process the next generation (either in parallel or sequentially) we 290 must ensure that we only use prev generation data to generate next generation data. 291 292 ``` 293 [ ][ ][*][ ][ ] [ ][ ][ ][ ][ ] 294 [ ][ ][*][ ][ ] [ ][*][*][*][ ] 295 [ ][ ][*][ ][ ] -> [ ][ ][ ][ ][ ] 296 [ ][ ][ ][ ][ ] [ ][ ][ ][ ][ ] 297 298 ``` 299 300 This usually requires us to hold two copies, and applying the kernel to one input set 301 which writes to the output. 302 303 In the case of the Game Of Life we may well use the output as the next input... 304 305 ```java 306 @CodeReflection void conway(Accelerator.NDRange ndrange, 307 MemorySegment in, MemorySegment out, int width, int height) { 308 int cx = ndrange.id.x % ndrange.id.maxx; 309 int cy = ndrange.id.x / ndrange.id.maxx; 310 311 int sum = 0; 312 for (int dx = -1; dx < 2; dy++) { 313 for (int dy = -1; dy < 2; dy++) { 314 if (dx != 0 || dy != 0) { 315 int x = cx + dx; 316 int y = cy + dy; 317 if (x >= 0 && x < widh && y >= 0 && y < height) { 318 sum += in.get(INT, x * width + h); 319 } 320 } 321 } 322 } 323 result = GOLRules(sum, in.get(INT, ndrange.id.x)); 324 out.set(INT, ndrange.id.x); 325 326 } 327 ``` 328 329 In this case the assumption is that the compute layer will swap the buffers for alternate passes 330 331 ```java 332 import java.lang.foreign.MemorySegment; 333 334 @CodeReflection 335 void compute(Accelerator accelerator, MemorySegment gameState, 336 int width, int height, int maxGenerations) { 337 MemorySegment s1 = gameState; 338 MemorySegment s2 = allocateGameState(width, height); 339 for (int generation = 0; generation < maxGenerations; generation++){ 340 MemorySegment from = generation%2==0?s1?s2; 341 MemorySegment to = generation%2==1?s1?s2; 342 accelerator.run(Compute::conway, from, to, range, width, height); 343 } 344 if (maxGenerations%2==1){ // ? 345 gameState.copyFrom(s2); 346 } 347 } 348 ``` 349 350 This common pattern includes some aliasing of MemorySegments that we need to untangle. 351 352 HAT needs to be able to track the aliases to determine the minimal number of copies. 353 ```java 354 import java.lang.foreign.MemorySegment; 355 356 @CodeReflection 357 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height, int maxGenerations, 358 DisplaySAM displaySAM) { 359 MemorySegment s1 = gameState; 360 MemorySegment s2 = allocateGameState(width, height); 361 362 for (int generation = 0; generation < maxGenerations; generation++){ 363 MemorySegment from = generation%2==0?s1?s2; 364 MemorySegment to = generation%2==1?s1?s2; 365 if (generation == 0) { /// injected 366 accerator.copyToDevice(from); // injected 367 } // injected 368 accelerator.run(Compute::conway, from, to, range, width, height, 1000); 369 if (generation == maxGenerations-1){ // injected 370 accerator.copyFromDevice(to); //injected 371 } //injected 372 } 373 if (maxGenerations%2==1){ // ? 374 gameState.copyFrom(s2); 375 } 376 377 } 378 ``` 379 380 ```java 381 import java.lang.foreign.MemorySegment; 382 383 @CodeReflection 384 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height, 385 int maxGenerations, 386 DisplaySAM displaySAM) { 387 MemorySegment s1 = gameState; 388 MemorySegment s2 = allocateGameState(width, height); 389 390 for (int generation = 0; generation < maxGenerations; generation++){ 391 MemorySegment from = generation%2==0?s1?s2; 392 MemorySegment to = generation%2==1?s1?s2; 393 accelerator.run(Compute::conway, from, to, range, width, height,1000); 394 displaySAM.display(s2,width, height); 395 } 396 if (maxGenerations%2==1){ // ? 397 gameState.copyFrom(to); 398 } 399 } 400 ``` 401 402 403 404 ### Example babylon transform to track buffer mutations. 405 406 One goal of hat was to automate the movement of buffers from Java to device. 407 408 One strategy employed by `NativeBackends` might be to track 'ifaceMappedSegment' accesses and inject tracking data into the compute method. 409 410 Here is a transformation for that 411 412 ```java 413 static FuncOpWrapper injectBufferTracking(ComputeClosure.ResolvedMethodCall resolvedMethodCall) { 414 FuncOpWrapper original = resolvedMethodCall.funcOpWrapper(); 415 var transformed = original.transformInvokes((builder, invoke) -> { 416 if (invoke.isIfaceBufferMethod()) { // void array(long idx, T value) or T array(long idx) 417 // Get the first parameter (computeClosure) 418 CopyContext cc = builder.context(); 419 Value computeClosure = cc.getValue(original.parameter(0)); 420 // Get the buffer receiver value in the output model 421 Value receiver = cc.getValue(invoke.operand(0)); // The buffer we are mutatibg or accessing 422 if (invoke.isIfaceMutator()) { 423 // inject computeContext.preMutate(buffer); 424 builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_MUTATE, computeClosure, receiver)); 425 builder.op(invoke.op()); 426 // inject computeContext.postMutate(buffer); 427 builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_MUTATE, computeClosure, receiver)); 428 } else if ( invoke.isIfaceAccessor()) { 429 // inject computeContext.preAccess(buffer); 430 builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_ACCESS, computeClosure, receiver)); 431 builder.op(invoke.op()); 432 // inject computeContext.postAccess(buffer); 433 builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_ACCESS, computeClosure, receiver)); 434 } else { 435 builder.op(invoke.op()); 436 } 437 }else{ 438 builder.op(invoke.op()); 439 } 440 return builder; 441 } 442 ); 443 transformed.op().writeTo(System.out); 444 resolvedMethodCall.funcOpWrapper(transformed); 445 return transformed; 446 } 447 ``` 448 449 So in our `OpenCLBackend` for example 450 ```java 451 public void mutateIfNeeded(ComputeClosure.MethodCall methodCall) { 452 injectBufferTracking(entrypoint); 453 } 454 455 @Override 456 public void computeContextClosed(ComputeContext computeContext){ 457 var codeBuilder = new OpenCLKernelBuilder(); 458 C99Code kernelCode = createKernelCode(computeContext, codeBuilder); 459 System.out.println(codeBuilder); 460 } 461 ``` 462 I hacked the Mandle example. So the compute accessed and mutated it's arrays. 463 464 ```java 465 @CodeReflection 466 static float doubleit(float f) { 467 return f * 2; 468 } 469 470 @CodeReflection 471 static float scaleUp(float f) { 472 return doubleit(f); 473 } 474 475 @CodeReflection 476 static public void compute(final ComputeContext computeContext, S32Array2D s32Array2D, float x, float y, float scale) { 477 scale = scaleUp(scale); 478 var range = computeContext.accelerator.range(s32Array2D.size()); 479 int i = s32Array2D.get(10,10); 480 s32Array2D.set(10,10,i); 481 computeContext.dispatchKernel(MandelCompute::kernel, range, s32Array2D, pallette, x, y, scale); 482 } 483 ``` 484 So here is the transformation being applied to the above compute 485 486 BEFORE (note the !'s indicating accesses through ifacebuffers) 487 ``` 488 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> { 489 %5 : Var<hat.ComputeContext> = var %0 @"computeContext"; 490 %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D"; 491 %7 : Var<float> = var %2 @"x"; 492 %8 : Var<float> = var %3 @"y"; 493 %9 : Var<float> = var %4 @"scale"; 494 %10 : float = var.load %9; 495 %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float"; 496 var.store %9 %11; 497 %12 : hat.ComputeContext = var.load %5; 498 %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator"; 499 %14 : hat.buffer.S32Array2D = var.load %6; 500 ! %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int"; 501 %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange"; 502 %17 : Var<hat.NDRange> = var %16 @"range"; 503 %18 : hat.buffer.S32Array2D = var.load %6; 504 %19 : int = constant @"10"; 505 %20 : int = constant @"10"; 506 ! %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int"; 507 %22 : Var<int> = var %21 @"i"; 508 %23 : hat.buffer.S32Array2D = var.load %6; 509 %24 : int = constant @"10"; 510 %25 : int = constant @"10"; 511 %26 : int = var.load %22; 512 ! invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void"; 513 %27 : hat.ComputeContext = var.load %5; 514 ... 515 ``` 516 AFTER 517 ``` 518 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> { 519 %5 : Var<hat.ComputeContext> = var %0 @"computeContext"; 520 %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D"; 521 %7 : Var<float> = var %2 @"x"; 522 %8 : Var<float> = var %3 @"y"; 523 %9 : Var<float> = var %4 @"scale"; 524 %10 : float = var.load %9; 525 %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float"; 526 var.store %9 %11; 527 %12 : hat.ComputeContext = var.load %5; 528 %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator"; 529 %14 : hat.buffer.S32Array2D = var.load %6; 530 invoke %0 %14 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void"; 531 ! %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int"; 532 invoke %0 %14 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void"; 533 %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange"; 534 %17 : Var<hat.NDRange> = var %16 @"range"; 535 %18 : hat.buffer.S32Array2D = var.load %6; 536 %19 : int = constant @"10"; 537 %20 : int = constant @"10"; 538 invoke %0 %18 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void"; 539 ! %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int"; 540 invoke %0 %18 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void"; 541 %22 : Var<int> = var %21 @"i"; 542 %23 : hat.buffer.S32Array2D = var.load %6; 543 %24 : int = constant @"10"; 544 %25 : int = constant @"10"; 545 %26 : int = var.load %22; 546 invoke %0 %23 @"hat.ComputeClosure::preMutate(hat.buffer.Buffer)void"; 547 ! invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void"; 548 invoke %0 %23 @"hat.ComputeClosure::postMutate(hat.buffer.Buffer)void"; 549 %27 : hat.ComputeContext = var.load %5; 550 ``` 551 And here at runtime the ComputeClosure is reporting accesses when executing via the interpreter after the injected calls. 552 553 ``` 554 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]] 555 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]] 556 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]] 557 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]] 558 ComputeClosure.preMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]] 559 ComputeClosure.postMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]] 560 ``` 561 ## Why inject this info? 562 So the idea is that the ComputeContext would maintain sets of dirty buffers, one set for `gpuDirty` and one set for `javaDirty`. 563 564 We have the code for kernel models. So we know which kernel accesses, mutates or accesses AND mutates particular parameters. 565 566 So when the ComputeContext receives `preAccess(x)` or `preMutate(x)` the ComputeContext would determine if `x` is in the `gpuDirty` set. 567 If so it would delegate to the backend to copy the GPU data back from device into the memory segment (assuming the memory is not coherent!) 568 before removing the buffer from `gpuDirty` set and returning. 569 570 Now the Java access to the segment sees the latest buffer. 571 572 After `postMutate(x)` it will place the buffer in `javaDirty` set. 573 574 When a kernel dispatch comes along, the parameters to the kernel are all checked against the `javaDirty` set. 575 If the parameter is 'accessed' by the kernel. The backend will copy the segment to device. Remove the parameter 576 from the `javaDirty` set and then invoke the kernel. 577 When the kernel completes (lets assume synchronous for a moment) all parameters are checked again, and if the parameter 578 is known to be mutated by the kernel the parameter is added to the 'gpuDirty' set. 579 580 This way we don't have to force the developer to request data movements. 581 582 BTW if kernel requests are async ;) then the ComputeContext maintains a map of buffer to kernel. So `preAccess(x)` or `preMutate(x)` calls 583 can wait on the kernel that is due to 'dirty' the buffer to complete. 584 585 ### Marking hat buffers directly. 586 587 588 589 590 591 592 593 594