1 2 # Compute Analysis or Runtime tracing 3 4 ---- 5 6 * [Contents](hat-00.md) 7 * House Keeping 8 * [Project Layout](hat-01-01-project-layout.md) 9 * [Building Babylon](hat-01-02-building-babylon.md) 10 * [Building HAT](hat-01-03-building-hat.md) 11 * Programming Model 12 * [Programming Model](hat-03-programming-model.md) 13 * Interface Mapping 14 * [Interface Mapping Overview](hat-04-01-interface-mapping.md) 15 * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md) 16 * Implementation Detail 17 * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md) 18 * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md) 19 20 ---- 21 22 # Compute Analysis or Runtime tracing 23 24 HAT does not dictate how a backend chooses to optimize execution, but does 25 provide the tools (Babylon's Code Models) and some helpers which the Backend is encouraged 26 use. 27 28 The ComputeContext contains all the information that the backend needs, but does not 29 include any 'policy' for minimizing data movements. 30 31 Our assumption is that backend can use various tools to deduce the most efficient execution strategy. 32 33 ## Some possible strategies.. 34 35 ### Copy data every time 'just in case' (JIC execution ;) ) 36 Just naiively execute the code as described in Compute graph. So the backend will copy each buffer to the device, execute the kernel and copy the data back again. 37 38 ### Use kernel knowledge to minimise data movement 39 Execute the code described in the Compute Graph, but use knowledge extracted from kernel models 40 to only copy to device buffers that the kernel is going to read, and only copy back from the device 41 buffers that the kernel has written to. 42 43 ### Use Compute knowledge and kernel knowledge to further minimise data movement 44 Use knowledge extracted from the compute reachable graph and the kernel 45 graphs to determine whether Java has mutated buffers between kernel dispatches 46 and only copy data to the device that we know the Java code has mutated. 47 48 This last strategy is ideal 49 50 We can achieve this using static analysis of the compute and kernel models or by being 51 involved in the execution process at runtime. 52 53 #### Static analysis 54 55 #### Runtime Tracking 56 57 * Dynamical 58 1. We 'close over' the call/dispatch graph from the entrypoint to all kernels and collect the kernels reachable from the entrypoint and all methods reachable from methods reachable by kernels. 59 2. We essentially end up with a graph of codemodels 'rooted' at the entrypoint 60 3. For each kernel we also determine how the kernel accesses it's 'MemorySegment` parameters, for each MemorySegment parameters we keep a side table of whther the kernel reads or writes to the segment. We keep this infomation in a side map. 61 62 This resulting 'ComputeClosure' (tree of codemodels and relevant side tables) is made available to the accelerator to coordinate execution. 63 64 Note that our very simple Compute::compute method neither expresses the movement of the MemorySegment to a device, or the retrieval of the data from a device when the kernel has executed. 65 66 Our assumption is that given the ComputeClosure we can deduce such movements. 67 68 There are many ways to achieve this. One way would be by static analysis. 69 70 Given the Compute::compute entrypoint it is easy to determine that we are always (no conditional or loops) passing (making available 71 might be a better term) a memory segment to a kernel (Compute::kernel) and this kernel only mutates the `MemorySegment`. 72 73 So from simple static analysis we could choose to inject one or more calls into the model representing the need for the accelerator to move data to the devices and/ord back from the device, after the kernel dispatch. 74 75 This modified model, would look like we had presented it with this code. 76 77 ```java 78 void compute(Accelerator accelerator, MemorySegment memorySegment, int len) { 79 Accelerator.Range range = accelerator.range(len); 80 accelerator.run(Compute::kernel, range, memorySegment); 81 accelerator.injectedCopyFromDevice(memorySegment); 82 } 83 ``` 84 85 Note the ```injectedCopyFromDevice()``` call. 86 87 Because the kernel does not read the `MemorySegment` we only need inject the code to request a move back from the device. 88 89 To do this requires HAT to analyse the kernel(s) and inject appropriate code into 90 the Compute::compute method to inform the vendor backend when it should perform such moves. 91 92 Another strategy would be to not rely on static analysis but to inject code to trace 'actual' mutations of the MemorySegments and use these flags to guard against unnecessary copies 93 94 ```java 95 void compute(Accelerator accelerator, MemorySegment memorySegment, int len) { 96 boolean injectedMemorySegmentIsDirty = false; 97 Accelerator.Range range = accelerator.range(len); 98 if (injectedMemorySegmentIsDirty){ 99 accelerator.injectedCopyToDevice(memorySegment); 100 } 101 accelerator.run(Compute::kernel, range, memorySegment); 102 injectedMemorySegmentIsDirty = true; // based on Compute::kernel sidetable 103 if (injectedMemorySegmentIsDirty) { 104 accelerator.injectedCopyFromDevice(memorySegment); 105 } 106 } 107 ``` 108 109 110 Whether this code mutation generates Java bytecode and executes (or interprets) on the JVM or whether the 111 CodeModels for the closure are handed over to a backend which reifies the kernel code and the 112 logic for dispatch is not defined. 113 114 The code model for the compute will be mutated to inject the appropriate nodes to achieve the goal 115 116 It is possible that some vendors may just take the original code model and analyse themselves. 117 118 Clearly this is a trivial compute closure. Lets discuss the required kernel analysis 119 and proposed pseudo code. 120 121 ## Copying data based on kernel MemorySegment analysis 122 123 Above we showed that we should be able to determine whether a kernel mutates or accesses any of 124 it's Kernel MemorySegment parameters. 125 126 We determined above that the kernel only called set() so we need 127 not copy the data to the device. 128 129 The following example shows a kernel which reads and mutates a memorysegment 130 ```java 131 static class Compute { 132 @CodeReflection public static 133 void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) { 134 int temp = memorySegment.get(JAVA_INT, ndrange.id.x); 135 memorySegment.set(JAVA_INT, temp*2); 136 } 137 138 @CodeReflection public static 139 void compute(Accelerator accelerator, MemorySegment memorySegment, int len) { 140 Accelerator.Range range = accelerator.range(len); 141 accelerator.run(Compute::doubleup, range, memorySegment); 142 } 143 } 144 ``` 145 Here our analysis needs to determine that the kernel reads and writes to the segment (it does) 146 so the generated compute model would equate to 147 148 ```java 149 void compute(Accelerator accelerator, MemorySegment memorySegment, int len) { 150 Accelerator.Range range = accelerator.range(len); 151 accelerator.copyToDevice(memorySegment); // injected via Babylon 152 accelerator.run(Compute::doubleup, range, memorySegment); 153 accelerator.copyFromDevice(memorySegment); // injected via Babylon 154 } 155 ``` 156 So far the deductions are fairly trivial 157 158 Consider 159 ```java 160 @CodeReflection public static 161 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) { 162 Accelerator.Range range = accelerator.range(len); 163 for (int i=0; i<count; i++) { 164 accelerator.run(Compute::doubleup, range, memorySegment); 165 } 166 } 167 ``` 168 169 Here HAT should deduce that the java side is merely looping over the kernel dispatch 170 and has no interest in the memorysegment between dispatches. 171 172 So the new model need only copy in once (before the fist kernel) and out once (prior to return) 173 174 ```java 175 @CodeReflection public static 176 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) { 177 Accelerator.Range range = accelerator.range(len); 178 accelerator.copyToDevice(memorySegment); // injected via Babylon 179 for (int i=0; i<count; i++) { 180 accelerator.run(Compute::doubleup, range, memorySegment); 181 } 182 accelerator.copyFromDevice(memorySegment); // injected via Babylon 183 } 184 ``` 185 186 Things get slightly more interesting when we do indeed access the memory segment 187 from the Java code inside the loop. 188 189 ```java 190 @CodeReflection public static 191 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) { 192 Accelerator.Range range = accelerator.range(len); 193 for (int i=0; i<count; i++) { 194 accelerator.run(Compute::doubleup, range, memorySegment); 195 int slot0 = memorySegment.get(INTVALUE, 0); 196 System.out.println("slot0 ", slot0); 197 } 198 } 199 ``` 200 Now we expect babylon to inject a read inside the loop to make the data available java side 201 202 ```java 203 @CodeReflection public static 204 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) { 205 Accelerator.Range range = accelerator.range(len); 206 accelerator.copyToDevice(memorySegment); // injected via Babylon 207 for (int i=0; i<count; i++) { 208 accelerator.run(Compute::doubleup, range, memorySegment); 209 accelerator.copyFromDevice(memorySegment); // injected via Babylon 210 int slot0 = memorySegment.get(INTVALUE, 0); 211 System.out.println("slot0 ", slot0); 212 } 213 214 } 215 ``` 216 217 Note that in this case we are only accessing 0th int from the segment so a possible 218 optimization might be to allow the vendor to only copy back this one element.... 219 ```java 220 @CodeReflection public static 221 void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) { 222 Accelerator.Range range = accelerator.range(len); 223 accelerator.copyToDevice(memorySegment); // injected via Babylon 224 for (int i=0; i<count; i++) { 225 accelerator.run(Compute::doubleup, range, memorySegment); 226 if (i+1==count){// injected 227 accelerator.copyFromDevice(memorySegment); // injected 228 }else { 229 accelerator.copyFromDevice(memorySegment, 1); // injected via Babylon 230 } 231 int slot0 = memorySegment.get(INTVALUE, 0); 232 System.out.println("slot0 ", slot0); 233 } 234 235 } 236 ``` 237 238 Again HAT will merely mutate the code model of the compute method, 239 the vendor may choose to interpret bytecode, generate bytecode and execute 240 or take complete plyTable and execute the model in native code. 241 242 So within HAT we must find all set/get calls on MemorySegments and trace them back to kernel parameters. 243 244 We should allow aliasing of memory segments... but in the short term we may well throw an exception when we see such aliasing 245 246 247 ```java 248 @CodeReflection public static 249 void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) { 250 MemorySegment alias = memorySegment; 251 alias.set(JAVA_INT, ndrange.id.x, alias.get(JAVA_INT, ndrange.id.x)*2); 252 } 253 ``` 254 255 ## Weed warning #1 256 257 We could find common kernel errors when analyzing 258 259 This code is probably wrong, as it is racey writing to 0th element 260 261 ```java 262 void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) { 263 MemorySegment alias = memorySegment; 264 alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x)*2); 265 } 266 ``` 267 268 By allowing a 'lint' like plugin mechanism for code model it would be easy to find. 269 If we ever find a constant index in set(...., <constant> ) we are probably in a world of hurt. 270 Unless the set is included in some conditional which itself is dependant on a value extracted from a memory segment. 271 272 ```java 273 void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) { 274 MemorySegment alias = memorySegment; 275 if (????){ 276 alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x) * 2); 277 } 278 } 279 ``` 280 281 There are a lot opportunities for catching such bugs. 282 283 284 ## Flipping Generations 285 286 Many algorithms require us to process data from generations. Consider 287 Convolutions or Game Of Life style problems where we have an image or game bufferState and 288 we need to calculate the result of applying rules to cells in the image or game. 289 290 It is important that when we process the next generation (either in parallel or sequentially) we 291 must ensure that we only use prev generation data to generate next generation data. 292 293 ``` 294 [ ][ ][*][ ][ ] [ ][ ][ ][ ][ ] 295 [ ][ ][*][ ][ ] [ ][*][*][*][ ] 296 [ ][ ][*][ ][ ] -> [ ][ ][ ][ ][ ] 297 [ ][ ][ ][ ][ ] [ ][ ][ ][ ][ ] 298 299 ``` 300 301 This usually requires us to hold two copies, and applying the kernel to one input set 302 which writes to the output. 303 304 In the case of the Game Of Life we may well use the output as the next input... 305 306 ```java 307 @CodeReflection void conway(Accelerator.NDRange ndrange, 308 MemorySegment in, MemorySegment out, int width, int height) { 309 int cx = ndrange.id.x % ndrange.id.maxx; 310 int cy = ndrange.id.x / ndrange.id.maxx; 311 312 int sum = 0; 313 for (int dx = -1; dx < 2; dy++) { 314 for (int dy = -1; dy < 2; dy++) { 315 if (dx != 0 || dy != 0) { 316 int x = cx + dx; 317 int y = cy + dy; 318 if (x >= 0 && x < widh && y >= 0 && y < height) { 319 sum += in.get(INT, x * width + h); 320 } 321 } 322 } 323 } 324 result = GOLRules(sum, in.get(INT, ndrange.id.x)); 325 out.set(INT, ndrange.id.x); 326 327 } 328 ``` 329 330 In this case the assumption is that the compute layer will swap the buffers for alternate passes 331 332 ```java 333 import java.lang.foreign.MemorySegment; 334 335 @CodeReflection 336 void compute(Accelerator accelerator, MemorySegment gameState, 337 int width, int height, int maxGenerations) { 338 MemorySegment s1 = gameState; 339 MemorySegment s2 = allocateGameState(width, height); 340 for (int generation = 0; generation < maxGenerations; generation++){ 341 MemorySegment from = generation%2==0?s1?s2; 342 MemorySegment to = generation%2==1?s1?s2; 343 accelerator.run(Compute::conway, from, to, range, width, height); 344 } 345 if (maxGenerations%2==1){ // ? 346 gameState.copyFrom(s2); 347 } 348 } 349 ``` 350 351 This common pattern includes some aliasing of MemorySegments that we need to untangle. 352 353 HAT needs to be able to track the aliases to determine the minimal number of copies. 354 ```java 355 import java.lang.foreign.MemorySegment; 356 357 @CodeReflection 358 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height, int maxGenerations, 359 DisplaySAM displaySAM) { 360 MemorySegment s1 = gameState; 361 MemorySegment s2 = allocateGameState(width, height); 362 363 for (int generation = 0; generation < maxGenerations; generation++){ 364 MemorySegment from = generation%2==0?s1?s2; 365 MemorySegment to = generation%2==1?s1?s2; 366 if (generation == 0) { /// injected 367 accerator.copyToDevice(from); // injected 368 } // injected 369 accelerator.run(Compute::conway, from, to, range, width, height, 1000); 370 if (generation == maxGenerations-1){ // injected 371 accerator.copyFromDevice(to); //injected 372 } //injected 373 } 374 if (maxGenerations%2==1){ // ? 375 gameState.copyFrom(s2); 376 } 377 378 } 379 ``` 380 381 ```java 382 import java.lang.foreign.MemorySegment; 383 384 @CodeReflection 385 void compute(Accelerator accelerator, MemorySegment gameState, int width, int height, 386 int maxGenerations, 387 DisplaySAM displaySAM) { 388 MemorySegment s1 = gameState; 389 MemorySegment s2 = allocateGameState(width, height); 390 391 for (int generation = 0; generation < maxGenerations; generation++){ 392 MemorySegment from = generation%2==0?s1?s2; 393 MemorySegment to = generation%2==1?s1?s2; 394 accelerator.run(Compute::conway, from, to, range, width, height,1000); 395 displaySAM.display(s2,width, height); 396 } 397 if (maxGenerations%2==1){ // ? 398 gameState.copyFrom(to); 399 } 400 } 401 ``` 402 403 404 405 ### MavenStyleProject babylon transform to track buffer mutations. 406 407 One goal of hat was to automate the movement of buffers from Java to device. 408 409 One strategy employed by `NativeBackends` might be to track 'ifaceMappedSegment' accesses and inject tracking data into the compute method. 410 411 Here is a transformation for that 412 413 ```java 414 static FuncOpWrapper injectBufferTracking(ComputeClosure.ResolvedMethodCall resolvedMethodCall) { 415 FuncOpWrapper original = resolvedMethodCall.funcOpWrapper(); 416 var transformed = original.transformInvokes((builder, invoke) -> { 417 if (invoke.isIfaceBufferMethod()) { // void array(long idx, T value) or T array(long idx) 418 // Get the first parameter (computeClosure) 419 CopyContext cc = builder.context(); 420 Value computeClosure = cc.getValue(original.parameter(0)); 421 // Get the buffer receiver value in the output model 422 Value receiver = cc.getValue(invoke.operand(0)); // The buffer we are mutatibg or accessing 423 if (invoke.isIfaceMutator()) { 424 // inject CLWrapComputeContext.preMutate(buffer); 425 builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_MUTATE, computeClosure, receiver)); 426 builder.op(invoke.op()); 427 // inject CLWrapComputeContext.postMutate(buffer); 428 builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_MUTATE, computeClosure, receiver)); 429 } else if ( invoke.isIfaceAccessor()) { 430 // inject CLWrapComputeContext.preAccess(buffer); 431 builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_ACCESS, computeClosure, receiver)); 432 builder.op(invoke.op()); 433 // inject CLWrapComputeContext.postAccess(buffer); 434 builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_ACCESS, computeClosure, receiver)); 435 } else { 436 builder.op(invoke.op()); 437 } 438 }else{ 439 builder.op(invoke.op()); 440 } 441 return builder; 442 } 443 ); 444 transformed.op().writeTo(System.out); 445 resolvedMethodCall.funcOpWrapper(transformed); 446 return transformed; 447 } 448 ``` 449 450 So in our `OpenCLBackend` for example 451 ```java 452 public void mutateIfNeeded(ComputeClosure.MethodCall methodCall) { 453 injectBufferTracking(entrypoint); 454 } 455 456 @Override 457 public void computeContextClosed(ComputeContext CLWrapComputeContext){ 458 var codeBuilder = new OpenCLKernelBuilder(); 459 C99Code kernelCode = createKernelCode(CLWrapComputeContext, codeBuilder); 460 System.out.println(codeBuilder); 461 } 462 ``` 463 I hacked the Mandle example. So the compute accessed and mutated it's arrays. 464 465 ```java 466 @CodeReflection 467 static float doubleit(float f) { 468 return f * 2; 469 } 470 471 @CodeReflection 472 static float scaleUp(float f) { 473 return doubleit(f); 474 } 475 476 @CodeReflection 477 static public void compute(final ComputeContext CLWrapComputeContext, S32Array2D s32Array2D, float x, float y, float scale) { 478 scale = scaleUp(scale); 479 var range = CLWrapComputeContext.accelerator.range(s32Array2D.size()); 480 int i = s32Array2D.get(10,10); 481 s32Array2D.set(10,10,i); 482 CLWrapComputeContext.dispatchKernel(MandelCompute::kernel, range, s32Array2D, pallette, x, y, scale); 483 } 484 ``` 485 So here is the transformation being applied to the above compute 486 487 BEFORE (note the !'s indicating accesses through ifacebuffers) 488 ``` 489 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> { 490 %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext"; 491 %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D"; 492 %7 : Var<float> = var %2 @"x"; 493 %8 : Var<float> = var %3 @"y"; 494 %9 : Var<float> = var %4 @"scale"; 495 %10 : float = var.load %9; 496 %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float"; 497 var.store %9 %11; 498 %12 : hat.ComputeContext = var.load %5; 499 %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator"; 500 %14 : hat.buffer.S32Array2D = var.load %6; 501 ! %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int"; 502 %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange"; 503 %17 : Var<hat.NDRange> = var %16 @"range"; 504 %18 : hat.buffer.S32Array2D = var.load %6; 505 %19 : int = constant @"10"; 506 %20 : int = constant @"10"; 507 ! %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int"; 508 %22 : Var<int> = var %21 @"i"; 509 %23 : hat.buffer.S32Array2D = var.load %6; 510 %24 : int = constant @"10"; 511 %25 : int = constant @"10"; 512 %26 : int = var.load %22; 513 ! invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void"; 514 %27 : hat.ComputeContext = var.load %5; 515 ... 516 ``` 517 AFTER 518 ``` 519 func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> { 520 %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext"; 521 %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D"; 522 %7 : Var<float> = var %2 @"x"; 523 %8 : Var<float> = var %3 @"y"; 524 %9 : Var<float> = var %4 @"scale"; 525 %10 : float = var.load %9; 526 %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float"; 527 var.store %9 %11; 528 %12 : hat.ComputeContext = var.load %5; 529 %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator"; 530 %14 : hat.buffer.S32Array2D = var.load %6; 531 invoke %0 %14 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void"; 532 ! %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int"; 533 invoke %0 %14 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void"; 534 %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange"; 535 %17 : Var<hat.NDRange> = var %16 @"range"; 536 %18 : hat.buffer.S32Array2D = var.load %6; 537 %19 : int = constant @"10"; 538 %20 : int = constant @"10"; 539 invoke %0 %18 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void"; 540 ! %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int"; 541 invoke %0 %18 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void"; 542 %22 : Var<int> = var %21 @"i"; 543 %23 : hat.buffer.S32Array2D = var.load %6; 544 %24 : int = constant @"10"; 545 %25 : int = constant @"10"; 546 %26 : int = var.load %22; 547 invoke %0 %23 @"hat.ComputeClosure::preMutate(hat.buffer.Buffer)void"; 548 ! invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void"; 549 invoke %0 %23 @"hat.ComputeClosure::postMutate(hat.buffer.Buffer)void"; 550 %27 : hat.ComputeContext = var.load %5; 551 ``` 552 And here at runtime the ComputeClosure is reporting accesses when executing via the interpreter after the injected calls. 553 554 ``` 555 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]] 556 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]] 557 ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]] 558 ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]] 559 ComputeClosure.preMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]] 560 ComputeClosure.postMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]] 561 ``` 562 ## Why inject this info? 563 So the idea is that the ComputeContext would maintain sets of dirty buffers, one set for `gpuDirty` and one set for `javaDirty`. 564 565 We have the code for kernel models. So we know which kernel accesses, mutates or accesses AND mutates particular parameters. 566 567 So when the ComputeContext receives `preAccess(x)` or `preMutate(x)` the ComputeContext would determine if `x` is in the `gpuDirty` set. 568 If so it would delegate to the backend to copy the GPU data back from device into the memory segment (assuming the memory is not coherent!) 569 before removing the buffer from `gpuDirty` set and returning. 570 571 Now the Java access to the segment sees the latest buffer. 572 573 After `postMutate(x)` it will place the buffer in `javaDirty` set. 574 575 When a kernel dispatch comes along, the parameters to the kernel are all checked against the `javaDirty` set. 576 If the parameter is 'accessed' by the kernel. The backend will copy the segment to device. Remove the parameter 577 from the `javaDirty` set and then invoke the kernel. 578 When the kernel completes (lets assume synchronous for a moment) all parameters are checked again, and if the parameter 579 is known to be mutated by the kernel the parameter is added to the 'gpuDirty' set. 580 581 This way we don't have to force the developer to request data movements. 582 583 BTW if kernel requests are async ;) then the ComputeContext maintains a map of buffer to kernel. So `preAccess(x)` or `preMutate(x)` calls 584 can wait on the kernel that is due to 'dirty' the buffer to complete. 585 586 ### Marking hat buffers directly. 587 588 589 590 591 592 593 594 595