diff a/hat/docs/Implementation/kernel-analysis.md b/hat/docs/Implementation/kernel-analysis.md
--- /dev/null
+++ b/hat/docs/Implementation/kernel-analysis.md
@@ -0,0 +1,577 @@
+# Compute Analysis or Runtime tracing
+[Back to Index ../](../index.md)
+
+# Compute Analysis or Runtime tracing
+
+HAT does not dictate how a backend chooses to optimize execution, but does
+provide the tools (Babylon's Code Models) and some helpers which the Backend is encouraged
+use.
+
+The ComputeContext contains all the information that the backend needs, but does not
+include any 'policy' for minimizing data movements.
+
+Our assumption is that backend can use various tools to deduce the most efficient execution strategy.
+
+## Some possible strategies..
+
+### Copy data every time 'just in case' (JIC execution ;) )
+Just naiively execute the code as described in Compute graph. So the backend will copy each buffer to the device, execute the kernel and copy the data back again.
+
+### Use kernel knowledge to minimise data movement
+Execute the code described in the Compute Graph, but use knowledge extracted from kernel models
+to only copy to device buffers that the kernel is going to read, and only copy back from the device
+buffers that the kernel has written to.
+
+### Use Compute knowledge and kernel knowledge to further minimise data movement
+Use knowledge extracted from the compute reachable graph and the kernel
+graphs to determine whether Java has mutated buffers between kernel dispatches
+and only copy data to the device that we know the Java code has mutated.
+
+This last strategy is ideal
+
+We can achieve this using static analysis of the compute and kernel models or by being
+involved in the execution process at runtime.
+
+#### Static analysis
+
+#### Runtime Tracking
+
+* Dynamical
+1. We 'close over' the call/dispatch graph from the entrypoint to all kernels and collect the kernels reachable from the entrypoint and all methods reachable from methods reachable by kernels.
+2. We essentially end up with a graph of codemodels 'rooted' at the entrypoint
+3. For each kernel we also determine how the kernel accesses it's 'MemorySegment` parameters, for each MemorySegment parameters we keep a side table of whther the kernel reads or writes to the segment. We keep this infomation in a side map.
+
+This resulting 'ComputeClosure' (tree of codemodels and relevant side tables) is made available to the accelerator to coordinate execution.
+
+Note that our very simple Compute::compute method neither expresses the movement of the MemorySegment to a device, or the retrieval of the data from a device when the kernel has executed.
+
+Our assumption is that given the ComputeClosure we can deduce such movements.
+
+There are many ways to achieve this.  One way would be by static analysis.
+
+Given the Compute::compute entrypoint it is easy to determine that we are always (no conditional or loops) passing (making available
+might be a better term) a memory segment to a kernel (Compute::kernel) and this kernel only mutates the  `MemorySegment`.
+
+So from simple static analysis we could choose to inject one or more calls into the model representing the need for the accelerator to move data to the devices and/ord back from the device, after the kernel dispatch.
+
+This modified model, would look like we had presented it with this code.
+
+```java
+ void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
+        Accelerator.Range range = accelerator.range(len);
+        accelerator.run(Compute::kernel, range, memorySegment);
+        accelerator.injectedCopyFromDevice(memorySegment);
+    }
+```
+
+Note the ```injectedCopyFromDevice()``` call.
+
+Because the kernel does not read the `MemorySegment` we only need inject the code to request a move back from the device.
+
+To do this requires HAT to analyse the kernel(s) and inject appropriate code into
+the Compute::compute method to inform the vendor backend when it should perform such moves.
+
+Another strategy would be to not rely on static analysis but to inject code to trace 'actual' mutations of the MemorySegments and use these flags to guard against unnecessary copies
+
+```java
+ void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
+        boolean injectedMemorySegmentIsDirty = false;
+        Accelerator.Range range = accelerator.range(len);
+        if (injectedMemorySegmentIsDirty){
+            accelerator.injectedCopyToDevice(memorySegment);
+        }
+        accelerator.run(Compute::kernel, range, memorySegment);
+        injectedMemorySegmentIsDirty = true; // based on Compute::kernel sidetable
+        if (injectedMemorySegmentIsDirty) {
+            accelerator.injectedCopyFromDevice(memorySegment);
+        }
+    }
+```
+
+
+Whether this code mutation generates Java bytecode and executes (or interprets) on the JVM or whether the
+CodeModels for the closure are handed over to a backend which reifies the kernel code and the
+logic for dispatch is not defined.
+
+The code model for the compute will be mutated to inject the appropriate nodes to achieve the goal
+
+It is possible that some vendors may just take the original code model and analyse themselves.
+
+Clearly this is a trivial compute closure.   Lets discuss the required kernel analysis
+and proposed pseudo code.
+
+## Copying data based on kernel MemorySegment analysis
+
+Above we showed that we should be able to determine whether a kernel mutates or accesses any of
+it's Kernel MemorySegment parameters.
+
+We determined above that the kernel only called set() so we need
+not copy the data to the device.
+
+The following example shows a kernel which reads and mutates a memorysegment
+```java
+    static class Compute {
+    @Reflect  public static
+    void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
+        int temp = memorySegment.get(JAVA_INT, ndrange.id.x);
+        memorySegment.set(JAVA_INT, temp*2);
+    }
+
+    @Reflect public static
+    void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
+        Accelerator.Range range = accelerator.range(len);
+        accelerator.run(Compute::doubleup, range, memorySegment);
+    }
+}
+```
+Here our analysis needs to determine that the kernel reads and writes to the segment (it does)
+so the generated compute model would equate to
+
+```java
+ void compute(Accelerator accelerator, MemorySegment memorySegment, int len) {
+        Accelerator.Range range = accelerator.range(len);
+        accelerator.copyToDevice(memorySegment); // injected via Babylon
+        accelerator.run(Compute::doubleup, range, memorySegment);
+        accelerator.copyFromDevice(memorySegment); // injected via Babylon
+    }
+```
+So far the deductions are fairly trivial
+
+Consider
+```java
+ @Reflect public static
+    void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
+        Accelerator.Range range = accelerator.range(len);
+        for (int i=0; i<count; i++) {
+            accelerator.run(Compute::doubleup, range, memorySegment);
+        }
+    }
+```
+
+Here HAT should deduce that the java side is merely looping over the kernel dispatch
+and has no interest in the memorysegment between dispatches.
+
+So the new model need only copy in once (before the fist kernel) and out once (prior to return)
+
+```java
+ @Reflect public static
+    void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
+        Accelerator.Range range = accelerator.range(len);
+        accelerator.copyToDevice(memorySegment); // injected via Babylon
+        for (int i=0; i<count; i++) {
+            accelerator.run(Compute::doubleup, range, memorySegment);
+        }
+        accelerator.copyFromDevice(memorySegment); // injected via Babylon
+    }
+```
+
+Things get slightly more interesting when we do indeed access the memory segment
+from the Java code inside the loop.
+
+```java
+ @Reflect public static
+    void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
+        Accelerator.Range range = accelerator.range(len);
+        for (int i=0; i<count; i++) {
+            accelerator.run(Compute::doubleup, range, memorySegment);
+            int slot0 = memorySegment.get(INTVALUE, 0);
+            System.out.println("slot0 ", slot0);
+        }
+    }
+```
+Now we expect babylon to inject a read inside the loop to make the data available java side
+
+```java
+ @Reflect public static
+    void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
+        Accelerator.Range range = accelerator.range(len);
+        accelerator.copyToDevice(memorySegment); // injected via Babylon
+        for (int i=0; i<count; i++) {
+            accelerator.run(Compute::doubleup, range, memorySegment);
+            accelerator.copyFromDevice(memorySegment); // injected via Babylon
+            int slot0 = memorySegment.get(INTVALUE, 0);
+            System.out.println("slot0 ", slot0);
+        }
+
+    }
+```
+
+Note that in this case we are only accessing 0th int from the segment so a possible
+optimization might be to allow the vendor to only copy back this one element....
+```java
+ @Reflect public static
+    void compute(Accelerator accelerator, MemorySegment memorySegment, int len, int count) {
+        Accelerator.Range range = accelerator.range(len);
+        accelerator.copyToDevice(memorySegment); // injected via Babylon
+        for (int i=0; i<count; i++) {
+            accelerator.run(Compute::doubleup, range, memorySegment);
+            if (i+1==count){// injected
+                accelerator.copyFromDevice(memorySegment); // injected
+            }else {
+                accelerator.copyFromDevice(memorySegment, 1); // injected via Babylon
+            }
+            int slot0 = memorySegment.get(INTVALUE, 0);
+            System.out.println("slot0 ", slot0);
+        }
+
+    }
+```
+
+Again HAT will merely mutate the code model of the compute method,
+the vendor may choose to interpret bytecode, generate bytecode and execute
+or take complete plyTable and execute the model in native code.
+
+So within HAT we must find all set/get calls on MemorySegments and trace them back to kernel parameters.
+
+We should allow aliasing of memory segments... but in the short term we may well throw an exception when we see such aliasing
+
+
+```java
+ @Reflect  public static
+    void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
+        MemorySegment alias = memorySegment;
+        alias.set(JAVA_INT, ndrange.id.x, alias.get(JAVA_INT, ndrange.id.x)*2);
+    }
+```
+
+## Weed warning #1
+
+We could find common kernel errors when analyzing
+
+This code is probably wrong, as it is racey writing to 0th element
+
+```java
+ void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
+    MemorySegment alias = memorySegment;
+    alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x)*2);
+}
+```
+
+By allowing a 'lint' like plugin mechanism for code model it would be easy to find.
+If we ever find a constant index in set(...., <constant> ) we are probably in a world of hurt.
+Unless the set is included in some conditional which itself is dependant on a value extracted from a memory segment.
+
+```java
+ void doubleup(Accelerator.NDRange ndrange, MemorySegment memorySegment) {
+    MemorySegment alias = memorySegment;
+    if (????){
+        alias.set(JAVA_INT, 0, alias.get(JAVA_INT, ndrange.id.x) * 2);
+    }
+}
+```
+
+There are a lot opportunities for catching such bugs.
+
+
+## Flipping Generations
+
+Many algorithms require us to process data from generations. Consider
+Convolutions or Game Of Life style problems where we have an image or game bufferState and
+we need to calculate the result of applying rules to cells in the image or game.
+
+It is important that when we process the next generation (either in parallel or sequentially) we
+must ensure that we only use prev generation data to generate next generation data.
+
+```
+[ ][ ][*][ ][ ]       [ ][ ][ ][ ][ ]
+[ ][ ][*][ ][ ]       [ ][*][*][*][ ]
+[ ][ ][*][ ][ ]   ->  [ ][ ][ ][ ][ ]
+[ ][ ][ ][ ][ ]       [ ][ ][ ][ ][ ]
+
+```
+
+This usually requires us to hold two copies,  and applying the kernel to one input set
+which writes to the output.
+
+In the case of the Game Of Life we may well use the output as the next input...
+
+```java
+@Reflect void conway(Accelerator.NDRange ndrange,
+                            MemorySegment in, MemorySegment out, int width, int height) {
+    int cx = ndrange.id.x % ndrange.id.maxx;
+    int cy = ndrange.id.x / ndrange.id.maxx;
+
+    int sum = 0;
+    for (int dx = -1; dx < 2; dy++) {
+        for (int dy = -1; dy < 2; dy++) {
+            if (dx != 0 || dy != 0) {
+                int x = cx + dx;
+                int y = cy + dy;
+                if (x >= 0 && x < widh && y >= 0 && y < height) {
+                    sum += in.get(INT, x * width + h);
+                }
+            }
+        }
+    }
+    result = GOLRules(sum, in.get(INT, ndrange.id.x));
+    out.set(INT, ndrange.id.x);
+
+}
+```
+
+In this case the assumption is that the compute layer will swap the buffers for alternate passes
+
+```java
+import java.lang.foreign.MemorySegment;
+
+@Reflect
+void compute(Accelerator accelerator, MemorySegment gameState,
+             int width, int height, int maxGenerations) {
+    MemorySegment s1 = gameState;
+    MemorySegment s2 = allocateGameState(width, height);
+    for (int generation = 0; generation < maxGenerations; generation++){
+        MemorySegment from = generation%2==0?s1?s2;
+        MemorySegment to = generation%2==1?s1?s2;
+        accelerator.run(Compute::conway, from, to, range, width, height);
+    }
+    if (maxGenerations%2==1){ // ?
+        gameState.copyFrom(s2);
+    }
+}
+```
+
+This common pattern includes some aliasing of MemorySegments that we need to untangle.
+
+HAT needs to be able to track the aliases to determine the minimal number of copies.
+```java
+import java.lang.foreign.MemorySegment;
+
+@Reflect
+void compute(Accelerator accelerator, MemorySegment gameState, int width, int height, int maxGenerations,
+             DisplaySAM displaySAM) {
+    MemorySegment s1 = gameState;
+    MemorySegment s2 = allocateGameState(width, height);
+
+    for (int generation = 0; generation < maxGenerations; generation++){
+        MemorySegment from = generation%2==0?s1?s2;
+        MemorySegment to = generation%2==1?s1?s2;
+        if (generation == 0) {             /// injected
+            accerator.copyToDevice(from);    // injected
+        }                                  // injected
+        accelerator.run(Compute::conway, from, to, range, width, height, 1000);
+        if (generation == maxGenerations-1){ // injected
+            accerator.copyFromDevice(to);    //injected
+        }                                    //injected
+    }
+    if (maxGenerations%2==1){ // ?
+        gameState.copyFrom(s2);
+    }
+
+}
+```
+
+```java
+import java.lang.foreign.MemorySegment;
+
+@Reflect
+void compute(Accelerator accelerator, MemorySegment gameState, int width, int height,
+             int maxGenerations,
+             DisplaySAM displaySAM) {
+    MemorySegment s1 = gameState;
+    MemorySegment s2 = allocateGameState(width, height);
+
+    for (int generation = 0; generation < maxGenerations; generation++){
+        MemorySegment from = generation%2==0?s1?s2;
+        MemorySegment to = generation%2==1?s1?s2;
+        accelerator.run(Compute::conway, from, to, range, width, height,1000);
+        displaySAM.display(s2,width, height);
+    }
+    if (maxGenerations%2==1){ // ?
+        gameState.copyFrom(to);
+    }
+}
+```
+
+
+
+### MavenStyleProject babylon transform to track buffer mutations.
+
+One goal of hat was to automate the movement of buffers from Java to device.
+
+One strategy employed by `NativeBackends` might be to track 'ifaceMappedSegment' accesses and inject tracking data into the compute method.
+
+Here is a transformation for that
+
+```java
+ static FuncOpWrapper injectBufferTracking(ComputeClosure.ResolvedMethodCall resolvedMethodCall) {
+        FuncOpWrapper original = resolvedMethodCall.funcOpWrapper();
+        var transformed = original.transformInvokes((builder, invoke) -> {
+                    if (invoke.isIfaceBufferMethod()) { // void array(long idx, T value) or T array(long idx)
+                        // Get the first parameter (computeClosure)
+                        CopyContext cc = builder.context();
+                        Value computeClosure = cc.getValue(original.parameter(0));
+                        // Get the buffer receiver value in the output model
+                        Value receiver = cc.getValue(invoke.operand(0)); // The buffer we are mutatibg or accessing
+                        if (invoke.isIfaceMutator()) {
+                            // inject CLWrapComputeContext.preMutate(buffer);
+                            builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_MUTATE, computeClosure, receiver));
+                            builder.op(invoke.op());
+                           // inject CLWrapComputeContext.postMutate(buffer);
+                            builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_MUTATE, computeClosure, receiver));
+                        } else if ( invoke.isIfaceAccessor()) {
+                           // inject CLWrapComputeContext.preAccess(buffer);
+                            builder.op(CoreOps.invoke(ComputeClosure.M_CC_PRE_ACCESS, computeClosure, receiver));
+                            builder.op(invoke.op());
+                            // inject CLWrapComputeContext.postAccess(buffer);
+                            builder.op(CoreOps.invoke(ComputeClosure.M_CC_POST_ACCESS, computeClosure, receiver));
+                        } else {
+                            builder.op(invoke.op());
+                        }
+                    }else{
+                        builder.op(invoke.op());
+                    }
+                    return builder;
+                }
+        );
+        transformed.op().writeTo(System.out);
+        resolvedMethodCall.funcOpWrapper(transformed);
+        return transformed;
+    }
+```
+
+So in our `OpenCLBackend` for example
+```java
+    public void mutateIfNeeded(ComputeClosure.MethodCall methodCall) {
+       injectBufferTracking(entrypoint);
+    }
+
+    @Override
+    public void computeContextClosed(ComputeContext CLWrapComputeContext){
+        var codeBuilder = new OpenCLKernelBuilder();
+        C99Code kernelCode = createKernelCode(CLWrapComputeContext, codeBuilder);
+        System.out.println(codeBuilder);
+    }
+```
+I hacked the Mandle example. So the compute accessed and mutated it's arrays.
+
+```java
+  @Reflect
+    static float doubleit(float f) {
+        return f * 2;
+    }
+
+    @Reflect
+    static float scaleUp(float f) {
+        return doubleit(f);
+    }
+
+    @Reflect
+    static public void compute(final ComputeContext CLWrapComputeContext, S32Array2D s32Array2D, float x, float y, float scale) {
+        scale = scaleUp(scale);
+        var range = CLWrapComputeContext.accelerator.range(s32Array2D.size());
+        int i = s32Array2D.get(10,10);
+        s32Array2D.set(10,10,i);
+        CLWrapComputeContext.dispatchKernel(MandelCompute::kernel, range, s32Array2D, pallette, x, y, scale);
+    }
+```
+So here is the transformation being applied to the above compute
+
+BEFORE (note the !'s indicating accesses through ifacebuffers)
+```
+func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
+    %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext";
+    %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
+    %7 : Var<float> = var %2 @"x";
+    %8 : Var<float> = var %3 @"y";
+    %9 : Var<float> = var %4 @"scale";
+    %10 : float = var.load %9;
+    %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
+    var.store %9 %11;
+    %12 : hat.ComputeContext = var.load %5;
+    %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
+    %14 : hat.buffer.S32Array2D = var.load %6;
+!   %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
+    %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
+    %17 : Var<hat.NDRange> = var %16 @"range";
+    %18 : hat.buffer.S32Array2D = var.load %6;
+    %19 : int = constant @"10";
+    %20 : int = constant @"10";
+!   %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
+    %22 : Var<int> = var %21 @"i";
+    %23 : hat.buffer.S32Array2D = var.load %6;
+    %24 : int = constant @"10";
+    %25 : int = constant @"10";
+    %26 : int = var.load %22;
+ !  invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
+    %27 : hat.ComputeContext = var.load %5;
+    ...
+```
+AFTER
+```
+func @"compute" (%0 : hat.ComputeContext, %1 : hat.buffer.S32Array2D, %2 : float, %3 : float, %4 : float)void -> {
+    %5 : Var<hat.ComputeContext> = var %0 @"CLWrapComputeContext";
+    %6 : Var<hat.buffer.S32Array2D> = var %1 @"s32Array2D";
+    %7 : Var<float> = var %2 @"x";
+    %8 : Var<float> = var %3 @"y";
+    %9 : Var<float> = var %4 @"scale";
+    %10 : float = var.load %9;
+    %11 : float = invoke %10 @"mandel.Main::scaleUp(float)float";
+    var.store %9 %11;
+    %12 : hat.ComputeContext = var.load %5;
+    %13 : hat.Accelerator = field.load %12 @"hat.ComputeContext::accelerator()hat.Accelerator";
+    %14 : hat.buffer.S32Array2D = var.load %6;
+    invoke %0 %14 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
+!    %15 : int = invoke %14 @"hat.buffer.S32Array2D::size()int";
+    invoke %0 %14 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
+    %16 : hat.NDRange = invoke %13 %15 @"hat.Accelerator::range(int)hat.NDRange";
+    %17 : Var<hat.NDRange> = var %16 @"range";
+    %18 : hat.buffer.S32Array2D = var.load %6;
+    %19 : int = constant @"10";
+    %20 : int = constant @"10";
+    invoke %0 %18 @"hat.ComputeClosure::preAccess(hat.buffer.Buffer)void";
+ !   %21 : int = invoke %18 %19 %20 @"hat.buffer.S32Array2D::get(int, int)int";
+    invoke %0 %18 @"hat.ComputeClosure::postAccess(hat.buffer.Buffer)void";
+    %22 : Var<int> = var %21 @"i";
+    %23 : hat.buffer.S32Array2D = var.load %6;
+    %24 : int = constant @"10";
+    %25 : int = constant @"10";
+    %26 : int = var.load %22;
+    invoke %0 %23 @"hat.ComputeClosure::preMutate(hat.buffer.Buffer)void";
+ !   invoke %23 %24 %25 %26 @"hat.buffer.S32Array2D::set(int, int, int)void";
+    invoke %0 %23 @"hat.ComputeClosure::postMutate(hat.buffer.Buffer)void";
+    %27 : hat.ComputeContext = var.load %5;
+```
+And here at runtime the ComputeClosure is reporting accesses when executing via the interpreter after the injected calls.
+
+```
+ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
+ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
+ComputeClosure.preAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
+ComputeClosure.postAccess S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
+ComputeClosure.preMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
+ComputeClosure.postMutate S32Array2D[width()=1024, height()=1024, array()=int[1048576]]
+```
+## Why inject this info?
+So the idea is that the ComputeContext would maintain sets of dirty buffers, one set for `gpuDirty` and one set for `javaDirty`.
+
+We have the code for kernel models. So we know which kernel accesses, mutates or accesses AND mutates particular parameters.
+
+So when the ComputeContext receives  `preAccess(x)` or `preMutate(x)` the ComputeContext would determine if `x` is in the `gpuDirty` set.
+If so it would delegate to the backend to  copy the GPU data back from device into the memory segment (assuming the memory is not coherent!)
+before removing the buffer from `gpuDirty` set and returning.
+
+Now the Java access to the segment sees the latest buffer.
+
+After `postMutate(x)` it will place the buffer in `javaDirty` set.
+
+When a kernel dispatch comes along, the parameters to the kernel are all checked against the `javaDirty` set.
+If the parameter is 'accessed' by the kernel. The backend will copy the segment to device. Remove the parameter
+from the `javaDirty` set and then invoke the kernel.
+When the kernel completes (lets assume synchronous for a moment) all parameters are checked again, and if the parameter
+is known to be mutated by the kernel the parameter is added to the 'gpuDirty' set.
+
+This way we don't have to force the developer to request data movements.
+
+BTW if kernel requests are async ;) then the ComputeContext maintains a map of buffer to kernel.  So `preAccess(x)` or `preMutate(x)` calls
+can wait on the kernel that is due to 'dirty' the buffer to complete.
+
+### Marking hat buffers directly.
+
+
+
+
+
+
+
+
+