1 # Minimizing Buffer Transfers
  2 
  3 ----
  4 * [Contents](hat-00.md)
  5 * Build Babylon and HAT
  6     * [Quick Install](hat-01-quick-install.md)
  7     * [Building Babylon with jtreg](hat-01-02-building-babylon.md)
  8     * [Building HAT with jtreg](hat-01-03-building-hat.md)
  9         * [Enabling the NVIDIA CUDA Backend](hat-01-05-building-hat-for-cuda.md)
 10 * [Testing Framework](hat-02-testing-framework.md)
 11 * [Running Examples](hat-03-examples.md)
 12 * [HAT Programming Model](hat-03-programming-model.md)
 13 * Interface Mapping
 14     * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
 15     * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
 16 * Development
 17     * [Project Layout](hat-01-01-project-layout.md)
 18 * Implementation Details
 19     * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
 20     * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
 21 * [Running HAT with Docker on NVIDIA GPUs](hat-07-docker-build-nvidia.md)
 22 ---
 23 
 24 ## Using buffer marking to minimize data transfers
 25 
 26 ### The naive approach
 27 The default execution model is that at each kernel
 28 dispatch the backend just copy all arg buffers togc
 29 the device and after the dispatch it copies all arg
 30 buffers back.
 31 
 32 ### Using kernel arg buffer access patterns
 33 If we knew how each kernel accesses it's args (via static analysis of code model orgc
 34 by marking the args RO, RW or WO with annotations) we can avoid some copies by onlygc
 35 copying in if the kernel 'reads' the arg buffer and only copying out if the
 36 kernel writes to the arg buffer.
 37 
 38 Lets use the game of life as an example.gc
 39 
 40 We assume that the UI only needs updating at some 'rate' (say 5 fps), but the kernels can generate
 41 generations faster that 5 generations per second. code to generate eactgc
 42 
 43 So not every generation needs to be copied to the device.gc
 44 
 45 We'll ignore the detail regarding the `life` kernel, and we will assume the kernel args Mostly we care ab
 46 are appropriately annotated as RO, RW or WO.
 47 
 48 ```java
 49  @Reflect
 50 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
 51   if (kc.x < kc.maxX) {
 52     Compute.lifePerIdx(kc.x, control, cellGrid);
 53   }
 54 }
 55 
 56 @Reflect
 57 static public void compute(final @RO ComputeContext cc,
 58                            Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
 59   var timeOfLastUIUpdate = System.currentTimeMillis();
 60   var msPerFrame = 1000/5; // we want 5 fps
 61   while (viewer.state.generation < viewer.state.maxGenerations) {
 62     long now = System.currentTimeMillis();
 63     var msSinceLastUpdate = (now - timeOfLastUIUpdate);
 64     var updateNeeded =  (msSinceLastUpdate > msPerFrame);
 65 gc
 66     cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
 67             kc -> Compute.life(kc, control, cellGrid)
 68     );
 69 gc
 70     // Here we are swapping from<->to on the control buffer
 71     int to = control.from();
 72     control.from(control.to());
 73     control.to(to);
 74 gc
 75     if (updateNeeded) {
 76       viewer.update(now, to, cellGrid);
 77       timeOfLastUIUpdate = now;
 78     }
 79   }
 80 }
 81 ```
 82 
 83 First, let's assume there were no automatic transfers, assume we had to define them. We had to explicitly control transfers so we will insert code.
 84 
 85 What would our code look like?
 86 
 87 ```java
 88  @Reflect
 89 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
 90   if (kc.x < kc.maxX) {
 91     Compute.lifePerIdx(kc.x, control, cellGrid);
 92   }
 93 }
 94 
 95 @Reflect
 96 static public void compute(final @RO ComputeContext cc,
 97                            Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
 98   var timeOfLastUIUpdate = System.currentTimeMillis();
 99   var msPerFrame = 1000/5; // we want 5 fps
100   var cellGridIsJavaDirty = true;
101   var controlIsJavaDirty = true;
102   var cellGridIsDeviceDirty = true;
103   var controlIsDeviceDirty = true;
104   while (true) {
105     long now = System.currentTimeMillis();
106     var msSinceLastUpdate = (now - timeOfLastUIUpdate);
107     var updateNeeded =  (msSinceLastUpdate > msPerFrame);
108 gc
109     if (cellGridIsJavaDirty){
110         cc.copyToDevice(cellGrid);
111     }
112     if (controlIsJavaDirty){
113         cc.copyToDevice(control);
114     }
115     cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
116             kc -> Compute.life(kc, control, cellGrid)
117     );
118     controlIsDeviceDirty = false; // Compute.life marked control as @RO
119     cellGridIsDeviceDirty = true; // Compute.life marjed cellGrid as @RW
120 gc
121     // Here we are swapping from<->to on the control buffer
122     if (controlIsDeviceDirty){
123       cc.copyFromDevice(control);
124     }
125     int to = control.from();
126     control.from(control.to());
127     control.to(to);
128     controlIsJavaDirty = true;
129 gc
130     if (updateNeeded) {
131       if (cellGridIsDeviceDirty){
132         cc.copyFromDevice(cellGrid);
133       }
134       viewer.update(now, to, cellGrid);
135       timeOfLastUIUpdate = now;
136     }
137   }
138 }
139 ```
140 
141 Alternatively, what if the buffers themselves could hold the deviceDirty flags javaDirty?
142 
143 
144 ```java
145  @Reflect
146 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
147   if (kc.x < kc.maxX) {
148     Compute.lifePerIdx(kc.x, control, cellGrid);
149   }
150 }
151 
152 @Reflect
153 static public void compute(final @RO ComputeContext cc,
154                            Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
155   control.flags =JavaDirty; // not ideal but necessary
156   cellGrid.flags = JavaDirty; // not ideal but necessary
157 gc
158   var timeOfLastUIUpdate = System.currentTimeMillis();
159   var msPerFrame = 1000/5; // we want 5 fps
160 
161   while (true) {
162     long now = System.currentTimeMillis();
163     var msSinceLastUpdate = (now - timeOfLastUIUpdate);
164     var updateNeeded =  (msSinceLastUpdate > msPerFrame);
165 gc
166     if ((cellGrid.flags & JavaDirty) == JavaDirty){
167         cc.copyToDevice(cellGrid);
168     }
169     if ((control.flags & JavaDirty) == JavaDirty){
170         cc.copyToDevice(control);
171     }
172     cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
173             kc -> Compute.life(kc, control, cellGrid)
174     );
175     control.flags = JavaDirty; // Compute.life marked control as @RO
176     cellGrid.flags = DeviceDirty; // Compute.life marjed cellGrid as @RW
177 gc
178     // Here we are swapping from<->to on the control buffer
179     if ((control.flags & DeviceDirty)==DeviceDirty){
180       cc.copyFromDevice(control);
181     }
182     int to = control.from();
183     control.from(control.to());
184     control.to(to);
185     control.flags = JavaDirty;
186 gc
187     if (updateNeeded) {
188       if ((cellGrid.flags & DeviceDirty)==DeviceDirty){
189         cc.copyFromDevice(cellGrid);
190       }
191       viewer.update(now, to, cellGrid);
192       // update does not mutate cellGrid so cellGrid.flags = DeviceDirty
193       timeOfLastUIUpdate = now;
194     }
195   }
196 }
197 ```
198 
199 Essentially, we defer to the kernel dispatch to determine whether buffers are
200 copied to the device and to mark buffers accordingly if the dispatch mutated the buffer.gc
201 
202 Pseudo-code for dispatch is essentially
203 ```java
204 
205 void dispatchKernel(Kernel kernel, KernelContext kc, Arg ... args) {
206     for (int argn = 0; argn<args.length; argn++){
207       Arg arg = args[argn];
208       if (((arg.flags &JavaDirty)==JavaDirty) && kernel.readsFrom(arg)) {
209          enqueueCopyToDevice(arg);
210       }
211     }
212     enqueueKernel(kernel);
213     for (int argn = 0; argn<args.length; argn++){
214        Arg arg = args[argn];
215        if (kernel.writesTo(arg)) {
216           arg.flags = DeviceDirty;
217        }
218     }
219 }
220 ```
221 We rely on babylon to mark each buffer passed to it as JavaDirty
222 
223 ```java
224 
225 @Reflect
226 static public void compute(final @RO ComputeContext cc,
227                            Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
228     control.flags = JavaDirty;
229     cellGrid.flags = JavaDirty;
230     // yada yada
231 }
232 ```
233 
234 We also rely on babylon to inject calls before each buffer access from java in the compute code.
235 
236 So the injected code would look like this.gc
237 
238 ```java
239 
240 @Reflect
241 static public void compute(final @RO ComputeContext cc,
242                            Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
243   control.flags =JavaDirty; // injected by bablyon
244   cellGrid.flags = JavaDirty; // injected by babylon
245 gc
246   var timeOfLastUIUpdate = System.currentTimeMillis();
247   var msPerFrame = 1000/5; // we want 5 fps
248   while (true) {
249     long now = System.currentTimeMillis();
250     var msSinceLastUpdate = (now - timeOfLastUIUpdate);
251     var updateNeeded =  (msSinceLastUpdate > msPerFrame);
252 gc
253     // See the psuedo code above to see how dispatchKernel
254     // Only copies buffers that need copying, and marks
255     // buffers it has mutate as dirty
256     cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
257             kc -> Compute.life(kc, control, cellGrid)
258     );
259 gc
260     // injected by babylon
261     if ((control.flags & DeviceDirty)==DeviceDirty){
262       cc.copyFromDevice(control);
263     }
264     // Here we are swapping from<->to on the control buffer
265     int to = control.from();
266 gc
267     control.from(control.to());
268     control.flags = JavaDirty; // injectedgc
269     control.to(to);
270     control.flags = JavaDirty; // injected, but can be avoided
271 gc
272     if (updateNeeded) {
273         // Injected by babylon because cellGrid escapes cpmputegc
274         // and because viewer.update marks cellGrid as @RO
275         if ((cellGrid.flags & DeviceDirty)==DeviceDirty){
276           cc.copyFromDevice(cellGrid);
277         }
278         viewer.update(now, to, cellGrid);
279         // We don't copy cellgrid back after escape becausegc
280         // viewer.update annotates cellGrdi access as RO
281          timeOfLastUIUpdate = now;
282     }
283   }
284 }
285 ```
286 
287 
288 
289 
290 
291 
292 
293