1 # Minimizing Buffer Transfers
  2 
  3 ----
  4 * [Contents](hat-00.md)
  5 * Build Babylon and HAT
  6     * [Quick Install](hat-01-quick-install.md)
  7     * [Building Babylon with jtreg](hat-01-02-building-babylon.md)
  8     * [Building HAT with jtreg](hat-01-03-building-hat.md)
  9         * [Enabling the NVIDIA CUDA Backend](hat-01-05-building-hat-for-cuda.md)
 10 * [Testing Framework](hat-02-testing-framework.md)
 11 * [Running Examples](hat-03-examples.md)
 12 * [HAT Programming Model](hat-03-programming-model.md)
 13 * Interface Mapping
 14     * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
 15     * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
 16 * Development
 17     * [Project Layout](hat-01-01-project-layout.md)
 18     * [IntelliJ Code Formatter](hat-development.md)
 19 * Implementation Details
 20     * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
 21     * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
 22 * [Running HAT with Docker on NVIDIA GPUs](hat-07-docker-build-nvidia.md)
 23 ---
 24 
 25 ## Using buffer marking to minimize data transfers
 26 
 27 ### The naive approach
 28 The default execution model is that at each kernel
 29 dispatch the backend just copy all arg buffers togc
 30 the device and after the dispatch it copies all arg
 31 buffers back.
 32 
 33 ### Using kernel arg buffer access patterns
 34 If we knew how each kernel accesses it's args (via static analysis of code model orgc
 35 by marking the args RO, RW or WO with annotations) we can avoid some copies by onlygc
 36 copying in if the kernel 'reads' the arg buffer and only copying out if the
 37 kernel writes to the arg buffer.
 38 
 39 Lets use the game of life as an example.gc
 40 
 41 We assume that the UI only needs updating at some 'rate' (say 5 fps), but the kernels can generate
 42 generations faster that 5 generations per second. code to generate eactgc
 43 
 44 So not every generation needs to be copied to the device.gc
 45 
 46 We'll ignore the detail regarding the `life` kernel, and we will assume the kernel args Mostly we care ab
 47 are appropriately annotated as RO, RW or WO.
 48 
 49 ```java
 50  @Reflect
 51 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
 52   if (kc.x < kc.maxX) {
 53     Compute.lifePerIdx(kc.x, control, cellGrid);
 54   }
 55 }
 56 
 57 @Reflect
 58 static public void compute(final @RO ComputeContext cc,
 59                            Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
 60   var timeOfLastUIUpdate = System.currentTimeMillis();
 61   var msPerFrame = 1000/5; // we want 5 fps
 62   while (viewer.state.generation < viewer.state.maxGenerations) {
 63     long now = System.currentTimeMillis();
 64     var msSinceLastUpdate = (now - timeOfLastUIUpdate);
 65     var updateNeeded =  (msSinceLastUpdate > msPerFrame);
 66 gc
 67     cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
 68             kc -> Compute.life(kc, control, cellGrid)
 69     );
 70 gc
 71     // Here we are swapping from<->to on the control buffer
 72     int to = control.from();
 73     control.from(control.to());
 74     control.to(to);
 75 gc
 76     if (updateNeeded) {
 77       viewer.update(now, to, cellGrid);
 78       timeOfLastUIUpdate = now;
 79     }
 80   }
 81 }
 82 ```
 83 
 84 First, let's assume there were no automatic transfers, assume we had to define them. We had to explicitly control transfers so we will insert code.
 85 
 86 What would our code look like?
 87 
 88 ```java
 89  @Reflect
 90 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
 91   if (kc.x < kc.maxX) {
 92     Compute.lifePerIdx(kc.x, control, cellGrid);
 93   }
 94 }
 95 
 96 @Reflect
 97 static public void compute(final @RO ComputeContext cc,
 98                            Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
 99   var timeOfLastUIUpdate = System.currentTimeMillis();
100   var msPerFrame = 1000/5; // we want 5 fps
101   var cellGridIsJavaDirty = true;
102   var controlIsJavaDirty = true;
103   var cellGridIsDeviceDirty = true;
104   var controlIsDeviceDirty = true;
105   while (true) {
106     long now = System.currentTimeMillis();
107     var msSinceLastUpdate = (now - timeOfLastUIUpdate);
108     var updateNeeded =  (msSinceLastUpdate > msPerFrame);
109 gc
110     if (cellGridIsJavaDirty){
111         cc.copyToDevice(cellGrid);
112     }
113     if (controlIsJavaDirty){
114         cc.copyToDevice(control);
115     }
116     cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
117             kc -> Compute.life(kc, control, cellGrid)
118     );
119     controlIsDeviceDirty = false; // Compute.life marked control as @RO
120     cellGridIsDeviceDirty = true; // Compute.life marjed cellGrid as @RW
121 gc
122     // Here we are swapping from<->to on the control buffer
123     if (controlIsDeviceDirty){
124       cc.copyFromDevice(control);
125     }
126     int to = control.from();
127     control.from(control.to());
128     control.to(to);
129     controlIsJavaDirty = true;
130 gc
131     if (updateNeeded) {
132       if (cellGridIsDeviceDirty){
133         cc.copyFromDevice(cellGrid);
134       }
135       viewer.update(now, to, cellGrid);
136       timeOfLastUIUpdate = now;
137     }
138   }
139 }
140 ```
141 
142 Alternatively, what if the buffers themselves could hold the deviceDirty flags javaDirty?
143 
144 
145 ```java
146  @Reflect
147 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
148   if (kc.x < kc.maxX) {
149     Compute.lifePerIdx(kc.x, control, cellGrid);
150   }
151 }
152 
153 @Reflect
154 static public void compute(final @RO ComputeContext cc,
155                            Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
156   control.flags =JavaDirty; // not ideal but necessary
157   cellGrid.flags = JavaDirty; // not ideal but necessary
158 gc
159   var timeOfLastUIUpdate = System.currentTimeMillis();
160   var msPerFrame = 1000/5; // we want 5 fps
161 
162   while (true) {
163     long now = System.currentTimeMillis();
164     var msSinceLastUpdate = (now - timeOfLastUIUpdate);
165     var updateNeeded =  (msSinceLastUpdate > msPerFrame);
166 gc
167     if ((cellGrid.flags & JavaDirty) == JavaDirty){
168         cc.copyToDevice(cellGrid);
169     }
170     if ((control.flags & JavaDirty) == JavaDirty){
171         cc.copyToDevice(control);
172     }
173     cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
174             kc -> Compute.life(kc, control, cellGrid)
175     );
176     control.flags = JavaDirty; // Compute.life marked control as @RO
177     cellGrid.flags = DeviceDirty; // Compute.life marjed cellGrid as @RW
178 gc
179     // Here we are swapping from<->to on the control buffer
180     if ((control.flags & DeviceDirty)==DeviceDirty){
181       cc.copyFromDevice(control);
182     }
183     int to = control.from();
184     control.from(control.to());
185     control.to(to);
186     control.flags = JavaDirty;
187 gc
188     if (updateNeeded) {
189       if ((cellGrid.flags & DeviceDirty)==DeviceDirty){
190         cc.copyFromDevice(cellGrid);
191       }
192       viewer.update(now, to, cellGrid);
193       // update does not mutate cellGrid so cellGrid.flags = DeviceDirty
194       timeOfLastUIUpdate = now;
195     }
196   }
197 }
198 ```
199 
200 Essentially, we defer to the kernel dispatch to determine whether buffers are
201 copied to the device and to mark buffers accordingly if the dispatch mutated the buffer.gc
202 
203 Pseudo-code for dispatch is essentially
204 ```java
205 
206 void dispatchKernel(Kernel kernel, KernelContext kc, Arg ... args) {
207     for (int argn = 0; argn<args.length; argn++){
208       Arg arg = args[argn];
209       if (((arg.flags &JavaDirty)==JavaDirty) && kernel.readsFrom(arg)) {
210          enqueueCopyToDevice(arg);
211       }
212     }
213     enqueueKernel(kernel);
214     for (int argn = 0; argn<args.length; argn++){
215        Arg arg = args[argn];
216        if (kernel.writesTo(arg)) {
217           arg.flags = DeviceDirty;
218        }
219     }
220 }
221 ```
222 We rely on babylon to mark each buffer passed to it as JavaDirty
223 
224 ```java
225 
226 @Reflect
227 static public void compute(final @RO ComputeContext cc,
228                            Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
229     control.flags = JavaDirty;
230     cellGrid.flags = JavaDirty;
231     // yada yada
232 }
233 ```
234 
235 We also rely on babylon to inject calls before each buffer access from java in the compute code.
236 
237 So the injected code would look like this.gc
238 
239 ```java
240 
241 @Reflect
242 static public void compute(final @RO ComputeContext cc,
243                            Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
244   control.flags =JavaDirty; // injected by bablyon
245   cellGrid.flags = JavaDirty; // injected by babylon
246 gc
247   var timeOfLastUIUpdate = System.currentTimeMillis();
248   var msPerFrame = 1000/5; // we want 5 fps
249   while (true) {
250     long now = System.currentTimeMillis();
251     var msSinceLastUpdate = (now - timeOfLastUIUpdate);
252     var updateNeeded =  (msSinceLastUpdate > msPerFrame);
253 gc
254     // See the psuedo code above to see how dispatchKernel
255     // Only copies buffers that need copying, and marks
256     // buffers it has mutate as dirty
257     cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
258             kc -> Compute.life(kc, control, cellGrid)
259     );
260 gc
261     // injected by babylon
262     if ((control.flags & DeviceDirty)==DeviceDirty){
263       cc.copyFromDevice(control);
264     }
265     // Here we are swapping from<->to on the control buffer
266     int to = control.from();
267 gc
268     control.from(control.to());
269     control.flags = JavaDirty; // injectedgc
270     control.to(to);
271     control.flags = JavaDirty; // injected, but can be avoided
272 gc
273     if (updateNeeded) {
274         // Injected by babylon because cellGrid escapes cpmputegc
275         // and because viewer.update marks cellGrid as @RO
276         if ((cellGrid.flags & DeviceDirty)==DeviceDirty){
277           cc.copyFromDevice(cellGrid);
278         }
279         viewer.update(now, to, cellGrid);
280         // We don't copy cellgrid back after escape becausegc
281         // viewer.update annotates cellGrdi access as RO
282          timeOfLastUIUpdate = now;
283     }
284   }
285 }
286 ```
287 
288 
289 
290 
291 
292 
293 
294