1 # Minimizing Buffer Transfers
2
3 ----
4 * [Contents](hat-00.md)
5 * Build Babylon and HAT
6 * [Quick Install](hat-01-quick-install.md)
7 * [Building Babylon with jtreg](hat-01-02-building-babylon.md)
8 * [Building HAT with jtreg](hat-01-03-building-hat.md)
9 * [Enabling the NVIDIA CUDA Backend](hat-01-05-building-hat-for-cuda.md)
10 * [Testing Framework](hat-02-testing-framework.md)
11 * [Running Examples](hat-03-examples.md)
12 * [HAT Programming Model](hat-03-programming-model.md)
13 * Interface Mapping
14 * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
15 * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
16 * Development
17 * [Project Layout](hat-01-01-project-layout.md)
18 * [IntelliJ Code Formatter](hat-development.md)
19 * Implementation Details
20 * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
21 * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
22 * [Running HAT with Docker on NVIDIA GPUs](hat-07-docker-build-nvidia.md)
23 ---
24
25 ## Using buffer marking to minimize data transfers
26
27 ### The naive approach
28 The default execution model is that at each kernel
29 dispatch the backend just copy all arg buffers togc
30 the device and after the dispatch it copies all arg
31 buffers back.
32
33 ### Using kernel arg buffer access patterns
34 If we knew how each kernel accesses it's args (via static analysis of code model orgc
35 by marking the args RO, RW or WO with annotations) we can avoid some copies by onlygc
36 copying in if the kernel 'reads' the arg buffer and only copying out if the
37 kernel writes to the arg buffer.
38
39 Lets use the game of life as an example.gc
40
41 We assume that the UI only needs updating at some 'rate' (say 5 fps), but the kernels can generate
42 generations faster that 5 generations per second. code to generate eactgc
43
44 So not every generation needs to be copied to the device.gc
45
46 We'll ignore the detail regarding the `life` kernel, and we will assume the kernel args Mostly we care ab
47 are appropriately annotated as RO, RW or WO.
48
49 ```java
50 @Reflect
51 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
52 if (kc.x < kc.maxX) {
53 Compute.lifePerIdx(kc.x, control, cellGrid);
54 }
55 }
56
57 @Reflect
58 static public void compute(final @RO ComputeContext cc,
59 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
60 var timeOfLastUIUpdate = System.currentTimeMillis();
61 var msPerFrame = 1000/5; // we want 5 fps
62 while (viewer.state.generation < viewer.state.maxGenerations) {
63 long now = System.currentTimeMillis();
64 var msSinceLastUpdate = (now - timeOfLastUIUpdate);
65 var updateNeeded = (msSinceLastUpdate > msPerFrame);
66 gc
67 cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
68 kc -> Compute.life(kc, control, cellGrid)
69 );
70 gc
71 // Here we are swapping from<->to on the control buffer
72 int to = control.from();
73 control.from(control.to());
74 control.to(to);
75 gc
76 if (updateNeeded) {
77 viewer.update(now, to, cellGrid);
78 timeOfLastUIUpdate = now;
79 }
80 }
81 }
82 ```
83
84 First, let's assume there were no automatic transfers, assume we had to define them. We had to explicitly control transfers so we will insert code.
85
86 What would our code look like?
87
88 ```java
89 @Reflect
90 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
91 if (kc.x < kc.maxX) {
92 Compute.lifePerIdx(kc.x, control, cellGrid);
93 }
94 }
95
96 @Reflect
97 static public void compute(final @RO ComputeContext cc,
98 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
99 var timeOfLastUIUpdate = System.currentTimeMillis();
100 var msPerFrame = 1000/5; // we want 5 fps
101 var cellGridIsJavaDirty = true;
102 var controlIsJavaDirty = true;
103 var cellGridIsDeviceDirty = true;
104 var controlIsDeviceDirty = true;
105 while (true) {
106 long now = System.currentTimeMillis();
107 var msSinceLastUpdate = (now - timeOfLastUIUpdate);
108 var updateNeeded = (msSinceLastUpdate > msPerFrame);
109 gc
110 if (cellGridIsJavaDirty){
111 cc.copyToDevice(cellGrid);
112 }
113 if (controlIsJavaDirty){
114 cc.copyToDevice(control);
115 }
116 cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
117 kc -> Compute.life(kc, control, cellGrid)
118 );
119 controlIsDeviceDirty = false; // Compute.life marked control as @RO
120 cellGridIsDeviceDirty = true; // Compute.life marjed cellGrid as @RW
121 gc
122 // Here we are swapping from<->to on the control buffer
123 if (controlIsDeviceDirty){
124 cc.copyFromDevice(control);
125 }
126 int to = control.from();
127 control.from(control.to());
128 control.to(to);
129 controlIsJavaDirty = true;
130 gc
131 if (updateNeeded) {
132 if (cellGridIsDeviceDirty){
133 cc.copyFromDevice(cellGrid);
134 }
135 viewer.update(now, to, cellGrid);
136 timeOfLastUIUpdate = now;
137 }
138 }
139 }
140 ```
141
142 Alternatively, what if the buffers themselves could hold the deviceDirty flags javaDirty?
143
144
145 ```java
146 @Reflect
147 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
148 if (kc.x < kc.maxX) {
149 Compute.lifePerIdx(kc.x, control, cellGrid);
150 }
151 }
152
153 @Reflect
154 static public void compute(final @RO ComputeContext cc,
155 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
156 control.flags =JavaDirty; // not ideal but necessary
157 cellGrid.flags = JavaDirty; // not ideal but necessary
158 gc
159 var timeOfLastUIUpdate = System.currentTimeMillis();
160 var msPerFrame = 1000/5; // we want 5 fps
161
162 while (true) {
163 long now = System.currentTimeMillis();
164 var msSinceLastUpdate = (now - timeOfLastUIUpdate);
165 var updateNeeded = (msSinceLastUpdate > msPerFrame);
166 gc
167 if ((cellGrid.flags & JavaDirty) == JavaDirty){
168 cc.copyToDevice(cellGrid);
169 }
170 if ((control.flags & JavaDirty) == JavaDirty){
171 cc.copyToDevice(control);
172 }
173 cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
174 kc -> Compute.life(kc, control, cellGrid)
175 );
176 control.flags = JavaDirty; // Compute.life marked control as @RO
177 cellGrid.flags = DeviceDirty; // Compute.life marjed cellGrid as @RW
178 gc
179 // Here we are swapping from<->to on the control buffer
180 if ((control.flags & DeviceDirty)==DeviceDirty){
181 cc.copyFromDevice(control);
182 }
183 int to = control.from();
184 control.from(control.to());
185 control.to(to);
186 control.flags = JavaDirty;
187 gc
188 if (updateNeeded) {
189 if ((cellGrid.flags & DeviceDirty)==DeviceDirty){
190 cc.copyFromDevice(cellGrid);
191 }
192 viewer.update(now, to, cellGrid);
193 // update does not mutate cellGrid so cellGrid.flags = DeviceDirty
194 timeOfLastUIUpdate = now;
195 }
196 }
197 }
198 ```
199
200 Essentially, we defer to the kernel dispatch to determine whether buffers are
201 copied to the device and to mark buffers accordingly if the dispatch mutated the buffer.gc
202
203 Pseudo-code for dispatch is essentially
204 ```java
205
206 void dispatchKernel(Kernel kernel, KernelContext kc, Arg ... args) {
207 for (int argn = 0; argn<args.length; argn++){
208 Arg arg = args[argn];
209 if (((arg.flags &JavaDirty)==JavaDirty) && kernel.readsFrom(arg)) {
210 enqueueCopyToDevice(arg);
211 }
212 }
213 enqueueKernel(kernel);
214 for (int argn = 0; argn<args.length; argn++){
215 Arg arg = args[argn];
216 if (kernel.writesTo(arg)) {
217 arg.flags = DeviceDirty;
218 }
219 }
220 }
221 ```
222 We rely on babylon to mark each buffer passed to it as JavaDirty
223
224 ```java
225
226 @Reflect
227 static public void compute(final @RO ComputeContext cc,
228 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
229 control.flags = JavaDirty;
230 cellGrid.flags = JavaDirty;
231 // yada yada
232 }
233 ```
234
235 We also rely on babylon to inject calls before each buffer access from java in the compute code.
236
237 So the injected code would look like this.gc
238
239 ```java
240
241 @Reflect
242 static public void compute(final @RO ComputeContext cc,
243 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
244 control.flags =JavaDirty; // injected by bablyon
245 cellGrid.flags = JavaDirty; // injected by babylon
246 gc
247 var timeOfLastUIUpdate = System.currentTimeMillis();
248 var msPerFrame = 1000/5; // we want 5 fps
249 while (true) {
250 long now = System.currentTimeMillis();
251 var msSinceLastUpdate = (now - timeOfLastUIUpdate);
252 var updateNeeded = (msSinceLastUpdate > msPerFrame);
253 gc
254 // See the psuedo code above to see how dispatchKernel
255 // Only copies buffers that need copying, and marks
256 // buffers it has mutate as dirty
257 cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
258 kc -> Compute.life(kc, control, cellGrid)
259 );
260 gc
261 // injected by babylon
262 if ((control.flags & DeviceDirty)==DeviceDirty){
263 cc.copyFromDevice(control);
264 }
265 // Here we are swapping from<->to on the control buffer
266 int to = control.from();
267 gc
268 control.from(control.to());
269 control.flags = JavaDirty; // injectedgc
270 control.to(to);
271 control.flags = JavaDirty; // injected, but can be avoided
272 gc
273 if (updateNeeded) {
274 // Injected by babylon because cellGrid escapes cpmputegc
275 // and because viewer.update marks cellGrid as @RO
276 if ((cellGrid.flags & DeviceDirty)==DeviceDirty){
277 cc.copyFromDevice(cellGrid);
278 }
279 viewer.update(now, to, cellGrid);
280 // We don't copy cellgrid back after escape becausegc
281 // viewer.update annotates cellGrdi access as RO
282 timeOfLastUIUpdate = now;
283 }
284 }
285 }
286 ```
287
288
289
290
291
292
293
294