1 # Minimizing Buffer Transfers
2
3 ----
4 * [Contents](hat-00.md)
5 * Build Babylon and HAT
6 * [Quick Install](hat-01-quick-install.md)
7 * [Building Babylon with jtreg](hat-01-02-building-babylon.md)
8 * [Building HAT with jtreg](hat-01-03-building-hat.md)
9 * [Enabling the NVIDIA CUDA Backend](hat-01-05-building-hat-for-cuda.md)
10 * [Testing Framework](hat-02-testing-framework.md)
11 * [Running Examples](hat-03-examples.md)
12 * [HAT Programming Model](hat-03-programming-model.md)
13 * Interface Mapping
14 * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
15 * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
16 * Development
17 * [Project Layout](hat-01-01-project-layout.md)
18 * Implementation Details
19 * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
20 * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
21 * [Running HAT with Docker on NVIDIA GPUs](hat-07-docker-build-nvidia.md)
22 ---
23
24 ## Using buffer marking to minimize data transfers
25
26 ### The naive approach
27 The default execution model is that at each kernel
28 dispatch the backend just copy all arg buffers togc
29 the device and after the dispatch it copies all arg
30 buffers back.
31
32 ### Using kernel arg buffer access patterns
33 If we knew how each kernel accesses it's args (via static analysis of code model orgc
34 by marking the args RO, RW or WO with annotations) we can avoid some copies by onlygc
35 copying in if the kernel 'reads' the arg buffer and only copying out if the
36 kernel writes to the arg buffer.
37
38 Lets use the game of life as an example.gc
39
40 We assume that the UI only needs updating at some 'rate' (say 5 fps), but the kernels can generate
41 generations faster that 5 generations per second. code to generate eactgc
42
43 So not every generation needs to be copied to the device.gc
44
45 We'll ignore the detail regarding the `life` kernel, and we will assume the kernel args Mostly we care ab
46 are appropriately annotated as RO, RW or WO.
47
48 ```java
49 @Reflect
50 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
51 if (kc.x < kc.maxX) {
52 Compute.lifePerIdx(kc.x, control, cellGrid);
53 }
54 }
55
56 @Reflect
57 static public void compute(final @RO ComputeContext cc,
58 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
59 var timeOfLastUIUpdate = System.currentTimeMillis();
60 var msPerFrame = 1000/5; // we want 5 fps
61 while (viewer.state.generation < viewer.state.maxGenerations) {
62 long now = System.currentTimeMillis();
63 var msSinceLastUpdate = (now - timeOfLastUIUpdate);
64 var updateNeeded = (msSinceLastUpdate > msPerFrame);
65 gc
66 cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
67 kc -> Compute.life(kc, control, cellGrid)
68 );
69 gc
70 // Here we are swapping from<->to on the control buffer
71 int to = control.from();
72 control.from(control.to());
73 control.to(to);
74 gc
75 if (updateNeeded) {
76 viewer.update(now, to, cellGrid);
77 timeOfLastUIUpdate = now;
78 }
79 }
80 }
81 ```
82
83 First, let's assume there were no automatic transfers, assume we had to define them. We had to explicitly control transfers so we will insert code.
84
85 What would our code look like?
86
87 ```java
88 @Reflect
89 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
90 if (kc.x < kc.maxX) {
91 Compute.lifePerIdx(kc.x, control, cellGrid);
92 }
93 }
94
95 @Reflect
96 static public void compute(final @RO ComputeContext cc,
97 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
98 var timeOfLastUIUpdate = System.currentTimeMillis();
99 var msPerFrame = 1000/5; // we want 5 fps
100 var cellGridIsJavaDirty = true;
101 var controlIsJavaDirty = true;
102 var cellGridIsDeviceDirty = true;
103 var controlIsDeviceDirty = true;
104 while (true) {
105 long now = System.currentTimeMillis();
106 var msSinceLastUpdate = (now - timeOfLastUIUpdate);
107 var updateNeeded = (msSinceLastUpdate > msPerFrame);
108 gc
109 if (cellGridIsJavaDirty){
110 cc.copyToDevice(cellGrid);
111 }
112 if (controlIsJavaDirty){
113 cc.copyToDevice(control);
114 }
115 cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
116 kc -> Compute.life(kc, control, cellGrid)
117 );
118 controlIsDeviceDirty = false; // Compute.life marked control as @RO
119 cellGridIsDeviceDirty = true; // Compute.life marjed cellGrid as @RW
120 gc
121 // Here we are swapping from<->to on the control buffer
122 if (controlIsDeviceDirty){
123 cc.copyFromDevice(control);
124 }
125 int to = control.from();
126 control.from(control.to());
127 control.to(to);
128 controlIsJavaDirty = true;
129 gc
130 if (updateNeeded) {
131 if (cellGridIsDeviceDirty){
132 cc.copyFromDevice(cellGrid);
133 }
134 viewer.update(now, to, cellGrid);
135 timeOfLastUIUpdate = now;
136 }
137 }
138 }
139 ```
140
141 Alternatively, what if the buffers themselves could hold the deviceDirty flags javaDirty?
142
143
144 ```java
145 @Reflect
146 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
147 if (kc.x < kc.maxX) {
148 Compute.lifePerIdx(kc.x, control, cellGrid);
149 }
150 }
151
152 @Reflect
153 static public void compute(final @RO ComputeContext cc,
154 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
155 control.flags =JavaDirty; // not ideal but necessary
156 cellGrid.flags = JavaDirty; // not ideal but necessary
157 gc
158 var timeOfLastUIUpdate = System.currentTimeMillis();
159 var msPerFrame = 1000/5; // we want 5 fps
160
161 while (true) {
162 long now = System.currentTimeMillis();
163 var msSinceLastUpdate = (now - timeOfLastUIUpdate);
164 var updateNeeded = (msSinceLastUpdate > msPerFrame);
165 gc
166 if ((cellGrid.flags & JavaDirty) == JavaDirty){
167 cc.copyToDevice(cellGrid);
168 }
169 if ((control.flags & JavaDirty) == JavaDirty){
170 cc.copyToDevice(control);
171 }
172 cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
173 kc -> Compute.life(kc, control, cellGrid)
174 );
175 control.flags = JavaDirty; // Compute.life marked control as @RO
176 cellGrid.flags = DeviceDirty; // Compute.life marjed cellGrid as @RW
177 gc
178 // Here we are swapping from<->to on the control buffer
179 if ((control.flags & DeviceDirty)==DeviceDirty){
180 cc.copyFromDevice(control);
181 }
182 int to = control.from();
183 control.from(control.to());
184 control.to(to);
185 control.flags = JavaDirty;
186 gc
187 if (updateNeeded) {
188 if ((cellGrid.flags & DeviceDirty)==DeviceDirty){
189 cc.copyFromDevice(cellGrid);
190 }
191 viewer.update(now, to, cellGrid);
192 // update does not mutate cellGrid so cellGrid.flags = DeviceDirty
193 timeOfLastUIUpdate = now;
194 }
195 }
196 }
197 ```
198
199 Essentially, we defer to the kernel dispatch to determine whether buffers are
200 copied to the device and to mark buffers accordingly if the dispatch mutated the buffer.gc
201
202 Pseudo-code for dispatch is essentially
203 ```java
204
205 void dispatchKernel(Kernel kernel, KernelContext kc, Arg ... args) {
206 for (int argn = 0; argn<args.length; argn++){
207 Arg arg = args[argn];
208 if (((arg.flags &JavaDirty)==JavaDirty) && kernel.readsFrom(arg)) {
209 enqueueCopyToDevice(arg);
210 }
211 }
212 enqueueKernel(kernel);
213 for (int argn = 0; argn<args.length; argn++){
214 Arg arg = args[argn];
215 if (kernel.writesTo(arg)) {
216 arg.flags = DeviceDirty;
217 }
218 }
219 }
220 ```
221 We rely on babylon to mark each buffer passed to it as JavaDirty
222
223 ```java
224
225 @Reflect
226 static public void compute(final @RO ComputeContext cc,
227 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
228 control.flags = JavaDirty;
229 cellGrid.flags = JavaDirty;
230 // yada yada
231 }
232 ```
233
234 We also rely on babylon to inject calls before each buffer access from java in the compute code.
235
236 So the injected code would look like this.gc
237
238 ```java
239
240 @Reflect
241 static public void compute(final @RO ComputeContext cc,
242 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
243 control.flags =JavaDirty; // injected by bablyon
244 cellGrid.flags = JavaDirty; // injected by babylon
245 gc
246 var timeOfLastUIUpdate = System.currentTimeMillis();
247 var msPerFrame = 1000/5; // we want 5 fps
248 while (true) {
249 long now = System.currentTimeMillis();
250 var msSinceLastUpdate = (now - timeOfLastUIUpdate);
251 var updateNeeded = (msSinceLastUpdate > msPerFrame);
252 gc
253 // See the psuedo code above to see how dispatchKernel
254 // Only copies buffers that need copying, and marks
255 // buffers it has mutate as dirty
256 cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
257 kc -> Compute.life(kc, control, cellGrid)
258 );
259 gc
260 // injected by babylon
261 if ((control.flags & DeviceDirty)==DeviceDirty){
262 cc.copyFromDevice(control);
263 }
264 // Here we are swapping from<->to on the control buffer
265 int to = control.from();
266 gc
267 control.from(control.to());
268 control.flags = JavaDirty; // injectedgc
269 control.to(to);
270 control.flags = JavaDirty; // injected, but can be avoided
271 gc
272 if (updateNeeded) {
273 // Injected by babylon because cellGrid escapes cpmputegc
274 // and because viewer.update marks cellGrid as @RO
275 if ((cellGrid.flags & DeviceDirty)==DeviceDirty){
276 cc.copyFromDevice(cellGrid);
277 }
278 viewer.update(now, to, cellGrid);
279 // We don't copy cellgrid back after escape becausegc
280 // viewer.update annotates cellGrdi access as RO
281 timeOfLastUIUpdate = now;
282 }
283 }
284 }
285 ```
286
287
288
289
290
291
292
293