1 # Minimizing Buffer Transfers
2
3 ----
4
5 * [Contents](hat-00.md)
6 * House Keeping
7 * [Project Layout](hat-01-01-project-layout.md)
8 * [Building Babylon](hat-01-02-building-babylon.md)
9 * [Building HAT](hat-01-03-building-hat.md)
10 * [Enabling the CUDA Backend](hat-01-05-building-hat-for-cuda.md)
11 * Programming Model
12 * [Programming Model](hat-03-programming-model.md)
13 * Interface Mapping
14 * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
15 * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
16 * Implementation Detail
17 * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
18 * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
19
20 ----
21
22 ## Using buffer marking to minimize data transfers
23
24 ### The naive approach
25 The default execution model is that at each kernel
26 dispatch the backend just copy all arg buffers togc
27 the device and after the dispatch it copies all arg
28 buffers back.
29
30 ### Using kernel arg buffer access patterns
31 If we knew how each kernel accesses it's args (via static analysis of code model orgc
32 by marking the args RO, RW or WO with annotations) we can avoid some copies by onlygc
33 copying in if the kernel 'reads' the arg buffer and only copying out if the
34 kernel writes to the arg buffer.
35
36 Lets use the game of life as an example.gc
37
38 We assume that the UI only needs updating at some 'rate' (say 5 fps), but the kernels can generate
39 generations faster that 5 generations per second. code to generate eactgc
40
41 So not every generation needs to be copied to the device.gc
42
43 We'll ignore the detail regarding the `life` kernel, and we will assume the kernel args Mostly we care ab
44 are appropriately annotated as RO, RW or WO.
45
46 ```java
47 @CodeReflection
48 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
49 if (kc.x < kc.maxX) {
50 Compute.lifePerIdx(kc.x, control, cellGrid);
51 }
52 }
53
54 @CodeReflection
55 static public void compute(final @RO ComputeContext cc,
56 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
57 var timeOfLastUIUpdate = System.currentTimeMillis();
58 var msPerFrame = 1000/5; // we want 5 fps
59 while (viewer.state.generation < viewer.state.maxGenerations) {
60 long now = System.currentTimeMillis();
61 var msSinceLastUpdate = (now - timeOfLastUIUpdate);
62 var updateNeeded = (msSinceLastUpdate > msPerFrame);
63 gc
64 cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
65 kc -> Compute.life(kc, control, cellGrid)
66 );
67 gc
68 // Here we are swapping from<->to on the control buffer
69 int to = control.from();
70 control.from(control.to());
71 control.to(to);
72 gc
73 if (updateNeeded) {
74 viewer.update(now, to, cellGrid);
75 timeOfLastUIUpdate = now;
76 }
77 }
78 }
79 ```
80
81 First, let's assume there were no automatic transfers, assume we had to define them. We had to explicitly control transfers so we will insert code.
82
83 What would our code look like?
84
85 ```java
86 @CodeReflection
87 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
88 if (kc.x < kc.maxX) {
89 Compute.lifePerIdx(kc.x, control, cellGrid);
90 }
91 }
92
93 @CodeReflection
94 static public void compute(final @RO ComputeContext cc,
95 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
96 var timeOfLastUIUpdate = System.currentTimeMillis();
97 var msPerFrame = 1000/5; // we want 5 fps
98 var cellGridIsJavaDirty = true;
99 var controlIsJavaDirty = true;
100 var cellGridIsDeviceDirty = true;
101 var controlIsDeviceDirty = true;
102 while (true) {
103 long now = System.currentTimeMillis();
104 var msSinceLastUpdate = (now - timeOfLastUIUpdate);
105 var updateNeeded = (msSinceLastUpdate > msPerFrame);
106 gc
107 if (cellGridIsJavaDirty){
108 cc.copyToDevice(cellGrid);
109 }
110 if (controlIsJavaDirty){
111 cc.copyToDevice(control);
112 }
113 cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
114 kc -> Compute.life(kc, control, cellGrid)
115 );
116 controlIsDeviceDirty = false; // Compute.life marked control as @RO
117 cellGridIsDeviceDirty = true; // Compute.life marjed cellGrid as @RW
118 gc
119 // Here we are swapping from<->to on the control buffer
120 if (controlIsDeviceDirty){
121 cc.copyFromDevice(control);
122 }
123 int to = control.from();
124 control.from(control.to());
125 control.to(to);
126 controlIsJavaDirty = true;
127 gc
128 if (updateNeeded) {
129 if (cellGridIsDeviceDirty){
130 cc.copyFromDevice(cellGrid);
131 }
132 viewer.update(now, to, cellGrid);
133 timeOfLastUIUpdate = now;
134 }
135 }
136 }
137 ```
138
139 Alternatively, what if the buffers themselves could hold the deviceDirty flags javaDirty?
140
141
142 ```java
143 @CodeReflection
144 public static void life(@RO KernelContext kc, @RO Control control, @RW CellGrid cellGrid) {
145 if (kc.x < kc.maxX) {
146 Compute.lifePerIdx(kc.x, control, cellGrid);
147 }
148 }
149
150 @CodeReflection
151 static public void compute(final @RO ComputeContext cc,
152 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
153 control.flags =JavaDirty; // not ideal but necessary
154 cellGrid.flags = JavaDirty; // not ideal but necessary
155 gc
156 var timeOfLastUIUpdate = System.currentTimeMillis();
157 var msPerFrame = 1000/5; // we want 5 fps
158
159 while (true) {
160 long now = System.currentTimeMillis();
161 var msSinceLastUpdate = (now - timeOfLastUIUpdate);
162 var updateNeeded = (msSinceLastUpdate > msPerFrame);
163 gc
164 if ((cellGrid.flags & JavaDirty) == JavaDirty){
165 cc.copyToDevice(cellGrid);
166 }
167 if ((control.flags & JavaDirty) == JavaDirty){
168 cc.copyToDevice(control);
169 }
170 cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
171 kc -> Compute.life(kc, control, cellGrid)
172 );
173 control.flags = JavaDirty; // Compute.life marked control as @RO
174 cellGrid.flags = DeviceDirty; // Compute.life marjed cellGrid as @RW
175 gc
176 // Here we are swapping from<->to on the control buffer
177 if ((control.flags & DeviceDirty)==DeviceDirty){
178 cc.copyFromDevice(control);
179 }
180 int to = control.from();
181 control.from(control.to());
182 control.to(to);
183 control.flags = JavaDirty;
184 gc
185 if (updateNeeded) {
186 if ((cellGrid.flags & DeviceDirty)==DeviceDirty){
187 cc.copyFromDevice(cellGrid);
188 }
189 viewer.update(now, to, cellGrid);
190 // update does not mutate cellGrid so cellGrid.flags = DeviceDirty
191 timeOfLastUIUpdate = now;
192 }
193 }
194 }
195 ```
196
197 Essentially, we defer to the kernel dispatch to determine whether buffers are
198 copied to the device and to mark buffers accordingly if the dispatch mutated the buffer.gc
199
200 Pseudo-code for dispatch is essentially
201 ```java
202
203 void dispatchKernel(Kernel kernel, KernelContext kc, Arg ... args) {
204 for (int argn = 0; argn<args.length; argn++){
205 Arg arg = args[argn];
206 if (((arg.flags &JavaDirty)==JavaDirty) && kernel.readsFrom(arg)) {
207 enqueueCopyToDevice(arg);
208 }
209 }
210 enqueueKernel(kernel);
211 for (int argn = 0; argn<args.length; argn++){
212 Arg arg = args[argn];
213 if (kernel.writesTo(arg)) {
214 arg.flags = DeviceDirty;
215 }
216 }
217 }
218 ```
219 We rely on babylon to mark each buffer passed to it as JavaDirty
220
221 ```java
222
223 @CodeReflection
224 static public void compute(final @RO ComputeContext cc,
225 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
226 control.flags = JavaDirty;
227 cellGrid.flags = JavaDirty;
228 // yada yada
229 }
230 ```
231
232 We also rely on babylon to inject calls before each buffer access from java in the compute code.
233
234 So the injected code would look like this.gc
235
236 ```java
237
238 @CodeReflection
239 static public void compute(final @RO ComputeContext cc,
240 Viewer viewer, @RO Control control, @RW CellGrid cellGrid) {
241 control.flags =JavaDirty; // injected by bablyon
242 cellGrid.flags = JavaDirty; // injected by babylon
243 gc
244 var timeOfLastUIUpdate = System.currentTimeMillis();
245 var msPerFrame = 1000/5; // we want 5 fps
246 while (true) {
247 long now = System.currentTimeMillis();
248 var msSinceLastUpdate = (now - timeOfLastUIUpdate);
249 var updateNeeded = (msSinceLastUpdate > msPerFrame);
250 gc
251 // See the psuedo code above to see how dispatchKernel
252 // Only copies buffers that need copying, and marks
253 // buffers it has mutate as dirty
254 cc.dispatchKernel(cellGrid.width() * cellGrid.height(),
255 kc -> Compute.life(kc, control, cellGrid)
256 );
257 gc
258 // injected by babylon
259 if ((control.flags & DeviceDirty)==DeviceDirty){
260 cc.copyFromDevice(control);
261 }
262 // Here we are swapping from<->to on the control buffer
263 int to = control.from();
264 gc
265 control.from(control.to());
266 control.flags = JavaDirty; // injectedgc
267 control.to(to);
268 control.flags = JavaDirty; // injected, but can be avoided
269 gc
270 if (updateNeeded) {
271 // Injected by babylon because cellGrid escapes cpmputegc
272 // and because viewer.update marks cellGrid as @RO
273 if ((cellGrid.flags & DeviceDirty)==DeviceDirty){
274 cc.copyFromDevice(cellGrid);
275 }
276 viewer.update(now, to, cellGrid);
277 // We don't copy cellgrid back after escape becausegc
278 // viewer.update annotates cellGrdi access as RO
279 timeOfLastUIUpdate = now;
280 }
281 }
282 }
283 ```
284
285
286
287
288
289
290
291