1 # Interface Mapping
2
3 ----
4
5 * [Contents](hat-00.md)
6 * House Keeping
7 * [Project Layout](hat-01-01-project-layout.md)
8 * [Building Babylon](hat-01-02-building-babylon.md)
9 * [Building HAT](hat-01-03-building-hat.md)
10 * [Enabling the CUDA Backend](hat-01-05-building-hat-for-cuda.md)
11 * Programming Model
12 * [Programming Model](hat-03-programming-model.md)
13 * Interface Mapping
14 * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
15 * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
16 * Implementation Detail
17 * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
18 * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
19
20 ----
21
22 # Interface Mapping
23
24 ## or ... HAT from a Data POV
25
26 ### Or ... what is this `S32Array` thing and why can't I just pass `int[]` to my kernel
27
28 Again here is the canonical HAT 'hello world' kernel, weill use this to describe itgerface mapping
29
30 We implement this in HAT by collecting the kernel(s) and compute method(s) in a `Compute` class.
31
32 ```java
33 public class SquareCompute {
34 @CodeReflection
35 public static int square(int v) {
36 return v * v;
37 }
38
39 @CodeReflection
40 public static void squareKernel(KernelContext kc, S32Array s32Array) {
41 int value = s32Array.array(kc.x); // arr[cc.x]
42 s32Array.array(kc.x, square(value)); // arr[cc.x]=value*value
43 }
44
45 @CodeReflection
46 public static void square(ComputeContext cc, S32Array s32Array) {
47 cc.dispatchKernel(s32Array.length(),
48 kc -> squareKernel(kc, s32Array)
49 );
50 }
51 }
52 ```
53 Which we dispatch by creating the appropriate data buffer and then asking an `Accelerator` (bound to a typical vendor backend) to execute the compute method.. which in turn coordinates the dispatch of the various kernels.
54
55 ```java
56 // Create an accelerator bound to a particular backend
57
58 var accelerator = new Accelerator(
59 java.lang.invoke.MethodHandles.lookup(),
60 Backend.FIRST // Predicate<Backend>
61 );
62
63 // Ask the accelerator/backend to allocate an S32Array
64 var s32Array = S32Array.create(accelerator, 32);
65
66 // Fill it with data
67 for (int i = 0; i < s32Array.length(); i++) {
68 s32Array.array(i, i);
69 }
70
71 // Tell the accelerator to execute the square() compute entrypoint
72
73 accelerator.compute(
74 cc -> SquareCompute.square(cc, s32Array)
75 );
76
77 // Check the data
78 for (int i = 0; i < arr.length(); i++) {
79 System.out.println(i + " " + arr.array(i));
80 }
81 ```
82
83 HAT kernels only accept Java primitives and HAT buffers as parameters.
84
85 We don't directly support heap allocated data (such as int[])
86
87 From Java's point of view `S32Array` is a `hat.Buffer` and is defined as an interface.
88
89 ```java
90 public interface S32Array extends Buffer {
91 int length();
92 void length(int i);
93 int array(long idx);
94 void array(long idx, int i);
95 }
96 ```
97
98 From C99 style OpenCL/CUDA POV this will eventually be mapped to a typedef.
99
100 ```C++
101 typedef struct S32Array_s{
102 int length;
103 int array[]; //<-- ?
104 }S32Array_t;
105 ```
106
107 Our Java implementations should treat the interface as `data`, generally the only
108 methods that we include in a `hat.Buffer` should be
109
110 ```java
111 T name(); //getter for a field called name with type T, where T may be primitive or inner interface)
112 void name(T name); //setter for a field called name with type T, T must be primitive
113 T name(long idx); //get an array element [idx] where array is called name and T is either primitive or inner interface
114 void name(long idx, T name); //set an array element [idx] where array is called name and T is primitive
115 ```
116
117 Algorithms can assume that an interface is 'bound' to 'some' concrete data layout.
118
119 We could for example implement `S32Array` like this.
120
121 ```java
122 class JavaS32Array implements S32Array{
123 int[] arr;
124 int length(){ return arr.length;}
125 int array(long idx) {return arr[idx];}
126 void array(long idx, int value) {arr[idx] = value;}
127 void length(int len) ; // we'll come back to this ;)
128 }
129 ```
130
131 But for HAT to access native memory, allocated by the appropriate backend we need interfaces bound to MemorySegents/
132
133 HAT includes an API which allows us to take an interface which extends `hat.Buffer`, and 'bind' it to a Panama FFM MemorySegment.
134
135 This binding process automatically maps the accessors (for example `length()`, `array(long idx, int v)`) to low level Method and Var handel trickery underlying MemorySegments.
136
137 Conceptually we might imagine that HAT creates something like this
138
139 ```java
140 class PanamaS32Array implements S32Array{
141 MemorySegment segment;
142 final int SIZEOFINT = 4;
143 final long lenOffset = 0;
144 final long arrayOffset = lenOffset+SIZEOFINT;
145 int length(){ return segment.getInt(lenOffset);}
146 int array(long idx) {return segment.getInt(arrayOffset+idx*SIZEOFINT);}
147 void array(long idx, int value) {segment.setInt(arrayOffset+idx*SIZEOFINT,value);}
148 void length(int len) ; // we'll come back to this ;)
149 }
150 ```
151
152 Much like Java's `Proxy` class, the iface mapper creates an implementation of the interface 'on the fly', the new Classfile API is used to 'spin up' the new class and the accessors are are composed using Var/Method Handles and offsets derived from the size and order of fields.
153
154 Sadly an interface is not quite enough to establish exactly what is needed to complete the mapping. We need to tell the `iface mapper` the order and size of fields and possibly some padding information.
155
156 We do this by providing a 'layout description' using Panama's Layout api.
157
158 ```java
159 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
160 JAVA_INT.withName("length"),
161 MemoryLayout.sequenceLayout(N, JAVA_INT.withName("length")).withName("array")
162 ).withName(S32Array.getSimpleName());
163 ```
164
165 Eventually we came to a common pattern for describing HAT buffers by adding a `create` method to our interface which hides the mapping detail
166
167 So the complete `S32Array` looks a like this. (....ish)
168
169 ```java
170 public interface S32Array extends Buffer {
171 int length();
172
173 void length(int i);
174
175 int array(long idx);
176
177 void array(long idx, int i);
178
179 S32Array create(Accelerator a, int len) {
180 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
181 JAVA_INT.withName("length"),
182 MemoryLayout.sequenceLayout(len, JAVA_INT.withName("length")).withName("array")
183 ).withName(S32Array.getSimpleName());
184
185 S32Array s32Array = a.allocate(
186 SegmentMapper.of(MethodHandles.lookup(), S32Array.class, s32ArrayLayout, len)
187 );
188
189 return s32Array;
190 }
191 }
192 ```
193
194 So now hopefully this code makes more sense.
195
196 ```
197 var s32Array = S32Array.create(accelerator, 32);
198 ```
199
200 Whilst this code is much nicer than hand mapping each method to offsets. It is still quite verbose.
201
202 In the last few weeks we have been migrating to Schema builder which makes this code easier to express..
203
204 ```java
205 public interface S32Array extends Buffer {
206 int length();
207 void length(int i);
208 int array(long idx);
209 void array(long idx, int i);
210 Schema<S32Array> schema = Schema.of(S32Array.class, s->s
211 .arrayLen("length")
212 .array("array")
213 );
214 }
215 ```
216 The schema is embedded inside the interface and defines the order of fields. It also allows us to bind fields to each other (above we are telling the schema we have a `int length` field followed by an `int array[]` field and that the first defines the size of the second), we also can describe useful 'HAT' information for fields. Such as whether a field is 'atomic' ;)
217
218 Here is an example of a table of Results for the face detector.
219
220 ```java
221 public interface ResultTable extends Buffer{
222 interface Result extends Buffer.StructChild {
223 float x();
224 void x(float x);
225 float y();
226 void y(float y);
227 }
228 void count(int count);
229 int count();
230 int length();
231 Result result(long idx);
232
233 Schema<ResultTable> schema = Schema.of(ResultTable.class, s->s
234 .atomic("count")
235 .arrayLen("length")
236 .array("result", r->r
237 .field("x")
238 .field("y")
239 )
240 );
241 }
242 ```
243
244 Which in C99 OpenCL code will manifest as
245
246 ```C++
247 typedef Result_s{
248 int x,y
249 } Result_t;
250
251 typedef ResultTable_s{
252 int count;
253 int length;
254 Result_t result[0];
255 } Result_t;
256 ```
257
258 In our Java code this interface makes access to MemorySegments much cleaner
259
260 ```java
261 ResultTable resultTable = ResultTable.create(acc, 100);
262 for (int i=0; i<resultTable.length(); i++){
263 Result result = resultTable.result(i);
264 result.x(0);
265 result.y(0);
266 }
267 ```
268
269 The generated OpenCL/C99 code from Java kernel code is also quite clean
270
271 We might use a kernel to initialize the location of a bunch of Results
272
273 ```java
274 @CodeReflection public static void init(KernelContext kc, ResultTable resultTable) {
275 if (kc.x < kc.maxX){
276 Result result = resulTable.result(kc.x);
277 result.x(kc.x);
278 result.y(100);
279 }
280 }
281 ```
282
283 Whose Kernel code will look like this.
284
285 ```
286 typedef struct KernelContext_s{
287 int x;
288 int maxX;
289 }KernelContext_t;
290
291 typedef Result_s{
292 int x,y
293 } Result_t;
294
295 typedef ResultTable_s{
296 int count;
297 int length;
298 Result_t result[0];
299 } Result_t;
300
301 __kernel void init(
302 __global KernelContext_t *empty,
303 __global ResultTable_t* resultTable
304 ){
305 KernelContext_t kernelContext;
306 KernelContext_t *kc = &kernelContext;
307 kc->x=get_global_id(0);
308 kc->maxX = get_global_id(0);
309
310 if(kc->x<kc->maxX){
311 __global Result_t *result = &resultTable[kc->x];
312 result->x = kc->x;
313 }
314 return;
315 }
316 ```
317
318 A few notes from this generated code...
319
320 * `KernelContext` is itself just an iface mapped segment.
321 - But we don't pass `kc.x` o `kc.maxX` in the segment.
322 - Instead initialize using appropriate vendor calls
323
324 So for OpenCL all kernels start like this
325
326 ```
327 __kernel void init(__global KernelContext_t *empty , ....){
328 KernelContext_t kernelContext;
329 KernelContext_t *kc = &kernelContext;
330 kc->x=get_global_id(0);
331 kc->maxX = get_global_id(0);
332 ....
333 }
334 ```
335
336 Whereas CUDA ;)
337
338 ```
339 __kernel void init(__global KernelContext_t *empty , ....){
340 KernelContext_t kernelContext;
341 KernelContext_t *kc = &kernelContext;
342 kc->x=blockIdx.x*blockDim.x+threadIdx.x;
343 kc->maxX =gridDim.x*blockDim.x
344 ....
345 }
346 ```
347
348 This simplifies code gen. Generally the CUDA code and OpenCL code looks identical.
349
350 ----
351
352 The iface mapping code in hat is a modified form of the code hereWe have a copy of Per's segment mapping code from
353
354 https://github.com/minborg/panama-foreign/blob/segment-mapper/src/java.base/share/classes