1 # Interface Mapping
2
3 ----
4
5 * [Contents](hat-00.md)
6 * House Keeping
7 * [Project Layout](hat-01-01-project-layout.md)
8 * [Building Babylon](hat-01-02-building-babylon.md)
9 * [Building HAT](hat-01-03-building-hat.md)
10 * [Enabling the CUDA Backend](hat-01-05-building-hat-for-cuda.md)
11 * Programming Model
12 * [Programming Model](hat-03-programming-model.md)
13 * Interface Mapping
14 * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
15 * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
16 * Implementation Detail
17 * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
18 * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
19
20 ----
21
22 # Interface Mapping
23
24 ## or ... HAT from a Data POV
25
26 ### Or ... what is this `S32Array` thing and why can't I just pass `int[]` to my kernel
27
28 Again here is the canonical HAT 'hello world' kernel, weill use this to describe itgerface mapping
29
30 We implement this in HAT by collecting the kernel(s) and compute method(s) in a `Compute` class.
31
32 ```java
33 public class SquareCompute {
34 @Reflect
35 public static int square(int v) {
36 return v * v;
37 }
38
39 @Reflect
40 public static void squareKernel(KernelContext kc, S32Array s32Array) {
41 int value = s32Array.array(kc.x); // arr[cc.x]
42 s32Array.array(kc.x, square(value)); // arr[cc.x]=value*value
43 }
44
45 @Reflect
46 public static void square(ComputeContext cc, S32Array s32Array) {
47 cc.dispatchKernel(s32Array.length(),
48 kc -> squareKernel(kc, s32Array)
49 );
50 }
51 }
52 ```
53 Which we dispatch by creating the appropriate data buffer and then asking an `Accelerator` (bound to a typical vendor backend) to execute the compute method.. which in turn coordinates the dispatch of the various kernels.
54
55 ```java
56 // Create an accelerator bound to a particular backend
57
58 var accelerator = new Accelerator(MethodHandles.lookup(), Backend.FIRST);
59
60 // Ask the accelerator/backend to allocate an S32Array
61 var s32Array = S32Array.create(accelerator, 32);
62
63 // Fill it with data
64 for (int i = 0; i < s32Array.length(); i++) {
65 s32Array.array(i, i);
66 }
67
68 // Tell the accelerator to execute the square() compute entrypoint
69
70 accelerator.compute(
71 cc -> SquareCompute.square(cc, s32Array)
72 );
73
74 // Check the data
75 for (int i = 0; i < arr.length(); i++) {
76 System.out.println(i + " " + arr.array(i));
77 }
78 ```
79
80 HAT kernels only accept Java primitives and HAT buffers as parameters.
81
82 We don't directly support heap allocated data (such as int[])
83
84 From Java's point of view `S32Array` is a `hat.Buffer` and is defined as an interface.
85
86 ```java
87 public interface S32Array extends Buffer {
88 int length();
89 void length(int i);
90 int array(long idx);
91 void array(long idx, int i);
92 }
93 ```
94
95 From C99 style OpenCL/CUDA POV this will eventually be mapped to a typedef.
96
97 ```C++
98 typedef struct S32Array_s{
99 int length;
100 int array[]; //<-- ?
101 }S32Array_t;
102 ```
103
104 Our Java implementations should treat the interface as `data`, generally the only
105 methods that we include in a `hat.Buffer` should be
106
107 ```java
108 T name(); //getter for a field called name with type T, where T may be primitive or inner interface)
109 void name(T name); //setter for a field called name with type T, T must be primitive
110 T name(long idx); //get an array element [idx] where array is called name and T is either primitive or inner interface
111 void name(long idx, T name); //set an array element [idx] where array is called name and T is primitive
112 ```
113
114 Algorithms can assume that an interface is 'bound' to 'some' concrete data layout.
115
116 We could for example implement `S32Array` like this.
117
118 ```java
119 class JavaS32Array implements S32Array{
120 int[] arr;
121 int length(){ return arr.length;}
122 int array(long idx) {return arr[idx];}
123 void array(long idx, int value) {arr[idx] = value;}
124 void length(int len) ; // we'll come back to this ;)
125 }
126 ```
127
128 But for HAT to access native memory, allocated by the appropriate backend we need interfaces bound to MemorySegents/
129
130 HAT includes an API which allows us to take an interface which extends `hat.Buffer`, and 'bind' it to a Panama FFM MemorySegment.
131
132 This binding process automatically maps the accessors (for example `length()`, `array(long idx, int v)`) to low level Method and Var handel trickery underlying MemorySegments.
133
134 Conceptually we might imagine that HAT creates something like this
135
136 ```java
137 class PanamaS32Array implements S32Array{
138 MemorySegment segment;
139 final int SIZEOFINT = 4;
140 final long lenOffset = 0;
141 final long arrayOffset = lenOffset+SIZEOFINT;
142 int length(){ return segment.getInt(lenOffset);}
143 int array(long idx) {return segment.getInt(arrayOffset+idx*SIZEOFINT);}
144 void array(long idx, int value) {segment.setInt(arrayOffset+idx*SIZEOFINT,value);}
145 void length(int len) ; // we'll come back to this ;)
146 }
147 ```
148
149 Much like Java's `Proxy` class, the iface mapper creates an implementation of the interface 'on the fly', the new Classfile API is used to 'spin up' the new class and the accessors are are composed using Var/Method Handles and offsets derived from the size and order of fields.
150
151 Sadly an interface is not quite enough to establish exactly what is needed to complete the mapping. We need to tell the `iface mapper` the order and size of fields and possibly some padding information.
152
153 We do this by providing a 'layout description' using Panama's Layout api.
154
155 ```java
156 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
157 JAVA_INT.withName("length"),
158 MemoryLayout.sequenceLayout(N, JAVA_INT.withName("length")).withName("array")
159 ).withName(S32Array.getSimpleName());
160 ```
161
162 Eventually we came to a common pattern for describing HAT buffers by adding a `create` method to our interface which hides the mapping detail
163
164 So the complete `S32Array` looks a like this. (....ish)
165
166 ```java
167 public interface S32Array extends Buffer {
168 int length();
169
170 void length(int i);
171
172 int array(long idx);
173
174 void array(long idx, int i);
175
176 S32Array create(Accelerator a, int len) {
177 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
178 JAVA_INT.withName("length"),
179 MemoryLayout.sequenceLayout(len, JAVA_INT.withName("length")).withName("array")
180 ).withName(S32Array.getSimpleName());
181
182 S32Array s32Array = a.allocate(
183 SegmentMapper.of(MethodHandles.lookup(), S32Array.class, s32ArrayLayout, len)
184 );
185
186 return s32Array;
187 }
188 }
189 ```
190
191 So now hopefully this code makes more sense.
192
193 ```
194 var s32Array = S32Array.create(accelerator, 32);
195 ```
196
197 Whilst this code is much nicer than hand mapping each method to offsets. It is still quite verbose.
198
199 In the last few weeks we have been migrating to Schema builder which makes this code easier to express..
200
201 ```java
202 public interface S32Array extends Buffer {
203 int length();
204 void length(int i);
205 int array(long idx);
206 void array(long idx, int i);
207 Schema<S32Array> schema = Schema.of(S32Array.class, s->s
208 .arrayLen("length")
209 .array("array")
210 );
211 }
212 ```
213 The schema is embedded inside the interface and defines the order of fields. It also allows us to bind fields to each other (above we are telling the schema we have a `int length` field followed by an `int array[]` field and that the first defines the size of the second), we also can describe useful 'HAT' information for fields. Such as whether a field is 'atomic' ;)
214
215 Here is an example of a table of Results for the face detector.
216
217 ```java
218 public interface ResultTable extends Buffer{
219 interface Result extends Buffer.StructChild {
220 float x();
221 void x(float x);
222 float y();
223 void y(float y);
224 }
225 void count(int count);
226 int count();
227 int length();
228 Result result(long idx);
229
230 Schema<ResultTable> schema = Schema.of(ResultTable.class, s->s
231 .atomic("count")
232 .arrayLen("length")
233 .array("result", r->r
234 .field("x")
235 .field("y")
236 )
237 );
238 }
239 ```
240
241 Which in C99 OpenCL code will manifest as
242
243 ```C++
244 typedef Result_s{
245 int x,y
246 } Result_t;
247
248 typedef ResultTable_s{
249 int count;
250 int length;
251 Result_t result[0];
252 } Result_t;
253 ```
254
255 In our Java code this interface makes access to MemorySegments much cleaner
256
257 ```java
258 ResultTable resultTable = ResultTable.create(acc, 100);
259 for (int i=0; i<resultTable.length(); i++){
260 Result result = resultTable.result(i);
261 result.x(0);
262 result.y(0);
263 }
264 ```
265
266 The generated OpenCL/C99 code from Java kernel code is also quite clean
267
268 We might use a kernel to initialize the location of a bunch of Results
269
270 ```java
271 @Reflect public static void init(KernelContext kc, ResultTable resultTable) {
272 if (kc.x < kc.maxX){
273 Result result = resulTable.result(kc.x);
274 result.x(kc.x);
275 result.y(100);
276 }
277 }
278 ```
279
280 Whose Kernel code will look like this.
281
282 ```
283 typedef struct KernelContext_s{
284 int x;
285 int maxX;
286 }KernelContext_t;
287
288 typedef Result_s{
289 int x,y
290 } Result_t;
291
292 typedef ResultTable_s{
293 int count;
294 int length;
295 Result_t result[0];
296 } Result_t;
297
298 __kernel void init(
299 __global KernelContext_t *empty,
300 __global ResultTable_t* resultTable
301 ){
302 KernelContext_t kernelContext;
303 KernelContext_t *kc = &kernelContext;
304 kc->x=get_global_id(0);
305 kc->maxX = get_global_id(0);
306
307 if(kc->x<kc->maxX){
308 __global Result_t *result = &resultTable[kc->x];
309 result->x = kc->x;
310 }
311 return;
312 }
313 ```
314
315 A few notes from this generated code...
316
317 * `KernelContext` is itself just an iface mapped segment.
318 - But we don't pass `kc.x` o `kc.maxX` in the segment.
319 - Instead initialize using appropriate vendor calls
320
321 So for OpenCL all kernels start like this
322
323 ```
324 __kernel void init(__global KernelContext_t *empty , ....){
325 KernelContext_t kernelContext;
326 KernelContext_t *kc = &kernelContext;
327 kc->x=get_global_id(0);
328 kc->maxX = get_global_id(0);
329 ....
330 }
331 ```
332
333 Whereas CUDA ;)
334
335 ```
336 __kernel void init(__global KernelContext_t *empty , ....){
337 KernelContext_t kernelContext;
338 KernelContext_t *kc = &kernelContext;
339 kc->x=blockIdx.x*blockDim.x+threadIdx.x;
340 kc->maxX =gridDim.x*blockDim.x
341 ....
342 }
343 ```
344
345 This simplifies code gen. Generally the CUDA code and OpenCL code looks identical.
346
347 ----
348
349 The iface mapping code in hat is a modified form of the code hereWe have a copy of Per's segment mapping code from
350
351 https://github.com/minborg/panama-foreign/blob/segment-mapper/src/java.base/share/classes