1 # Interface Mapping
2
3 ----
4 * [Contents](hat-00.md)
5 * Build Babylon and HAT
6 * [Quick Install](hat-01-quick-install.md)
7 * [Building Babylon with jtreg](hat-01-02-building-babylon.md)
8 * [Building HAT with jtreg](hat-01-03-building-hat.md)
9 * [Enabling the NVIDIA CUDA Backend](hat-01-05-building-hat-for-cuda.md)
10 * [Testing Framework](hat-02-testing-framework.md)
11 * [Running Examples](hat-03-examples.md)
12 * [HAT Programming Model](hat-03-programming-model.md)
13 * Interface Mapping
14 * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
15 * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
16 * Development
17 * [Project Layout](hat-01-01-project-layout.md)
18 * Implementation Details
19 * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
20 * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
21 * [Running HAT with Docker on NVIDIA GPUs](hat-07-docker-build-nvidia.md)
22 ---
23
24 # Interface Mapping
25
26 ## or ... HAT from a Data POV
27
28 ### Or ... what is this `S32Array` thing and why can't I just pass `int[]` to my kernel
29
30 Again here is the canonical HAT 'hello world' kernel, weill use this to describe itgerface mapping
31
32 We implement this in HAT by collecting the kernel(s) and compute method(s) in a `Compute` class.
33
34 ```java
35 public class SquareCompute {
36 @Reflect
37 public static int square(int v) {
38 return v * v;
39 }
40
41 @Reflect
42 public static void squareKernel(KernelContext kc, S32Array s32Array) {
43 int value = s32Array.array(kc.x); // arr[cc.x]
44 s32Array.array(kc.x, square(value)); // arr[cc.x]=value*value
45 }
46
47 @Reflect
48 public static void square(ComputeContext cc, S32Array s32Array) {
49 cc.dispatchKernel(s32Array.length(),
50 kc -> squareKernel(kc, s32Array)
51 );
52 }
53 }
54 ```
55 Which we dispatch by creating the appropriate data buffer and then asking an `Accelerator` (bound to a typical vendor backend) to execute the compute method.. which in turn coordinates the dispatch of the various kernels.
56
57 ```java
58 // Create an accelerator bound to a particular backend
59
60 var accelerator = new Accelerator(MethodHandles.lookup(), Backend.FIRST);
61
62 // Ask the accelerator/backend to allocate an S32Array
63 var s32Array = S32Array.create(accelerator, 32);
64
65 // Fill it with data
66 for (int i = 0; i < s32Array.length(); i++) {
67 s32Array.array(i, i);
68 }
69
70 // Tell the accelerator to execute the square() compute entrypoint
71
72 accelerator.compute(
73 cc -> SquareCompute.square(cc, s32Array)
74 );
75
76 // Check the data
77 for (int i = 0; i < arr.length(); i++) {
78 System.out.println(i + " " + arr.array(i));
79 }
80 ```
81
82 HAT kernels only accept Java primitives and HAT buffers as parameters.
83
84 We don't directly support heap allocated data (such as int[])
85
86 From Java's point of view `S32Array` is a `hat.Buffer` and is defined as an interface.
87
88 ```java
89 public interface S32Array extends Buffer {
90 int length();
91 void length(int i);
92 int array(long idx);
93 void array(long idx, int i);
94 }
95 ```
96
97 From C99 style OpenCL/CUDA POV this will eventually be mapped to a typedef.
98
99 ```C++
100 typedef struct S32Array_s{
101 int length;
102 int array[]; //<-- ?
103 }S32Array_t;
104 ```
105
106 Our Java implementations should treat the interface as `data`, generally the only
107 methods that we include in a `hat.Buffer` should be
108
109 ```java
110 T name(); //getter for a field called name with type T, where T may be primitive or inner interface)
111 void name(T name); //setter for a field called name with type T, T must be primitive
112 T name(long idx); //get an array element [idx] where array is called name and T is either primitive or inner interface
113 void name(long idx, T name); //set an array element [idx] where array is called name and T is primitive
114 ```
115
116 Algorithms can assume that an interface is 'bound' to 'some' concrete data layout.
117
118 We could for example implement `S32Array` like this.
119
120 ```java
121 class JavaS32Array implements S32Array{
122 int[] arr;
123 int length(){ return arr.length;}
124 int array(long idx) {return arr[idx];}
125 void array(long idx, int value) {arr[idx] = value;}
126 void length(int len) ; // we'll come back to this ;)
127 }
128 ```
129
130 But for HAT to access native memory, allocated by the appropriate backend we need interfaces bound to MemorySegents/
131
132 HAT includes an API which allows us to take an interface which extends `hat.Buffer`, and 'bind' it to a Panama FFM MemorySegment.
133
134 This binding process automatically maps the accessors (for example `length()`, `array(long idx, int v)`) to low level Method and Var handel trickery underlying MemorySegments.
135
136 Conceptually we might imagine that HAT creates something like this
137
138 ```java
139 class PanamaS32Array implements S32Array{
140 MemorySegment segment;
141 final int SIZEOFINT = 4;
142 final long lenOffset = 0;
143 final long arrayOffset = lenOffset+SIZEOFINT;
144 int length(){ return segment.getInt(lenOffset);}
145 int array(long idx) {return segment.getInt(arrayOffset+idx*SIZEOFINT);}
146 void array(long idx, int value) {segment.setInt(arrayOffset+idx*SIZEOFINT,value);}
147 void length(int len) ; // we'll come back to this ;)
148 }
149 ```
150
151 Much like Java's `Proxy` class, the iface mapper creates an implementation of the interface 'on the fly', the new Classfile API is used to 'spin up' the new class and the accessors are are composed using Var/Method Handles and offsets derived from the size and order of fields.
152
153 Sadly an interface is not quite enough to establish exactly what is needed to complete the mapping. We need to tell the `iface mapper` the order and size of fields and possibly some padding information.
154
155 We do this by providing a 'layout description' using Panama's Layout api.
156
157 ```java
158 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
159 JAVA_INT.withName("length"),
160 MemoryLayout.sequenceLayout(N, JAVA_INT.withName("length")).withName("array")
161 ).withName(S32Array.getSimpleName());
162 ```
163
164 Eventually we came to a common pattern for describing HAT buffers by adding a `create` method to our interface which hides the mapping detail
165
166 So the complete `S32Array` looks a like this. (....ish)
167
168 ```java
169 public interface S32Array extends Buffer {
170 int length();
171
172 void length(int i);
173
174 int array(long idx);
175
176 void array(long idx, int i);
177
178 S32Array create(Accelerator a, int len) {
179 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
180 JAVA_INT.withName("length"),
181 MemoryLayout.sequenceLayout(len, JAVA_INT.withName("length")).withName("array")
182 ).withName(S32Array.getSimpleName());
183
184 S32Array s32Array = a.allocate(
185 SegmentMapper.of(MethodHandles.lookup(), S32Array.class, s32ArrayLayout, len)
186 );
187
188 return s32Array;
189 }
190 }
191 ```
192
193 So now hopefully this code makes more sense.
194
195 ```
196 var s32Array = S32Array.create(accelerator, 32);
197 ```
198
199 Whilst this code is much nicer than hand mapping each method to offsets. It is still quite verbose.
200
201 In the last few weeks we have been migrating to Schema builder which makes this code easier to express..
202
203 ```java
204 public interface S32Array extends Buffer {
205 int length();
206 void length(int i);
207 int array(long idx);
208 void array(long idx, int i);
209 Schema<S32Array> schema = Schema.of(S32Array.class, s->s
210 .arrayLen("length")
211 .array("array")
212 );
213 }
214 ```
215 The schema is embedded inside the interface and defines the order of fields. It also allows us to bind fields to each other (above we are telling the schema we have a `int length` field followed by an `int array[]` field and that the first defines the size of the second), we also can describe useful 'HAT' information for fields. Such as whether a field is 'atomic' ;)
216
217 Here is an example of a table of Results for the face detector.
218
219 ```java
220 public interface ResultTable extends Buffer{
221 interface Result extends Buffer.StructChild {
222 float x();
223 void x(float x);
224 float y();
225 void y(float y);
226 }
227 void count(int count);
228 int count();
229 int length();
230 Result result(long idx);
231
232 Schema<ResultTable> schema = Schema.of(ResultTable.class, s->s
233 .atomic("count")
234 .arrayLen("length")
235 .array("result", r->r
236 .field("x")
237 .field("y")
238 )
239 );
240 }
241 ```
242
243 Which in C99 OpenCL code will manifest as
244
245 ```C++
246 typedef Result_s{
247 int x,y
248 } Result_t;
249
250 typedef ResultTable_s{
251 int count;
252 int length;
253 Result_t result[0];
254 } Result_t;
255 ```
256
257 In our Java code this interface makes access to MemorySegments much cleaner
258
259 ```java
260 ResultTable resultTable = ResultTable.create(acc, 100);
261 for (int i=0; i<resultTable.length(); i++){
262 Result result = resultTable.result(i);
263 result.x(0);
264 result.y(0);
265 }
266 ```
267
268 The generated OpenCL/C99 code from Java kernel code is also quite clean
269
270 We might use a kernel to initialize the location of a bunch of Results
271
272 ```java
273 @Reflect public static void init(KernelContext kc, ResultTable resultTable) {
274 if (kc.x < kc.maxX){
275 Result result = resulTable.result(kc.x);
276 result.x(kc.x);
277 result.y(100);
278 }
279 }
280 ```
281
282 Whose Kernel code will look like this.
283
284 ```
285 typedef struct KernelContext_s{
286 int x;
287 int maxX;
288 }KernelContext_t;
289
290 typedef Result_s{
291 int x,y
292 } Result_t;
293
294 typedef ResultTable_s{
295 int count;
296 int length;
297 Result_t result[0];
298 } Result_t;
299
300 __kernel void init(
301 __global KernelContext_t *empty,
302 __global ResultTable_t* resultTable
303 ){
304 KernelContext_t kernelContext;
305 KernelContext_t *kc = &kernelContext;
306 kc->x=get_global_id(0);
307 kc->maxX = get_global_id(0);
308
309 if(kc->x<kc->maxX){
310 __global Result_t *result = &resultTable[kc->x];
311 result->x = kc->x;
312 }
313 return;
314 }
315 ```
316
317 A few notes from this generated code...
318
319 * `KernelContext` is itself just an iface mapped segment.
320 - But we don't pass `kc.x` o `kc.maxX` in the segment.
321 - Instead initialize using appropriate vendor calls
322
323 So for OpenCL all kernels start like this
324
325 ```
326 __kernel void init(__global KernelContext_t *empty , ....){
327 KernelContext_t kernelContext;
328 KernelContext_t *kc = &kernelContext;
329 kc->x=get_global_id(0);
330 kc->maxX = get_global_id(0);
331 ....
332 }
333 ```
334
335 Whereas CUDA ;)
336
337 ```
338 __kernel void init(__global KernelContext_t *empty , ....){
339 KernelContext_t kernelContext;
340 KernelContext_t *kc = &kernelContext;
341 kc->x=blockIdx.x*blockDim.x+threadIdx.x;
342 kc->maxX =gridDim.x*blockDim.x
343 ....
344 }
345 ```
346
347 This simplifies code gen. Generally the CUDA code and OpenCL code looks identical.
348
349 ----
350
351 The iface mapping code in hat is a modified form of the code hereWe have a copy of Per's segment mapping code from
352
353 https://github.com/minborg/panama-foreign/blob/segment-mapper/src/java.base/share/classes