1 # Interface Mapping
  2 
  3 ----
  4 
  5 * [Contents](hat-00.md)
  6 * House Keeping
  7   * [Project Layout](hat-01-01-project-layout.md)
  8   * [Building Babylon](hat-01-02-building-babylon.md)
  9   * [Building HAT](hat-01-03-building-hat.md)
 10     * [Enabling the CUDA Backend](hat-01-05-building-hat-for-cuda.md)
 11 * Programming Model
 12   * [Programming Model](hat-03-programming-model.md)
 13 * Interface Mapping
 14   * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
 15   * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
 16 * Implementation Detail
 17   * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
 18   * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
 19 
 20 ----
 21 
 22 # Interface Mapping
 23 
 24 ## or ... HAT from a Data POV
 25 
 26 ### Or ... what is this `S32Array` thing and why can't I just pass `int[]` to my kernel
 27 
 28 Again here is the canonical HAT 'hello world' kernel, weill use this to describe itgerface mapping
 29 
 30 We implement this in HAT by collecting the kernel(s) and compute method(s) in a `Compute` class.
 31 
 32 ```java
 33 public class SquareCompute {
 34     @Reflect
 35     public static int square(int v) {
 36         return v * v;
 37     }
 38 
 39     @Reflect
 40     public static void squareKernel(KernelContext kc, S32Array s32Array) {
 41         int value = s32Array.array(kc.x);     // arr[cc.x]
 42         s32Array.array(kc.x, square(value));  // arr[cc.x]=value*value
 43     }
 44 
 45     @Reflect
 46     public static void square(ComputeContext cc, S32Array s32Array) {
 47         cc.dispatchKernel(s32Array.length(),
 48                 kc -> squareKernel(kc, s32Array)
 49         );
 50     }
 51 }
 52 ```
 53 Which we dispatch by creating the appropriate data buffer and then asking an `Accelerator` (bound to a typical vendor backend) to execute the compute method.. which in turn coordinates the dispatch of the various kernels.
 54 
 55 ```java
 56   // Create an accelerator bound to a particular backend
 57 
 58   var accelerator = new Accelerator(MethodHandles.lookup(), Backend.FIRST);
 59 
 60   // Ask the accelerator/backend to allocate an S32Array
 61   var s32Array = S32Array.create(accelerator, 32);
 62 
 63   // Fill it with data
 64   for (int i = 0; i < s32Array.length(); i++) {
 65       s32Array.array(i, i);
 66   }
 67 
 68   // Tell the accelerator to execute the square() compute entrypoint
 69 
 70   accelerator.compute(
 71      cc -> SquareCompute.square(cc, s32Array)
 72   );
 73 
 74   // Check the data
 75   for (int i = 0; i < arr.length(); i++) {
 76       System.out.println(i + " " + arr.array(i));
 77   }
 78 ```
 79 
 80 HAT kernels only accept Java primitives and HAT buffers as parameters.
 81 
 82 We don't directly support heap allocated data (such as int[])
 83 
 84 From Java's point of view `S32Array` is a `hat.Buffer` and is defined as an interface.
 85 
 86 ```java
 87 public interface S32Array extends Buffer {
 88     int length();
 89     void length(int i);
 90     int array(long idx);
 91     void array(long idx, int i);
 92 }
 93 ```
 94 
 95 From C99 style OpenCL/CUDA POV this will eventually be mapped to a typedef.
 96 
 97 ```C++
 98 typedef struct S32Array_s{
 99     int length;
100     int array[];  //<-- ?
101 }S32Array_t;
102 ```
103 
104 Our Java implementations should treat the interface as `data`, generally the only
105 methods that we include in a `hat.Buffer` should be
106 
107 ```java
108 T name();                    //getter for a field called name with type T, where T may be primitive or inner interface)
109 void name(T name);           //setter for a field called name with type T, T must be  primitive
110 T name(long idx);            //get an array element [idx] where array is called name and T is either primitive or inner interface
111 void name(long idx, T name); //set an array element [idx] where array is called name and T is primitive
112 ```
113 
114 Algorithms can assume that an interface is 'bound' to 'some' concrete data layout.
115 
116 We could for example implement `S32Array` like this.
117 
118 ```java
119 class JavaS32Array implements S32Array{
120      int[] arr;
121      int length(){ return arr.length;}
122      int array(long idx) {return arr[idx];}
123      void array(long idx, int value) {arr[idx] = value;}
124      void length(int len) ; // we'll come back to this ;)
125 }
126 ```
127 
128 But for HAT to access native memory, allocated by the appropriate backend we need interfaces bound to MemorySegents/
129 
130 HAT includes an API which allows us to take an interface which extends `hat.Buffer`, and 'bind' it to a Panama FFM MemorySegment.
131 
132 This binding process automatically maps the accessors (for example `length()`, `array(long idx, int v)`) to low level Method and Var handel trickery underlying MemorySegments.
133 
134 Conceptually we might imagine that HAT creates something like this
135 
136 ```java
137 class PanamaS32Array implements S32Array{
138      MemorySegment segment;
139      final int SIZEOFINT = 4;
140      final long lenOffset = 0;
141      final long arrayOffset = lenOffset+SIZEOFINT;
142      int length(){ return segment.getInt(lenOffset);}
143      int array(long idx) {return segment.getInt(arrayOffset+idx*SIZEOFINT);}
144      void array(long idx, int value) {segment.setInt(arrayOffset+idx*SIZEOFINT,value);}
145      void length(int len) ; // we'll come back to this ;)
146 }
147 ```
148 
149 Much like Java's `Proxy` class, the iface mapper creates an implementation of the interface  'on the fly', the new Classfile API is used to 'spin up' the new class and the accessors are are composed using Var/Method Handles and offsets derived from the size and order of fields.
150 
151 Sadly an interface is not quite enough to establish exactly what is needed to complete the mapping.  We need to tell the `iface mapper` the order and size of fields and possibly some padding information.
152 
153 We do this by providing a 'layout description' using Panama's Layout api.
154 
155 ```java
156 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
157         JAVA_INT.withName("length"),
158         MemoryLayout.sequenceLayout(N, JAVA_INT.withName("length")).withName("array")
159 ).withName(S32Array.getSimpleName());
160 ```
161 
162 Eventually we came to a common pattern for describing HAT buffers by adding a `create` method to our interface which hides the mapping detail
163 
164 So the complete `S32Array` looks a like this. (....ish)
165 
166 ```java
167 public interface S32Array extends Buffer {
168     int length();
169 
170     void length(int i);
171 
172     int array(long idx);
173 
174     void array(long idx, int i);
175 
176     S32Array create(Accelerator a, int len) {
177         MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
178                 JAVA_INT.withName("length"),
179                 MemoryLayout.sequenceLayout(len, JAVA_INT.withName("length")).withName("array")
180         ).withName(S32Array.getSimpleName());
181 
182         S32Array s32Array = a.allocate(
183                 SegmentMapper.of(MethodHandles.lookup(), S32Array.class, s32ArrayLayout, len)
184         );
185 
186         return s32Array;
187     }
188 }
189 ```
190 
191 So now hopefully this code makes more sense.
192 
193 ```
194 var s32Array = S32Array.create(accelerator, 32);
195 ```
196 
197 Whilst this code is much nicer than hand mapping each method to offsets.  It is still quite verbose.
198 
199 In the last few weeks we have been migrating to Schema builder which makes this code easier to express..
200 
201 ```java
202 public interface S32Array extends Buffer {
203     int length();
204     void length(int i);
205     int array(long idx);
206     void array(long idx, int i);
207     Schema<S32Array> schema = Schema.of(S32Array.class, s->s
208         .arrayLen("length")
209         .array("array")
210     );
211 }
212 ```
213 The schema is embedded inside the interface and defines the order of fields. It also allows us to bind fields to each other (above we are telling the schema we have a `int length` field followed by an `int array[]` field and that the first defines the size of the second), we also can describe useful 'HAT' information for fields. Such as whether a field is 'atomic' ;)
214 
215 Here is an example of a table of Results for the face detector.
216 
217 ```java
218 public interface ResultTable extends Buffer{
219     interface Result extends Buffer.StructChild {
220         float x();
221         void x(float x);
222         float y();
223         void y(float y);
224     }
225     void count(int count);
226     int count();
227     int length();
228     Result result(long idx);
229 
230     Schema<ResultTable> schema = Schema.of(ResultTable.class, s->s
231             .atomic("count")
232             .arrayLen("length")
233             .array("result", r->r
234                .field("x")
235                .field("y")
236             )
237     );
238 }
239 ```
240 
241 Which in C99 OpenCL code will manifest as
242 
243 ```C++
244 typedef Result_s{
245    int x,y
246 } Result_t;
247 
248 typedef ResultTable_s{
249    int count;
250    int length;
251    Result_t result[0];
252 } Result_t;
253 ```
254 
255 In our Java code this interface makes access to MemorySegments much cleaner
256 
257 ```java
258     ResultTable resultTable = ResultTable.create(acc, 100);
259     for (int i=0; i<resultTable.length(); i++){
260         Result result = resultTable.result(i);
261         result.x(0);
262         result.y(0);
263     }
264 ```
265 
266 The generated OpenCL/C99 code from Java kernel code is also quite clean
267 
268 We might use a kernel to initialize the location of a bunch of Results
269 
270 ```java
271     @Reflect public static void init(KernelContext kc, ResultTable resultTable) {
272         if (kc.x < kc.maxX){
273            Result result = resulTable.result(kc.x);
274            result.x(kc.x);
275            result.y(100);
276         }
277     }
278 ```
279 
280 Whose Kernel code will look like this.
281 
282 ```
283 typedef struct KernelContext_s{
284     int x;
285     int maxX;
286 }KernelContext_t;
287 
288 typedef Result_s{
289    int x,y
290 } Result_t;
291 
292 typedef ResultTable_s{
293    int count;
294    int length;
295    Result_t result[0];
296 } Result_t;
297 
298 __kernel void init(
299     __global KernelContext_t *empty,
300     __global ResultTable_t* resultTable
301 ){
302     KernelContext_t kernelContext;
303     KernelContext_t *kc = &kernelContext;
304     kc->x=get_global_id(0);
305     kc->maxX = get_global_id(0);
306 
307     if(kc->x<kc->maxX){
308         __global Result_t *result = &resultTable[kc->x];
309         result->x = kc->x;
310     }
311     return;
312 }
313 ```
314 
315 A few notes from this generated code...
316 
317 * `KernelContext` is itself just an iface mapped segment.
318     -  But we don't pass `kc.x` o `kc.maxX` in the segment.
319         -  Instead initialize using appropriate  vendor calls
320 
321 So for OpenCL all kernels start like this
322 
323 ```
324 __kernel void init(__global KernelContext_t *empty , ....){
325     KernelContext_t kernelContext;
326     KernelContext_t *kc = &kernelContext;
327     kc->x=get_global_id(0);
328     kc->maxX = get_global_id(0);
329      ....
330 }
331 ```
332 
333 Whereas CUDA ;)
334 
335 ```
336 __kernel void init(__global KernelContext_t *empty , ....){
337     KernelContext_t kernelContext;
338     KernelContext_t *kc = &kernelContext;
339     kc->x=blockIdx.x*blockDim.x+threadIdx.x;
340     kc->maxX =gridDim.x*blockDim.x
341     ....
342 }
343 ```
344 
345 This simplifies code gen. Generally the CUDA code and OpenCL code looks identical.
346 
347 ----
348 
349 The iface mapping code in hat is a modified form of the code hereWe have a copy of Per's segment mapping code from
350 
351 https://github.com/minborg/panama-foreign/blob/segment-mapper/src/java.base/share/classes