1 
  2 # Interface Mapping
  3 
  4 ----
  5 
  6 * [Contents](hat-00.md)
  7 * House Keeping
  8     * [Project Layout](hat-01-01-project-layout.md)
  9     * [Building Babylon](hat-01-02-building-babylon.md)
 10     * [Building HAT](hat-01-03-building-hat.md)
 11 * Programming Model
 12     * [Programming Model](hat-03-programming-model.md)
 13 * Interface Mapping
 14     * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
 15     * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
 16 * Implementation Detail
 17     * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
 18     * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
 19 
 20 ----
 21 # Interface Mapping
 22 
 23 ## or ... HAT from a Data POV
 24 
 25 ### Or ... what is this `S32Array` thing and why can't I just pass `int[]` to my kernel
 26 
 27 Again here is the canonical HAT 'hello world' kernel, weill use this to describe itgerface mapping
 28 
 29 We implement this in HAT by collecting the kernel(s) and compute method(s) in a `Compute` class.
 30 
 31 ```java
 32 public class SquareCompute {
 33     @CodeReflection
 34     public static int square(int v) {
 35         return v * v;
 36     }
 37 
 38     @CodeReflection
 39     public static void squareKernel(KernelContext kc, S32Array s32Array) {
 40         int value = s32Array.array(kc.x);     // arr[cc.x]
 41         s32Array.array(kc.x, square(value));  // arr[cc.x]=value*value
 42     }
 43 
 44     @CodeReflection
 45     public static void square(ComputeContext cc, S32Array s32Array) {
 46         cc.dispatchKernel(s32Array.length(),
 47                 kc -> squareKernel(kc, s32Array)
 48         );
 49     }
 50 }
 51 ```
 52 Which we dispatch by creating the appropriate data buffer and then asking an `Accelerator` (bound to a typical vendor backend) to execute the compute method.. which in turn coordinates the dispatch of the various kernels.
 53 
 54 ```java
 55   // Create an accelerator bound to a particular backend
 56 
 57   var accelerator = new Accelerator(
 58       java.lang.invoke.MethodHandles.lookup(),
 59       Backend.FIRST  // Predicate<Backend>
 60   );
 61 
 62   // Ask the accelerator/backend to allocate an S32Array
 63   var s32Array = S32Array.create(accelerator, 32);
 64 
 65   // Fill it with data
 66   for (int i = 0; i < s32Array.length(); i++) {
 67       s32Array.array(i, i);
 68   }
 69 
 70   // Tell the accelerator to execute the square() compute entrypoint
 71 
 72   accelerator.compute(
 73      cc -> SquareCompute.square(cc, s32Array)
 74   );
 75 
 76   // Check the data
 77   for (int i = 0; i < arr.length(); i++) {
 78       System.out.println(i + " " + arr.array(i));
 79   }
 80 ```
 81 
 82 HAT kernels only accept Java primitives and HAT buffers as parameters.
 83 
 84 We don't directly support heap allocated data (such as int[])
 85 
 86 From Java's point of view `S32Array` is a `hat.Buffer` and is defined as an interface.
 87 
 88 ```java
 89 public interface S32Array extends Buffer {
 90     int length();
 91     void length(int i);
 92     int array(long idx);
 93     void array(long idx, int i);
 94 }
 95 ```
 96 
 97 From C99 style OpenCL/CUDA POV this will eventually be mapped to a typedef.
 98 
 99 ```C++
100 typedef struct S32Array_s{
101     int length;
102     int array[];  //<-- ?
103 }S32Array_t;
104 ```
105 
106 Our Java implementations should treat the interface as `data`, generally the only
107 methods that we include in a `hat.Buffer` should be
108 
109 ```java
110 T name();                    //getter for a field called name with type T, where T may be primitive or inner interface)
111 void name(T name);           //setter for a field called name with type T, T must be  primitive
112 T name(long idx);            //get an array element [idx] where array is called name and T is either primitive or inner interface
113 void name(long idx, T name); //set an array element [idx] where array is called name and T is primitive
114 ```
115 
116 Algorithms can assume that an interface is 'bound' to 'some' concrete data layout.
117 
118 We could for example implement `S32Array` like this.
119 
120 ```java
121 class JavaS32Array implements S32Array{
122      int[] arr;
123      int length(){ return arr.length;}
124      int array(long idx) {return arr[idx];}
125      void array(long idx, int value) {arr[idx] = value;}
126      void length(int len) ; // we'll come back to this ;)
127 }
128 ```
129 
130 But for HAT to access native memory, allocated by the appropriate backend we need interfaces bound to MemorySegents/
131 
132 HAT includes an API which allows us to take an interface which extends `hat.Buffer`, and 'bind' it to a Panama FFM MemorySegment.
133 
134 This binding process automatically maps the accessors (for example `length()`, `array(long idx, int v)`) to low level Method and Var handel trickery underlying MemorySegments.
135 
136 Conceptually we might imagine that HAT creates something like this
137 
138 ```java
139 class PanamaS32Array implements S32Array{
140      MemorySegment segment;
141      final int SIZEOFINT = 4;
142      final long lenOffset = 0;
143      final long arrayOffset = lenOffset+SIZEOFINT;
144      int length(){ return segment.getInt(lenOffset);}
145      int array(long idx) {return segment.getInt(arrayOffset+idx*SIZEOFINT);}
146      void array(long idx, int value) {segment.setInt(arrayOffset+idx*SIZEOFINT,value);}
147      void length(int len) ; // we'll come back to this ;)
148 }
149 ```
150 
151 Much like Java's `Proxy` class, the iface mapper creates an implementation of the interface  'on the fly', the new Classfile API is used to 'spin up' the new class and the accessors are are composed using Var/Method Handles and offsets derived from the size and order of fields.
152 
153 Sadly an interface is not quite enough to establish exactly what is needed to complete the mapping.  We need to tell the `iface mapper` the order and size of fields and possibly some padding information.
154 
155 We do this by providing a 'layout description' using Panama's Layout api.
156 
157 ```java
158 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
159         JAVA_INT.withName("length"),
160         MemoryLayout.sequenceLayout(N, JAVA_INT.withName("length")).withName("array")
161 ).withName(S32Array.getSimpleName());
162 ```
163 
164 Eventually we came to a common pattern for describing HAT buffers by adding a `create` method to our interface which hides the mapping detail
165 
166 So the complete `S32Array` looks a like this. (....ish)
167 
168 ```java
169 public interface S32Array extends Buffer {
170     int length();
171 
172     void length(int i);
173 
174     int array(long idx);
175 
176     void array(long idx, int i);
177 
178     S32Array create(Accelerator a, int len) {
179         MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
180                 JAVA_INT.withName("length"),
181                 MemoryLayout.sequenceLayout(len, JAVA_INT.withName("length")).withName("array")
182         ).withName(S32Array.getSimpleName());
183 
184         S32Array s32Array = a.allocate(
185                 SegmentMapper.of(MethodHandles.lookup(), S32Array.class, s32ArrayLayout, len)
186         );
187 
188         return s32Array;
189     }
190 }
191 ```
192 
193 So now hopefully this code makes more sense.
194 
195 ```
196 var s32Array = S32Array.create(accelerator, 32);
197 ```
198 
199 Whilst this code is much nicer than hand mapping each method to offsets.  It is still quite verbose.
200 
201 In the last few weeks we have been migrating to Schema builder which makes this code easier to express..
202 
203 ```java
204 public interface S32Array extends Buffer {
205     int length();
206     void length(int i);
207     int array(long idx);
208     void array(long idx, int i);
209     Schema<S32Array> schema = Schema.of(S32Array.class, s->s
210         .arrayLen("length")
211         .array("array")
212     );
213 }
214 ```
215 The schema is embedded inside the interface and defines the order of fields. It also allows us to bind fields to each other (above we are telling the schema we have a `int length` field followed by an `int array[]` field and that the first defines the size of the second), we also can describe useful 'HAT' information for fields. Such as whether a field is 'atomic' ;)
216 
217 Here is an example of a table of Results for the face detector.
218 
219 ```java
220 public interface ResultTable extends Buffer{
221     interface Result extends Buffer.StructChild {
222         float x();
223         void x(float x);
224         float y();
225         void y(float y);
226     }
227     void count(int count);
228     int count();
229     int length();
230     Result result(long idx);
231 
232     Schema<ResultTable> schema = Schema.of(ResultTable.class, s->s
233             .atomic("count")
234             .arrayLen("length")
235             .array("result", r->r
236                .field("x")
237                .field("y")
238             )
239     );
240 }
241 ```
242 
243 Which in C99 OpenCL code will manifest as
244 
245 ```C++
246 typedef Result_s{
247    int x,y
248 } Result_t;
249 
250 typedef ResultTable_s{
251    int count;
252    int length;
253    Result_t result[0];
254 } Result_t;
255 ```
256 
257 In our Java code this interface makes access to MemorySegments much cleaner
258 
259 ```java
260     ResultTable resultTable = ResultTable.create(acc, 100);
261     for (int i=0; i<resultTable.length(); i++){
262         Result result = resultTable.result(i);
263         result.x(0);
264         result.y(0);
265     }
266 ```
267 
268 The generated OpenCL/C99 code from Java kernel code is also quite clean
269 
270 We might use a kernel to initialize the location of a bunch of Results
271 
272 ```java
273     @CodeReflection public static void init(KernelContext kc, ResultTable resultTable) {
274         if (kc.x < kc.maxX){
275            Result result = resulTable.result(kc.x);
276            result.x(kc.x);
277            result.y(100);
278         }
279     }
280 ```
281 
282 Whose Kernel code will look like this.
283 
284 ```
285 typedef struct KernelContext_s{
286     int x;
287     int maxX;
288 }KernelContext_t;
289 
290 typedef Result_s{
291    int x,y
292 } Result_t;
293 
294 typedef ResultTable_s{
295    int count;
296    int length;
297    Result_t result[0];
298 } Result_t;
299 
300 __kernel void init(
301     __global KernelContext_t *empty,
302     __global ResultTable_t* resultTable
303 ){
304     KernelContext_t kernelContext;
305     KernelContext_t *kc = &kernelContext;
306     kc->x=get_global_id(0);
307     kc->maxX = get_global_id(0);
308 
309     if(kc->x<kc->maxX){
310         __global Result_t *result = &resultTable[kc->x];
311         result->x = kc->x;
312     }
313     return;
314 }
315 ```
316 
317 A few notes from this generated code...
318 
319 * `KernelContext` is itself just an iface mapped segment.
320     -  But we don't pass `kc.x` o `kc.maxX` in the segment.
321         -  Instead initialize using appropriate  vendor calls
322 
323 So for OpenCL all kernels start like this
324 
325 ```
326 __kernel void init(__global KernelContext_t *empty , ....){
327     KernelContext_t kernelContext;
328     KernelContext_t *kc = &kernelContext;
329     kc->x=get_global_id(0);
330     kc->maxX = get_global_id(0);
331      ....
332 }
333 ```
334 
335 Whereas CUDA ;)
336 
337 ```
338 __kernel void init(__global KernelContext_t *empty , ....){
339     KernelContext_t kernelContext;
340     KernelContext_t *kc = &kernelContext;
341     kc->x=blockIdx.x*blockDim.x+threadIdx.x;
342     kc->maxX =gridDim.x*blockDim.x
343     ....
344 }
345 ```
346 
347 This simplifies code gen. Generally the CUDA code and OpenCL code looks identical.
348 
349 ----
350 
351 The iface mapping code in hat is a modified form of the code hereWe have a copy of Per's segment mapping code from
352 
353 https://github.com/minborg/panama-foreign/blob/segment-mapper/src/java.base/share/classes