New hat/docs/hat-04-01-interface-mapping.md

  1 # Interface Mapping
  2 
  3 ----
  4 
  5 * [Contents](hat-00.md)
  6 * House Keeping
  7   * [Project Layout](hat-01-01-project-layout.md)
  8   * [Building Babylon](hat-01-02-building-babylon.md)
  9   * [Building HAT](hat-01-03-building-hat.md)
 10     * [Enabling the CUDA Backend](hat-01-05-building-hat-for-cuda.md)
 11 * Programming Model
 12   * [Programming Model](hat-03-programming-model.md)
 13 * Interface Mapping
 14   * [Interface Mapping Overview](hat-04-01-interface-mapping.md)
 15   * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md)
 16 * Implementation Detail
 17   * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md)
 18   * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md)
 19 
 20 ----
 21 
 22 # Interface Mapping
 23 
 24 ## or ... HAT from a Data POV
 25 
 26 ### Or ... what is this `S32Array` thing and why can't I just pass `int[]` to my kernel
 27 
 28 Again here is the canonical HAT 'hello world' kernel, weill use this to describe itgerface mapping
 29 
 30 We implement this in HAT by collecting the kernel(s) and compute method(s) in a `Compute` class.
 31 
 32 ```java
 33 public class SquareCompute {
 34     @CodeReflection
 35     public static int square(int v) {
 36         return v * v;
 37     }
 38 
 39     @CodeReflection
 40     public static void squareKernel(KernelContext kc, S32Array s32Array) {
 41         int value = s32Array.array(kc.x);     // arr[cc.x]
 42         s32Array.array(kc.x, square(value));  // arr[cc.x]=value*value
 43     }
 44 
 45     @CodeReflection
 46     public static void square(ComputeContext cc, S32Array s32Array) {
 47         cc.dispatchKernel(s32Array.length(),
 48                 kc -> squareKernel(kc, s32Array)
 49         );
 50     }
 51 }
 52 ```
 53 Which we dispatch by creating the appropriate data buffer and then asking an `Accelerator` (bound to a typical vendor backend) to execute the compute method.. which in turn coordinates the dispatch of the various kernels.
 54 
 55 ```java
 56   // Create an accelerator bound to a particular backend
 57 
 58   var accelerator = new Accelerator(
 59       java.lang.invoke.MethodHandles.lookup(),
 60       Backend.FIRST  // Predicate<Backend>
 61   );
 62 
 63   // Ask the accelerator/backend to allocate an S32Array
 64   var s32Array = S32Array.create(accelerator, 32);
 65 
 66   // Fill it with data
 67   for (int i = 0; i < s32Array.length(); i++) {
 68       s32Array.array(i, i);
 69   }
 70 
 71   // Tell the accelerator to execute the square() compute entrypoint
 72 
 73   accelerator.compute(
 74      cc -> SquareCompute.square(cc, s32Array)
 75   );
 76 
 77   // Check the data
 78   for (int i = 0; i < arr.length(); i++) {
 79       System.out.println(i + " " + arr.array(i));
 80   }
 81 ```
 82 
 83 HAT kernels only accept Java primitives and HAT buffers as parameters.
 84 
 85 We don't directly support heap allocated data (such as int[])
 86 
 87 From Java's point of view `S32Array` is a `hat.Buffer` and is defined as an interface.
 88 
 89 ```java
 90 public interface S32Array extends Buffer {
 91     int length();
 92     void length(int i);
 93     int array(long idx);
 94     void array(long idx, int i);
 95 }
 96 ```
 97 
 98 From C99 style OpenCL/CUDA POV this will eventually be mapped to a typedef.
 99 
100 ```C++
101 typedef struct S32Array_s{
102     int length;
103     int array[];  //<-- ?
104 }S32Array_t;
105 ```
106 
107 Our Java implementations should treat the interface as `data`, generally the only
108 methods that we include in a `hat.Buffer` should be
109 
110 ```java
111 T name();                    //getter for a field called name with type T, where T may be primitive or inner interface)
112 void name(T name);           //setter for a field called name with type T, T must be  primitive
113 T name(long idx);            //get an array element [idx] where array is called name and T is either primitive or inner interface
114 void name(long idx, T name); //set an array element [idx] where array is called name and T is primitive
115 ```
116 
117 Algorithms can assume that an interface is 'bound' to 'some' concrete data layout.
118 
119 We could for example implement `S32Array` like this.
120 
121 ```java
122 class JavaS32Array implements S32Array{
123      int[] arr;
124      int length(){ return arr.length;}
125      int array(long idx) {return arr[idx];}
126      void array(long idx, int value) {arr[idx] = value;}
127      void length(int len) ; // we'll come back to this ;)
128 }
129 ```
130 
131 But for HAT to access native memory, allocated by the appropriate backend we need interfaces bound to MemorySegents/
132 
133 HAT includes an API which allows us to take an interface which extends `hat.Buffer`, and 'bind' it to a Panama FFM MemorySegment.
134 
135 This binding process automatically maps the accessors (for example `length()`, `array(long idx, int v)`) to low level Method and Var handel trickery underlying MemorySegments.
136 
137 Conceptually we might imagine that HAT creates something like this
138 
139 ```java
140 class PanamaS32Array implements S32Array{
141      MemorySegment segment;
142      final int SIZEOFINT = 4;
143      final long lenOffset = 0;
144      final long arrayOffset = lenOffset+SIZEOFINT;
145      int length(){ return segment.getInt(lenOffset);}
146      int array(long idx) {return segment.getInt(arrayOffset+idx*SIZEOFINT);}
147      void array(long idx, int value) {segment.setInt(arrayOffset+idx*SIZEOFINT,value);}
148      void length(int len) ; // we'll come back to this ;)
149 }
150 ```
151 
152 Much like Java's `Proxy` class, the iface mapper creates an implementation of the interface  'on the fly', the new Classfile API is used to 'spin up' the new class and the accessors are are composed using Var/Method Handles and offsets derived from the size and order of fields.
153 
154 Sadly an interface is not quite enough to establish exactly what is needed to complete the mapping.  We need to tell the `iface mapper` the order and size of fields and possibly some padding information.
155 
156 We do this by providing a 'layout description' using Panama's Layout api.
157 
158 ```java
159 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
160         JAVA_INT.withName("length"),
161         MemoryLayout.sequenceLayout(N, JAVA_INT.withName("length")).withName("array")
162 ).withName(S32Array.getSimpleName());
163 ```
164 
165 Eventually we came to a common pattern for describing HAT buffers by adding a `create` method to our interface which hides the mapping detail
166 
167 So the complete `S32Array` looks a like this. (....ish)
168 
169 ```java
170 public interface S32Array extends Buffer {
171     int length();
172 
173     void length(int i);
174 
175     int array(long idx);
176 
177     void array(long idx, int i);
178 
179     S32Array create(Accelerator a, int len) {
180         MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
181                 JAVA_INT.withName("length"),
182                 MemoryLayout.sequenceLayout(len, JAVA_INT.withName("length")).withName("array")
183         ).withName(S32Array.getSimpleName());
184 
185         S32Array s32Array = a.allocate(
186                 SegmentMapper.of(MethodHandles.lookup(), S32Array.class, s32ArrayLayout, len)
187         );
188 
189         return s32Array;
190     }
191 }
192 ```
193 
194 So now hopefully this code makes more sense.
195 
196 ```
197 var s32Array = S32Array.create(accelerator, 32);
198 ```
199 
200 Whilst this code is much nicer than hand mapping each method to offsets.  It is still quite verbose.
201 
202 In the last few weeks we have been migrating to Schema builder which makes this code easier to express..
203 
204 ```java
205 public interface S32Array extends Buffer {
206     int length();
207     void length(int i);
208     int array(long idx);
209     void array(long idx, int i);
210     Schema<S32Array> schema = Schema.of(S32Array.class, s->s
211         .arrayLen("length")
212         .array("array")
213     );
214 }
215 ```
216 The schema is embedded inside the interface and defines the order of fields. It also allows us to bind fields to each other (above we are telling the schema we have a `int length` field followed by an `int array[]` field and that the first defines the size of the second), we also can describe useful 'HAT' information for fields. Such as whether a field is 'atomic' ;)
217 
218 Here is an example of a table of Results for the face detector.
219 
220 ```java
221 public interface ResultTable extends Buffer{
222     interface Result extends Buffer.StructChild {
223         float x();
224         void x(float x);
225         float y();
226         void y(float y);
227     }
228     void count(int count);
229     int count();
230     int length();
231     Result result(long idx);
232 
233     Schema<ResultTable> schema = Schema.of(ResultTable.class, s->s
234             .atomic("count")
235             .arrayLen("length")
236             .array("result", r->r
237                .field("x")
238                .field("y")
239             )
240     );
241 }
242 ```
243 
244 Which in C99 OpenCL code will manifest as
245 
246 ```C++
247 typedef Result_s{
248    int x,y
249 } Result_t;
250 
251 typedef ResultTable_s{
252    int count;
253    int length;
254    Result_t result[0];
255 } Result_t;
256 ```
257 
258 In our Java code this interface makes access to MemorySegments much cleaner
259 
260 ```java
261     ResultTable resultTable = ResultTable.create(acc, 100);
262     for (int i=0; i<resultTable.length(); i++){
263         Result result = resultTable.result(i);
264         result.x(0);
265         result.y(0);
266     }
267 ```
268 
269 The generated OpenCL/C99 code from Java kernel code is also quite clean
270 
271 We might use a kernel to initialize the location of a bunch of Results
272 
273 ```java
274     @CodeReflection public static void init(KernelContext kc, ResultTable resultTable) {
275         if (kc.x < kc.maxX){
276            Result result = resulTable.result(kc.x);
277            result.x(kc.x);
278            result.y(100);
279         }
280     }
281 ```
282 
283 Whose Kernel code will look like this.
284 
285 ```
286 typedef struct KernelContext_s{
287     int x;
288     int maxX;
289 }KernelContext_t;
290 
291 typedef Result_s{
292    int x,y
293 } Result_t;
294 
295 typedef ResultTable_s{
296    int count;
297    int length;
298    Result_t result[0];
299 } Result_t;
300 
301 __kernel void init(
302     __global KernelContext_t *empty,
303     __global ResultTable_t* resultTable
304 ){
305     KernelContext_t kernelContext;
306     KernelContext_t *kc = &kernelContext;
307     kc->x=get_global_id(0);
308     kc->maxX = get_global_id(0);
309 
310     if(kc->x<kc->maxX){
311         __global Result_t *result = &resultTable[kc->x];
312         result->x = kc->x;
313     }
314     return;
315 }
316 ```
317 
318 A few notes from this generated code...
319 
320 * `KernelContext` is itself just an iface mapped segment.
321     -  But we don't pass `kc.x` o `kc.maxX` in the segment.
322         -  Instead initialize using appropriate  vendor calls
323 
324 So for OpenCL all kernels start like this
325 
326 ```
327 __kernel void init(__global KernelContext_t *empty , ....){
328     KernelContext_t kernelContext;
329     KernelContext_t *kc = &kernelContext;
330     kc->x=get_global_id(0);
331     kc->maxX = get_global_id(0);
332      ....
333 }
334 ```
335 
336 Whereas CUDA ;)
337 
338 ```
339 __kernel void init(__global KernelContext_t *empty , ....){
340     KernelContext_t kernelContext;
341     KernelContext_t *kc = &kernelContext;
342     kc->x=blockIdx.x*blockDim.x+threadIdx.x;
343     kc->maxX =gridDim.x*blockDim.x
344     ....
345 }
346 ```
347 
348 This simplifies code gen. Generally the CUDA code and OpenCL code looks identical.
349 
350 ----
351 
352 The iface mapping code in hat is a modified form of the code hereWe have a copy of Per's segment mapping code from
353 
354 https://github.com/minborg/panama-foreign/blob/segment-mapper/src/java.base/share/classes