New hat/docs/Implementation/interface-mapping.md

  1 # Interface Mapping
  2 [Back to Index ../](../index.md)
  3 
  4 ## or ... HAT from a Data POV
  5 
  6 ### Or ... what is this `S32Array` thing and why can't I just pass `int[]` to my kernel
  7 
  8 Again here is the canonical HAT 'hello world' kernel, weill use this to describe itgerface mapping
  9 
 10 We implement this in HAT by collecting the kernel(s) and compute method(s) in a `Compute` class.
 11 
 12 ```java
 13 public class SquareCompute {
 14     @Reflect
 15     public static int square(int v) {
 16         return v * v;
 17     }
 18 
 19     @Reflect
 20     public static void squareKernel(KernelContext kc, S32Array s32Array) {
 21         int value = s32Array.array(kc.x);     // arr[cc.x]
 22         s32Array.array(kc.x, square(value));  // arr[cc.x]=value*value
 23     }
 24 
 25     @Reflect
 26     public static void square(ComputeContext cc, S32Array s32Array) {
 27         cc.dispatchKernel(s32Array.length(),
 28                 kc -> squareKernel(kc, s32Array)
 29         );
 30     }
 31 }
 32 ```
 33 Which we dispatch by creating the appropriate data buffer and then asking an `Accelerator` (bound to a typical vendor backend) to execute the compute method.. which in turn coordinates the dispatch of the various kernels.
 34 
 35 ```java
 36   // Create an accelerator bound to a particular backend
 37 
 38   var accelerator = new Accelerator(MethodHandles.lookup(), Backend.FIRST);
 39 
 40   // Ask the accelerator/backend to allocate an S32Array
 41   var s32Array = S32Array.create(accelerator, 32);
 42 
 43   // Fill it with data
 44   for (int i = 0; i < s32Array.length(); i++) {
 45       s32Array.array(i, i);
 46   }
 47 
 48   // Tell the accelerator to execute the square() compute entrypoint
 49 
 50   accelerator.compute(
 51      cc -> SquareCompute.square(cc, s32Array)
 52   );
 53 
 54   // Check the data
 55   for (int i = 0; i < arr.length(); i++) {
 56       System.out.println(i + " " + arr.array(i));
 57   }
 58 ```
 59 
 60 HAT kernels only accept Java primitives and HAT buffers as parameters.
 61 
 62 We don't directly support heap allocated data (such as int[])
 63 
 64 From Java's point of view `S32Array` is a `hat.Buffer` and is defined as an interface.
 65 
 66 ```java
 67 public interface S32Array extends Buffer {
 68     int length();
 69     void length(int i);
 70     int array(long idx);
 71     void array(long idx, int i);
 72 }
 73 ```
 74 
 75 From C99 style OpenCL/CUDA POV this will eventually be mapped to a typedef.
 76 
 77 ```C++
 78 typedef struct S32Array_s{
 79     int length;
 80     int array[];  //<-- ?
 81 }S32Array_t;
 82 ```
 83 
 84 Our Java implementations should treat the interface as `data`, generally the only
 85 methods that we include in a `hat.Buffer` should be
 86 
 87 ```java
 88 T name();                    //getter for a field called name with type T, where T may be primitive or inner interface)
 89 void name(T name);           //setter for a field called name with type T, T must be  primitive
 90 T name(long idx);            //get an array element [idx] where array is called name and T is either primitive or inner interface
 91 void name(long idx, T name); //set an array element [idx] where array is called name and T is primitive
 92 ```
 93 
 94 Algorithms can assume that an interface is 'bound' to 'some' concrete data layout.
 95 
 96 We could for example implement `S32Array` like this.
 97 
 98 ```java
 99 class JavaS32Array implements S32Array{
100      int[] arr;
101      int length(){ return arr.length;}
102      int array(long idx) {return arr[idx];}
103      void array(long idx, int value) {arr[idx] = value;}
104      void length(int len) ; // we'll come back to this ;)
105 }
106 ```
107 
108 But for HAT to access native memory, allocated by the appropriate backend we need interfaces bound to MemorySegents/
109 
110 HAT includes an API which allows us to take an interface which extends `hat.Buffer`, and 'bind' it to a Panama FFM MemorySegment.
111 
112 This binding process automatically maps the accessors (for example `length()`, `array(long idx, int v)`) to low level Method and Var handel trickery underlying MemorySegments.
113 
114 Conceptually we might imagine that HAT creates something like this
115 
116 ```java
117 class PanamaS32Array implements S32Array{
118      MemorySegment segment;
119      final int SIZEOFINT = 4;
120      final long lenOffset = 0;
121      final long arrayOffset = lenOffset+SIZEOFINT;
122      int length(){ return segment.getInt(lenOffset);}
123      int array(long idx) {return segment.getInt(arrayOffset+idx*SIZEOFINT);}
124      void array(long idx, int value) {segment.setInt(arrayOffset+idx*SIZEOFINT,value);}
125      void length(int len) ; // we'll come back to this ;)
126 }
127 ```
128 
129 Much like Java's `Proxy` class, the iface mapper creates an implementation of the interface  'on the fly', the new Classfile API is used to 'spin up' the new class and the accessors are are composed using Var/Method Handles and offsets derived from the size and order of fields.
130 
131 Sadly an interface is not quite enough to establish exactly what is needed to complete the mapping.  We need to tell the `iface mapper` the order and size of fields and possibly some padding information.
132 
133 We do this by providing a 'layout description' using Panama's Layout api.
134 
135 ```java
136 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
137         JAVA_INT.withName("length"),
138         MemoryLayout.sequenceLayout(N, JAVA_INT.withName("length")).withName("array")
139 ).withName(S32Array.getSimpleName());
140 ```
141 
142 Eventually we came to a common pattern for describing HAT buffers by adding a `create` method to our interface which hides the mapping detail
143 
144 So the complete `S32Array` looks a like this. (....ish)
145 
146 ```java
147 public interface S32Array extends Buffer {
148     int length();
149 
150     void length(int i);
151 
152     int array(long idx);
153 
154     void array(long idx, int i);
155 
156     S32Array create(Accelerator a, int len) {
157         MemoryLayout s32ArrayLayout = MemoryLayout.structLayout(
158                 JAVA_INT.withName("length"),
159                 MemoryLayout.sequenceLayout(len, JAVA_INT.withName("length")).withName("array")
160         ).withName(S32Array.getSimpleName());
161 
162         S32Array s32Array = a.allocate(
163                 SegmentMapper.of(MethodHandles.lookup(), S32Array.class, s32ArrayLayout, len)
164         );
165 
166         return s32Array;
167     }
168 }
169 ```
170 
171 So now hopefully this code makes more sense.
172 
173 ```
174 var s32Array = S32Array.create(accelerator, 32);
175 ```
176 
177 Whilst this code is much nicer than hand mapping each method to offsets.  It is still quite verbose.
178 
179 In the last few weeks we have been migrating to Schema builder which makes this code easier to express..
180 
181 ```java
182 public interface S32Array extends Buffer {
183     int length();
184     void length(int i);
185     int array(long idx);
186     void array(long idx, int i);
187     Schema<S32Array> schema = Schema.of(S32Array.class, s->s
188         .arrayLen("length")
189         .array("array")
190     );
191 }
192 ```
193 The schema is embedded inside the interface and defines the order of fields. It also allows us to bind fields to each other (above we are telling the schema we have a `int length` field followed by an `int array[]` field and that the first defines the size of the second), we also can describe useful 'HAT' information for fields. Such as whether a field is 'atomic' ;)
194 
195 Here is an example of a table of Results for the face detector.
196 
197 ```java
198 public interface ResultTable extends Buffer{
199     interface Result extends Buffer.StructChild {
200         float x();
201         void x(float x);
202         float y();
203         void y(float y);
204     }
205     void count(int count);
206     int count();
207     int length();
208     Result result(long idx);
209 
210     Schema<ResultTable> schema = Schema.of(ResultTable.class, s->s
211             .atomic("count")
212             .arrayLen("length")
213             .array("result", r->r
214                .field("x")
215                .field("y")
216             )
217     );
218 }
219 ```
220 
221 Which in C99 OpenCL code will manifest as
222 
223 ```C++
224 typedef Result_s{
225    int x,y
226 } Result_t;
227 
228 typedef ResultTable_s{
229    int count;
230    int length;
231    Result_t result[0];
232 } Result_t;
233 ```
234 
235 In our Java code this interface makes access to MemorySegments much cleaner
236 
237 ```java
238     ResultTable resultTable = ResultTable.create(acc, 100);
239     for (int i=0; i<resultTable.length(); i++){
240         Result result = resultTable.result(i);
241         result.x(0);
242         result.y(0);
243     }
244 ```
245 
246 The generated OpenCL/C99 code from Java kernel code is also quite clean
247 
248 We might use a kernel to initialize the location of a bunch of Results
249 
250 ```java
251     @Reflect public static void init(KernelContext kc, ResultTable resultTable) {
252         if (kc.x < kc.maxX){
253            Result result = resulTable.result(kc.x);
254            result.x(kc.x);
255            result.y(100);
256         }
257     }
258 ```
259 
260 Whose Kernel code will look like this.
261 
262 ```
263 typedef struct KernelContext_s{
264     int x;
265     int maxX;
266 }KernelContext_t;
267 
268 typedef Result_s{
269    int x,y
270 } Result_t;
271 
272 typedef ResultTable_s{
273    int count;
274    int length;
275    Result_t result[0];
276 } Result_t;
277 
278 __kernel void init(
279     __global KernelContext_t *empty,
280     __global ResultTable_t* resultTable
281 ){
282     KernelContext_t kernelContext;
283     KernelContext_t *kc = &kernelContext;
284     kc->x=get_global_id(0);
285     kc->maxX = get_global_id(0);
286 
287     if(kc->x<kc->maxX){
288         __global Result_t *result = &resultTable[kc->x];
289         result->x = kc->x;
290     }
291     return;
292 }
293 ```
294 
295 A few notes from this generated code...
296 
297 * `KernelContext` is itself just an iface mapped segment.
298     -  But we don't pass `kc.x` o `kc.maxX` in the segment.
299         -  Instead initialize using appropriate  vendor calls
300 
301 So for OpenCL all kernels start like this
302 
303 ```
304 __kernel void init(__global KernelContext_t *empty , ....){
305     KernelContext_t kernelContext;
306     KernelContext_t *kc = &kernelContext;
307     kc->x=get_global_id(0);
308     kc->maxX = get_global_id(0);
309      ....
310 }
311 ```
312 
313 Whereas CUDA ;)
314 
315 ```
316 __kernel void init(__global KernelContext_t *empty , ....){
317     KernelContext_t kernelContext;
318     KernelContext_t *kc = &kernelContext;
319     kc->x=blockIdx.x*blockDim.x+threadIdx.x;
320     kc->maxX =gridDim.x*blockDim.x
321     ....
322 }
323 ```
324 
325 This simplifies code gen. Generally the CUDA code and OpenCL code looks identical.
326 
327 ----
328 
329 The iface mapping code in hat is a modified form of the code hereWe have a copy of Per's segment mapping code from
330 
331 https://github.com/minborg/panama-foreign/blob/segment-mapper/src/java.base/share/classes