1 2 # Interface Mapping 3 4 ---- 5 6 * [Contents](hat-00.md) 7 * House Keeping 8 * [Project Layout](hat-01-01-project-layout.md) 9 * [Building Babylon](hat-01-02-building-babylon.md) 10 * [Building HAT](hat-01-03-building-hat.md) 11 * Programming Model 12 * [Programming Model](hat-03-programming-model.md) 13 * Interface Mapping 14 * [Interface Mapping Overview](hat-04-01-interface-mapping.md) 15 * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md) 16 * Implementation Detail 17 * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md) 18 19 ---- 20 # Interface Mapping 21 22 ## or ... HAT from a Data POV 23 24 ### Or ... what is this `S32Array` thing and why can't I just pass `int[]` to my kernel 25 26 Again here is the canonical HAT 'hello world' kernel, weill use this to describe itgerface mapping 27 28 We implement this in HAT by collecting the kernel(s) and compute method(s) in a `Compute` class. 29 30 ```java 31 public class SquareCompute { 32 @CodeReflection 33 public static int square(int v) { 34 return v * v; 35 } 36 37 @CodeReflection 38 public static void squareKernel(KernelContext kc, S32Array s32Array) { 39 int value = s32Array.array(kc.x); // arr[cc.x] 40 s32Array.array(kc.x, square(value)); // arr[cc.x]=value*value 41 } 42 43 @CodeReflection 44 public static void square(ComputeContext cc, S32Array s32Array) { 45 cc.dispatchKernel(s32Array.length(), 46 kc -> squareKernel(kc, s32Array) 47 ); 48 } 49 } 50 ``` 51 Which we dispatch by creating the appropriate data buffer and then asking an `Accelerator` (bound to a typical vendor backend) to execute the compute method.. which in turn coordinates the dispatch of the various kernels. 52 53 ```java 54 // Create an accelerator bound to a particular backend 55 56 var accelerator = new Accelerator( 57 java.lang.invoke.MethodHandles.lookup(), 58 Backend.FIRST // Predicate<Backend> 59 ); 60 61 // Ask the accelerator/backend to allocate an S32Array 62 var s32Array = S32Array.create(accelerator, 32); 63 64 // Fill it with data 65 for (int i = 0; i < s32Array.length(); i++) { 66 s32Array.array(i, i); 67 } 68 69 // Tell the accelerator to execute the square() compute entrypoint 70 71 accelerator.compute( 72 cc -> SquareCompute.square(cc, s32Array) 73 ); 74 75 // Check the data 76 for (int i = 0; i < arr.length(); i++) { 77 System.out.println(i + " " + arr.array(i)); 78 } 79 ``` 80 81 HAT kernels only accept Java primitives and HAT buffers as parameters. 82 83 We don't directly support heap allocated data (such as int[]) 84 85 From Java's point of view `S32Array` is a `hat.Buffer` and is defined as an interface. 86 87 ```java 88 public interface S32Array extends Buffer { 89 int length(); 90 void length(int i); 91 int array(long idx); 92 void array(long idx, int i); 93 } 94 ``` 95 96 From C99 style OpenCL/CUDA POV this will eventually be mapped to a typedef. 97 98 ```C++ 99 typedef struct S32Array_s{ 100 int length; 101 int array[]; //<-- ? 102 }S32Array_t; 103 ``` 104 105 Our Java implementations should treat the interface as `data`, generally the only 106 methods that we include in a `hat.Buffer` should be 107 108 ```java 109 T name(); //getter for a field called name with type T, where T may be primitive or inner interface) 110 void name(T name); //setter for a field called name with type T, T must be primitive 111 T name(long idx); //get an array element [idx] where array is called name and T is either primitive or inner interface 112 void name(long idx, T name); //set an array element [idx] where array is called name and T is primitive 113 ``` 114 115 Algorithms can assume that an interface is 'bound' to 'some' concrete data layout. 116 117 We could for example implement `S32Array` like this. 118 119 ```java 120 class JavaS32Array implements S32Array{ 121 int[] arr; 122 int length(){ return arr.length;} 123 int array(long idx) {return arr[idx];} 124 void array(long idx, int value) {arr[idx] = value;} 125 void length(int len) ; // we'll come back to this ;) 126 } 127 ``` 128 129 But for HAT to access native memory, allocated by the appropriate backend we need interfaces bound to MemorySegents/ 130 131 HAT includes an API which allows us to take an interface which extends `hat.Buffer`, and 'bind' it to a Panama FFM MemorySegment. 132 133 This binding process automatically maps the accessors (for example `length()`, `array(long idx, int v)`) to low level Method and Var handel trickery underlying MemorySegments. 134 135 Conceptually we might imagine that HAT creates something like this 136 137 ```java 138 class PanamaS32Array implements S32Array{ 139 MemorySegment segment; 140 final int SIZEOFINT = 4; 141 final long lenOffset = 0; 142 final long arrayOffset = lenOffset+SIZEOFINT; 143 int length(){ return segment.getInt(lenOffset);} 144 int array(long idx) {return segment.getInt(arrayOffset+idx*SIZEOFINT);} 145 void array(long idx, int value) {segment.setInt(arrayOffset+idx*SIZEOFINT,value);} 146 void length(int len) ; // we'll come back to this ;) 147 } 148 ``` 149 150 Much like Java's `Proxy` class, the iface mapper creates an implementation of the interface 'on the fly', the new Classfile API is used to 'spin up' the new class and the accessors are are composed using Var/Method Handles and offsets derived from the size and order of fields. 151 152 Sadly an interface is not quite enough to establish exactly what is needed to complete the mapping. We need to tell the `iface mapper` the order and size of fields and possibly some padding information. 153 154 We do this by providing a 'layout description' using Panama's Layout api. 155 156 ```java 157 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout( 158 JAVA_INT.withName("length"), 159 MemoryLayout.sequenceLayout(N, JAVA_INT.withName("length")).withName("array") 160 ).withName(S32Array.getSimpleName()); 161 ``` 162 163 Eventually we came to a common pattern for describing HAT buffers by adding a `create` method to our interface which hides the mapping detail 164 165 So the complete `S32Array` looks a like this. (....ish) 166 167 ```java 168 public interface S32Array extends Buffer { 169 int length(); 170 171 void length(int i); 172 173 int array(long idx); 174 175 void array(long idx, int i); 176 177 S32Array create(Accelerator a, int len) { 178 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout( 179 JAVA_INT.withName("length"), 180 MemoryLayout.sequenceLayout(len, JAVA_INT.withName("length")).withName("array") 181 ).withName(S32Array.getSimpleName()); 182 183 S32Array s32Array = a.allocate( 184 SegmentMapper.of(MethodHandles.lookup(), S32Array.class, s32ArrayLayout, len) 185 ); 186 187 return s32Array; 188 } 189 } 190 ``` 191 192 So now hopefully this code makes more sense. 193 194 ``` 195 var s32Array = S32Array.create(accelerator, 32); 196 ``` 197 198 Whilst this code is much nicer than hand mapping each method to offsets. It is still quite verbose. 199 200 In the last few weeks we have been migrating to Schema builder which makes this code easier to express.. 201 202 ```java 203 public interface S32Array extends Buffer { 204 int length(); 205 void length(int i); 206 int array(long idx); 207 void array(long idx, int i); 208 Schema<S32Array> schema = Schema.of(S32Array.class, s->s 209 .arrayLen("length") 210 .array("array") 211 ); 212 } 213 ``` 214 The schema is embedded inside the interface and defines the order of fields. It also allows us to bind fields to each other (above we are telling the schema we have a `int length` field followed by an `int array[]` field and that the first defines the size of the second), we also can describe useful 'HAT' information for fields. Such as whether a field is 'atomic' ;) 215 216 Here is an example of a table of Results for the face detector. 217 218 ```java 219 public interface ResultTable extends Buffer{ 220 interface Result extends Buffer.StructChild { 221 float x(); 222 void x(float x); 223 float y(); 224 void y(float y); 225 } 226 void count(int count); 227 int count(); 228 int length(); 229 Result result(long idx); 230 231 Schema<ResultTable> schema = Schema.of(ResultTable.class, s->s 232 .atomic("count") 233 .arrayLen("length") 234 .array("result", r->r 235 .field("x") 236 .field("y") 237 ) 238 ); 239 } 240 ``` 241 242 Which in C99 OpenCL code will manifest as 243 244 ```C++ 245 typedef Result_s{ 246 int x,y 247 } Result_t; 248 249 typedef ResultTable_s{ 250 int count; 251 int length; 252 Result_t result[0]; 253 } Result_t; 254 ``` 255 256 In our Java code this interface makes access to MemorySegments much cleaner 257 258 ```java 259 ResultTable resultTable = ResultTable.create(acc, 100); 260 for (int i=0; i<resultTable.length(); i++){ 261 Result result = resultTable.result(i); 262 result.x(0); 263 result.y(0); 264 } 265 ``` 266 267 The generated OpenCL/C99 code from Java kernel code is also quite clean 268 269 We might use a kernel to initialize the location of a bunch of Results 270 271 ```java 272 @CodeReflection public static void init(KernelContext kc, ResultTable resultTable) { 273 if (kc.x < kc.maxX){ 274 Result result = resulTable.result(kc.x); 275 result.x(kc.x); 276 result.y(100); 277 } 278 } 279 ``` 280 281 Whose Kernel code will look like this. 282 283 ``` 284 typedef struct KernelContext_s{ 285 int x; 286 int maxX; 287 }KernelContext_t; 288 289 typedef Result_s{ 290 int x,y 291 } Result_t; 292 293 typedef ResultTable_s{ 294 int count; 295 int length; 296 Result_t result[0]; 297 } Result_t; 298 299 __kernel void init( 300 __global KernelContext_t *empty, 301 __global ResultTable_t* resultTable 302 ){ 303 KernelContext_t kernelContext; 304 KernelContext_t *kc = &kernelContext; 305 kc->x=get_global_id(0); 306 kc->maxX = get_global_id(0); 307 308 if(kc->x<kc->maxX){ 309 __global Result_t *result = &resultTable[kc->x]; 310 result->x = kc->x; 311 } 312 return; 313 } 314 ``` 315 316 A few notes from this generated code... 317 318 * `KernelContext` is itself just an iface mapped segment. 319 - But we don't pass `kc.x` o `kc.maxX` in the segment. 320 - Instead initialize using appropriate vendor calls 321 322 So for OpenCL all kernels start like this 323 324 ``` 325 __kernel void init(__global KernelContext_t *empty , ....){ 326 KernelContext_t kernelContext; 327 KernelContext_t *kc = &kernelContext; 328 kc->x=get_global_id(0); 329 kc->maxX = get_global_id(0); 330 .... 331 } 332 ``` 333 334 Whereas CUDA ;) 335 336 ``` 337 __kernel void init(__global KernelContext_t *empty , ....){ 338 KernelContext_t kernelContext; 339 KernelContext_t *kc = &kernelContext; 340 kc->x=blockIdx.x*blockDim.x+threadIdx.x; 341 kc->maxX =gridDim.x*blockDim.x 342 .... 343 } 344 ``` 345 346 This simplifies code gen. Generally the CUDA code and OpenCL code looks identical. 347 348 ---- 349 350 The iface mapping code in hat is a modified form of the code hereWe have a copy of Per's segment mapping code from 351 352 https://github.com/minborg/panama-foreign/blob/segment-mapper/src/java.base/share/classes