1 2 # Interface Mapping 3 4 ---- 5 6 * [Contents](hat-00.md) 7 * House Keeping 8 * [Project Layout](hat-01-01-project-layout.md) 9 * [Building Babylon](hat-01-02-building-babylon.md) 10 * [Building HAT](hat-01-03-building-hat.md) 11 * Programming Model 12 * [Programming Model](hat-03-programming-model.md) 13 * Interface Mapping 14 * [Interface Mapping Overview](hat-04-01-interface-mapping.md) 15 * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md) 16 * Implementation Detail 17 * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md) 18 * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md) 19 20 ---- 21 # Interface Mapping 22 23 ## or ... HAT from a Data POV 24 25 ### Or ... what is this `S32Array` thing and why can't I just pass `int[]` to my kernel 26 27 Again here is the canonical HAT 'hello world' kernel, weill use this to describe itgerface mapping 28 29 We implement this in HAT by collecting the kernel(s) and compute method(s) in a `Compute` class. 30 31 ```java 32 public class SquareCompute { 33 @CodeReflection 34 public static int square(int v) { 35 return v * v; 36 } 37 38 @CodeReflection 39 public static void squareKernel(KernelContext kc, S32Array s32Array) { 40 int value = s32Array.array(kc.x); // arr[cc.x] 41 s32Array.array(kc.x, square(value)); // arr[cc.x]=value*value 42 } 43 44 @CodeReflection 45 public static void square(ComputeContext cc, S32Array s32Array) { 46 cc.dispatchKernel(s32Array.length(), 47 kc -> squareKernel(kc, s32Array) 48 ); 49 } 50 } 51 ``` 52 Which we dispatch by creating the appropriate data buffer and then asking an `Accelerator` (bound to a typical vendor backend) to execute the compute method.. which in turn coordinates the dispatch of the various kernels. 53 54 ```java 55 // Create an accelerator bound to a particular backend 56 57 var accelerator = new Accelerator( 58 java.lang.invoke.MethodHandles.lookup(), 59 Backend.FIRST // Predicate<Backend> 60 ); 61 62 // Ask the accelerator/backend to allocate an S32Array 63 var s32Array = S32Array.create(accelerator, 32); 64 65 // Fill it with data 66 for (int i = 0; i < s32Array.length(); i++) { 67 s32Array.array(i, i); 68 } 69 70 // Tell the accelerator to execute the square() compute entrypoint 71 72 accelerator.compute( 73 cc -> SquareCompute.square(cc, s32Array) 74 ); 75 76 // Check the data 77 for (int i = 0; i < arr.length(); i++) { 78 System.out.println(i + " " + arr.array(i)); 79 } 80 ``` 81 82 HAT kernels only accept Java primitives and HAT buffers as parameters. 83 84 We don't directly support heap allocated data (such as int[]) 85 86 From Java's point of view `S32Array` is a `hat.Buffer` and is defined as an interface. 87 88 ```java 89 public interface S32Array extends Buffer { 90 int length(); 91 void length(int i); 92 int array(long idx); 93 void array(long idx, int i); 94 } 95 ``` 96 97 From C99 style OpenCL/CUDA POV this will eventually be mapped to a typedef. 98 99 ```C++ 100 typedef struct S32Array_s{ 101 int length; 102 int array[]; //<-- ? 103 }S32Array_t; 104 ``` 105 106 Our Java implementations should treat the interface as `data`, generally the only 107 methods that we include in a `hat.Buffer` should be 108 109 ```java 110 T name(); //getter for a field called name with type T, where T may be primitive or inner interface) 111 void name(T name); //setter for a field called name with type T, T must be primitive 112 T name(long idx); //get an array element [idx] where array is called name and T is either primitive or inner interface 113 void name(long idx, T name); //set an array element [idx] where array is called name and T is primitive 114 ``` 115 116 Algorithms can assume that an interface is 'bound' to 'some' concrete data layout. 117 118 We could for example implement `S32Array` like this. 119 120 ```java 121 class JavaS32Array implements S32Array{ 122 int[] arr; 123 int length(){ return arr.length;} 124 int array(long idx) {return arr[idx];} 125 void array(long idx, int value) {arr[idx] = value;} 126 void length(int len) ; // we'll come back to this ;) 127 } 128 ``` 129 130 But for HAT to access native memory, allocated by the appropriate backend we need interfaces bound to MemorySegents/ 131 132 HAT includes an API which allows us to take an interface which extends `hat.Buffer`, and 'bind' it to a Panama FFM MemorySegment. 133 134 This binding process automatically maps the accessors (for example `length()`, `array(long idx, int v)`) to low level Method and Var handel trickery underlying MemorySegments. 135 136 Conceptually we might imagine that HAT creates something like this 137 138 ```java 139 class PanamaS32Array implements S32Array{ 140 MemorySegment segment; 141 final int SIZEOFINT = 4; 142 final long lenOffset = 0; 143 final long arrayOffset = lenOffset+SIZEOFINT; 144 int length(){ return segment.getInt(lenOffset);} 145 int array(long idx) {return segment.getInt(arrayOffset+idx*SIZEOFINT);} 146 void array(long idx, int value) {segment.setInt(arrayOffset+idx*SIZEOFINT,value);} 147 void length(int len) ; // we'll come back to this ;) 148 } 149 ``` 150 151 Much like Java's `Proxy` class, the iface mapper creates an implementation of the interface 'on the fly', the new Classfile API is used to 'spin up' the new class and the accessors are are composed using Var/Method Handles and offsets derived from the size and order of fields. 152 153 Sadly an interface is not quite enough to establish exactly what is needed to complete the mapping. We need to tell the `iface mapper` the order and size of fields and possibly some padding information. 154 155 We do this by providing a 'layout description' using Panama's Layout api. 156 157 ```java 158 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout( 159 JAVA_INT.withName("length"), 160 MemoryLayout.sequenceLayout(N, JAVA_INT.withName("length")).withName("array") 161 ).withName(S32Array.getSimpleName()); 162 ``` 163 164 Eventually we came to a common pattern for describing HAT buffers by adding a `create` method to our interface which hides the mapping detail 165 166 So the complete `S32Array` looks a like this. (....ish) 167 168 ```java 169 public interface S32Array extends Buffer { 170 int length(); 171 172 void length(int i); 173 174 int array(long idx); 175 176 void array(long idx, int i); 177 178 S32Array create(Accelerator a, int len) { 179 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout( 180 JAVA_INT.withName("length"), 181 MemoryLayout.sequenceLayout(len, JAVA_INT.withName("length")).withName("array") 182 ).withName(S32Array.getSimpleName()); 183 184 S32Array s32Array = a.allocate( 185 SegmentMapper.of(MethodHandles.lookup(), S32Array.class, s32ArrayLayout, len) 186 ); 187 188 return s32Array; 189 } 190 } 191 ``` 192 193 So now hopefully this code makes more sense. 194 195 ``` 196 var s32Array = S32Array.create(accelerator, 32); 197 ``` 198 199 Whilst this code is much nicer than hand mapping each method to offsets. It is still quite verbose. 200 201 In the last few weeks we have been migrating to Schema builder which makes this code easier to express.. 202 203 ```java 204 public interface S32Array extends Buffer { 205 int length(); 206 void length(int i); 207 int array(long idx); 208 void array(long idx, int i); 209 Schema<S32Array> schema = Schema.of(S32Array.class, s->s 210 .arrayLen("length") 211 .array("array") 212 ); 213 } 214 ``` 215 The schema is embedded inside the interface and defines the order of fields. It also allows us to bind fields to each other (above we are telling the schema we have a `int length` field followed by an `int array[]` field and that the first defines the size of the second), we also can describe useful 'HAT' information for fields. Such as whether a field is 'atomic' ;) 216 217 Here is an example of a table of Results for the face detector. 218 219 ```java 220 public interface ResultTable extends Buffer{ 221 interface Result extends Buffer.StructChild { 222 float x(); 223 void x(float x); 224 float y(); 225 void y(float y); 226 } 227 void count(int count); 228 int count(); 229 int length(); 230 Result result(long idx); 231 232 Schema<ResultTable> schema = Schema.of(ResultTable.class, s->s 233 .atomic("count") 234 .arrayLen("length") 235 .array("result", r->r 236 .field("x") 237 .field("y") 238 ) 239 ); 240 } 241 ``` 242 243 Which in C99 OpenCL code will manifest as 244 245 ```C++ 246 typedef Result_s{ 247 int x,y 248 } Result_t; 249 250 typedef ResultTable_s{ 251 int count; 252 int length; 253 Result_t result[0]; 254 } Result_t; 255 ``` 256 257 In our Java code this interface makes access to MemorySegments much cleaner 258 259 ```java 260 ResultTable resultTable = ResultTable.create(acc, 100); 261 for (int i=0; i<resultTable.length(); i++){ 262 Result result = resultTable.result(i); 263 result.x(0); 264 result.y(0); 265 } 266 ``` 267 268 The generated OpenCL/C99 code from Java kernel code is also quite clean 269 270 We might use a kernel to initialize the location of a bunch of Results 271 272 ```java 273 @CodeReflection public static void init(KernelContext kc, ResultTable resultTable) { 274 if (kc.x < kc.maxX){ 275 Result result = resulTable.result(kc.x); 276 result.x(kc.x); 277 result.y(100); 278 } 279 } 280 ``` 281 282 Whose Kernel code will look like this. 283 284 ``` 285 typedef struct KernelContext_s{ 286 int x; 287 int maxX; 288 }KernelContext_t; 289 290 typedef Result_s{ 291 int x,y 292 } Result_t; 293 294 typedef ResultTable_s{ 295 int count; 296 int length; 297 Result_t result[0]; 298 } Result_t; 299 300 __kernel void init( 301 __global KernelContext_t *empty, 302 __global ResultTable_t* resultTable 303 ){ 304 KernelContext_t kernelContext; 305 KernelContext_t *kc = &kernelContext; 306 kc->x=get_global_id(0); 307 kc->maxX = get_global_id(0); 308 309 if(kc->x<kc->maxX){ 310 __global Result_t *result = &resultTable[kc->x]; 311 result->x = kc->x; 312 } 313 return; 314 } 315 ``` 316 317 A few notes from this generated code... 318 319 * `KernelContext` is itself just an iface mapped segment. 320 - But we don't pass `kc.x` o `kc.maxX` in the segment. 321 - Instead initialize using appropriate vendor calls 322 323 So for OpenCL all kernels start like this 324 325 ``` 326 __kernel void init(__global KernelContext_t *empty , ....){ 327 KernelContext_t kernelContext; 328 KernelContext_t *kc = &kernelContext; 329 kc->x=get_global_id(0); 330 kc->maxX = get_global_id(0); 331 .... 332 } 333 ``` 334 335 Whereas CUDA ;) 336 337 ``` 338 __kernel void init(__global KernelContext_t *empty , ....){ 339 KernelContext_t kernelContext; 340 KernelContext_t *kc = &kernelContext; 341 kc->x=blockIdx.x*blockDim.x+threadIdx.x; 342 kc->maxX =gridDim.x*blockDim.x 343 .... 344 } 345 ``` 346 347 This simplifies code gen. Generally the CUDA code and OpenCL code looks identical. 348 349 ---- 350 351 The iface mapping code in hat is a modified form of the code hereWe have a copy of Per's segment mapping code from 352 353 https://github.com/minborg/panama-foreign/blob/segment-mapper/src/java.base/share/classes