1 # Interface Mapping 2 3 ---- 4 5 * [Contents](hat-00.md) 6 * House Keeping 7 * [Project Layout](hat-01-01-project-layout.md) 8 * [Building Babylon](hat-01-02-building-babylon.md) 9 * [Building HAT](hat-01-03-building-hat.md) 10 * [Enabling the CUDA Backend](hat-01-05-building-hat-for-cuda.md) 11 * Programming Model 12 * [Programming Model](hat-03-programming-model.md) 13 * Interface Mapping 14 * [Interface Mapping Overview](hat-04-01-interface-mapping.md) 15 * [Cascade Interface Mapping](hat-04-02-cascade-interface-mapping.md) 16 * Implementation Detail 17 * [Walkthrough Of Accelerator.compute()](hat-accelerator-compute.md) 18 * [How we minimize buffer transfers](hat-minimizing-buffer-transfers.md) 19 20 ---- 21 22 # Interface Mapping 23 24 ## or ... HAT from a Data POV 25 26 ### Or ... what is this `S32Array` thing and why can't I just pass `int[]` to my kernel 27 28 Again here is the canonical HAT 'hello world' kernel, weill use this to describe itgerface mapping 29 30 We implement this in HAT by collecting the kernel(s) and compute method(s) in a `Compute` class. 31 32 ```java 33 public class SquareCompute { 34 @CodeReflection 35 public static int square(int v) { 36 return v * v; 37 } 38 39 @CodeReflection 40 public static void squareKernel(KernelContext kc, S32Array s32Array) { 41 int value = s32Array.array(kc.x); // arr[cc.x] 42 s32Array.array(kc.x, square(value)); // arr[cc.x]=value*value 43 } 44 45 @CodeReflection 46 public static void square(ComputeContext cc, S32Array s32Array) { 47 cc.dispatchKernel(s32Array.length(), 48 kc -> squareKernel(kc, s32Array) 49 ); 50 } 51 } 52 ``` 53 Which we dispatch by creating the appropriate data buffer and then asking an `Accelerator` (bound to a typical vendor backend) to execute the compute method.. which in turn coordinates the dispatch of the various kernels. 54 55 ```java 56 // Create an accelerator bound to a particular backend 57 58 var accelerator = new Accelerator( 59 java.lang.invoke.MethodHandles.lookup(), 60 Backend.FIRST // Predicate<Backend> 61 ); 62 63 // Ask the accelerator/backend to allocate an S32Array 64 var s32Array = S32Array.create(accelerator, 32); 65 66 // Fill it with data 67 for (int i = 0; i < s32Array.length(); i++) { 68 s32Array.array(i, i); 69 } 70 71 // Tell the accelerator to execute the square() compute entrypoint 72 73 accelerator.compute( 74 cc -> SquareCompute.square(cc, s32Array) 75 ); 76 77 // Check the data 78 for (int i = 0; i < arr.length(); i++) { 79 System.out.println(i + " " + arr.array(i)); 80 } 81 ``` 82 83 HAT kernels only accept Java primitives and HAT buffers as parameters. 84 85 We don't directly support heap allocated data (such as int[]) 86 87 From Java's point of view `S32Array` is a `hat.Buffer` and is defined as an interface. 88 89 ```java 90 public interface S32Array extends Buffer { 91 int length(); 92 void length(int i); 93 int array(long idx); 94 void array(long idx, int i); 95 } 96 ``` 97 98 From C99 style OpenCL/CUDA POV this will eventually be mapped to a typedef. 99 100 ```C++ 101 typedef struct S32Array_s{ 102 int length; 103 int array[]; //<-- ? 104 }S32Array_t; 105 ``` 106 107 Our Java implementations should treat the interface as `data`, generally the only 108 methods that we include in a `hat.Buffer` should be 109 110 ```java 111 T name(); //getter for a field called name with type T, where T may be primitive or inner interface) 112 void name(T name); //setter for a field called name with type T, T must be primitive 113 T name(long idx); //get an array element [idx] where array is called name and T is either primitive or inner interface 114 void name(long idx, T name); //set an array element [idx] where array is called name and T is primitive 115 ``` 116 117 Algorithms can assume that an interface is 'bound' to 'some' concrete data layout. 118 119 We could for example implement `S32Array` like this. 120 121 ```java 122 class JavaS32Array implements S32Array{ 123 int[] arr; 124 int length(){ return arr.length;} 125 int array(long idx) {return arr[idx];} 126 void array(long idx, int value) {arr[idx] = value;} 127 void length(int len) ; // we'll come back to this ;) 128 } 129 ``` 130 131 But for HAT to access native memory, allocated by the appropriate backend we need interfaces bound to MemorySegents/ 132 133 HAT includes an API which allows us to take an interface which extends `hat.Buffer`, and 'bind' it to a Panama FFM MemorySegment. 134 135 This binding process automatically maps the accessors (for example `length()`, `array(long idx, int v)`) to low level Method and Var handel trickery underlying MemorySegments. 136 137 Conceptually we might imagine that HAT creates something like this 138 139 ```java 140 class PanamaS32Array implements S32Array{ 141 MemorySegment segment; 142 final int SIZEOFINT = 4; 143 final long lenOffset = 0; 144 final long arrayOffset = lenOffset+SIZEOFINT; 145 int length(){ return segment.getInt(lenOffset);} 146 int array(long idx) {return segment.getInt(arrayOffset+idx*SIZEOFINT);} 147 void array(long idx, int value) {segment.setInt(arrayOffset+idx*SIZEOFINT,value);} 148 void length(int len) ; // we'll come back to this ;) 149 } 150 ``` 151 152 Much like Java's `Proxy` class, the iface mapper creates an implementation of the interface 'on the fly', the new Classfile API is used to 'spin up' the new class and the accessors are are composed using Var/Method Handles and offsets derived from the size and order of fields. 153 154 Sadly an interface is not quite enough to establish exactly what is needed to complete the mapping. We need to tell the `iface mapper` the order and size of fields and possibly some padding information. 155 156 We do this by providing a 'layout description' using Panama's Layout api. 157 158 ```java 159 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout( 160 JAVA_INT.withName("length"), 161 MemoryLayout.sequenceLayout(N, JAVA_INT.withName("length")).withName("array") 162 ).withName(S32Array.getSimpleName()); 163 ``` 164 165 Eventually we came to a common pattern for describing HAT buffers by adding a `create` method to our interface which hides the mapping detail 166 167 So the complete `S32Array` looks a like this. (....ish) 168 169 ```java 170 public interface S32Array extends Buffer { 171 int length(); 172 173 void length(int i); 174 175 int array(long idx); 176 177 void array(long idx, int i); 178 179 S32Array create(Accelerator a, int len) { 180 MemoryLayout s32ArrayLayout = MemoryLayout.structLayout( 181 JAVA_INT.withName("length"), 182 MemoryLayout.sequenceLayout(len, JAVA_INT.withName("length")).withName("array") 183 ).withName(S32Array.getSimpleName()); 184 185 S32Array s32Array = a.allocate( 186 SegmentMapper.of(MethodHandles.lookup(), S32Array.class, s32ArrayLayout, len) 187 ); 188 189 return s32Array; 190 } 191 } 192 ``` 193 194 So now hopefully this code makes more sense. 195 196 ``` 197 var s32Array = S32Array.create(accelerator, 32); 198 ``` 199 200 Whilst this code is much nicer than hand mapping each method to offsets. It is still quite verbose. 201 202 In the last few weeks we have been migrating to Schema builder which makes this code easier to express.. 203 204 ```java 205 public interface S32Array extends Buffer { 206 int length(); 207 void length(int i); 208 int array(long idx); 209 void array(long idx, int i); 210 Schema<S32Array> schema = Schema.of(S32Array.class, s->s 211 .arrayLen("length") 212 .array("array") 213 ); 214 } 215 ``` 216 The schema is embedded inside the interface and defines the order of fields. It also allows us to bind fields to each other (above we are telling the schema we have a `int length` field followed by an `int array[]` field and that the first defines the size of the second), we also can describe useful 'HAT' information for fields. Such as whether a field is 'atomic' ;) 217 218 Here is an example of a table of Results for the face detector. 219 220 ```java 221 public interface ResultTable extends Buffer{ 222 interface Result extends Buffer.StructChild { 223 float x(); 224 void x(float x); 225 float y(); 226 void y(float y); 227 } 228 void count(int count); 229 int count(); 230 int length(); 231 Result result(long idx); 232 233 Schema<ResultTable> schema = Schema.of(ResultTable.class, s->s 234 .atomic("count") 235 .arrayLen("length") 236 .array("result", r->r 237 .field("x") 238 .field("y") 239 ) 240 ); 241 } 242 ``` 243 244 Which in C99 OpenCL code will manifest as 245 246 ```C++ 247 typedef Result_s{ 248 int x,y 249 } Result_t; 250 251 typedef ResultTable_s{ 252 int count; 253 int length; 254 Result_t result[0]; 255 } Result_t; 256 ``` 257 258 In our Java code this interface makes access to MemorySegments much cleaner 259 260 ```java 261 ResultTable resultTable = ResultTable.create(acc, 100); 262 for (int i=0; i<resultTable.length(); i++){ 263 Result result = resultTable.result(i); 264 result.x(0); 265 result.y(0); 266 } 267 ``` 268 269 The generated OpenCL/C99 code from Java kernel code is also quite clean 270 271 We might use a kernel to initialize the location of a bunch of Results 272 273 ```java 274 @CodeReflection public static void init(KernelContext kc, ResultTable resultTable) { 275 if (kc.x < kc.maxX){ 276 Result result = resulTable.result(kc.x); 277 result.x(kc.x); 278 result.y(100); 279 } 280 } 281 ``` 282 283 Whose Kernel code will look like this. 284 285 ``` 286 typedef struct KernelContext_s{ 287 int x; 288 int maxX; 289 }KernelContext_t; 290 291 typedef Result_s{ 292 int x,y 293 } Result_t; 294 295 typedef ResultTable_s{ 296 int count; 297 int length; 298 Result_t result[0]; 299 } Result_t; 300 301 __kernel void init( 302 __global KernelContext_t *empty, 303 __global ResultTable_t* resultTable 304 ){ 305 KernelContext_t kernelContext; 306 KernelContext_t *kc = &kernelContext; 307 kc->x=get_global_id(0); 308 kc->maxX = get_global_id(0); 309 310 if(kc->x<kc->maxX){ 311 __global Result_t *result = &resultTable[kc->x]; 312 result->x = kc->x; 313 } 314 return; 315 } 316 ``` 317 318 A few notes from this generated code... 319 320 * `KernelContext` is itself just an iface mapped segment. 321 - But we don't pass `kc.x` o `kc.maxX` in the segment. 322 - Instead initialize using appropriate vendor calls 323 324 So for OpenCL all kernels start like this 325 326 ``` 327 __kernel void init(__global KernelContext_t *empty , ....){ 328 KernelContext_t kernelContext; 329 KernelContext_t *kc = &kernelContext; 330 kc->x=get_global_id(0); 331 kc->maxX = get_global_id(0); 332 .... 333 } 334 ``` 335 336 Whereas CUDA ;) 337 338 ``` 339 __kernel void init(__global KernelContext_t *empty , ....){ 340 KernelContext_t kernelContext; 341 KernelContext_t *kc = &kernelContext; 342 kc->x=blockIdx.x*blockDim.x+threadIdx.x; 343 kc->maxX =gridDim.x*blockDim.x 344 .... 345 } 346 ``` 347 348 This simplifies code gen. Generally the CUDA code and OpenCL code looks identical. 349 350 ---- 351 352 The iface mapping code in hat is a modified form of the code hereWe have a copy of Per's segment mapping code from 353 354 https://github.com/minborg/panama-foreign/blob/segment-mapper/src/java.base/share/classes