New doc/panama_memaccess.md

  1 ## State of foreign memory support
  2 
  3 **December 2023**
  4 
  5 **Maurizio Cimadamore**
  6 
  7 A crucial part of any native interop story lies in the ability of accessing off-heap memory efficiently and safely. Java achieves this goal through the Foreign Function & Memory API (FFM API in short), parts of which have been available as an [incubating](https://openjdk.java.net/jeps/11) API since Java [14](https://openjdk.java.net/jeps/370). The FFM API introduces abstractions to allocate and access flat memory regions (whether on- or off-heap), to manage the lifecycle of memory resources and to model native memory addresses.
  8 
  9 ### Memory segments and arenas
 10 
 11 Memory segments are abstractions which can be used to model contiguous memory regions, located either on-heap (i.e. *heap segments*) or off- the Java heap (i.e. *native segments*). Memory segments provide *strong* spatial, temporal and thread-confinement guarantees which make memory dereference operation *safe* (more on that later), although in most simple cases some properties of memory segments can safely be ignored.
 12 
 13 For instance, the following snippet allocates 100 bytes off-heap:
 14 
 15 ```java
 16 MemorySegment segment = Arena.global().allocate(100);
 17 ```
 18 
 19 The above code allocates a 100-bytes long memory segment, using an *arena*. The FFM API provides several kinds of arena, which can be used to control the lifecycle of the allocated native segments in different ways. In this example, the segment is allocated with the *global* arena. Memory segments allocated with this arena are always *alive* and their backing regions of memory are never deallocated. In other words, we say that the above segment has an *unbounded* lifetime.
 20 
 21 > Note: the lifetime of a memory segment is modelled by a *scope* (see `MemorySegment.Scope`). A memory segment can be accessed as long as its associated scope is *alive* (see `Scope::isAlive`). In most cases, the scope of a memory segment is the scope of the arena which allocated that segment. Accessing the scope of a segment can be useful to perform lifetime queries (e.g. asking whether a segment has the same lifetime as that of another segment), creating custom arenas and unsafely assigning new temporal bounds to an existing native memory segments (these topics are explored in more details below).
 22 
 23 Most programs, though, require off-heap memory to be deallocated while the program is running, and thus need memory segments with *bounded* lifetimes. The simplest way to obtain a segment with bounded lifetime is to use an *automatic arena*:
 24 
 25 ```java
 26 MemorySegment segment = Arena.ofAuto().allocate(100);
 27 ```
 28 
 29 Segments allocated with an automatic arena are alive as long as they are determined to be reachable by the garbage collector. In other words, the above snippet creates a native segment whose behavior closely matches that of a `ByteBuffer` allocated with the `allocateDirect` factory.
 30 
 31 There are cases, however, where automatic deallocation is not enough: consider the case where a large memory segment is mapped from a file (this is possible using `FileChannel::map`); in this case, an application would probably prefer to release (e.g. `unmap`) the memory associated with this segment in a *deterministic* fashion, to ensure that memory doesn't remain available for longer than it needs to.
 32 
 33 A *confined* arena allocates segment featuring a bounded *and* deterministic lifetime. A memory segment allocated with a confined arena is alive from the time when the arena is opened, until the time when the arena is closed (at which point the segments become inaccessible). Multiple segments allocated with the same arena enjoy the *same* bounded lifetime and can safely contain mutual references. For example, this code opens an arena and uses it to allocate several native segments:
 34 
 35 ```java
 36 try (Arena arena = Arena.ofConfined()) {
 37     MemorySegment segment1 = arena.allocate(100);
 38     MemorySegment segment2 = arena.allocate(100);
 39     ...
 40     MemorySegment segmentN = arena.allocate(100);
 41 } // all segments are deallocated here
 42 ```
 43 
 44 When the arena is closed (above, this is done with the *try-with-resources* construct) the arena is no longer alive, all the segments associated with it are invalidated atomically, and the regions of memory backing the segments are deallocated.
 45 
 46 A confined arena's deterministic lifetime comes at a price: only one thread can access the memory segments allocated in a confined arena. If multiple threads need access to a segment, then a *shared* arena can be used (`Arena::ofShared`). The memory segments allocated in a shared arena can be accessed by multiple threads, and any thread (regardless of whether it was involved in access) can close the shared arena to deallocate the segments. The closure will atomically invalidate the segments, though deallocation of the regions of memory backing the segments might not occur immediately: an expensive synchronization operation<a href="#1"><sup>1</sup></a> is needed to detect and cancel pending concurrent access operations on the segments.
 47 
 48 In summary, an arena controls *which* threads can access a memory segment and *when*, in order to provide both strong temporal safety and a predictable performance model. The FFM API offers a choice of arenas so that a client can trade off breadth-of-access against timeliness of deallocation.
 49 
 50 ### Slicing segments
 51 
 52 Memory segments support *slicing* — that is, given a segment, it is possible to create a new segment whose spatial bounds are stricter than that of the original segment:
 53 
 54 ```java
 55 MemorySegment segment = Arena.ofAuto().allocate(10);
 56 MemorySegment slice = segment.asSlice(4, 4);
 57 ```
 58 
 59 The above code creates a slice that starts at offset 4 and has a length of 4 bytes. Slices have the *same* temporal bounds (i.e. segment scope) as the parent segment. In the above example, the memory associated with the parent segment will not be released as long as there is at least one *reachable* slice derived from that segment.
 60 
 61 To process the contents of a memory segment in bulk, a memory segment can be turned into a stream of slices, using the `MemorySegment::elements` method:
 62 
 63 ```java
 64 SequenceLayout seq = MemoryLayout.sequenceLayout(1_000_000, JAVA_INT);
 65 SequenceLayout bulk_element = MemoryLayout.sequenceLayout(100, JAVA_INT);
 66 
 67 try (Arena arena = Arena.ofShared()) {
 68     MemorySegment segment = arena.allocate(seq);
 69     int sum = segment.elements(bulk_element).parallel()
 70                        .mapToInt(slice -> {
 71                            int res = 0;
 72                            for (int i = 0; i < 100 ; i++) {
 73                                res += slice.getAtIndex(JAVA_INT, i);
 74                            }
 75                            return res;
 76                        }).sum();
 77 }
 78 ```
 79 
 80 The `MemorySegment::elements` method takes an element layout and returns a new stream. The stream is built on top of a spliterator instance (see `MemorySegment::spliterator`) which splits the segment into chunks whose size matches that of the provided layout. Here, we want to sum elements in an array which contains a million of elements; now, doing a parallel sum where each computation processes *exactly* one element would be inefficient, so instead we use a *bulk* element layout. The bulk element layout is a sequence layout containing a group of 100 elements — which should make it more amenable to parallel processing. Since we are using `Stream::parallel` to work on disjoint slices in parallel, here we use a *shared* arena, to ensure that the resulting segment can be accessed by multiple threads.
 81 
 82 ### Accessing segments
 83 
 84 Memory segments can be dereferenced easily, by using *value layouts* (layouts are covered in greater details in the next section). A value layout captures information such as:
 85 
 86 - The number of bytes to be dereferenced;
 87 - The alignment constraints of the address at which dereference occurs;
 88 - The endianness with which bytes are stored in said memory region;
 89 - The Java type to be used in the dereference operation (e.g. `int` vs `float`).
 90 
 91 For instance, the layout constant `ValueLayout.JAVA_INT` is four bytes wide, has no alignment constraints, uses the native platform endianness (e.g. little-endian on Linux/x64) and is associated with the Java type `int`. The following example reads pairs of 32-bit values (as Java ints) and uses them to construct an array of points:
 92 
 93 ```java
 94 record Point(int x, int y);
 95 MemorySegment segment = Arena.ofAuto().allocate(10 * 4 * 2);
 96 Point[] values = new Point[10];
 97 for (int i = 0 ; i < values.length ; i++) {
 98     int x = segment.getAtIndex(JAVA_INT, i * 2);
 99     int y = segment.getAtIndex(JAVA_INT, (i * 2) + 1);
100     values[i] = new Point(x, y);
101 }
102 ```
103 
104 The above snippet allocates a flat array of 80 bytes using an automatic arena. Then, inside the loop, elements in the array are accessed using the `MemorySegment::getAtIndex` method, which accesses `int` elements in a segment at a certain *logical* index (under the hood, the segment offset being accessed is obtained by multiplying the logical index by 4, which is the stride of a Java `int` array). Thus, all coordinates `x` and `y` are collected into instances of a `Point` record.
105 
106 ### Structured access
107 
108 Expressing byte offsets (as in the example above) can lead to code that is hard to read, and very fragile — as memory layout invariants are captured, implicitly, in the constants used to scale offsets. To address this issue, clients can use a `MemoryLayout` to describe the contents of a memory segment *programmatically*. For instance, the layout of the array used in the above example can be expressed using the following code <a href="#2"><sup>2</sup></a>:
109 
110 ```java
111 MemoryLayout points = MemoryLayout.sequenceLayout(10,
112     MemoryLayout.structLayout(
113         JAVA_INT.withName("x"),
114         JAVA_INT.withName("y")
115     )
116 );            
117 ```
118 
119 That is, our layout is a repetition of 10 *struct* elements, each struct element containing two 32-bit values each. Once defined, a memory layout can be queried — for instance we can compute the offset of the `y` coordinate in the 4th element of the `points` array:
120 
121 ```java
122 long y3 = points.byteOffset(PathElement.sequenceElement(3), PathElement.groupElement("y")); // 28
123 ```
124 
125 To specify which nested layout element should be used for the offset calculation we use a *layout path*, a selection expression that navigates the layout, from the *root* layout, down to the leaf layout we wish to select; in this case we need to select the 4th layout element in the sequence, and then select the layout named `y` inside the selected group layout.
126 
127 One of the things that can be derived from a layout is a *memory access var handle*. A memory access var handle is a special kind of var handle which takes a memory segment access coordinate, together with a byte offset — the offset, relative to the segment's base address at which the dereference operation should occur. With memory access var handles we can rewrite our example above as follows:
128 
129 ```java
130 MemorySegment segment = Arena.ofAuto().allocate(points);
131 VarHandle xHandle = points.varHandle(PathElement.sequenceElement(), PathElement.groupElement("x"));
132 VarHandle yHandle = points.varHandle(PathElement.sequenceElement(), PathElement.groupElement("y"));
133 Point[] values = new Point[10];
134 for (int i = 0 ; i < values.length ; i++) {
135     int x = (int)xHandle.get(segment, 0L /* base offset */, (long)i /* index */);
136     int y = (int)yHandle.get(segment, 0L /* base offset */, (long)i /* index */);
137 }
138 ```
139 
140 In the above, `xHandle` and `yHandle` are two var handle instances whose type is `int` and which takes three access coordinates:
141 
142 1. a `MemorySegment` instance; the segment whose memory should be dereferenced
143 2. a *base offset*, which indicates the portions of the memory segment to be accessed; this is typically left to zero (as above), but can be useful when combining memory access var handles (see below);
144 3. a *logical* index, which is used to select the element of the sequence we want to access (as the layout path used to construct these var handles contains one free dimension)
145 
146 In other words, the offset of the access operation can be expressed as follows:
147 
148 ```java
149 offset = baseOffset + (index * JAVA_INT.byteSize());
150 ```
151 
152 Or, equivalently, using the `MemoryLayout::scale` method, as:
153 
154 ```java
155 offset = JAVA_INT.scale(baseOffset, index);
156 ```
157 
158 Note that memory access var handles (as any other var handle) are *strongly* typed; and to get maximum efficiency, it is generally necessary to introduce casts to make sure that the access coordinates match the expected types — in this case we have to cast `i` into a `long`; similarly, since the signature polymorphic method `VarHandle::get` notionally returns `Object` a cast is necessary to force the right return type the var handle operation <a href="#3"><sup>3</sup></a>.
159 
160 In other words, manual offset computation is no longer needed — offsets and strides can in fact be derived from the layout object; note how `yHandle` is able to compute the required offset of the `y` coordinate in the flat array without the need of any error-prone arithmetic computation.
161 
162 ### Combining memory access handles
163 
164 We have seen in the previous sections how memory access var handles dramatically simplify user code when structured access is involved. While deriving memory access var handles from layout is the most convenient option, the FFM API also allows to create such memory access var handles in a standalone fashion, as demonstrated in the following code:
165 
166 ```java
167 VarHandle intHandle = JAVA_INT.varHandle(); // (MS, J) -> I
168 ```
169 
170 The above code creates a memory access var handle which reads/writes `int` values at a certain byte offset in a segment. To create this var handle we have to specify a carrier type — the type we want to use e.g. to extract values from memory, as well as whether any byte swapping should be applied when contents are read from or stored to memory. Additionally, the user might want to impose additional constraints on how memory dereferences should occur; for instance, a client might want to prevent access to misaligned 32 bit values. Of course, all this information can be succinctly derived from the provided value layout (`JAVA_INT` in the above example).
171 
172 The attentive reader might have noted how the var handles obtained from the sequence layout in the previous section can be in fact derived from  the simple memory access var handle we have constructed here. That is, var handles can be adapted and turned into more complex var handles, using var handle *combinators*. Developers familiar with the method handle API know how simpler method handles can be combined into more complex ones using the various combinator methods in the `MethodHandles` class. These methods allow, for instance, to insert (or bind) arguments into a target method handle, filter return values, permute arguments and much more.
173 
174 The FFM API adds a rich set of var handle combinators in the `MethodHandles` class; with these tools, developers can express var handle transformations such as:
175 
176 * mapping a var handle carrier type into a different one, using an embedding/projection method handle pairs
177 * filter one or more var handle access coordinates using unary filters
178 * permute var handle access coordinates
179 * bind concrete access coordinates to an existing var handle
180 
181 Without diving too deep, let's consider how we might want to take a basic memory access handle and turn it into a var handle which dereference a segment at a specific offset (again using the `points` layout defined previously):
182 
183 ```java
184 VarHandle intHandle = JAVA_INT.varHandle(); // (MS, J) -> I
185 long offsetOfY = points.byteOffset(PathElement.sequenceElement(3), PathElement.groupElement("y"));
186 VarHandle valueHandle = MethodHandles.insertCoordinates(intHandle, 1, offsetOfValue); // (MS) -> I
187 ```
188 
189 We have been able to derive, from a basic memory access var handle, a new var handle that dereferences a segment at a given fixed offset. It is easy to see how other, richer, var handles obtained using a memory layout can also be constructed manually using the var handle combinators provided by the FFM API.
190 
191 ### Segment allocators and custom arenas
192 
193 Memory allocation is often a bottleneck when clients use off-heap memory. The FFM API therefore includes a `SegmentAllocator` interface to define operations to allocate and initialize memory segments. As a convenience, the `Arena` interface extends the `SegmentAllocator` interface so that arenas can be used to allocate native segments. In other words, `Arena` is a "one-stop shop" for flexible allocation and timely deallocation of off-heap memory:
194 
195 ```java
196 FileChannel channel = ...
197 try (Arena offHeap = Arena.ofConfined()) {
198     MemorySegment nativeArray   = offHeap.allocateFrom(ValueLayout.JAVA_INT, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9);
199     MemorySegment nativeString  = offHeap.allocateFrom("Hello!");
200 
201     MemorySegment mappedSegment = channel.map(MapMode.READ_WRITE, 0, 1000, arena);
202    ...
203 } // memory released here
204 ```
205 
206 Segment allocators can also be obtained via factories in the `SegmentAllocator` interface. For example, one factory creates a *slicing allocator* that responds to allocation requests by returning memory segments which are part of a previously allocated segment; thus, many requests can be satisfied without physically allocating more memory. The following code obtains a slicing allocator over an existing segment, then uses it to allocate a segment initialized from a Java array:
207 
208 ```java
209 MemorySegment segment = ...
210 SegmentAllocator allocator = SegmentAllocator.slicingAllocator(segment);
211 for (int i = 0 ; i < 10 ; i++) {
212     MemorySegment s = allocator.allocateFrom(JAVA_INT, 1, 2, 3, 4, 5);
213     ...
214 }
215 ```
216 
217 A segment allocator can be used as a building block to create an arena that supports a custom allocation strategy. For example, if many segments share the same bounded lifetime, then an arena could use a slicing allocator to allocate the segments efficiently. This lets clients enjoy both scalable allocation (thanks to slicing) and deterministic deallocation (thanks to the arena).
218 
219 As an example, the following code defines a *slicing arena* that behaves like a confined arena (i.e., single-threaded access), but internally uses a slicing allocator to respond to allocation requests.  When the slicing arena is closed, the underlying confined arena is also closed; this will invalidate all segments allocated with the slicing arena:
220 
221 ```java
222 class SlicingArena {
223      final Arena arena = Arena.ofConfined();
224      final SegmentAllocator slicingAllocator;
225 
226      SlicingArena(long size) {
227          slicingAllocator = SegmentAllocator.slicingAllocator(arena.allocate(size));
228      }
229 
230      public void allocate(long byteSize, long byteAlignment) {
231          return slicingAllocator.allocate(byteSize, byteAlignment);
232      }
233 
234      public MemorySegment.Scope scope() {
235          return arena.scope();
236      }
237 
238      public void close() {
239          return arena.close();
240      }
241 }
242 ```
243 
244 The earlier code which used a slicing allocator directly can now be written more succinctly, as follows:
245 
246 ```java
247 try (Arena slicingArena = new SlicingArena(1000)) {
248      for (int i = 0 ; i < 10 ; i++) {
249          MemorySegment s = arena.allocateFrom(JAVA_INT, 1, 2, 3, 4, 5);
250          ...
251      }
252 } // all memory allocated is released here
253 ```
254 
255 * <a id="1"/>(<sup>1</sup>):<small> Shared arenas rely on VM thread-local handshakes (JEP [312](https://openjdk.java.net/jeps/312)) to implement lock-free, safe, shared memory access; that is, when it comes to memory access, there should be no difference in performance between a shared segment and a confined segment. On the other hand, `Arena::close` might be slower on shared arenas than on confined ones.</small>
256 * <a id="2"/>(<sup>2</sup>):<small> In general, deriving a complete layout from a C `struct` declaration is no trivial matter, and it's one of those areas where tooling can help greatly.</small>
257 * <a id="3"/>(<sup>3</sup>):<small> Clients can enforce stricter type checking when interacting with `VarHandle` instances, by obtaining an *exact* var handle, using the `VarHandle::withInvokeExactBehavior` method.</small>
258