1 ## Design Points 2 3 This section discusses design decisions, as guided by the following tenets. 4 5 ### Tenets 6 7 1. Don’t punish the mutator: While mutator code must be added to 8 implement generational collection, degradations of 9 mutator throughput and responsiveness 10 should be contained as much as possible. 11 If in doubt between design choices, 12 pick the one that promises to minimize mutator overhead. 13 2. Safeguard incremental progress with milestones that demonstrate key capabilities. 14 3. Minimize the overall development effort. This includes containing 15 efforts to prepare and present each milestone deliverable 16 that do not directly contribute to the end product. 17 18 ### Design Decisions 19 20 While it is probable that we may revisit some of the following design 21 decisions if we run into implementation difficulties or performance 22 problems, the following is the current plan of record. 23 24 1. This document is maintained alongside the source code. 25 1. It is organized hierarchically with sections 26 dedicated to major design decisions. 27 Further, more detailed decisions are organized in top-down fashion, 28 with overarching design decisions preceding derivative design decisions. 29 The topic numbers assigned to particular design decisions may evolve over time. 30 2. There is a separate "rationale" document that explains more of the reasoning behind design decisions 31 and links relevant portions between documents. 32 2. Genshen changes to the Shenandoah implementation will be 33 compartmented. See discussion [here](rationale.summary.md#compartmentalization"). 34 3. We will support concurrent collection of old gen and young gen, 35 with the expectation that a single old collection will typically 36 overlap with the execution of many young collections. 37 4. In order to minimize performance impact on the mutator, we will use 38 a single Load-Reference-Barrier to implement both evacuation 39 from young gen and from old gen. 40 1. Tenuring will be implemented as part of young evacuation. 41 2. All evacuation for both young and old gen regions 42 happens under the control of the same GC threads and is 43 supported by the same Load Reference Barrier (LRB) as in 44 vanilla Shenandoah, with only small refinements to the 45 implementation of slow-path code within the LRB. 46 3. See discussion [here](rationale.summary.md#load-reference-barrier). 47 5. An old collection begins at the same time as a young gen 48 collection, with both collections leveraging the 49 results of a single shared root scan. 50 6. Old collection generally runs at lower priority (e.g., fewer 51 numbers of parallel threads or greater “nice” values for 52 concurrent threads) than young collection because 53 replenishing the allocation pool to support ongoing allocation 54 needs of mutator threads requires urgent completion of young GC. 55 See discussion [here](rationale.summary.md#prioritization). 56 7. Although the planned future production release of GenShen will 57 run young collection concurrently with old collection, support 58 will also be implemented and maintained for alternating executions 59 of young collections with global collections. See 60 discussion [here](rationale.summary.md#young-global-discussion). 61 1. Since this is not a production release, efficiency of 62 implementation is not a primary concern. 63 2. Regression tests will exercise the implementation of this mode 64 of operation. 65 8. A regression test suite will be developed and maintained to 66 support all of the enhanced capabilities of GenShen, including 67 capabilities that are enabled and disabled at build time for the 68 JVM. 69 9. Though old collections start at the same time as young 70 collections collections, they do not necessarily end at the same 71 time. Typically, many young-gen collections will be completed 72 concurrently during the time that a single old-gen collection is 73 running. See discussion [here](rationale.summary.md#concurrency-of-young-and-old). 74 10. Root scanning will be performed during a JVM safe point. 75 11. Scanning of the remembered set will be performed by parallel 76 young-gen GC threads during a JVM safe point. A side 77 effect of scanning is to mark as CLEAN any cards of the remembered 78 set that are found by remembered set scanning to no longer be 79 DIRTY. 80 12. The remembered set maintains a start-of-object representation to 81 facilitate quick identification of where the first object 82 pertaining to each DIRTY card region begins. See discussion 83 [here](rationale.summary.md#remembered-set-starting-offsets). 84 This requires that: 85 1. Newly promoted objects be registered with the remembered set 86 and start-of-object support by post-processing the associated GCLAB block, and 87 2. When entire regions are promoted out of young collections into 88 old gen, the objects contained within those regions must be 89 registered with the remembered set and the start-of-object support. 90 13. Young marking, including scanning of root pointers, 91 will place discovered references to old gen into a thread-local SATB 92 buffer so that they can be processed by an old-gen collection 93 thread. See discussion [here](rationale.summary.md#satb-keep-old-alive-entries). 94 1. Possible optimization: refrain from logging old-gen pointers 95 that refer to already marked old objects into the SATB buffer. 96 14. The default size of the SATB buffer will increase from 1024 to 97 4096 words because we will be placing more information into the 98 SATB buffer. 99 15. Whenever the SATB buffer is full, the slow path for adding to 100 the SATB buffer will attempt to compress the buffer contents before 101 communicating the buffer to GC threads. If compression leaves a 102 minimum of 256 slots available within the SATB buffer, the thread 103 continues to add values to the existing buffer. Compression 104 consists of: 105 1. For pointers to young gen: 106 1. If concurrent marking of young gen is disabled, ignore the 107 pointer. 108 2. Otherwise, if the object referenced by this pointer has 109 already been marked, ignore the pointer. 110 3. If we choose to use SATB-based remembered set, ignore all 111 overwritten address values that reside within young gen. 112 2. For pointers to old gen: 113 1. If concurrent marking of old gen is disabled, ignore the 114 pointer. 115 2. Otherwise, if the object referenced by this pointer 116 has already been marked, ignore the pointer. 117 3. If we choose to use an SATB-based remembered set: 118 - If the card corresponding to an overwritten old gen 119 address is already DIRTY, ignore this overwritten 120 address. 121 - Otherwise, fetch the pointer value held at the overwritten 122 address. If the fetched pointer value does not 123 refer to young gen, ignore this overwritten address. 124 - Otherwise, use a hash table (suggested size 127) to sift 125 redundant DIRTY card references for the current batch of 126 overwritten old-gen addresses. Preserve only one address 127 for each card entry that needs to be marked as DIRTY. 128 3. SATB buffers will be processed by both a young GC 129 thread and an old GC thread. 130 1. The young GC thread will mark objects referenced 131 by young pointers. 132 2. The old GC thread will: 133 1. Mark objects referenced by old-gen pointers, and 134 2. If we choose to use SATB-based remembered set: mark as DIRTY 135 the card entries corresponding to overwritten addresses. 136 16. GC uses a G1-style heuristic to choose collection sets for both 137 young and old memory areas. The collection set represents the 138 heap regions that offer the greatest opportunity to quickly reclaim 139 garbage, as with the existing Shenandoah implementation. See 140 discussion [here](rationale.summary.md#g1-heuristic). 141 17. Following an evacuation phase that evacuates 142 both old and young heap regions, the update-references phase is 143 required to update references throughout all 144 old regions that were not selected as part of the old 145 collection set in addition to updating references in young 146 heap regions. 147 1. If a particular young collection evacuation phase does not 148 evacuate any old regions, then its update references phase can 149 focus solely on young heap regions and the 150 remembered set. 151 18. Tenuring is based on object age with the enhancements described 152 below. See discussion [here](rationale.summary.md#tenuring). 153 1. The variable TenureCycle counts how many GC cycles correspond to 154 each increment of an object’s age. Object ages are not 155 necessarily incremented each time a GC cycle is completed. They 156 are incremented each time TenureCycle GC cycles have been 157 completed. 158 2. The variable TenureAge represents the age at which an object 159 becomes eligible for tenuring. 160 3. During GC cycles that correspond to TenureCycle, the “age” of 161 individual objects is incremented by 1 plus the size of the 162 object’s original heap region age when the object is evacuated. 163 4. During GC cycles that correspond to TenureCycle, the “age” of 164 each heap region that has been fully allocated (i.e. no more 165 available memory for allocation) and that is not in the 166 collection set is incremented by 1 at the end of the evacuation 167 phase. If the resulting “age” equals TenureAge, then the entire 168 region is reassigned to become part of old gen. 169 1. The update-refs phase will process this heap region even 170 though it is “now” considered to be part of old gen. 171 2. Each of the objects contained within the promoted region 172 shall be registered with the remembered set abstraction. 173 3. Each of the objects contained within the promoted region 174 be scanned to determine any references to young-gen 175 memory that are contained therein. For any such pointers, set 176 the associated card table entry to DIRTY. 177 19. During evacuation, each running mutator thread has both a TLAB 178 and a GCLAB. 179 1. The TLAB is associated with a young heap region. 180 2. The GCLAB is associated with an old heap region. 181 3. When the mutator’s load reference barrier encounters a 182 reference for which the associated object needs to be tenured, it 183 allocates the copy memory from the GCLAB. 184 4. When the mutator’s load reference barrier encounters a 185 reference for which the associated object resides within the 186 collection set of old gen, it allocates the copy memory from the 187 GCLAB. 188 5. When the mutator’s load reference barrier encounters a 189 reference for which the associated object needs to be evacuated to 190 young gen, it allocates the copy memory from the TLAB. 191 6. We initially plan to use the same size for TLAB and GCLAB, but 192 this decision may be revisited. 193 7. If the size of the object to be evacuated is larger than half 194 the size of the respective local allocation buffer, allocation 195 of the replica memory is handled by alternative allocators, to be 196 designed. Call these "odd" objects. 197 20. During evacuation, each evacuating GC thread will maintain two 198 GCLAB buffers: 199 1. GCLAB-Young is associated with young gen. 200 2. GCLAB-Old is associated with old gen. 201 3. If the object to be evacuated currently resides in old gen or 202 if it resides in young gen and it is to be tenured, allocate the 203 copy memory from GCLAB-Old. 204 4. Otherwise, allocate the copy memory from GCLAB-Young. 205 5. At the end of the evacuation phase, consider repurposing any 206 unspent GCLAB-Young as a TLAB if there is sufficient unallocated 207 memory remaining within it. 208 6. At the end of the evacuation phase, consider preserving the 209 GCLAB-Old for use as a GCLAB for a mutator thread during the next 210 young collections collection or as a GCLAB-Old during the next 211 old-gen evacuation pass. 212 7. See 19.7. 213 21. A certain budget of CPU time is provisioned to perform young-gen 214 GC in order to support a particular planned workload. 215 1. The resources dedicated to young-gen GC are limited in order 216 to assure a certain amount of CPU time is available to mutator 217 threads. 218 2. Command line options allow user control over provisioning. A 219 TBD API may allow services to adjust provisioning dynamically. 220 3. In the ideal, provisioning is adjusted automatically based on 221 TBD heuristics. 222 4. The provisioned CPU resources can support a range of service 223 quality, reclaiming large numbers of heap regions with a low GC 224 frequency or reclaiming small numbers of heap regions with a 225 high GC frequency. Given a particular frequency of GC cycles, 226 the same CPU resources can evacuate a large number of sparsely 227 populated heap regions or a small number of densely populated 228 heap regions. Tradeoffs between these configuration extremes 229 may be adjusted under software control or by TBD 230 heuristics. 231 5. For each young-gen GC pass, a certain TBD percentage 232 of CPU resources are reserved for old-gen evacuation and 233 update-refs activities. 234 1. The old-gen CPU resource budget represents the total amount 235 of old-gen memory that can be relocated, and is quantified as 236 a multiple N of the heap region size. 237 2. The old-gen GC threads determine the composition of the 238 old-gen collection set, up to but never exceeding the upper 239 bound N on cumulative evacuation size. 240 3. The old-gen GC thread may select for the collection set 241 N heap regions which are known to have 100% 242 utilization, 2N heap regions known to have 50% utilization, 243 5N heap regions known to have 20% utilization, and so on. 244 4. If the old-gen GC refrains from delegating N heap regions 245 worth of evacuation work to the young-gen evacuation phase, 246 then the young GC is free to use the excess CPU resources to 247 more aggressively evacuate more of its own young heap regions, 248 using a larger than normal young-gen collection set. 249 5. The budget for updating old-gen references must be large 250 enough to handle the total old-gen memory size - N. In the 251 case that old-gen GC configures a collection set that 252 represents its full evacuation capacity of N heap regions, the 253 number of old-gen heap regions that are not part of the 254 old-gen collection set is never larger than this quantity. 255 In the case that old-gen GC configures a smaller collection 256 set, then for each fraction of a heap region that is not 257 evacuated, this much more of a heap region might have to be 258 processed during the update-refs phase of GC. We estimate 259 that the cost of evacuating M bytes of memory is similar to 260 the cost of updating the references within M bytes of 261 memory. 262 22. A smaller (generally) budget of CPU time is provisioned to 263 perform old-gen GC in order to support a particular planned 264 workload. 265 1. The resources dedicated to young-gen GC are limited in order 266 to assure a certain amount of CPU time is available to mutator 267 threads. 268 2. Command-line options allow user control over provisioning. A 269 TBD API may allow services to adjust provisioning dynamically. 270 3. In the ideal, provisioning is adjusted automatically based on 271 TBD heuristics. 272 4. As with young-gen GC, the CPU resources provisioned for 273 old-gen GC can support a range of service quality. 274 5. The CPU resources dedicated to old-gen collection do not have 275 responsibility for evacuating old regions as all evacuation 276 is performed by young-gen GC threads. Instead, the CPU 277 resources dedicated to old-gen GC activities are used for 278 activities such as the following: 279 1. Processing the content of SATB buffers: 280 - If old collection is in concurrent marking phase, mark 281 objects referenced by any keep-alive pointers. 282 - If we are using SATB-based remembered set, update the 283 remembered set based on overwritten addresses reported in the 284 SATB buffer. 285 2. During concurrent marking, scan the contents of previously 286 marked objects to complete the transitive closure of 287 reachability. 288 3. Perform heuristic computations: 289 - Determine when to start the next old-gen GC cycle. 290 - Determine which old regions to evacuate on this and future 291 passes of young-gen GC. 292 - Adjust young-gen efficiency parameters such as: How many 293 heap regions should be dedicated to young gen? What is 294 optimal value of TenureCycle? What is optimal value of 295 TenureAge? How much CPU time should be dedicated to 296 young-gen GC? 297 - Adjust old-gen efficiency parameters such as: How much CPU 298 time should be dedicated to old-gen GC? How many heap regions 299 should be dedicated to old gen? Should any heap regions be 300 decommissioned and returned to the operating system in order 301 to shrink the memory footprint of this service? 302 4. Perform routine maintenance as time and schedule permits: 303 - Potentially sweep up dead memory, accumulating ages at 304 which dead objects were reclaimed within old regions. 305 - Potentially, sweep through evacuated memory to accumulate 306 ages at which dead objects were reclaimed. 307 - Organize free lists for fast and efficient allocation of 308 GCLAB and Odd object allocations. 309 - Return emptied old regions to the free set. 310 - Eventually, reference processing and string deduplication 311 should be performed by lower priority old-gen threads 312 rather than higher priority young-gen threads. 313 5. Following completion of each old-gen concurrent mark pass, 314 select regions to be included in the old-gen collection set: 315 - No more than a total of N bytes of old gen is evacuated by 316 each pass of the young-gen evacuator. 317 - If old-gen GC threads desire to evacuate M, which is more 318 than N bytes of old gen, it does so by piggy backing on 319 multiple subsequent young-gen evacuation passes, selecting 320 evacuating no more than the accumulation of N total heap 321 regions in each of the following young-gen evacuation passes. 322 - Since it is most efficient to evacuate all M > N regions 323 of old-gen memory with a single young-gen evacuation pass, 324 configure the old-gen collection set to include all M 325 regions if there are sufficient free regions available to 326 afford the young-gen allocator to continue allocating new 327 objects during the longer delay that would be required to 328 evacuate more than the traditionally budgeted N regions of 329 old-gen memory.