1 ## Design Points
  2 
  3 This section discusses design decisions, as guided by the following tenets.
  4 
  5 ### Tenets
  6 
  7 1. Don’t punish the mutator: While mutator code must be added to 
  8    implement generational collection, degradations of
  9    mutator throughput and responsiveness 
 10    should be contained as much as possible.
 11    If in doubt between design choices, 
 12    pick the one that promises to minimize mutator overhead.
 13 2. Safeguard incremental progress with milestones that demonstrate key capabilities.
 14 3. Minimize the overall development effort.  This includes containing
 15    efforts to prepare and present each milestone deliverable 
 16    that do not directly contribute to the end product.
 17 
 18 ### Design Decisions
 19 
 20 While it is probable that we may revisit some of the following design
 21 decisions if we run into implementation difficulties or performance
 22 problems, the following is the current plan of record.
 23 
 24 1. This document is maintained alongside the source code. 
 25    1. It is organized hierarchically with sections
 26       dedicated to major design decisions.
 27       Further, more detailed decisions are organized in top-down fashion,
 28       with overarching design decisions preceding derivative design decisions.
 29       The topic numbers assigned to particular design decisions may evolve over time.
 30    2. There is a separate "rationale" document that explains more of the reasoning behind design decisions
 31       and links relevant portions between documents.
 32 2. Genshen changes to the Shenandoah implementation will be
 33    compartmented.  See discussion [here](rationale.summary.md#compartmentalization").
 34 3. We will support concurrent collection of old gen and young gen,
 35    with the expectation that a single old collection will typically
 36    overlap with the execution of many young collections.
 37 4. In order to minimize performance impact on the mutator, we will use
 38    a single Load-Reference-Barrier to implement both evacuation
 39    from young gen and from old gen.
 40    1. Tenuring will be implemented as part of young evacuation.
 41    2. All evacuation for both young and old gen regions
 42       happens under the control of the same GC threads and is
 43       supported by the same Load Reference Barrier (LRB) as in
 44       vanilla Shenandoah, with only small refinements to the
 45       implementation of slow-path code within the LRB.
 46    3. See discussion [here](rationale.summary.md#load-reference-barrier).
 47 5. An old collection begins at the same time as a young gen
 48    collection, with both collections leveraging the
 49    results of a single shared root scan.
 50 6. Old collection generally runs at lower priority (e.g., fewer
 51    numbers of parallel threads or greater “nice” values for
 52    concurrent threads) than young collection because
 53    replenishing the allocation pool to support ongoing allocation
 54    needs of mutator threads requires urgent completion of young GC.
 55    See discussion [here](rationale.summary.md#prioritization).
 56 7. Although the planned future production release of GenShen will
 57    run young collection concurrently with old collection, support
 58    will also be implemented and maintained for alternating executions
 59    of young collections with global collections.  See
 60    discussion [here](rationale.summary.md#young-global-discussion).
 61    1. Since this is not a production release, efficiency of
 62       implementation is not a primary concern.
 63    2. Regression tests will exercise the implementation of this mode
 64       of operation.
 65 8. A regression test suite will be developed and maintained to
 66    support all of the enhanced capabilities of GenShen, including
 67    capabilities that are enabled and disabled at build time for the
 68    JVM.
 69 9. Though old collections start at the same time as young
 70    collections collections, they do not necessarily end at the same
 71    time.  Typically, many young-gen collections will be completed
 72    concurrently during the time that a single old-gen collection is
 73    running.  See discussion [here](rationale.summary.md#concurrency-of-young-and-old).
 74 10. Root scanning will be performed during a JVM safe point.
 75 11. Scanning of the remembered set will be performed by parallel
 76     young-gen GC threads during a JVM safe point.  A side
 77     effect of scanning is to mark as CLEAN any cards of the remembered
 78     set that are found by remembered set scanning to no longer be
 79     DIRTY. 
 80 12. The remembered set maintains a start-of-object representation to
 81     facilitate quick identification of where the first object
 82     pertaining to each DIRTY card region begins.  See discussion
 83     [here](rationale.summary.md#remembered-set-starting-offsets).
 84     This requires that:
 85     1. Newly promoted objects be registered with the remembered set
 86        and start-of-object support by post-processing the associated GCLAB block, and
 87     2. When entire regions are promoted out of young collections into
 88        old gen, the objects contained within those regions must be
 89        registered with the remembered set and the start-of-object support.
 90 13. Young marking, including scanning of root pointers,
 91     will place discovered references to old gen into a thread-local SATB
 92     buffer so that they can be processed by an old-gen collection
 93     thread.  See discussion [here](rationale.summary.md#satb-keep-old-alive-entries). 
 94     1. Possible optimization: refrain from logging old-gen pointers
 95        that refer to already marked old objects into the SATB buffer.
 96 14. The default size of the SATB buffer will increase from 1024 to
 97     4096 words because we will be placing more information into the
 98     SATB buffer.
 99 15. Whenever the SATB buffer is full, the slow path for adding to
100     the SATB buffer will attempt to compress the buffer contents before
101     communicating the buffer to GC threads.  If compression leaves a
102     minimum of 256 slots available within the SATB buffer, the thread
103     continues to add values to the existing buffer.  Compression
104     consists of:
105     1. For pointers to young gen:
106        1. If concurrent marking of young gen is disabled, ignore the
107           pointer.
108        2. Otherwise, if the object referenced by this pointer has
109           already been marked, ignore the pointer.
110        3. If we choose to use SATB-based remembered set, ignore all
111           overwritten address values that reside within young gen.
112     2. For pointers to old gen:
113        1. If concurrent marking of old gen is disabled, ignore the
114           pointer.
115        2. Otherwise, if the object referenced by this pointer
116           has already been marked, ignore the pointer.
117        3. If we choose to use an SATB-based remembered set:
118           - If the card corresponding to an overwritten old gen
119             address is already DIRTY, ignore this overwritten
120             address.
121           - Otherwise, fetch the pointer value held at the overwritten
122             address.  If the fetched pointer value does not
123             refer to young gen, ignore this overwritten address.
124           - Otherwise, use a hash table (suggested size 127) to sift
125             redundant DIRTY card references for the current batch of
126             overwritten old-gen addresses.  Preserve only one address
127             for each card entry that needs to be marked as DIRTY.
128     3. SATB buffers will be processed by both a young GC
129        thread and an old GC thread.
130        1. The young GC thread will mark objects referenced
131           by young pointers.
132        2. The old GC thread will:
133           1. Mark objects referenced by old-gen pointers, and
134           2. If we choose to use SATB-based remembered set: mark as DIRTY
135              the card entries corresponding to overwritten addresses.
136 16. GC uses a G1-style heuristic to choose collection sets for both
137     young and old memory areas.  The collection set represents the
138     heap regions that offer the greatest opportunity to quickly reclaim
139     garbage, as with the existing Shenandoah implementation.  See
140     discussion [here](rationale.summary.md#g1-heuristic).
141 17. Following an evacuation phase that evacuates
142     both old and young heap regions, the update-references phase is
143     required to update references throughout all 
144     old regions that were not selected as part of the old
145     collection set in addition to updating references in young
146     heap regions.
147     1. If a particular young collection evacuation phase does not
148        evacuate any old regions, then its update references phase can
149        focus solely on young heap regions and the
150        remembered set.
151 18. Tenuring is based on object age with the enhancements described
152     below.  See discussion [here](rationale.summary.md#tenuring).
153     1. The variable TenureCycle counts how many GC cycles correspond to
154        each increment of an object’s age.  Object ages are not
155        necessarily incremented each time a GC cycle is completed.  They
156        are incremented each time TenureCycle GC cycles have been
157        completed.
158     2. The variable TenureAge represents the age at which an object
159        becomes eligible for tenuring.
160     3. During GC cycles that correspond to TenureCycle, the “age” of
161        individual objects is incremented by 1 plus the size of the
162        object’s original heap region age when the object is evacuated.
163     4. During GC cycles that correspond to TenureCycle, the “age” of
164        each heap region that has been fully allocated (i.e. no more
165        available memory for allocation) and that is not in the
166        collection set is incremented by 1 at the end of the evacuation
167        phase.  If the resulting “age” equals TenureAge, then the entire
168        region is reassigned to become part of old gen.
169        1. The update-refs phase will process this heap region even
170           though it is “now” considered to be part of old gen.
171        2. Each of the objects contained within the promoted region
172           shall be registered with the remembered set abstraction.
173        3. Each of the objects contained within the promoted region
174          be scanned to determine any references to young-gen
175          memory that are contained therein.  For any such pointers, set
176          the associated card table entry to DIRTY.
177 19. During evacuation, each running mutator thread has both a TLAB
178     and a GCLAB.
179     1. The TLAB is associated with a young heap region.
180     2. The GCLAB is associated with an old heap region.
181     3. When the mutator’s load reference barrier encounters a
182        reference for which the associated object needs to be tenured, it
183        allocates the copy memory from the GCLAB.
184     4. When the mutator’s load reference barrier encounters a
185        reference for which the associated object resides within the
186        collection set of old gen, it allocates the copy memory from the
187        GCLAB.
188     5. When the mutator’s load reference barrier encounters a
189        reference for which the associated object needs to be evacuated to
190        young gen, it allocates the copy memory from the TLAB.
191     6. We initially plan to use the same size for TLAB and GCLAB, but
192        this decision may be revisited.
193     7. If the size of the object to be evacuated is larger than half
194        the size of the respective local allocation buffer, allocation
195        of the replica memory is handled by alternative allocators, to be
196        designed.  Call these "odd" objects.
197 20. During evacuation, each evacuating GC thread will maintain two
198     GCLAB buffers:
199     1. GCLAB-Young is associated with young gen.
200     2. GCLAB-Old is associated with old gen.
201     3. If the object to be evacuated currently resides in old gen or
202       if it resides in young gen and it is to be tenured, allocate the
203       copy memory from GCLAB-Old.
204     4. Otherwise, allocate the copy memory from GCLAB-Young.
205     5. At the end of the evacuation phase, consider repurposing any
206        unspent GCLAB-Young as a TLAB if there is sufficient unallocated
207        memory remaining within it.
208     6. At the end of the evacuation phase, consider preserving the
209        GCLAB-Old for use as a GCLAB for a mutator thread during the next
210        young collections collection or as a GCLAB-Old during the next
211        old-gen evacuation pass.
212     7. See 19.7.
213 21. A certain budget of CPU time is provisioned to perform young-gen
214     GC in order to support a particular planned workload.
215     1. The resources dedicated to young-gen GC are limited in order
216        to assure a certain amount of CPU time is available to mutator
217        threads.
218     2. Command line options allow user control over provisioning.  A
219        TBD API may allow services to adjust provisioning dynamically.
220     3. In the ideal, provisioning is adjusted automatically based on
221        TBD heuristics.
222     4. The provisioned CPU resources can support a range of service
223        quality, reclaiming large numbers of heap regions with a low GC
224        frequency or reclaiming small numbers of heap regions with a
225        high GC frequency.  Given a particular frequency of GC cycles,
226        the same CPU resources can evacuate a large number of sparsely
227        populated heap regions or a small number of densely populated
228        heap regions.  Tradeoffs between these configuration extremes
229        may be adjusted under software control or by TBD
230        heuristics.
231     5. For each young-gen GC pass, a certain TBD percentage
232        of CPU resources are reserved for old-gen evacuation and
233        update-refs activities.
234        1. The old-gen CPU resource budget represents the total amount
235           of old-gen memory that can be relocated, and is quantified as
236           a multiple N of the heap region size.
237        2. The old-gen GC threads determine the composition of the
238           old-gen collection set, up to but never exceeding the upper
239           bound N on cumulative evacuation size.
240        3. The old-gen GC thread may select for the collection set
241           N heap regions which are known to have 100%
242           utilization, 2N heap regions known to have 50% utilization,
243           5N heap regions known to have 20% utilization, and so on.
244        4. If the old-gen GC refrains from delegating N heap regions
245           worth of evacuation work to the young-gen evacuation phase,
246           then the young GC is free to use the excess CPU resources to
247           more aggressively evacuate more of its own young heap regions,
248           using a larger than normal young-gen collection set.
249        5. The budget for updating old-gen references must be large
250           enough to handle the total old-gen memory size - N.  In the
251           case that old-gen GC configures a collection set that
252           represents its full evacuation capacity of N heap regions, the
253           number of old-gen heap regions that are not part of the
254           old-gen collection set is never larger than this quantity.
255           In the case that old-gen GC configures a smaller collection
256           set, then for each fraction of a heap region that is not
257           evacuated, this much more of a heap region might have to be
258           processed during the update-refs phase of GC.  We estimate
259           that the cost of evacuating M bytes of memory is similar to
260           the cost of updating the references within M bytes of
261           memory.
262 22. A smaller (generally) budget of CPU time is provisioned to
263     perform old-gen GC in order to support a particular planned
264     workload.
265     1. The resources dedicated to young-gen GC are limited in order
266        to assure a certain amount of CPU time is available to mutator
267        threads.
268     2. Command-line options allow user control over provisioning.  A
269        TBD API may allow services to adjust provisioning dynamically.
270     3. In the ideal, provisioning is adjusted automatically based on
271        TBD heuristics.
272     4. As with young-gen GC, the CPU resources provisioned for
273        old-gen GC can support a range of service quality.
274     5. The CPU resources dedicated to old-gen collection do not have
275        responsibility for evacuating old regions as all evacuation
276        is performed by young-gen GC threads.  Instead, the CPU
277        resources dedicated to old-gen GC activities are used for
278        activities such as the following:
279        1. Processing the content of SATB buffers:
280           - If old collection is in concurrent marking phase, mark
281             objects referenced by any keep-alive pointers.
282           - If we are using SATB-based remembered set, update the
283             remembered set based on overwritten addresses reported in the
284             SATB buffer.
285        2. During concurrent marking, scan the contents of previously
286           marked objects to complete the transitive closure of
287           reachability.
288        3. Perform heuristic computations:
289           - Determine when to start the next old-gen GC cycle.
290           - Determine which old regions to evacuate on this and future
291             passes of young-gen GC.
292           - Adjust young-gen efficiency parameters such as: How many
293             heap regions should be dedicated to young gen?  What is
294             optimal value of TenureCycle?  What is optimal value of
295             TenureAge?  How much CPU time should be dedicated to
296             young-gen GC?
297           - Adjust old-gen efficiency parameters such as: How much CPU
298             time should be dedicated to old-gen GC?  How many heap regions
299             should be dedicated to old gen?  Should any heap regions be
300             decommissioned and returned to the operating system in order
301             to shrink the memory footprint of this service?
302        4. Perform routine maintenance as time and schedule permits:
303           - Potentially sweep up dead memory, accumulating ages at
304             which dead objects were reclaimed within old regions.
305           - Potentially, sweep through evacuated memory to accumulate
306             ages at which dead objects were reclaimed.
307           - Organize free lists for fast and efficient allocation of
308             GCLAB and Odd object allocations.
309           - Return emptied old regions to the free set.
310           - Eventually, reference processing and string deduplication
311             should be performed by lower priority old-gen threads
312             rather than higher priority young-gen threads.
313        5. Following completion of each old-gen concurrent mark pass,
314           select regions to be included in the old-gen collection set:
315           - No more than a total of N bytes of old gen is evacuated by
316             each pass of the young-gen evacuator.
317           - If old-gen GC threads desire to evacuate M, which is more
318             than N bytes of old gen, it does so by piggy backing on
319             multiple subsequent young-gen evacuation passes, selecting
320             evacuating no more than the accumulation of N total heap
321             regions in each of the following young-gen evacuation passes.
322           - Since it is most efficient to evacuate all M > N regions
323             of old-gen memory with a single young-gen evacuation pass,
324             configure the old-gen collection set to include all M
325             regions if there are sufficient free regions available to
326             afford the young-gen allocator to continue allocating new
327             objects during the longer delay that would be required to
328             evacuate more than the traditionally budgeted N regions of
329             old-gen memory.