1 # Proposed Leyden Terminal Stage Workflow
  2 
  3 This is a new propsed workflow for the "terminal stage" of the [Leyden
  4 condenser pipeline](https://openjdk.org/projects/leyden/notes/03-toward-condensers)
  5 
  6 - The CDS and AOT caches are automatically generated with a single `java` command.
  7 
  8 - The caches are stored in the file specified by the `-XX:CacheDataStore=<app>.cds` option
  9     - The implementation is still a work in progress. AOT integration is not done yet.
 10     - As an intermediate step, the AOT cache may be stored in a separate file.
 11 
 12 - The `-XX:CacheDataStore` option is intended to be a replacement for the existing
 13   `-XX:SharedArchiveFile` option.
 14 
 15 - We no longer need a separate "training run". Instead, the `-XX:CacheDataStore=<app>.cds`
 16   option should be added to the command-line of the production run of your application. For example
 17 
 18     ```
 19     java -Xlog:cds -XX:CacheDataStore=javac.cds com.sun.tools.javac.Main ~/tmp/HelloWorld.java
 20     ```
 21 
 22 - If the specified file doesn't exist, it will be created automatically when the JVM process exits:
 23 
 24     - The loaded classes and their compiler profile are dumped into a temporary file with a `.preimage`
 25       prefix. E.g., `javac.cds.preimage`
 26     - A JVM subprocess is launched to convert `javac.cds.preimage` to the final CDS image, `javac.cds`
 27         - See the end of `MetaspaceShared::preload_and_dump_impl()` in
 28           [metaspaceShared.cpp](../../../../../src/hotspot/share/cds/metaspaceShared.cpp)
 29 
 30 - In the next run of your application, the `javac.cds` file will be automatically loaded at start-up. Your
 31   application will see the benefit of CDS (and soon, AOT).
 32 
 33 
 34 - By default, the following VM options are used when `-XX:CacheDataStore=<app>.cds` is specified. This way, you
 35   can automatically use all the Leyden-premain optimizations without specifying any extra flags.
 36 
 37     - `RecordTraining` is set to `true` when the VM is *writing* the `<app>.cds.preimage` file.
 38     - `RecordTraining`, `ReplayTraining` and `StoreCachedCode` are set to `true` when the VM is *writing* the final CDS image file.
 39     - `ReplayTraining` and `LoadCachedCode` are set to `true` when the VM is *loading* the final CDS image file.
 40     - `CachedCodeFile` is set to `<app>.cds.code`.
 41 
 42   However, you can explicitly disable some of these flags for diagnostic purposes. For example, the
 43   following command-line will automatically generate `app.cds` and `app.cds.code` on its first run. However, it will
 44   only load `app.cds` on subsequent runs, but not `app.cds.code`.
 45 
 46 
 47     ```
 48     java -XX:CacheDataStore=app.cds -XX:-LoadCachedCode -cp app.jar MyApp
 49 
 50     ```
 51 
 52 - See [run.sh](run.sh) in this directory for an example of using `-XX:CacheDataStore=<app>.cds`
 53 
 54 ## Notes
 55 
 56 - For applications that do not exit automatically, you may need to hand-craft a training like this, so you
 57   app exits voluntarily, to allow the subprocess to be launched to complete the generation of `app.cds`.
 58 
 59     ```
 60     rm -f app.cds
 61     java -XX:CacheDataStore=app.cds -cp app.jar MyApp -exit-after-start
 62     ```
 63 
 64 - In the future, we may add a `jcmd` option to connect to a long running JVM and trigger the creation of
 65   the CacheDataStore.
 66 
 67 - By default, the subprocess is automatically forked at JVM exit. For debugging purpose, you can use the
 68   `-XX:+CDSManualFinalImage` option to disable the automatic forking. This allows you to debug the the
 69    subprocess more easily.
 70     - When `-XX:+CDSManualFinalImage` is specified, the JVM will create only the `<app>.cds.preimage`
 71       file at exit. It will then print out a command-line that you can execute manually to create the
 72       final `<app>.cds` file.
 73 
 74 ## AOT Code Generation
 75 
 76 AOT support is not fully implemented yet. As of Sep 18, 2023, at the end of `MetaspaceShared::preload_and_dump()`,
 77 the compiler will be executed to compile a single method, `String::charAt`. The nmethod will be stored inside the
 78 `CachedCodeFile`.
 79 
 80 The intended design is to, at this point, compile all methods that were recorded in the traing data during the
 81 training run. This is TBD.
 82 
 83 ## Benchmark
 84 
 85 (Sep 11, 2023)
 86 
 87 
 88 - Without `-XX:CacheDataStore`
 89 
 90 ```
 91 $ perf stat -r 20 java com.sun.tools.javac.Main HelloWorld.java
 92 
 93  Performance counter stats for 'java com.sun.tools.javac.Main HelloWorld.java' (20 runs):
 94 
 95        643.10 msec task-clock        #   2.374 CPUs utilized    ( +-  0.24% )
 96         4,318      context-switches  #   6.800 K/sec            ( +-  1.84% )
 97            29      cpu-migrations    #  45.666 /sec             ( +-  5.89% )
 98        15,003      page-faults       #  23.625 K/sec            ( +-  0.20% )
 99 2,936,972,438      cycles            #   4.625 GHz              ( +-  0.24% )
100 3,262,915,553      instructions      #   1.12  insn per cycle   ( +-  0.10% )
101   644,286,520      branches          #   1.015 G/sec            ( +-  0.11% )
102    29,099,407      branch-misses     #   4.57% of all branches  ( +-  0.15% )
103 
104       0.27091 +- 0.00107 seconds time elapsed  ( +-  0.40% )
105 ```
106 
107 - With `-XX:CacheDataStore` (note: AOT is not yet supported)
108 
109 ```
110 $ perf stat -r 20 java -XX:+ReplayTraining -XX:CacheDataStore=javac.cds com.sun.tools.javac.Main HelloWorld.java
111 
112  Performance counter stats for 'java -XX:+ReplayTraining -XX:CacheDataStore=javac.cds com.sun.tools.javac.Main HelloWorld.java' (20 runs):
113 
114        234.72 msec task-clock        #   2.165 CPUs utilized    ( +-  0.29% )
115         1,839      context-switches  #   7.735 K/sec            ( +-  1.22% )
116            14      cpu-migrations    #  58.883 /sec             ( +-  4.13% )
117         9,003      page-faults       #  37.866 K/sec            ( +-  0.22% )
118 1,070,819,957      cycles            #   4.504 GHz              ( +-  0.30% )
119 1,170,776,369      instructions      #   1.08  insn per cycle   ( +-  0.35% )
120   229,314,097      branches          # 964.471 M/sec            ( +-  0.36% )
121     9,544,981      branch-misses     #   4.09% of all branches  ( +-  0.38% )
122 
123      0.108406 +- 0.000844 seconds time elapsed  ( +-  0.78% )
124 ```