-
Collected metrics:
- ccr-stats (every 5s)
- node-stats incl refresh (every 10s), including
refresh
- metricbeat from each node (metricbeat enabled everywhere)
- median indexing throughput and the usual results Rally provides in summary
-
Criteria to compare runs:
- time taken to follower to complete replication; ideally almost instantaneous once indexing is over.
- indexing throughput
- overall system usage (esp. CPU + IO)
- overhead compared to baseline
- Median (delta) "leader global checkpoint - following global checkpoint" from CCR stats
-
Telemetry device collection frequency
"ccr-stats-sample-interval": 5, "node-stats-sample-interval": 10, "node-stats-include-indices": true # including refresh section
-
Abbreviations/Terminology:
- baseline: run benchmarks without following and soft_deletes explicitly disabled on leader
Using the currently running GCP env we have now (1 node per cluster, n1-highcpu-16
, 8r/8w concurrency, 8 Rally clients), same build commit: (https://github.com/elastic/elasticsearch/commit/df6f9669dccd762416e644f5956f350e618f0d87)
and keep the refresh section in the included docs from node-stats.
- Run
http_logs
benchmark using x-pack-security - Run
http_logs
benchmark using a non append-only challenge
Compare with trial-id: e2209a84-74a9-491d-99b1-5c04cdee9c4f
/ https://goo.gl/fVPh3G which is the benchmark with 8r/8w/8 clients, CCR enabled, x-pack disabled.
- 3 node clusters, security enabled.
- GCP: ES:
custom-16-32768
16cpu / 32GB ram / min Skylake processor, Loaddriver:n1-standard-16
(16vcpu 60GB ram) - AWS: ES:
c5d.4xlarge
16vcpu / 32GB ram, Loaddriver:m5d.4xlarge
(16vcpu 64GB ram) - Index settings: 3P/1R
- Rally tracks:
geopoint
/http_logs
/pmc
. Index settings: 3P/1R. All tracks configured for Indexing+replication simultaneously.8
indexing clients.
Elasticsearch branch: 6.x
-
Run smoke test CCR benchmarks to validate
max_*
defaults allow follower to always catch up.Run GCP only benchmarks with the three tracks to validate that follower is always catching up. This will allow us to tune defaults before spending more time to collect baseline numbers for checking CCR overhead.
-
Use defaults from 1. and establish baseline numbers without replication. Needed to calculate the overhead of CCR.
Run all tracks on both GCP + AWS to establish baseline. Total 3x2=6 benchmarks.
Capture median indexing performance and resource metrics (CPU, IO, Network).
-
Rerun 2. with CCR enabled.
Same benchmarks and capture same metrics as in 2. Compare CCR overhead.
Same env+ES branch+settings (incl. any modifications from defaults agreed) from Stage 1.
Run eventdata track on both GCP + AWS, both baseline and with CCR to understand behavior for longer duration.
Evaluate behavior based on criteria mentioned earlier.
TBD
Stage 2: Ensure throughput hasn't dropped with recent commit id
Benchmark using the same track (eventdata) and environment as in the previous stage 2 results, using elastic/elasticsearch@f908949
No tweaking of
max_
parameters, using the defaults of the corresponding commit.For all benchmarks time to catch up was
<500ms
.AWS
GCP
AWS: Comparison of system metrics for Baseline vs First Run vs Second Run
IO
CPU
GCP: Comparison of system metrics for Baseline vs First Run vs Second Run
IO
CPU