-
Collected metrics:
- ccr-stats (every 5s)
- node-stats incl refresh (every 10s), including
refresh
- metricbeat from each node (metricbeat enabled everywhere)
- median indexing throughput and the usual results Rally provides in summary
-
Criteria to compare runs:
- time taken to follower to complete replication; ideally almost instantaneous once indexing is over.
- indexing throughput
- overall system usage (esp. CPU + IO)
- overhead compared to baseline
- Median (delta) "leader global checkpoint - following global checkpoint" from CCR stats
-
Telemetry device collection frequency
"ccr-stats-sample-interval": 5, "node-stats-sample-interval": 10, "node-stats-include-indices": true # including refresh section
-
Abbreviations/Terminology:
- baseline: run benchmarks without following and soft_deletes explicitly disabled on leader
Using the currently running GCP env we have now (1 node per cluster, n1-highcpu-16
, 8r/8w concurrency, 8 Rally clients), same build commit: (https://github.com/elastic/elasticsearch/commit/df6f9669dccd762416e644f5956f350e618f0d87)
and keep the refresh section in the included docs from node-stats.
- Run
http_logs
benchmark using x-pack-security - Run
http_logs
benchmark using a non append-only challenge
Compare with trial-id: e2209a84-74a9-491d-99b1-5c04cdee9c4f
/ https://goo.gl/fVPh3G which is the benchmark with 8r/8w/8 clients, CCR enabled, x-pack disabled.
- 3 node clusters, security enabled.
- GCP: ES:
custom-16-32768
16cpu / 32GB ram / min Skylake processor, Loaddriver:n1-standard-16
(16vcpu 60GB ram) - AWS: ES:
c5d.4xlarge
16vcpu / 32GB ram, Loaddriver:m5d.4xlarge
(16vcpu 64GB ram) - Index settings: 3P/1R
- Rally tracks:
geopoint
/http_logs
/pmc
. Index settings: 3P/1R. All tracks configured for Indexing+replication simultaneously.8
indexing clients.
Elasticsearch branch: 6.x
-
Run smoke test CCR benchmarks to validate
max_*
defaults allow follower to always catch up.Run GCP only benchmarks with the three tracks to validate that follower is always catching up. This will allow us to tune defaults before spending more time to collect baseline numbers for checking CCR overhead.
-
Use defaults from 1. and establish baseline numbers without replication. Needed to calculate the overhead of CCR.
Run all tracks on both GCP + AWS to establish baseline. Total 3x2=6 benchmarks.
Capture median indexing performance and resource metrics (CPU, IO, Network).
-
Rerun 2. with CCR enabled.
Same benchmarks and capture same metrics as in 2. Compare CCR overhead.
Same env+ES branch+settings (incl. any modifications from defaults agreed) from Stage 1.
Run eventdata track on both GCP + AWS, both baseline and with CCR to understand behavior for longer duration.
Evaluate behavior based on criteria mentioned earlier.
TBD
Stage 1
Step 2 vs Step 3 vs Step 4 (no replication+ soft_deletes:false VS no replication + soft_deletes: true VS CCR)
Using commit: https://github.com/elastic/elasticsearch/commits/63bb8a0201591a61fc123cf181a27fd13c38c123
and
max_read_request_size: 32mb
All use commit: 63bb8a0201591a61fc123cf181a27fd13c38c123
Baseline: No CCR, soft_deletes: false
Δ: always calculated against baseline.
HTTP_LOGS
AWS
GCP
GEOPOINT
AWS
GCP
PMC
AWS
GCP
Time for follower to catch up in CCR (step 3)
Everything took <1sec.
HTTP_LOGS
AWS CCR
GCP CCR
GEOPOINT
AWS CCR
GCP CCR
PMC
AWS CCR
GCP CCR