-
Collected metrics:
- ccr-stats (every 5s)
- node-stats incl refresh (every 10s), including
refresh
- metricbeat from each node (metricbeat enabled everywhere)
- median indexing throughput and the usual results Rally provides in summary
-
Criteria to compare runs:
- time taken to follower to complete replication; ideally almost instantaneous once indexing is over.
- indexing throughput
- overall system usage (esp. CPU + IO)
- overhead compared to baseline
- Median (delta) "leader global checkpoint - following global checkpoint" from CCR stats
-
Telemetry device collection frequency
"ccr-stats-sample-interval": 5, "node-stats-sample-interval": 10, "node-stats-include-indices": true # including refresh section
-
Abbreviations/Terminology:
- baseline: run benchmarks without following and soft_deletes explicitly disabled on leader
Using the currently running GCP env we have now (1 node per cluster, n1-highcpu-16
, 8r/8w concurrency, 8 Rally clients), same build commit: (https://github.com/elastic/elasticsearch/commit/df6f9669dccd762416e644f5956f350e618f0d87)
and keep the refresh section in the included docs from node-stats.
- Run
http_logs
benchmark using x-pack-security - Run
http_logs
benchmark using a non append-only challenge
Compare with trial-id: e2209a84-74a9-491d-99b1-5c04cdee9c4f
/ https://goo.gl/fVPh3G which is the benchmark with 8r/8w/8 clients, CCR enabled, x-pack disabled.
- 3 node clusters, security enabled.
- GCP: ES:
custom-16-32768
16cpu / 32GB ram / min Skylake processor, Loaddriver:n1-standard-16
(16vcpu 60GB ram) - AWS: ES:
c5d.4xlarge
16vcpu / 32GB ram, Loaddriver:m5d.4xlarge
(16vcpu 64GB ram) - Index settings: 3P/1R
- Rally tracks:
geopoint
/http_logs
/pmc
. Index settings: 3P/1R. All tracks configured for Indexing+replication simultaneously.8
indexing clients.
Elasticsearch branch: 6.x
-
Run smoke test CCR benchmarks to validate
max_*
defaults allow follower to always catch up.Run GCP only benchmarks with the three tracks to validate that follower is always catching up. This will allow us to tune defaults before spending more time to collect baseline numbers for checking CCR overhead.
-
Use defaults from 1. and establish baseline numbers without replication. Needed to calculate the overhead of CCR.
Run all tracks on both GCP + AWS to establish baseline. Total 3x2=6 benchmarks.
Capture median indexing performance and resource metrics (CPU, IO, Network).
-
Rerun 2. with CCR enabled.
Same benchmarks and capture same metrics as in 2. Compare CCR overhead.
Same env+ES branch+settings (incl. any modifications from defaults agreed) from Stage 1.
Run eventdata track on both GCP + AWS, both baseline and with CCR to understand behavior for longer duration.
Evaluate behavior based on criteria mentioned earlier.
TBD
Stage 0 results
Compare CCR on single node clusters using 8r/8w/8clients between no x-pack-security and x-pack-security
no x-pack security:
Additional time for follower to catch up after indexing was over:
486.981ms
x-pack security enabled:
Additional time for follower to catch up after indexing was over:
457.238ms
Delta of leader-follower global checkpoint (left is no security, right is with security):
Kibana Links for delta:
security: https://goo.gl/dTRGp3
no security: https://goo.gl/G4KEtV
benchmark with 25% conflicts (index):
Hit
Not all operations between from_seqno ... and to_seqno ... found
errors in the replication due to default0
inindex.soft_deletes.retention.operations
. Used initiallyindex.soft_deletes.retention.operations=100000
, but alsoUsing the same Rally indexing clients as with the above Stage-0 security:enabled and security:disabled runs, observed a number of throttling log entries in the ES leader:
Indexing throughput gets affected by conflicts. The charts below show median throughput (50th percentile) with various experiments:
Append only:
Conflicts, Replication: enabled:
Conflicts No Replication No soft_deletes: