dliappis/plan.md Secret

## plan.md

      
    Raw
  

              plan.md
            
          
    Common for all benchmarks


Collected metrics:

ccr-stats (every 5s)
node-stats incl refresh (every 10s), including refresh
metricbeat from each node (metricbeat enabled everywhere)
median indexing throughput and the usual results Rally provides in summary


Criteria to compare runs:

time taken to follower to complete replication; ideally almost instantaneous once indexing is over.
indexing throughput
overall system usage (esp. CPU + IO)
overhead compared to baseline
Median (delta) "leader global checkpoint - following global checkpoint" from CCR stats


Telemetry device collection frequency
"ccr-stats-sample-interval": 5,
"node-stats-sample-interval": 10,
"node-stats-include-indices": true # including refresh section


Abbreviations/Terminology:

baseline: run benchmarks without following and soft_deletes explicitly disabled on leader


Stage 0 / a few adhoc benchmarks

Using the currently running GCP env we have now (1 node per cluster, n1-highcpu-16, 8r/8w concurrency, 8 Rally clients), same build commit: (https://github.com/elastic/elasticsearch/commit/df6f9669dccd762416e644f5956f350e618f0d87)
and keep the refresh section in the included docs from node-stats.

Run http_logs benchmark using x-pack-security
Run http_logs benchmark using a non append-only challenge

Compare with trial-id: e2209a84-74a9-491d-99b1-5c04cdee9c4f / https://goo.gl/fVPh3G which is the benchmark with 8r/8w/8 clients, CCR enabled, x-pack disabled.
Stage 1

env+settings


3 node clusters, security enabled.
GCP: ES: custom-16-32768 16cpu / 32GB ram / min Skylake processor, Loaddriver: n1-standard-16 (16vcpu 60GB ram)
AWS: ES: c5d.4xlarge 16vcpu / 32GB ram, Loaddriver: m5d.4xlarge (16vcpu 64GB ram)
Index settings: 3P/1R
Rally tracks: geopoint / http_logs / pmc. Index settings: 3P/1R.  All tracks configured for Indexing+replication simultaneously. 8 indexing clients.

Elasticsearch branch: 6.x
Benchmarks


Run smoke test CCR benchmarks to validate max_* defaults allow follower to always catch up.
Run GCP only benchmarks with the three tracks to validate that follower is always catching up.
This will allow us to tune defaults before spending more time to collect baseline numbers for checking CCR overhead.


Use defaults from 1. and establish baseline numbers without replication. Needed to calculate the overhead of CCR.
Run all tracks on both GCP + AWS to establish baseline. Total 3x2=6 benchmarks.
Capture median indexing performance and resource metrics (CPU, IO, Network).


Rerun 2. with CCR enabled.
Same benchmarks and capture same metrics as in 2.
Compare CCR overhead.


Stage 2

Same env+ES branch+settings (incl. any modifications from defaults agreed) from Stage 1.
Run eventdata track on both GCP + AWS, both baseline and with CCR to understand behavior for longer duration.
Evaluate behavior based on criteria mentioned earlier.
Stage 3

TBD
Metric	Operation	Baseline	#63bb8a0	Δ (Drop) %	#f908949	Δ (Drop) %	Unit
Min Throughput	bulk-append-1000	32536.3	30895.8	5.0420607	29792.3	8.4336572	docs/s
Median Throughput	bulk-append-1000	56683.8	53170.8	6.1975379	53760.4	5.1573818	docs/s
Max Throughput	bulk-append-1000	72361.5	63889.4	11.708022	66441.4	8.1812842	docs/s
Metric	Operation	Baseline	#63bb8a0	Δ (Drop) %	#f908949	Δ (Drop) %	Unit
Min Throughput	bulk-append-1000	21305	23471.6	-10.169444	16177.8	24.065712	docs/s
Median Throughput	bulk-append-1000	30890.1	29608.9	4.1476072	32383.2	-4.8335875	docs/s
Max Throughput	bulk-append-1000	41925.7	41047.7	2.0941809	40603.3	3.1541513	docs/s