dliappis/plan.md Secret

## plan.md

      
    Raw
  

              plan.md
            
          
    Common for all benchmarks


Collected metrics:

ccr-stats (every 5s)
node-stats incl refresh (every 10s), including refresh
metricbeat from each node (metricbeat enabled everywhere)
median indexing throughput and the usual results Rally provides in summary


Criteria to compare runs:

time taken to follower to complete replication; ideally almost instantaneous once indexing is over.
indexing throughput
overall system usage (esp. CPU + IO)
overhead compared to baseline
Median (delta) "leader global checkpoint - following global checkpoint" from CCR stats


Telemetry device collection frequency
"ccr-stats-sample-interval": 5,
"node-stats-sample-interval": 10,
"node-stats-include-indices": true # including refresh section


Abbreviations/Terminology:

baseline: run benchmarks without following and soft_deletes explicitly disabled on leader


Stage 0 / a few adhoc benchmarks

Using the currently running GCP env we have now (1 node per cluster, n1-highcpu-16, 8r/8w concurrency, 8 Rally clients), same build commit: (https://github.com/elastic/elasticsearch/commit/df6f9669dccd762416e644f5956f350e618f0d87)
and keep the refresh section in the included docs from node-stats.

Run http_logs benchmark using x-pack-security
Run http_logs benchmark using a non append-only challenge

Compare with trial-id: e2209a84-74a9-491d-99b1-5c04cdee9c4f / https://goo.gl/fVPh3G which is the benchmark with 8r/8w/8 clients, CCR enabled, x-pack disabled.
Stage 1

env+settings


3 node clusters, security enabled.
GCP: ES: custom-16-32768 16cpu / 32GB ram / min Skylake processor, Loaddriver: n1-standard-16 (16vcpu 60GB ram)
AWS: ES: c5d.4xlarge 16vcpu / 32GB ram, Loaddriver: m5d.4xlarge (16vcpu 64GB ram)
Index settings: 3P/1R
Rally tracks: geopoint / http_logs / pmc. Index settings: 3P/1R.  All tracks configured for Indexing+replication simultaneously. 8 indexing clients.

Elasticsearch branch: 6.x
Benchmarks


Run smoke test CCR benchmarks to validate max_* defaults allow follower to always catch up.
Run GCP only benchmarks with the three tracks to validate that follower is always catching up.
This will allow us to tune defaults before spending more time to collect baseline numbers for checking CCR overhead.


Use defaults from 1. and establish baseline numbers without replication. Needed to calculate the overhead of CCR.
Run all tracks on both GCP + AWS to establish baseline. Total 3x2=6 benchmarks.
Capture median indexing performance and resource metrics (CPU, IO, Network).


Rerun 2. with CCR enabled.
Same benchmarks and capture same metrics as in 2.
Compare CCR overhead.


Stage 2

Same env+ES branch+settings (incl. any modifications from defaults agreed) from Stage 1.
Run eventdata track on both GCP + AWS, both baseline and with CCR to understand behavior for longer duration.
Evaluate behavior based on criteria mentioned earlier.
Stage 3

TBD
Metric	Operation	Baseline	Baseline + soft_deletes: true	Δ %	CCR on	Δ %	Unit
Min Throughput	bulk-leader-index-autogenerated-ids	101815	96005.7	5.7057408	85629.1	15.897363	docs/s
Median Throughput	bulk-leader-index-autogenerated-ids	183385	179244	2.2580909	174034	5.0991084	docs/s
Max Throughput	bulk-leader-index-autogenerated-ids	188017	182787	2.7816634	175779	6.5089859	docs/s
Metric	Operation	Baseline	Baseline + soft_deletes: true	Δ %	CCR on	Δ %	Unit
Min Throughput	bulk-leader-index-autogenerated-ids	71812	66280.4	7.7028909	68538.4	4.5585696	docs/s
Median Throughput	bulk-leader-index-autogenerated-ids	103788	107824	-3.8886962	100935	2.7488727	docs/s
Max Throughput	bulk-leader-index-autogenerated-ids	107305	111670	-4.0678440	103465	3.5785844	docs/s
Metric	Operation	Baseline	Baseline + soft_deletes: true	Δ %	CCR on	Δ %	Unit
Min Throughput	bulk-leader-index-autogenerated-ids	152255	131645	13.536501	127179	16.469738	docs/s
Median Throughput	bulk-leader-index-autogenerated-ids	263436	261074	0.89661246	251621	4.4849603	docs/s
Max Throughput	bulk-leader-index-autogenerated-ids	266929	263900	1.1347587	253394	5.0706368	docs/s
Metric	Operation	Baseline	Baseline + soft_deletes: true	Δ %	CCR on	Δ %	Unit
Min Throughput	bulk-leader-index-autogenerated-ids	90000.9	85838.1	4.6252871	79567.9	11.592106	docs/s
Median Throughput	bulk-leader-index-autogenerated-ids	142982	148210	-3.6564043	139059	2.7437020	docs/s
Max Throughput	bulk-leader-index-autogenerated-ids	145412	149422	-2.7576816	140477	3.3938052	docs/s
Metric	Operation	Baseline	Baseline + soft_deletes: true	Δ %	CCR on	Δ %	Unit
Min Throughput	bulk-leader-index-autogenerated-ids	1441.19	1428.69	0.86733880	1464.63	-1.6264337	docs/s
Median Throughput	bulk-leader-index-autogenerated-ids	1682.59	1675.29	0.43385495	1642.68	2.3719385	docs/s
Max Throughput	bulk-leader-index-autogenerated-ids	1802.23	1824.55	-1.2384657	1706.35	5.3200757	docs/s
Metric	Operation	Baseline	Baseline + soft_deletes: true	Δ %	CCR on	Δ %	Unit
Min Throughput	bulk-leader-index-autogenerated-ids	1090.42	1132.39	-3.8489756	1093.4	-0.27328919	docs/s
Median Throughput	bulk-leader-index-autogenerated-ids	1260.57	1284.64	-1.9094537	1243.04	1.3906407	docs/s
Max Throughput	bulk-leader-index-autogenerated-ids	1287.65	1307.77	-1.5625364	1271.37	1.2643187	docs/s
Metric	Operation	Baseline	#63bb8a0	Δ (Drop) %	#f908949	Δ (Drop) %	Unit
Min Throughput	bulk-append-1000	32536.3	30895.8	5.0420607	29792.3	8.4336572	docs/s
Median Throughput	bulk-append-1000	56683.8	53170.8	6.1975379	53760.4	5.1573818	docs/s
Max Throughput	bulk-append-1000	72361.5	63889.4	11.708022	66441.4	8.1812842	docs/s
Metric	Operation	Baseline	#63bb8a0	Δ (Drop) %	#f908949	Δ (Drop) %	Unit
Min Throughput	bulk-append-1000	21305	23471.6	-10.169444	16177.8	24.065712	docs/s
Median Throughput	bulk-append-1000	30890.1	29608.9	4.1476072	32383.2	-4.8335875	docs/s
Max Throughput	bulk-append-1000	41925.7	41047.7	2.0941809	40603.3	3.1541513	docs/s