Skip to content

Instantly share code, notes, and snippets.

@dliappis

dliappis/plan.md Secret

Last active December 3, 2018 08:38
Show Gist options
  • Save dliappis/67615df7622071400846ef890430f57a to your computer and use it in GitHub Desktop.
Save dliappis/67615df7622071400846ef890430f57a to your computer and use it in GitHub Desktop.
Benchmarking plan for CCR

Common for all benchmarks

  1. Collected metrics:

    • ccr-stats (every 5s)
    • node-stats incl refresh (every 10s), including refresh
    • metricbeat from each node (metricbeat enabled everywhere)
    • median indexing throughput and the usual results Rally provides in summary
  2. Criteria to compare runs:

    • time taken to follower to complete replication; ideally almost instantaneous once indexing is over.
    • indexing throughput
    • overall system usage (esp. CPU + IO)
    • overhead compared to baseline
    • Median (delta) "leader global checkpoint - following global checkpoint" from CCR stats
  3. Telemetry device collection frequency

    "ccr-stats-sample-interval": 5,
    "node-stats-sample-interval": 10,
    "node-stats-include-indices": true # including refresh section
  4. Abbreviations/Terminology:

    • baseline: run benchmarks without following and soft_deletes explicitly disabled on leader

Stage 0 / a few adhoc benchmarks

Using the currently running GCP env we have now (1 node per cluster, n1-highcpu-16, 8r/8w concurrency, 8 Rally clients), same build commit: (https://github.com/elastic/elasticsearch/commit/df6f9669dccd762416e644f5956f350e618f0d87)

and keep the refresh section in the included docs from node-stats.

  1. Run http_logs benchmark using x-pack-security
  2. Run http_logs benchmark using a non append-only challenge

Compare with trial-id: e2209a84-74a9-491d-99b1-5c04cdee9c4f / https://goo.gl/fVPh3G which is the benchmark with 8r/8w/8 clients, CCR enabled, x-pack disabled.

Stage 1

env+settings

  • 3 node clusters, security enabled.
  • GCP: ES: custom-16-32768 16cpu / 32GB ram / min Skylake processor, Loaddriver: n1-standard-16 (16vcpu 60GB ram)
  • AWS: ES: c5d.4xlarge 16vcpu / 32GB ram, Loaddriver: m5d.4xlarge (16vcpu 64GB ram)
  • Index settings: 3P/1R
  • Rally tracks: geopoint / http_logs / pmc. Index settings: 3P/1R. All tracks configured for Indexing+replication simultaneously. 8 indexing clients.

Elasticsearch branch: 6.x

Benchmarks

  1. Run smoke test CCR benchmarks to validate max_* defaults allow follower to always catch up.

    Run GCP only benchmarks with the three tracks to validate that follower is always catching up. This will allow us to tune defaults before spending more time to collect baseline numbers for checking CCR overhead.

  2. Use defaults from 1. and establish baseline numbers without replication. Needed to calculate the overhead of CCR.

    Run all tracks on both GCP + AWS to establish baseline. Total 3x2=6 benchmarks.

    Capture median indexing performance and resource metrics (CPU, IO, Network).

  3. Rerun 2. with CCR enabled.

    Same benchmarks and capture same metrics as in 2. Compare CCR overhead.

Stage 2

Same env+ES branch+settings (incl. any modifications from defaults agreed) from Stage 1.

Run eventdata track on both GCP + AWS, both baseline and with CCR to understand behavior for longer duration.

Evaluate behavior based on criteria mentioned earlier.

Stage 3

TBD

@dliappis
Copy link
Author

dliappis commented Nov 10, 2018

Stage 2: Ensure throughput hasn't dropped with recent commit id

Benchmark using the same track (eventdata) and environment as in the previous stage 2 results, using elastic/elasticsearch@f908949

No tweaking of max_ parameters, using the defaults of the corresponding commit.

  1. Baseline:
  • NO CCR, no soft_deletes
  • commit: 63bb8a0
  • AWS trial-id: 6afeb696-64d7-4ec7-87a1-940fd30c57cd
  • GCP trial-id: ea32372b-9274-4093-a05a-f1c4100ccd9f
  1. CCR first run:
  • commit: 63bb8a0
  • AWS trial-id: 8e83b9b5-dd36-4d37-ab5b-dff193f69f6b
  • GCP trial-id: 47dc469c-dda3-4d3c-b177-c153df0afdf3
  1. CCR second run:
  • commit: f908949
  • AWS trial-id: e44aff48-df6c-49e4-b681-ba05514ea0ae
  • GCP trial-id: 1c93ca79-d1b9-467f-8555-48919439c5b8

For all benchmarks time to catch up was <500ms.

AWS

Metric Operation Baseline #63bb8a0 Δ (Drop) % #f908949 Δ (Drop) % Unit
Min Throughput bulk-append-1000 32536.3 30895.8 5.0420607 29792.3 8.4336572 docs/s
Median Throughput bulk-append-1000 56683.8 53170.8 6.1975379 53760.4 5.1573818 docs/s
Max Throughput bulk-append-1000 72361.5 63889.4 11.708022 66441.4 8.1812842 docs/s

GCP

Metric Operation Baseline #63bb8a0 Δ (Drop) % #f908949 Δ (Drop) % Unit
Min Throughput bulk-append-1000 21305 23471.6 -10.169444 16177.8 24.065712 docs/s
Median Throughput bulk-append-1000 30890.1 29608.9 4.1476072 32383.2 -4.8335875 docs/s
Max Throughput bulk-append-1000 41925.7 41047.7 2.0941809 40603.3 3.1541513 docs/s

AWS: Comparison of system metrics for Baseline vs First Run vs Second Run

IO

image

CPU

image

GCP: Comparison of system metrics for Baseline vs First Run vs Second Run

IO

image

CPU

image

@dliappis
Copy link
Author

dliappis commented Dec 3, 2018

AWS Stage 2 benchmarks, node-stats turned off, including comparison with tcp.compression: enabled

Re-run stage 2 benchmarks (eventdata track, AWS) without collecting node-stats as docs/completion increases refreshes and skews results. Also benchmark tcp.compress: enabled

Track: example doc.

tcp.compress effect conclusion

Network Usage

Significant drop in network-usage with tcp.compression enabled:

Leader cluster: network-in usage down by approx.63% and network-out usage by approx. 80%. Link: https://goo.gl/RgX8cw
Follower cluster: network-in usage down by approx. 70%-80% and network-out usage by approx. 75-80%. Link: https://goo.gl/yzTvoh

CPU Usage

No noteworthy difference in CPU usage. (see graphs below).

Results

SOFT_DELETES_OFF

|   All |                       Min Throughput | bulk-append |     54158 | docs/s |
|   All |                    Median Throughput | bulk-append |   56445.7 | docs/s |
|   All |                       Max Throughput | bulk-append |   69991.2 | docs/s |

SOFT_DELETES_ON

|   All |                       Min Throughput | bulk-append |   54732.1 | docs/s |
|   All |                    Median Throughput | bulk-append |   57239.1 | docs/s |
|   All |                       Max Throughput | bulk-append |   71856.2 | docs/s |

CCR on, no tcp.compression

|   All |                       Min Throughput |                bulk-append |   51936.5 | docs/s |
|   All |                    Median Throughput |                bulk-append |   54039.1 | docs/s |
|   All |                       Max Throughput |                bulk-append |   65640.8 | docs/s |

CCR on, tcp.compression: enabled

|   All |                       Min Throughput |                bulk-append |     52300 | docs/s |
|   All |                    Median Throughput |                bulk-append |   54334.6 | docs/s |
|   All |                       Max Throughput |                bulk-append |   65911.8 | docs/s |

Graphs for CCR tcp.compress disabled vs enabled

LEFT is tcp.compress disabled, RIGHT is tcp.compress enabled. cluster_b is leader, cluster_a is follower.

Network-in Leader cluster

image

Network-out Follower cluster

image

CPU Stats Leader cluster

image

CPU Stats Follower cluster

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment