Skip to content

Instantly share code, notes, and snippets.

@dliappis

dliappis/plan.md Secret

Last active December 3, 2018 08:38
Show Gist options
  • Save dliappis/67615df7622071400846ef890430f57a to your computer and use it in GitHub Desktop.
Save dliappis/67615df7622071400846ef890430f57a to your computer and use it in GitHub Desktop.
Benchmarking plan for CCR

Common for all benchmarks

  1. Collected metrics:

    • ccr-stats (every 5s)
    • node-stats incl refresh (every 10s), including refresh
    • metricbeat from each node (metricbeat enabled everywhere)
    • median indexing throughput and the usual results Rally provides in summary
  2. Criteria to compare runs:

    • time taken to follower to complete replication; ideally almost instantaneous once indexing is over.
    • indexing throughput
    • overall system usage (esp. CPU + IO)
    • overhead compared to baseline
    • Median (delta) "leader global checkpoint - following global checkpoint" from CCR stats
  3. Telemetry device collection frequency

    "ccr-stats-sample-interval": 5,
    "node-stats-sample-interval": 10,
    "node-stats-include-indices": true # including refresh section
  4. Abbreviations/Terminology:

    • baseline: run benchmarks without following and soft_deletes explicitly disabled on leader

Stage 0 / a few adhoc benchmarks

Using the currently running GCP env we have now (1 node per cluster, n1-highcpu-16, 8r/8w concurrency, 8 Rally clients), same build commit: (https://github.com/elastic/elasticsearch/commit/df6f9669dccd762416e644f5956f350e618f0d87)

and keep the refresh section in the included docs from node-stats.

  1. Run http_logs benchmark using x-pack-security
  2. Run http_logs benchmark using a non append-only challenge

Compare with trial-id: e2209a84-74a9-491d-99b1-5c04cdee9c4f / https://goo.gl/fVPh3G which is the benchmark with 8r/8w/8 clients, CCR enabled, x-pack disabled.

Stage 1

env+settings

  • 3 node clusters, security enabled.
  • GCP: ES: custom-16-32768 16cpu / 32GB ram / min Skylake processor, Loaddriver: n1-standard-16 (16vcpu 60GB ram)
  • AWS: ES: c5d.4xlarge 16vcpu / 32GB ram, Loaddriver: m5d.4xlarge (16vcpu 64GB ram)
  • Index settings: 3P/1R
  • Rally tracks: geopoint / http_logs / pmc. Index settings: 3P/1R. All tracks configured for Indexing+replication simultaneously. 8 indexing clients.

Elasticsearch branch: 6.x

Benchmarks

  1. Run smoke test CCR benchmarks to validate max_* defaults allow follower to always catch up.

    Run GCP only benchmarks with the three tracks to validate that follower is always catching up. This will allow us to tune defaults before spending more time to collect baseline numbers for checking CCR overhead.

  2. Use defaults from 1. and establish baseline numbers without replication. Needed to calculate the overhead of CCR.

    Run all tracks on both GCP + AWS to establish baseline. Total 3x2=6 benchmarks.

    Capture median indexing performance and resource metrics (CPU, IO, Network).

  3. Rerun 2. with CCR enabled.

    Same benchmarks and capture same metrics as in 2. Compare CCR overhead.

Stage 2

Same env+ES branch+settings (incl. any modifications from defaults agreed) from Stage 1.

Run eventdata track on both GCP + AWS, both baseline and with CCR to understand behavior for longer duration.

Evaluate behavior based on criteria mentioned earlier.

Stage 3

TBD

@dliappis
Copy link
Author

dliappis commented Nov 6, 2018

Stage 1

Step 2 vs Step 3 vs Step 4 (no replication+ soft_deletes:false VS no replication + soft_deletes: true VS CCR)

Using commit: https://github.com/elastic/elasticsearch/commits/63bb8a0201591a61fc123cf181a27fd13c38c123
and max_read_request_size: 32mb

All use commit: 63bb8a0201591a61fc123cf181a27fd13c38c123

Baseline: No CCR, soft_deletes: false
Δ: always calculated against baseline.

HTTP_LOGS

AWS

Metric Operation Baseline Baseline + soft_deletes: true Δ % CCR on Δ % Unit
Min Throughput bulk-leader-index-autogenerated-ids 101815 96005.7 5.7057408 85629.1 15.897363 docs/s
Median Throughput bulk-leader-index-autogenerated-ids 183385 179244 2.2580909 174034 5.0991084 docs/s
Max Throughput bulk-leader-index-autogenerated-ids 188017 182787 2.7816634 175779 6.5089859 docs/s

GCP

Metric Operation Baseline Baseline + soft_deletes: true Δ % CCR on Δ % Unit
Min Throughput bulk-leader-index-autogenerated-ids 71812 66280.4 7.7028909 68538.4 4.5585696 docs/s
Median Throughput bulk-leader-index-autogenerated-ids 103788 107824 -3.8886962 100935 2.7488727 docs/s
Max Throughput bulk-leader-index-autogenerated-ids 107305 111670 -4.0678440 103465 3.5785844 docs/s

GEOPOINT

AWS

Metric Operation Baseline Baseline + soft_deletes: true Δ % CCR on Δ % Unit
Min Throughput bulk-leader-index-autogenerated-ids 152255 131645 13.536501 127179 16.469738 docs/s
Median Throughput bulk-leader-index-autogenerated-ids 263436 261074 0.89661246 251621 4.4849603 docs/s
Max Throughput bulk-leader-index-autogenerated-ids 266929 263900 1.1347587 253394 5.0706368 docs/s

GCP

Metric Operation Baseline Baseline + soft_deletes: true Δ % CCR on Δ % Unit
Min Throughput bulk-leader-index-autogenerated-ids 90000.9 85838.1 4.6252871 79567.9 11.592106 docs/s
Median Throughput bulk-leader-index-autogenerated-ids 142982 148210 -3.6564043 139059 2.7437020 docs/s
Max Throughput bulk-leader-index-autogenerated-ids 145412 149422 -2.7576816 140477 3.3938052 docs/s

PMC

AWS

Metric Operation Baseline Baseline + soft_deletes: true Δ % CCR on Δ % Unit
Min Throughput bulk-leader-index-autogenerated-ids 1441.19 1428.69 0.86733880 1464.63 -1.6264337 docs/s
Median Throughput bulk-leader-index-autogenerated-ids 1682.59 1675.29 0.43385495 1642.68 2.3719385 docs/s
Max Throughput bulk-leader-index-autogenerated-ids 1802.23 1824.55 -1.2384657 1706.35 5.3200757 docs/s

GCP

Metric Operation Baseline Baseline + soft_deletes: true Δ % CCR on Δ % Unit
Min Throughput bulk-leader-index-autogenerated-ids 1090.42 1132.39 -3.8489756 1093.4 -0.27328919 docs/s
Median Throughput bulk-leader-index-autogenerated-ids 1260.57 1284.64 -1.9094537 1243.04 1.3906407 docs/s
Max Throughput bulk-leader-index-autogenerated-ids 1287.65 1307.77 -1.5625364 1271.37 1.2643187 docs/s

Time for follower to catch up in CCR (step 3)

Everything took <1sec.

HTTP_LOGS

AWS CCR

|   All |             100th percentile latency |          wait-for-followers-to-sync |   584.205 |     ms |

GCP CCR

|   All |             100th percentile latency |          wait-for-followers-to-sync |   647.758 |     ms |

GEOPOINT

AWS CCR

|   All |             100th percentile latency |          wait-for-followers-to-sync |   201.332 |     ms |

GCP CCR

|   All |             100th percentile latency |          wait-for-followers-to-sync |    267.255 |     ms |

PMC

AWS CCR

|   All |             100th percentile latency |          wait-for-followers-to-sync |    256.562 |     ms |

GCP CCR

|   All |             100th percentile latency |          wait-for-followers-to-sync |    271.792 |     ms |

@dliappis
Copy link
Author

dliappis commented Nov 9, 2018

Stage 2: 3 day benchmarks

Using:

{
    "bulk_indexing_clients": 8,
    "bulk_indexing_iterations": 10000000,
    "bulk_size": 100,
    "number_of_shards": 3,
    "number_of_replicas": 1,
    "enable_soft_deletes": true,
    "target_throughput": 38,
    "index_additional_settings": {
    }
}
  • _cat/indices from both clusters at the end:
green open elasticlogs-leader1 KJTGspzQQtCj0wpaTjlsUw 3 1 1000008000 0 650gb 324.9gb
green open elasticlogs-follower1 MSiAoQulRvmQRxeDnU197Q 3 1 1000008000 0 648.9gb 324.4gb

Results

AWS

|   All |                       Min Throughput |                bulk-append |   3754.23 | docs/s |
|   All |                    Median Throughput |                bulk-append |   3799.97 | docs/s |
|   All |                       Max Throughput |                bulk-append |   3878.31 | docs/s |

|   All |             100th percentile latency | wait-for-followers-to-sync |   477.708 |     ms |

[INFO] SUCCESS (took 263283 seconds)

(trial-id: b0f1d448-3f68-4daf-92ce-35f7a126d512)

GCP

|   All |                       Min Throughput |                bulk-append |   2950.74 | docs/s |
|   All |                    Median Throughput |                bulk-append |   3799.98 | docs/s |
|   All |                       Max Throughput |                bulk-append |   3799.99 | docs/s |

|   All |             100th percentile latency | wait-for-followers-to-sync |   314.884 |     ms |

[INFO] SUCCESS (took 263266 seconds)

(trial-id: 78dfab04-b197-49ff-b188-95584a39735d)

@dliappis
Copy link
Author

dliappis commented Nov 10, 2018

Stage 2: Ensure throughput hasn't dropped with recent commit id

Benchmark using the same track (eventdata) and environment as in the previous stage 2 results, using elastic/elasticsearch@f908949

No tweaking of max_ parameters, using the defaults of the corresponding commit.

  1. Baseline:
  • NO CCR, no soft_deletes
  • commit: 63bb8a0
  • AWS trial-id: 6afeb696-64d7-4ec7-87a1-940fd30c57cd
  • GCP trial-id: ea32372b-9274-4093-a05a-f1c4100ccd9f
  1. CCR first run:
  • commit: 63bb8a0
  • AWS trial-id: 8e83b9b5-dd36-4d37-ab5b-dff193f69f6b
  • GCP trial-id: 47dc469c-dda3-4d3c-b177-c153df0afdf3
  1. CCR second run:
  • commit: f908949
  • AWS trial-id: e44aff48-df6c-49e4-b681-ba05514ea0ae
  • GCP trial-id: 1c93ca79-d1b9-467f-8555-48919439c5b8

For all benchmarks time to catch up was <500ms.

AWS

Metric Operation Baseline #63bb8a0 Δ (Drop) % #f908949 Δ (Drop) % Unit
Min Throughput bulk-append-1000 32536.3 30895.8 5.0420607 29792.3 8.4336572 docs/s
Median Throughput bulk-append-1000 56683.8 53170.8 6.1975379 53760.4 5.1573818 docs/s
Max Throughput bulk-append-1000 72361.5 63889.4 11.708022 66441.4 8.1812842 docs/s

GCP

Metric Operation Baseline #63bb8a0 Δ (Drop) % #f908949 Δ (Drop) % Unit
Min Throughput bulk-append-1000 21305 23471.6 -10.169444 16177.8 24.065712 docs/s
Median Throughput bulk-append-1000 30890.1 29608.9 4.1476072 32383.2 -4.8335875 docs/s
Max Throughput bulk-append-1000 41925.7 41047.7 2.0941809 40603.3 3.1541513 docs/s

AWS: Comparison of system metrics for Baseline vs First Run vs Second Run

IO

image

CPU

image

GCP: Comparison of system metrics for Baseline vs First Run vs Second Run

IO

image

CPU

image

@dliappis
Copy link
Author

dliappis commented Dec 3, 2018

AWS Stage 2 benchmarks, node-stats turned off, including comparison with tcp.compression: enabled

Re-run stage 2 benchmarks (eventdata track, AWS) without collecting node-stats as docs/completion increases refreshes and skews results. Also benchmark tcp.compress: enabled

Track: example doc.

tcp.compress effect conclusion

Network Usage

Significant drop in network-usage with tcp.compression enabled:

Leader cluster: network-in usage down by approx.63% and network-out usage by approx. 80%. Link: https://goo.gl/RgX8cw
Follower cluster: network-in usage down by approx. 70%-80% and network-out usage by approx. 75-80%. Link: https://goo.gl/yzTvoh

CPU Usage

No noteworthy difference in CPU usage. (see graphs below).

Results

SOFT_DELETES_OFF

|   All |                       Min Throughput | bulk-append |     54158 | docs/s |
|   All |                    Median Throughput | bulk-append |   56445.7 | docs/s |
|   All |                       Max Throughput | bulk-append |   69991.2 | docs/s |

SOFT_DELETES_ON

|   All |                       Min Throughput | bulk-append |   54732.1 | docs/s |
|   All |                    Median Throughput | bulk-append |   57239.1 | docs/s |
|   All |                       Max Throughput | bulk-append |   71856.2 | docs/s |

CCR on, no tcp.compression

|   All |                       Min Throughput |                bulk-append |   51936.5 | docs/s |
|   All |                    Median Throughput |                bulk-append |   54039.1 | docs/s |
|   All |                       Max Throughput |                bulk-append |   65640.8 | docs/s |

CCR on, tcp.compression: enabled

|   All |                       Min Throughput |                bulk-append |     52300 | docs/s |
|   All |                    Median Throughput |                bulk-append |   54334.6 | docs/s |
|   All |                       Max Throughput |                bulk-append |   65911.8 | docs/s |

Graphs for CCR tcp.compress disabled vs enabled

LEFT is tcp.compress disabled, RIGHT is tcp.compress enabled. cluster_b is leader, cluster_a is follower.

Network-in Leader cluster

image

Network-out Follower cluster

image

CPU Stats Leader cluster

image

CPU Stats Follower cluster

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment