-
-
Save markwalkom/8a7201e3f6ea4354ae06 to your computer and use it in GitHub Desktop.
input { | |
elasticsearch { | |
hosts => [ "HOSTNAME_HERE" ] | |
port => "9200" | |
index => "INDEXNAME_HERE" | |
size => 1000 | |
scroll => "5m" | |
docinfo => true | |
scan => true | |
} | |
} | |
output { | |
elasticsearch { | |
hosts => [ "HOSTNAME_HERE" ] | |
index => "%{[@metadata][_index]}" | |
document_type => "%{[@metadata][_type]}" | |
document_id => "%{[@metadata][_id]}" | |
} | |
stdout { | |
codec => "dots" | |
} | |
} |
Using this method, I keep getting "Error: Unable to establish loopback connection" and using the metric to count the number of document returned, showed that it is not corresponding to the number of record in the original. I have to keep filtering till I get the size right. And FYI, if you keep the scroll value too long and if it encountered exception, it is going to eat into available memory as the search context will be kept open while the plugin restarts. Use this /_nodes/stats/indices/search?pretty to see the number of search context opened and you will find the open_context increasing. Would like to hear how others are handling such scenario.
I have the same kind of config.
One notable difference is that I also use the metrics plugin to monitor throughput and number of copied documents.
I also added some filters to discard randomly entries by percentage, useful for copying from production to test (smaller clusters). Or simply CTRL + C if you don't need complete timeframe sample.
For maximum of parallelism, you can specify more than one non overlapping elasticsearch input.
As noted before, be cautious to tune correctly and consistently different input/worker/output parameters :
- LS_HEAP_SIZE
- Number of workers (-w switch).
- Number of elasticsearch input blocs (non overlapping index patterns)
- Elasticsearch output perf parameters (workers, flush_size, timeout)
There's no magic switch, it's always a matter of adjustment depending on your hardware and data volume.
Here's my file (sorry, it's french documented):
https://gist.github.com/blavoie/58c90290935e8e1167e6
Also, before copying, we create a end of line template (order 90) that disable refreshes and replicas for newly created indices. That makes bulk indexing faster. At the end of bulk, we remove template and re-enable desired replication factor and also set back refresh interval to better values (say 15s).
Link to this template:
https://gist.github.com/blavoie/ebdf92793e14d8d9ebfc
In our case, for simplicity and multi-tenancy reason, our indices begin all by ul-*.
This makes things simpler when copying stuff, applying templates, aliasing, etc.
Please note that this config is for LS < 2.2, as the workers/pools/pipeline model changed a bit as of LS >= 2.2.
Thanks for the comments @blavioe!
Hi, I use the default index naming "logstash-" for a daily index. I have altered the number of shards from the default 5 to 1. I need to re-index my indices. I don't want to re-index into a new index eg "logstash-new-" but instead I want the existing indices to end up being spread across their single shard (instead of the current 5 shards per index).
How can I use this logstash script to do this?
Is there a better way to do this - eg re-index into new indices eg "logstash-new-", delete the original "logstash-" indices, then re-index back into "logstash-" from the new "logstash-new-" indices?
Many thanks.
Reindex API is a nice option:
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html#_reindex_daily_indices
Also look into automatic scroll slicing that allows scrolls to be processed by multiple threads in parallel giving a nice speed boost.
Can anybody please explain that scroll option? I do reindex with logstash and it loops endlessly - the data from source index is randomly duplicated to output
Is it possible to reindex all indices in one shot, or do you have to specify each index name?