We have a set of serialized tasks that occasionally fail. When this happens we usually retry from the start. The failures are usually transient, so the second attempt will usually succeed. However, we start again from the beginning, so we pay the cost of redoing all the tasks. Because the errors are transient, if we could retry in-line these transient errors would not abort the entire pipeline.
Let's try introducing retry(1)
to allow us to specify a managed
execution.
command-that-fails --argument
becomes:
retry 3 command-that-fails --argument
As long as the command succeeds at least one in three we shouldn't have to restart the pipeline.
retry$ ./retry.sh 3 ./ok
running ok
retry$ ./retry.sh 3 ./bad
running bad
Jan 6 17:11:49 host.local user[653] <Notice>: ./bad failed, retrying
running bad
Jan 6 17:11:49 host.local jonathona[655] <Notice>: ./bad failed, retrying
running bad
Jan 6 17:11:49 host.local jonathona[657] <Notice>: ./bad failed to much, bailing