TLDR, I want to propose / have a conversation around two things:
- Create a protocol / wire / some form of structural definition for distributed tracing. I urge the project owners to please reconsider having the libraries be the specification. I should not have to import an SDK to participate in distributed tracing, I should only need to be a cooperative client able to produce the data in the format agreed upon.
- The protocol / specification is designed to be 100% stateless. For lack of better words "eventual causality", I should know which set of spans are open and there should not be a consequence to a span which never was closed. Leave it to visualization and tooling to iron through the edge cases. Tracing is about observation across systems, a failed transaction, seg fault, crashing program, etc should not lose all the data that led up to that event as it does in the implementations I've used today (due to buffering log records and hanging onto unfinished spans in memory).
Where could I find more information about the higher level goals for the upcoming OpenTelemetry effort? From what I gather it seems to be focusing on defining common API's similar to what OpenTracing seems to do. I would like to open an issue to discuss all of the consequences of having the "specification" end up living in each languages API. In short, I think that before everyone starts implementing libraries, yet again, a well defined protocol should be created first. The current specification for OpenTracing is a document with a few verbs in it. The definition of a standard protocol or data format has been completed avoided in favor of having that live inside an API.
But does it really make sense to have every single tracing provider implement a language specific API instead of having a common protocol / data format? Given that each vendor already has to implement all of these various libraries, wouldn't it be less effort to provide a single ingest type that can accept the common protocol? Instead of implementing the "API" x N languages the vendor wishes to support, they provide 1 endpoint with 1 common, well defined format.
This leaves the best-in-class implementations to emerge, rather than be stuck with whatever interfaces are created by the initial contributors. While I'm sure the intentions are pure and the efforts may result in top notch software, it's impossible to satisfy the requirements of all systems or meet all use cases. I should not have to import a library to participate in the distributed tracing ecosystem, I should only need to produce the data for a well defined protocol.
I feel like some of the major design issues I've found with the current libraries will emerge while carefully designing a protocol to satisfy all the use cases. The biggest flaws with the current design of OpenTracing-Go for example:
- Spans accumulate log records on the span object
- Spans stay in memory until they are finished (perhaps only beacuse they hold log records..?)
There are several design flaws / warts in the current API's, but I strongly feel that the two above are many orders of magnitude more detrimental. First off they cause a loss of visibility for in-flight spans, you don't know a span is running until it's finished. This makes any type of dashboards, alerting or visibility into issues impossible. Next it creates an entire class of software issues:
Strains memory usage & bandwidth for any burst of requests, hot paths I had that had zero or amortized to zero allocations suddenly have dozens and dozens when you add a handful of spans or log entries. Each one is "held" until finished. You can't pool log records because they are potentially read after they are finished by the opentracing "driver", so a caller can't use a sync.Pool because calling Release some time after a span is finished does not guarantee it's done being read. You're at the mercy of the API.
It makes it impossible to have a top level span in your main()
and produce independent child spans in blocking services. You're limited to using spans in exactly one case: short lived requests that minimal metadata, you have to cautiosly log because if you produce too many records you end up losing your entire span. "Span contains too many log records".
Because you buffer- you now have an entire class of engineering problems. Responsible software should define bounds, but what do you do when a log record is added that exceeds the upper bounds? There is no error propagation. Well some libraries don't check the current number of records and accumulate log records forever, so an innocent enough piece of software like the one below:
func startWork(...) {
span_start()
defer span_finish()
for { job := job_next(); job_log_start(job); job_handle(job); job_log_end(job); }
}
Becomes a runaway buffer until the program runs out of memory or exits. Worst- if you rely on tracing for visibility into issues the entire span with all it's glorious millions of entries are discarded: "Span contains too many log records" - yes, discarded. Not truncated. But many some libs truncate, maybe some log a helpful log record at limit-1 saying "truncated N more messages". The problem here is that the implementation specific behavior exists- and it is only a problem when you take the approach that the implementations are the specification.
All of these problems go away if there are distinct events produced for spans and log records. If a protocol exists that defines event types (nomenclature unimportant, just for illustration):
- span_open // span has been opened
- span_close // span has been closed
- span_oneshot // maybe for a use case to have a span event be it's own open and close I haven't discovered yet?
- span_log // log record for span, emitted as an independent event, the delivery guarantees & consistency model would be defined in protocol
Now our problem goes away and an entire class of tooling opens up for closer-to-realtime visibility into open transactions. I can also start spans anywhere in my software, despite how long it may stay open or how many child spans it may produce. If there was a protocol defined that took this approach library authors could choose to use a library like opentracing-go that buffers log records in spans for whatever reason, but users like me could produce high performance / quality libraries that didn't.
For an example of an event specification I wrote before I actually even knew the term "distributed tracing" (4-6 years ago) which was used at my company see: https://github.com/godaddy/py-emit/blob/master/EVENT.md - it's dated and a protocol / spec for tracing today would emerge in a much different form I'm sure thanks to much smarter and more experienced people producing it.
But it illustrates that it is helpful to have a concrete data structure to observe. Right now you have to look at a large client libraries implemented across N languages and build a mental model of what you're actually representing. I think we can do better.