Distributed Tracing with Node JS

Oct 13, 2020

The microservice architecture pattern solves many of the problems inherent with monolithic applications. But microservices also bring challenges of their own, one of which is figuring out what went wrong when something breaks. There are at least 3 related challenges here:

Log collection
Metric collection
Distributed tracing

Log and metric collection is fairly straightforward (we'll cover these in a separate post), but only gets you so far.

Let's say your 20 microservice application starts behaving badly - you start getting timeouts on a particular API and want to find out why. The first place you look may be your centralized metrics service. This will likely confirm to you that you have a problem, as hopefully you have one or more metrics that are now showing out-of-band numbers.

But what if the issue only affects part of your user population, or worse, a single (but important) customer? In these cases your metrics - assuming you have the right ones in the first place - probably won't tell you much.

In cases like these, where you have minimal or no guidance from your configured metrics, you start trying to figure out where the problem may be. You know your system architecture, and you're pretty sure you've narrowed the issue down to three or four of your services.

So what's next? Well, you've got your centrally aggregated service logs, right? So you open up three or four windows and try to find an example of a request that fails, and trace it through to the other 2-3 services in the mix. Of course, if your problem only manifests in production then you'll be sifting through a large number of logs.

How good are you logs anyway? You're in prod, so you've probably disabled debug logs, but even if you hadn't, logs usually only get you so far. After some digging, you might be able to narrow things down to a function or two, but you're likely not logging all the information you need to proceed from there. Time to start sifting through code...

But maybe there's a better way.

Enter Distributed Tracing

Distributed Tracing is a method of tracking a request as it traverses multiple services. Let's say you have a simple e-commerce app, which looks a little like this (simplified for clarity):

Now, your user has made an order and wants to track the order's status. In order for this to happen the user makes a request that hits your API Gateway, which needs to authenticate the request and then send it on to your Orders service. This fetches Order details, then consults your Shipping service to discover shipping status, which in turn calls an external API belonging to your shipping partner.

There are quite a few things that can go wrong here. Your Auth service could be down, your Orders service could be unable to reach its database, your Shipping service could be unable to access the external API, and so on. All you know, though, is that your customer is complaining that they can't access their Order details and they're getting aggravated.

We can solve this by tracing a request as it traverses your architecture, with each step surfacing details about what is going on and what (if anything) went wrong. We can then use the Jaeger UI to visualize the trace as it happened, allowing us to debug problems as well as identify bottlenecks.

An example distributed application

To demonstrate how this works I've created a distributed tracing example app on Github. The repo is pretty basic, containing a packages directory that contains 4 extremely simple apps: gateway, auth, orders and shipping, corresponding to 4 of the services in our service architecture diagram.

The easiest way to play with this yourself is to simply clone the repo and start the services using docker-compose:

git clone git@github.com:edspencer/tracing-example.git
cd tracing-example
docker-compose up

git clone git@github.com:edspencer/tracing-example.git
cd tracing-example
docker-compose up

This will spin up 5 docker containers - one for each of our 4 services plus Jaeger. Now go to http://localhost:5000/orders/12345 and hit refresh a few times. I've set the services up to sometimes work and sometimes cause errors - there's a 20% chance that the auth app will return an error and a 30% chance that the simulated call to the external shipping service API will fail.

After refreshing http://localhost:5000/orders/12345 a few times, open up the Jaeger UI at http://localhost:16686/search and you'll see something like this:

http://localhost:5000/orders/12345 serves up the Gateway service, which is a pretty simple one-file express app that will call the Auth service on every request, then make calls to the Orders service. The Orders service in turn calls the Shipping service, which makes a simulated call to the external shipping API.

Clicking into one of the traces will show you something like this:

This view shows you the the request took 44ms to complete, and has a nice breakdown of where that time was spent. The services are color coded automatically so you can see at a glance how the 44ms was distributed across them. In this case we can see that there was an error in the shipping service. Clicking into the row with the error yields additional information useful for debugging:

The contents of this row are highly customizable. It's easy to tag the request with whatever information you like. So let's see how this works.

The Code

Let's look at the Gateway service. First we set up the Jaeger integration:

const express = require('express')
const superagent = require('superagent')
const opentracing = require('opentracing')
const {initTracer} = require('jaeger-client')

const port = process.env.PORT || 80
const authHost = process.env.AUTH_HOST || "auth"
const ordersHost = process.env.ORDERS_HOST || "orders"
const app = express()

//set up our tracer
const config = {
  serviceName: 'gateway',
  reporter: {
    logSpans: true,
    collectorEndpoint: 'http://jaeger:14268/api/traces',
  },
  sampler: {
    type: 'const',
    param: 1
  }
};

const options = {
  tags: {
    'gateway.version': '1.0.0'
  }
};

const tracer = initTracer(config, options);

const express = require('express')
const superagent = require('superagent')
const opentracing = require('opentracing')
const {initTracer} = require('jaeger-client')

const port = process.env.PORT || 80
const authHost = process.env.AUTH_HOST || "auth"
const ordersHost = process.env.ORDERS_HOST || "orders"
const app = express()

//set up our tracer
const config = {
  serviceName: 'gateway',
  reporter: {
    logSpans: true,
    collectorEndpoint: 'http://jaeger:14268/api/traces',
  },
  sampler: {
    type: 'const',
    param: 1
  }
};

const options = {
  tags: {
    'gateway.version': '1.0.0'
  }
};

const tracer = initTracer(config, options);

The most interesting stuff here is where we declare our config. Here we're telling the Jaeger client tracer to post its traces to http://jaeger:14268/api/traces (this is set up in our docker-compose file), and to sample all requests - as specified in the sampler config. In production, you won't want to sample every request - one in a thousand is probably enough - so you can switch to type: 'probabilistic' and param: 0.001 to achieve this.

Now that we have our tracer, let's tell Express to instrument each request that it serves:

//create a root span for every request
app.use((req, res, next) => {
  req.rootSpan = tracer.startSpan(req.originalUrl)
  tracer.inject(req.rootSpan, "http_headers", req.headers)

  res.on("finish", () => {
    req.rootSpan.finish()
  })

  next()
})

//create a root span for every request
app.use((req, res, next) => {
  req.rootSpan = tracer.startSpan(req.originalUrl)
  tracer.inject(req.rootSpan, "http_headers", req.headers)

  res.on("finish", () => {
    req.rootSpan.finish()
  })

  next()
})

Here we're setting up our outer span and giving it a title matching the request url. We encounter 3 of the 4 simple concepts we need to understand:

startSpan - creates a new "span" in our distributed trace; this corresponds to one of the rows we see in the Jaeger UI. This span is given a unique span ID and may have a parent span ID
inject - adds the span ID somewhere else - usually into HTTP headers for a downstream request - we'll see more of this in a moment
finishing the span - we hook into Express' "finish" event on the response to make sure we call .finish() on the span. This is what sends it to Jaeger.

Now let's see how we call the Auth service, passing along the span ID:

//use the auth service to see if the request is authenticated
const checkAuth = async (req, res, next) => {
  const span = tracer.startSpan("check auth", {
    childOf: tracer.extract(opentracing.FORMAT_HTTP_HEADERS, req.headers)
  })

  try {
    const headers = {}
    tracer.inject(span, "http_headers", headers)
    const res = await superagent.get(http://${authHost}/auth).set(headers)

    if (res && res.body.valid) { 
      span.setTag(opentracing.Tags.HTTP_STATUS_CODE, 200) 
      next() 
    } else { 
      span.setTag(opentracing.Tags.HTTP_STATUS_CODE, 401) 
      res.status(401).send("Unauthorized") 
    }
  } catch(e) { 
    res.status(503).send("Auth Service gave an error") 
  }

  span.finish()
}

//use the auth service to see if the request is authenticated
const checkAuth = async (req, res, next) => {
  const span = tracer.startSpan("check auth", {
    childOf: tracer.extract(opentracing.FORMAT_HTTP_HEADERS, req.headers)
  })

  try {
    const headers = {}
    tracer.inject(span, "http_headers", headers)
    const res = await superagent.get(http://${authHost}/auth).set(headers)

    if (res && res.body.valid) { 
      span.setTag(opentracing.Tags.HTTP_STATUS_CODE, 200) 
      next() 
    } else { 
      span.setTag(opentracing.Tags.HTTP_STATUS_CODE, 401) 
      res.status(401).send("Unauthorized") 
    }
  } catch(e) { 
    res.status(503).send("Auth Service gave an error") 
  }

  span.finish()
}

There are 2 important things happening here:

We create a new span representing the "check auth" operation, and set it to be the childOf the parent span we created previously
When we send the superagent request to the Auth service, we inject the new child span into the HTTP request headers

We're also showing how to add tags to a span via setTag. In this case we're appending the HTTP status code that we return to the client.

Let's examine the final piece of the Gateway service - the actual proxying to the Orders service:

//proxy to the Orders service to return Order details
app.all('/orders/:orderId', checkAuth, async (req, res) => {
  const span = tracer.startSpan("get order details", {
    childOf: tracer.extract(opentracing.FORMAT_HTTP_HEADERS, req.headers)
  })
  try {
    const headers = {}
    tracer.inject(span, "http_headers", headers)
    const order = await superagent.get(http://${ordersHost}/order/${req.params.orderId}).set(headers)
    if (order && order.body) {
      span.finish()
      res.json(order.body)
    } else { 
      span.setTag(opentracing.Tags.HTTP_STATUS_CODE, 200) 
      span.finish()
      res.status(500).send("Could not fetch order")
    }
  } catch(e) {
    res.status(503).send("Error contacting Orders service") 
  }
})

app.listen(port, () => console.log(`API Gateway app listening on port ${port}`))

//proxy to the Orders service to return Order details
app.all('/orders/:orderId', checkAuth, async (req, res) => {
  const span = tracer.startSpan("get order details", {
    childOf: tracer.extract(opentracing.FORMAT_HTTP_HEADERS, req.headers)
  })
  try {
    const headers = {}
    tracer.inject(span, "http_headers", headers)
    const order = await superagent.get(http://${ordersHost}/order/${req.params.orderId}).set(headers)
    if (order && order.body) {
      span.finish()
      res.json(order.body)
    } else { 
      span.setTag(opentracing.Tags.HTTP_STATUS_CODE, 200) 
      span.finish()
      res.status(500).send("Could not fetch order")
    }
  } catch(e) {
    res.status(503).send("Error contacting Orders service") 
  }
})

app.listen(port, () => console.log(`API Gateway app listening on port ${port}`))

This looks pretty similar to what we just did for the Auth service - we're creating a new span that represents the call to the Orders service, setting its parent to our outer span, and injecting it into the superagent call we make to Orders. Pretty simple stuff.

Finally, let's look at the other side of this - how to pick up the trace in another service - in this case the Auth service:

//simulate our auth service being flaky with a 20% chance of 500 internal server error
app.get('/auth', (req, res) => {
  const parentSpan = tracer.extract(opentracing.FORMAT_HTTP_HEADERS, req.headers)
  const span = tracer.startSpan("checking user", {
    childOf: parentSpan, tags: {
      [opentracing.Tags.COMPONENT]: "database" 
    }
  })

  if (Math.random() > 0.2) {
    span.finish()
    res.json({valid: true, userId: 123})
  } else {
    span.setTag(opentracing.Tags.ERROR, true) 
    span.finish() 
    res.status(500).send("Internal Auth Service error") 
  }
})

//simulate our auth service being flaky with a 20% chance of 500 internal server error
app.get('/auth', (req, res) => {
  const parentSpan = tracer.extract(opentracing.FORMAT_HTTP_HEADERS, req.headers)
  const span = tracer.startSpan("checking user", {
    childOf: parentSpan, tags: {
      [opentracing.Tags.COMPONENT]: "database" 
    }
  })

  if (Math.random() > 0.2) {
    span.finish()
    res.json({valid: true, userId: 123})
  } else {
    span.setTag(opentracing.Tags.ERROR, true) 
    span.finish() 
    res.status(500).send("Internal Auth Service error") 
  }
})

Here we see the 4th and final concept involved in distributed tracing:

extract - pulls the trace ID from the upstream service from the incoming HTTP headers

This is how the trace is able to traverse our services - in service A we create a span and inject it into calls to service B. Service B picks it up and creates a new span with the extracted span as its parent. We can then pass this span ID on to service C.

Jaeger is even nice enough to automatically create a system architecture diagram for you:

Conclusion

Distributed tracing is immensely powerful when it comes to understanding why distributed systems behave the way they do. There is a lot more to distributed tracing than we covered above, but at its core it really comes down to those 4 key concepts: starting spans, finishing them, injecting them into downstream requests and extracting them from the upstream.

One nice attribute of open tracing standards is that they work across technologies. In this example we saw how to hook up 4 Node JS microservices with it, but there's nothing special about Node JS here - this stuff is well supported in other languages like Go and can be added pretty much anywhere - it's just basic UDP and (usually) HTTP.

For further reading I recommend you check out the Jaeger intro docs, as well as the architecture. The Node JS Jaeger client repo is a good place to poke around, and has links to more resources. Actual example code for Node JS was a little hard to come by, which is why I wrote this post. I hope it helps you in your microservice applications.

What to Read Next

For readers interested in further improving their JavaScript applications' performance and debugging capabilities, consider exploring Sencha Con 2013: Ext JS Performance tips for practical solutions to common performance issues. You might also find Read my BDD article in this month's JS Magazine insightful, as it delves into Behavior Driven Development (BDD) practices that can enhance application reliability and maintainability.