Telemetry with Scala, part 1: OpenTelemetry
Introduction
First part
This blog post is the first of a series of articles, aimed to show a landscape of metrics and telemetry solutions possible to use in different Scala ecosystems, including some examples of APM solutions. I possibly won’t cover all possible solutions or a combination of those but will try to cover the main one.
Please, NOTE: this series is not going to compare telemetry as whole products between each other, instead it focuses on how to use each telemetry framework in different Scala ecosystem and perhaps highlight some pitfalls from this perspective only.
Since this is quite a large topic to discuss, just a small portion of code samples are present here. Please, see this repository for complete results.
As the title says, in this particular post, we will focus on OpenTelemetry framework.
Concepts
But before moving forward, it’s worth mentioning that monitoring or telemetry is a pretty wide technical topic. Of course, I won’t be able to cover it all. But, to be on the same page, first, let’s introduce a shortlist of main concepts I’m going to refer to. Definitions could differ from source to source, so I try to give some average notions:
- Metrics — abstract signals sent from service, resource, etc.
- Spans — data about single logical unit execution. For instance: timings about service method execution.
- Trace — data about single logical transaction execution. For instance: handling HTTP requests.
- Instrumentation — any tools or libraries, which understand specifics of certain service stack of technologies they run inside and can scrap valuable metrics or spans.
- Agent — service collecting or scrapping data from target services or resources.
- APM (application performance monitoring) — complex software solution capable to store, visualize, alert about, etc. received metrics. For example Splunk, NewRelic, etc.
Overall, when we’re talking about any telemetry solution, we usually mean building a complex solution consisting of certain instrumentation, agent/storage, and visualization/APM solution. Schematically it showed in picture below:
Ecosystems
Following ecosystems are going to be considered in series:
- Lightbend — mostly known for
akka
and surrounding projects, such asakka-http
,akka-streams
etc.; - Typelevel — ecosystem build on top of
cats-effect
library, such asfs2
,doobie
,http4s
etc.
Looking ahead, this and the next post are focused on Lightbend stack, and the third will describe monitoring in Typelevel stack.
System under monitoring
To compare how to plug and use certain telemetry solutions for end-service, let’s consider a single example service: a task ticketing system, similar to well-known Jira or Asana. The domain model is pretty simple — there is Ticket
representing a single task and Project
which contains multiple tickets. This system at a high level consists of microservices responsible for projects management (projects-service
), tickets management (tickets-service
) and tickets change notifications, e.g. emails (notification-service
). Since there is plenty of stuff to monitor, let's keep our focus on tickets-service
only. The projects-service
is stubbed in our case and provides API just to fetch project by its id. The notification-service
shown as theoretical service.
For the sake of example, the task tickets system is much simplified compared to the real-world production system and does not include things like auth, caching, load balancing, etc. Along with that, the technical stack has been chosen to be more or less diverse to simulate a real production system to some extent, hence certain architectural decisions might not look perfect.
Task ticketing system which we are going to monitor looks following at a high level:
tickets-service
- micro-service for tasks tickets, also provides full-text search capabilities, publishes a change to Kafka topic.notification-service
- micro-service for user subscriptions, send emails for subscribed tickets. Just show Kafka intention, is not present in implementation.projects-service
- micro-service for projects, which tickets are leave in. Stubbed with static data.
Please, note: implementation of exact service depends on the stack we consider, but the overall architecture remains the same for the entire series.
tickets-service
API
The example ticket model is following:
tickets-service
exposes following API endpoints.POST /v1/tickets
- create single ticket:
- Request project by
project
id field fromprojects-service
to verify project exists; - Insert a record into Postgre
tickets
table; - Index ticket document into
tickets
Elasticsearch index; - Send message to Kafka
tickets
topic identifying ticket created event;
GET /v1/tickets?search={search-query}&project={project-id}
- performs a full-text search over all tickets:
- Search tickets by
title
anddescription
in Elasticsearch; - Fetches full ticket records from Postgre
tickets
table by foundid
's from Elasticsearch;
PUT /v1/tickets
- update single ticket:
- Update a record in Postgre
tickets
table. - Index ticket document into
tickets
Elasticsearch index; - Send message to Kafka
tickets
topic identifying ticket updated event;
DELETE /v1/tickets/:id
- delete single ticket:
- Delete ticket by id from both Postgre and Elasticsearch;
Things to monitor
Moving to data we would like to get from our telemetry:
- HTTP requests spans — timings and details about each request handling. Ideally, to get traces about how much time the application spends in.
- Tickets count — custom metric showing number of existing tickets.
Simulating user traffic
To get some data for any telemetry from ticket-service
we need to send some requests. I decided to use Gatling for this. Although this is a rather load and performance testing tool, let's omit the testing aspect and use it just to simulate user traffic. The following scenario for a single user is used to put some load on service:
- Create 20 tickets for 10 different projects via
POST /v1/tickets
endpoint; - Search tickets for 10 different projects via
GET /v1/tickets?search={search-query}&project={}
endpoint; - Update 20 tickets via
PUT /v1/tickets
endpoint; - Delete 30 tickets via
DELETE /v1/tickets/:id
endpoint;
And this scenario used for simulating activity of 100 concurrent users.
Implementation details
Following libraries and frameworks were used to implement tickets-service
in Lightbend ecosystem:
akka-http
- for HTTP server and client;slick
- for database access layer, along with Postgre JDBC driver;elastic4s
- not really library supported by Lightbend, but it provides convenient Scala API, abstracted over underlying effect, hence can be used withFuture
, which is our case.kafka
- plain Java client to write records totickets
.
OpenTelemetry
Instrumentation
OpenTelemetry is, as the documentation says:
OpenTelemetry is a set of APIs, SDKs, tooling and integrations that are designed for the creation and management of telemetry data such as traces, metrics, and logs.
As you may see, nothing Scala specific and so far it has no dedicated support for a variety of Scala libraries, but luckily it provides number of JVM instrumentation. In particular for ticket-service
case, we are interested about following instrumentation: "Akka HTTP", "JDBC", "Elasticsearch Client", "Kafka Producer/Consumer API"
Despite the fact that the list of available instrumentations is pretty long, I strongly encourage you to check if it fits your particular needs.
Export
OpenTelemetry is designed in a way that instrumentation is abstracted over a backend it exposes metrics to. Such an approach allows supporting a variety of exporters, both pull and push based. See Data Collection page for more details. We will consider following exporters in examples: Prometheus, Zipkin and OTPL.
Talking about APM solutions, there are many services natively supporting OpenTelemetry, such as Splunk, Datalog, etc.
Automatic instrumentation
OpenTelemetry provides an easy way to plug automatic instrumentation Taking into account sbt
specifics, it is possible to implement in the following approach.
First, add OpenTelemetry dependencies to the project:
Use sbt-javaagent plugin to run app with javaagent
parameter. Add to plugin.sbt
:
And add following code for project settings:
Agent configuration can be supplied through environment variables or with Java options (e.g. -D
argument). Automatic tracing covers "HTTP requests spans" monitoring requirement.
Metrics
Apart from automatic tracing, with OpenTelemetry it is possible to expose custom metrics. Please, pay attention to that:
Note The stability of Metrics in opentelemetry-java is mixed. The first stable metrics API was release in version 1.10.0; however, the metrics SDK is alpha and still subject to change
As it was said above: we want to track a total number of tickets. First, let’s instantiate a counter for tickets:
Great, now we can utilize ticketsCounter
to track the number of tickets, for instance by increasing: ticketsCounter.add(1)
Now we have plugged instrumentation and custom metrics for ticket-service
. Full service implementation you can find by this link
Metrics example: Prometheus
Since OpenTelemetry supports integration with Prometheus for metrics exporting, we can use it to monitor tickets_count
. Let's start local Prometheus instance using docker-compose
(partial example):
Set necessary environment variables for ticket-service
for Prometheus exporter (see, link for more details)
Don’t forget to expose 9094
port for tickets-service
for Prometheus agent to scrap metrics.
Let’s start the whole setup and run Gatling tests after. On Prometheus UI at localhost:9090
we can find tickets_count
metric:
Full docker-compose you can find by this link.
Tracing example: Zipkin
As it was mentioned earlier, OpenTelemetry supports exporter for Zipkin, but this time for spans only. To send some span’s data, let’s start local Zipkin instance using following docker-compose
:
Then set necessary environment variables for ticket-service
for Zipkin exporter (see, link for more details):
Let’s start whole setup and run Gatling tests after. On Zipkin UI at localhost:9411
we can find some tickets_service
traces:
If we open some example request, for instance DELETE /tickets/:id
, not many details could be found:
Full docker-compose you can find by this link.
APM Example: Datadog
Within complex APM solutions, like Datadog, it is possible to combine and monitor both metrics and spans in a single place. OpenTelemetry offers its own protocol called OTLP, which main advantage is it supports simultaneously export for metrics, spans, and logs (not covered here). OTLP is a push-based protocol, meaning the application must send data to a collector.
In the case of Datadog APM, our application won’t send data directly to the Datadog site or API. Instead, what we need is to set up and configure the collector. See, the documentation for more details. Let’s start from collector configuration:
Great, now we can use this configuration to extend collector base Docker image and build our own:
Please, NOTE: there are two similar base Docker images for OpenTelemetry collectors:
- opentelemetry-collector — OpenTelemetry Collector core distribution.
- opentelemetry-collector-contrib — OpenTelemetry Collector customizations, which are not a part of the core distribution. And we need exactly this one.
After, our customized collector can be used, for example in docker-compose:
The collector part is ready. Now we need to configure application instrumentation. For this, specify necessary environment variables for ticket-service
for OTLP exporter (see, link for more details)
Pay attention to one detail: we need to specify both OTEL_TRACES_EXPORTER
and OTEL_METRICS_EXPORTER
to point to otlp
exporter.
Let’s start the whole setup and run Gatling tests after. First, we can check tickets_count
in "Metrics" section of Datadog:
Awesome, that works. Now moving on to traces: open “APM” -> “Traces”. You can find plenty of tracked spans.
We can have a closer look, for instance, at some POST /tickets
endpoint invocation that is responsible for ticket creation. Choose any trace for this endpoint and open "Span List":
And in this list, you can observe all requests from ticket_service
to other services it does while creating a new ticket. Full docker-compose you can find by this link. Off course this is just an example of APM usage, check opentelemetry-collector-contrib and documentation of an APM you are interested in. Highly likely it supports any of protocols supported by OpenTelemetry (such as Jaeger) or has a dedicated OpenTelemetry collector exporter.
Conclusions
In conclusion, I would like to share the list of pros and cons of using OpenTelemetry for Scala and Lightbend stack projects.
Pros:
- Open, vendor-agnostic standard with wide support among free and commercial monitoring solutions. It is worth highlighting, that OpenTelemetry is not just a library designed for JVM. Its whole standard includes, but not limited to, multiple language instrumentation, a standard protocol for metrics, spans, tracing, logging data, and a variety of exporters for free and commercial backends.
- Effortless to plug automatic telemetry. This is a huge benefit. With just several environment variables and java agent you can start sending telemetry data to almost any monitoring backend you need.
Cons:
- Not that many natively supported Scala libraries. For instance, Slick framework. Yes, OpenTelemetry can instrument low-level JDBC API, but in complex cases, it’s better to have some more high-level data.
- Some dependencies with version
1.11.0
are still inalpha
. E.g.: opentelemetry-sdk-extension-autoconfigure
Last but not least, I’d like to emphasize — choose APM or build your own telemetry solution bearing in mind needs all infrastructure you need to monitor and not only ecosystems or programing language of software.
In the next part of “Telemetry with Scala”, we will dive into Kamon framework and it’s usage in conjunction with Lightbend stack.
Thank you!