Monitoring with Prometheus, Loki & Grafana

Tags: ops (29)

Have you ever wondered how this Grafana thing works? What Prometheus is, or if there’s more to Loki than Tom Hiddleston? If so, keep on reading, we are going to explore a tiny deployment that sets up those components and integrates them with varnish.

As Grafana and prometeus are two of the quickest names to come to mind when talking about monitoring, it’s interesting to know at least one way to deploy them, which will explore together here.

If you are in a hurry, you can check the code on GitHub, or jump directly to each subsection in the table of contents.

Overview

We are going to use containers to spin up:

  • A varnish server
  • An origin
  • A load generator to create some artificial traffic
  • A Prometheus server to collect metrics, and a Loki one for logs
  • Exporters to push varnish data to Prometheus/Loki
  • A Grafana instance to visualize the exported data

As the title infers, this will be a simple setup to help understand the various moving parts involved, and this should not be taken straight into production. Some details have been omitted or simplified for the sake of clarity, for example:

  • Access management
  • Network security
  • Volume permissions
  • Dashboards structures
  • Alerting
  • Relevant metrics to monitor

Similarly, Grafana, Prometheus and Loki are infinitely configurable tools and choices had to be made to avoid going to deep into rabbit holes. In short: the configuration is just complex enough to make things work and/or make a point, do not expect manual pages here.

Requirements

The setup relies entirely on docker compose, so make sure you have a relatively recent version installed. And of course you will need git to clone the repository.

Getting started

First, let’s get the code:

git clone https://github.com/varnish/toolbox.git
cd toolbox/docker-compose-examples/grafana-monitoring

There should be a bunch of files in this directory, the two most important ones being docker-compose.yaml and .env.

The presence of docker-compose.yaml means we can call

docker compose up

and Docker will spin up the containers described in the yaml file (using values found in .env). Running this should start filling your screen, starting with something like this:

[+] Running 9/9
 ✔ Network docker-compose_default             Created                                                                                                                                                                                                                                                                    0.1s
 ✔ Container docker-compose-promtail-1        Created                                                                                                                                                                                                                                                                    0.2s
 ✔ Container docker-compose-grafana-1         Created                                                                                                                                                                                                                                                                    0.2s
 ✔ Container docker-compose-loki-1            Created                                                                                                                                                                                                                                                                    0.2s
 ✔ Container docker-compose-prometheus-1      Created                                                                                                                                                                                                                                                                    0.2s
 ✔ Container docker-compose-origin-1          Created                                                                                                                                                                                                                                                                    0.1s
 ✔ Container docker-compose-varnish-1         Created                                                                                                                                                                                                                                                                    0.1s
 ✔ Container docker-compose-exporter-1        Created                                                                                                                                                                                                                                                                    0.1s
 ✔ Container docker-compose-load_generator-1  Created                                                                                                                                                                                                                                                                    0.1s
Attaching to exporter-1, grafana-1, load_generator-1, loki-1, origin-1, prometheus-1, promtail-1, varnish-1
promtail-1        | level=info ts=2024-01-17T23:12:31.589369204Z caller=promtail.go:133 msg="Reloading configuration file" md5sum=337a112dd5f6434312be86c5e0c8cdf9
promtail-1        | level=info ts=2024-01-17T23:12:31.58999493Z caller=server.go:334 http=[::]:9080 grpc=[::]:39375 msg="server listening on addresses"
promtail-1        | level=info ts=2024-01-17T23:12:31.590194311Z caller=main.go:174 msg="Starting Promtail" version="(version=2.8.7, branch=HEAD, revision=1dfdc432c)"
promtail-1        | level=warn ts=2024-01-17T23:12:31.59022805Z caller=promtail.go:265 msg="enable watchConfig"
prometheus-1      | ts=2024-01-17T23:12:31.734Z caller=main.go:539 level=info msg="No time or size retention was set so using the default time retention" duration=15d
prometheus-1      | ts=2024-01-17T23:12:31.734Z caller=main.go:583 level=info msg="Starting Prometheus Server" mode=server version="(version=2.48.1, branch=HEAD, revision=63894216648f0d6be310c9d16fb48293c45c9310)"
prometheus-1      | ts=2024-01-17T23:12:31.734Z caller=main.go:588 level=info build_context="(go=go1.21.5, platform=linux/amd64, user=root@71f108ff5632, date=20231208-23:33:22, tags=netgo,builtinassets,stringlabels)"
prometheus-1      | ts=2024-01-17T23:12:31.734Z caller=main.go:589 level=info host_details="(Linux 6.6.8-arch1-1 #1 SMP PREEMPT_DYNAMIC Thu, 21 Dec 2023 19:01:01 +0000 x86_64 11171d61dfae (none))"
prometheus-1      | ts=2024-01-17T23:12:31.734Z caller=main.go:590 level=info fd_limits="(soft=1073741816, hard=1073741816)"
prometheus-1      | ts=2024-01-17T23:12:31.734Z caller=main.go:591 level=info vm_limits="(soft=unlimited, hard=unlimited)"
prometheus-1      | ts=2024-01-17T23:12:31.735Z caller=web.go:566 level=info component=web msg="Start listening for connections" address=0.0.0.0:9090

And you’ll see containers happily push its logs, each with a different header and color. You’ll be able to shut off everything with a simple Ctrl+C, but not now.

Instead, now, you are going head over to http://localhost:3000/ and a active dashboard should greet you:

Let’s start poking around to see how all this works in the next sections.

Varnish, the origin and the client

Origin

We’re going to take it easy and start with the (very simple) origin. It’s almost entirely contained in this section of docker-compose.yaml:

  origin:
    image: python
    command: python -m http.server --directory /orig_data 80
    volumes:
      - ${ORIGIN_DATA}:/orig_data

This defines the origin service and tells docker compose use a python to start python -m http.server … which is a file server listening on port 80 and exposing the container’s /orig_data directory.

As we can see in the volumes section, /orig_data is actually mounted from the host, and to know which host directory is actually mounted, we need to look at the .env file which shows:

ORIGIN_DATA=./data

In other words: we mount our local data/ directory into the container and expose its content with a python server. Easy peasy.

Varnish

Now it gets fun! Fun, but not really complicated. Here are the important .env and docker-compose.yaml snippets:

# docker-compose.yaml
services:
  varnish:
    image: varnish
    command: -n varnish
    volumes:
      - workdir:/var/lib/varnish
      - ${VARNISH_VCL}:/etc/varnish/default.vcl
    ports:
      - "${HTTP_PORT}:80"
    depends_on:
      - origin
...
volumes:
  workdir:
    driver: local
    driver_opts:
      type: tmpfs
      device: tmpfs

# .env
HTTP_PORT=80
VARNISH_VCL=./conf/default.vcl

We define a varnish service, as expected, but we add a few bells and whistles.

Fixed work directory with -n varnish, it’ll get added to the container’s default command. Check the “Extra arguments” section on the official page. This make the directory predictable as it’s otherwise inferred from the machine host name, which will change for each container.

Shared work directory: we define a tmpfs volume called workdir, and we ask our container to mount it as /var/lib/varnish, so that any container mounting it can access the logs and metrics stored in it.

Startup order with depends_on. Varnish needs to wait for the origin service to be up first before starting. We’ll see why in the VCL explanation.

Mount the local configuration; our local ./conf/default.vcl is mounted as /etc/varnish/default.vcl, meaning that' what Varnish will run.

Port mapping: Varnish’s listening port will be exposed externally on port 80 on your machine, but you can change it by editing HTTP_PORT.

Testing

And we can check that varnish is indeed reachable and that the file server is indeed working with curl:

$ curl localhost:80 -I
HTTP/1.1 200 OK
Server: SimpleHTTP/0.6 Python/3.12.1              <----- the origin works!
Date: Thu, 18 Jan 2024 05:16:49 GMT
Content-type: text/html; charset=utf-8
Content-Length: 232
X-Varnish: 98506 33130                            <----- varnish too!
Age: 13
Via: 1.1 e072ed301501 (Varnish/7.4)
Accept-Ranges: bytes
Connection: keep-alive

VCL

Finally, we should really look at the vcl here, granted, it’s pretty dumb, but there’s a point to be made here:

vcl 4.1;

backend default { .host = "origin"; }

sub vcl_recv {
       # don't cache if the request path or querystring contains uncacheable
       if (req.url ~ "uncacheable") {
               return (pass);
       # create a synthetic response for heathcheck requests
       } else if (req.url == "/healthcheck") {
               return (synth(200));
       # otherwise, cache
       } else {
               return (hash);
       }
}

sub vcl_backend_response {
	set beresp.ttl = 10s;
}
Of course, feel free to update that vcl to your needs, it’s really just an example.

However, please note the .host field of the backend: it’s origin. It matches the service of our origin, and docker compose will provide DNS resolution to the actual (somewhat random) IP address it created for the origin container.

And as you surely know, Varnish will resolve that host name when it’ll load the configuration, so we really, really need origin before Varnish so that the latter can DNS resolve the host of the former. That’s the whole reason for the depends_on field we saw above.

Load generator

This part is completely optional and really only exists so that we have some pretty graphs and logs already populated. You can find the complete code in load_generator.sh, which is run directly by a dedicated container:

services:
  load_generator:
    image: curlimages/curl
    volumes:
      - ./load_generator.sh:/usr/bin/load_generator
    command: /usr/bin/load_generator
    depends_on:
      - varnish

The code itself is just a few lines of shell script that will continuous create a bunch of requests tailored to our default.vcl shown above:

  • Health checks to generate synthetic responses
  • Some cacheable requests…
  • And some uncacheable ones, less frequent

That’s it, we’ve cover the spine of the setup, giving both our varnish service a backend and some traffic, now it’s time to see how we can monitor it.

Prometheus, Loki and their exporters

We have looked at the core setup that serves content, now it’s time to explore the monitoring section of the project, in other words: how do we extract metadata for the system and how do we store it?

In our case, we are interested in two dimensions:

  • Metrics: either counters (only go up) such as number of request served, or gauges (go up and down), for example the current space used in cache
  • Logs: some text, formatted or not, describing an event

For both of these, we are going to have a store holding the data durably, and an exporter to extract data from varnish and push it to said store.

Metrics

Prometheus

Prometheus is an open-source system monitoring and alerting toolkit, that collects and stores its metrics as time-series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.

Within the docker-compose.yaml, we call Prometheus like this:

prometheus:
    image: prom/prometheus
    volumes:
      - ${PROMETHEUS_CONF}:/etc/prometheus/prometheus.yml

The PROM_CONF environment variable is within the hidden file .env and looks like this:

PROMETHEUS_CONF=./conf/prometheus.yml

In other words, we are just using the official Prometheus image, feeding it our custom prometheus.yml:

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

# define which exporters to scrape
scrape_configs:
  - job_name: varnish
    # 'exporter' is our exporter container in `docker-compose.yml`
    # metrics_path defaults to '/metrics', scheme to 'http'
    static_configs:
      - targets: ["exporter:9131"]

We can see here that Prometheus is simply “scraping” the exporter for new data stats every 15 seconds, and we only have one target: the exporter container that we’re going to explain right away.

exporter

The exporter is a third-party open-source project that uses varnishstat internally (its jsonoutput, more precisely), thus we need to build a container that has both Varnish and the exporter installed. We are going to do this with a custom Dockerfile, except that here, the file is going to be inlined in docker-compose.yaml, under the cleverly named dockerfile_inline field. If Docker sees that the container doesn’t exist, it will build if using those instructions.

exporter:
    image: stat-exporter
    command: /tmp/prometheus_varnish_exporter -n varnish -verbose
    volumes:
      - workdir:/var/lib/varnish
    depends_on:
      - varnish
    build:
      dockerfile_inline: |
        FROM curlimages/curl
        RUN set -ex; \
            cd /tmp/; \
            curl -L https://github.com/jonnenauha/prometheus_varnish_exporter/releases/download/1.6.1/prometheus_varnish_exporter-1.6.1.linux-amd64.tar.gz | tar xvzf - --strip 1
        FROM varnish
        COPY --from=0 /tmp/prometheus_varnish_exporter /tmp/prometheus_varnish_exporter        

volumes:
  workdir:
    driver: local
    driver_opts:
      type: tmpfs
      device: tmpfs

An important note is the workdir volume that is defined at the root level as tmpfs and mounted as /var/lib/varnish in the container, which we also do in the Varnish container, remember?.

We do this because internally the exporter uses varnishstat which needs access to varnishd’s work directory. The problem is that this directory is in the varnish! So we build the workdir globally, and we ask every container that needs read or write access to it to simply mount it. You are also going to see this pattern for the promtail container, which is logical, since it uses varnishlog to extract logs.

Logs

Loki

Loki describes itself as “like Prometheus, but for logs” on GitHub, but differs by focusing on logs instead of metrics, and delivering logs via push, instead of pull, and uses a label based approach for indexing.

A Loki-based logging stack consists of 3 components:

  • Promtail is the agent, responsible for gathering logs and sending them to Loki.
  • Loki is the main server, responsible for storing logs and processing queries.
  • Grafana for querying and displaying the logs.

The container definition is our shortest yet:

loki:
    image: grafana/loki:2.9.0
    command: -config.file=/etc/loki/local-config.yaml

It’s the default image, running the default configuration, we don’t mount or redefine anything, meaning we can expect Loki to run on its default 3100 port.

Promtail

Promtail is an agent that ships the contents of local logs to a private Grafana Loki instance or Grafana Cloud. It is usually deployed to every machine that has applications needing monitoring. It is also supported by Grafana and generic for different log inputs by reading stdin, meaning we can just pipe the varnishncsa output to it and everything will work.

Promtail primarily:

  • Discovers targets.
  • Attaches labels to log streams.
  • Pushes them to the Loki instance.

We can see the code for Loki and Promtail in this snippet from the docker-compose.yaml below:

  promtail:
    image: ncsa-promtail
    entrypoint: ""
    command: bash -c "varnishncsa -n varnish | promtail -config.file=/etc/promtail/config.yml --stdin"
    volumes:
      - ./conf/promtail-config.yaml:/etc/promtail/config.yml
      - workdir:/var/lib/varnish
    build:
      dockerfile_inline: |
        FROM curlimages/curl
        RUN set -ex; \
            curl -sLo https://github.com/grafana/loki/releases/download/v2.8.7/promtail-linux-amd64.zip | unzip -d /tmp -; \

        FROM varnish
        COPY --from=0 --chmod=555 /tmp/promtail-linux-amd64 /tmp/promtail        

We recognize the usual logic of mounting the configuration file, but also the workdir volume, and the now familiar dockerfile_inline downloads a zip file rather that a deb, but it’s all very straightforward.

The promtail-config.yaml looks like this:

clients:
  - url: http://loki:3100/loki/api/v1/push

Simply put, we are just indicating where the Loki instance is so that promtail can push to it. Notice how in the previous case Prometheus was the storage was pulling data from the exporter, but in the log case, the exporter is pushing to the storage.

Grafana

It’s now time to study the star of the show. With all the foundations dealt with, we can finally put or log and metrics visualizer on top of the stack.

Main configuration

Thanks to the previous work done to set up Prometheus and Loki, we can integrate Grafana with just a handful of yaml and json files. Here’s the entirety of the grafana configuration:

conf/grafana/
├── grafana.ini
└── provisioning
    ├── dashboards
    │   ├── default.yaml
    │   ├── varnish_logs.json
    │   └── varnish_metrics.json
    └── datasources
        ├── loki.yaml
        └── prometheus.yaml

First, let’s look at how we start grafana by looking at the relevant bits from docker-compose.yaml and .env:

# docker-compose.yaml:
  grafana:
    image: grafana/grafana-enterprise
    ports:
      - 3000:3000
    volumes:
      - ${GRAFANA_CONF}:/etc/grafana/grafana.ini
      - ${GRAFANA_PROVISIONING}:/etc/grafana/provisioning/

# .env
GRAFANA_CONF=./conf/grafana/grafana.ini
GRAFANA_PROVISIONING=./conf/grafana/provisioning

By now, this should look fairly standard: we expose the 3000 to the host, which we used to log in. And we mount grafana.ini and the provisioning/ into the container, at their default locations, where grafana expects them, and that’s it.

It turns out we don’t need to configure much either, at least in grafana.ini, here is its full content:

[auth]
disable_login_form = true

[auth.anonymous]
enabled = true
org_role = Admin

[dashboards]
default_home_dashboard_path = /etc/grafana/provisioning/dashboards/varnish_metrics.json

This file does two things: enable anonymous access with admin role (do NOT use this in production), and select a default dashboard as a landing page (the one that’s screenshot in the previous section). And to be honest, we could have provided an empty file as grafana has default values for every option.

Dashboards

Metrics

Let’s have a look at the file pointed at by default_home_dashboard_path, like the other files, we can find it in the conf/ directory:

# cat conf/grafana/provisioning/dashboards/varnish_metrics.json
{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": {
          "type": "grafana",
<and it goes on for about 2000 lines>

That is just one big JSON file describing everything you can see on the landing screen we talked about, each graph, tooltip, and formulas needed by grafana to present the page is encoded here.

This specific dashboard is really just an import of a popular Varnish dashboard on the grafana registry. If the file looks complex, do not worry, you’ll probably never update it manually. Rather, you’ll use the grafana interface to visually edit it, then you’ll export it as JSON. Much simpler!

However, there’s one part that is import in that file, or rather, many important parts, and they all look like this:


        {
          "datasource": {
            "type": "prometheus",
            "uid": "prometheusUID"
          },

They tell grafana which data source to use for each graph, and they are using uid to identify it, in this case, it’s prometheusUID, remember it, it’ll come up again soon (you can also jump to the “Data sources” section a bit below)!

Logs

We’ve covered statistics, now it’s time to tackle logs! On the upper left corner, open the home menu, and click on “Dashboards”

Menu

This takes us to the Dashboard page (duh!), where we can see a list with our metrics dashboard, but also something listed as “Varnish logs”, let’s click on that:

The dashboard is much simpler that the metrics one, but it follows the same logic: we “just” have a big JSON file (varnish_logs.json) using the loki database we set up earlier. And this time, the JSON contains

      "datasource": {
        "type": "loki",
        "uid": "lokiUID"
      },

The provider concept

By now, you might be wondering: “ok, grafana.ini was a fairly standard file with a default path, no wonder grafana found it, but how did it know about varnish_logs.json and varnish_metrics.json? There’s no way it would expect those!?!”

And that’s a very good question! It turns out that grafana looks for yaml files in /etc/grafana/provisioning/dashboards/, and we mount this file there:

# cat conf/grafana/provisioning/dashboards/default.yaml
apiVersion: 1

providers:
  - name: Default # A uniquely identifiable name for the provider
    type: file
    options:
      path: /etc/grafana/provisioning/dashboards/

What we are saying here it “check for dashboard files in /etc/grafana/provisioning/dashboards/ and load them up”, and grafana will happily oblige. The interesting thing is that this “provider” is able to reload files as they get update, among other things. We are not using that feature here for the sake of simplicity, but be aware it’s one of its features.

Data sources

And we are almost done tying things together, we just need to look at data sources, and they will clarify how our JSON dashboards are able to find our Prometheus and Loki servers.

Same as for /etc/grafana/provisioning/dashboards/, grafana will check /etc/grafana/provisioning/datasources/ for data sources description. Let pause a moment to marvel at the consistency of the naming, for once in the IT world, it makes immediate sense!

Anyway, here’s the content of our two files in that directory:

# cat conf/grafana/provisioning/datasources/prometheus.yaml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    uid: prometheusUID
    url: http://prometheus:9090

# cat conf/grafana/provisioning/datasources/loki.yaml
apiVersion: 1

datasources:
  - name: loki
    type: loki
    uid: lokiUID
    access: proxy
    url: http://loki:3100

Very short, and very sweet, each file describes a data source, gives it a uid and explain which url to use to contact it. That was how we go from the JSON dashboard listing only the uid to the actual server we spun up using docker compose.