Troubleshooting Varnish
Varnish is incredibly stable, but that doesn’t mean troubleshooting doesn’t have to take place. On rare occassions crashes may occur. Based on the available server resources and traffic patterns, misconfigured runtime parameters may cause unwanted behavior.
Crashes rarely occur, but when they do, you need to figure out what’s going on.
Even when Varnish is stable, the hit rate of the cache might be too low for your liking.
There are many ways to perform troubleshooting: from panic inspection, to Varnish Shared Memory Logs (VSL), to monitoring metrics.
Restarting and reloading
“Have you tried turning it off and on again?" is a very common question asked by IT support staff and quite often it turns out to be an effective measure.
However, restarting is often considered lazy problem solving because it lacks the ambition inspect the root cause. But when you’ve run out of options or out of time, a quick restart can be considered.
Restart
Here’s how you can restart Varnish:
sudo systemctl restart varnish
This command will kill the varnishd
runtime process and start it up again. Any runtime parameters that weren’t persisted in the systemd service file will be reset.
Reload
Performing a Varnish reload will only reload the VCL configuration and has no impact on the state of the varnishd
main or child processes.
Here’s how you reload the VCL configuration:
sudo systemctl reload varnish
The systemd reload will call the varnishreload
script that will gracefully add a new VCL configuration, while discarding older configurations.
Varnish panics
When the Varnish child process crashes a panic is triggered and the child process is restarted. The panic.show
CLI command will return the last panic.
Here’s how to return the last panic:
sudo varnishadm panic.show
An example panic
Here’s an example of a panic that we triggered ourselves through the varnishadm debug.panic.worker
command:
$ sudo varnishadm panic.show
Panic at: Mon, 03 Jan 2022 13:16:43 GMT
Assert error in ccf_panic(), cache/cache_panic.c line 725:
Condition((strcmp(\"\", \"You asked for it\")) == 0) not true.
version = varnish-6.0.8 revision 97e54ada6ac578af332e52b44d2038bb4fa4cd4a, vrt api = 7.1
ident = Linux,5.10.76-linuxkit,aarch64,-junix,-sdefault,-sdefault,-hcritbit,epoll
now = 11645.069782 (mono), 1641215803.491569 (real)
Backtrace:
0xaaaaca50fcd0: /usr/sbin/varnishd(+0x4fcd0) [0xaaaaca50fcd0]
0xaaaaca57c5f4: /usr/sbin/varnishd(VAS_Fail+0x24) [0xaaaaca57c5f4]
0xaaaaca50e474: /usr/sbin/varnishd(body_status_2str+0) [0xaaaaca50e474]
0xaaaaca57ea1c: /usr/sbin/varnishd(+0xbea1c) [0xaaaaca57ea1c]
0xaaaaca57ee54: /usr/sbin/varnishd(VCLS_Poll+0x2d0) [0xaaaaca57ee54]
0xaaaaca4f1e28: /usr/sbin/varnishd(CLI_Run+0x58) [0xaaaaca4f1e28]
0xaaaaca50ad10: /usr/sbin/varnishd(child_main+0x1a0) [0xaaaaca50ad10]
0xaaaaca54d6c4: /usr/sbin/varnishd(+0x8d6c4) [0xaaaaca54d6c4]
0xaaaaca57ea1c: /usr/sbin/varnishd(+0xbea1c) [0xaaaaca57ea1c]
0xaaaaca57ee54: /usr/sbin/varnishd(VCLS_Poll+0x2d0) [0xaaaaca57ee54]
thread = (cache-main)
thr.req = (nil) {
},
thr.busyobj = (nil) {
},
vmods = {
},
The error message, the filename, the line number and the backtrace are the tools you have to figure out what caused the panic and how to potentially solve it.
You can get help when you cannot figure these panics out on your own.
panic.show
only returns the last panic. If multiple panics occur, the one you’re trying to debug may be overwritten.Debug mode
The varnishd
program has a -d
flag that enables debug mode.
When enabling debug mode, the parent process will run in the foreground with a CLI connection on stdin/stdout. The child process must be started explicitly with a CLI command. Terminating the parent process will also terminate the child.
Enabling debug mode
Here’s an example where the -d
debug flag is added to varnishd
:
$ varnishd -f default.vcl -d
Warnings:
VCL compiled.
Debug: Version: varnish-6.0.8 revision 97e54ada6ac578af332e52b44d2038bb4fa4cd4a
Debug: Platform: Linux,5.10.76-linuxkit,aarch64,-junix,-sdefault,-sdefault,-hcritbit
200 318
-----------------------------
Varnish Cache CLI 1.0
-----------------------------
Linux,5.10.76-linuxkit,aarch64,-junix,-sdefault,-sdefault,-hcritbit
varnish-6.0.8 revision 97e54ada6ac578af332e52b44d2038bb4fa4cd4a
Type 'help' for command list.
Type 'quit' to close CLI session.
Type 'start' to launch worker process.
start
Debug: Child (927) Started
200 0
Info: Child (927) said Child starts
start
command had to be manually executed to start the child process. Without a running child process, Varnish cannot accept incoming requests.When to use debug mode
An occasional Varnish panic can be caught using the panic.show
CLI command, as mentioned earlier. When the panic occurs, the child process will restart and request handling continues.
If multiple panics occur, the last panic will be overwritten and the original panic you’re trying to investigate will no longer be available through panic.show
. To prevent this information from being lost, debug mode can be used.
Debug mode is very similar to running the varnishd
program in the foreground via the -F
runtime parameter. The main difference is that a CLI is available and that the child process is not automatically started.
Thanks to the CLI being available, you can run additional commands while debugging. These commands include:
param.show
,param.set
andparam.reset
to show, set, and reset runtime parametersstop
andstart
to stop and start thevarnishd
child processvcl.load
,vcl.use
andvcl.discard
to load, activate and remove VCL configurations
The Guru Meditation error
The Guru Meditation errors that Varnish returns are well-known, but usually have nothing to do with problems within Varnish.
As you can see in the example below, the Guru Meditation error reports that Varnish couldn’t reach the backend:
<!DOCTYPE html>
<html>
<head>
<title>503 Backend fetch failed</title>
</head>
<body>
<h1>Error 503 Backend fetch failed</h1>
<p>Backend fetch failed</p>
<h3>Guru Meditation:</h3>
<p>XID: 123456</p>
<hr>
<p>Varnish cache server</p>
</body>
</html>
The output format for these types of errors and the fact that the term Guru Meditation is used, is all part of the built-in VCL for the vcl_synth
and vcl_backend_error
subroutines.
The vcl_synth
subroutine is reponsible for synthetic responses that are returned in the client-side logic of Varnish. For errors that happen at the backend-side, the vcl_backend_error
subroutine is used.
Client-side synthetic responses
If you want to intercept Varnish Shared Memory Logs (VSL) for client-side synthetic responses, you can run the following command:
sudo varnishlog -g request -q "VCL_call eq 'SYNTH'"
If you want to intercept VSL transactions for synthetic responses that return 400-range status codes, you can run the following command:
sudo varnishlog -g request -q "VCL_call eq 'SYNTH'\
and RespStatus >= 400 and RespStatus < 500"
This will ensure that non-erroneous synthetic responses are not part of the output.
These commands will present a VSL transaction every time a client-side Guru Meditation error takes place in real time.
If you want to dump all the old transactions in the VSL memory buffer that match these criteria, you can add the -d
parameter to the command. Here’s an example:
sudo varnishlog -d -g request -q "VCL_call eq 'SYNTH'\
and RespStatus >= 400 and RespStatus < 500"
The varnishd
program reserves 80 MB of memory for the Varnish Shared Memory Log (VSL) under the form of a circular memory buffer. As soon as this buffer is full, the oldest entries are overwritten. By using the -d
parameter, you dump the full content of the VSL buffer.
However, there is no guarantee how long you can go back in time by dumping the buffer. It all depends on the traffic patterns of your Varnish server.
You can increase the size of the VSL circular memory buffer by overriding the vsl_space
runtime parameter which defaults to 80 MB.
VSL output example
Here’s a VSL example for a request that resulted in a synthetic response with a 400-range status code:
$ sudo varnishlog -g request -q "VCL_call eq 'SYNTH'\
and RespStatus >= 400 and RespStatus < 500"
* << Request >> 8
- Begin req 7 rxreq
- Timestamp Start: 1641228924.312547 0.000000 0.000000
- Timestamp Req: 1641228924.312547 0.000000 0.000000
- ReqStart 172.17.0.1 55068 a0
- ReqMethod PRI
- ReqURL /
- ReqProtocol HTTP/1.1
- ReqHeader Host: localhost
- VCL_call RECV
- VCL_return synth
- VCL_call HASH
- VCL_return lookup
- Timestamp Process: 1641228924.312633 0.000086 0.000086
- RespProtocol HTTP/1.1
- RespStatus 405
- RespReason Method Not Allowed
- RespReason Method Not Allowed
- RespHeader Date: Mon, 03 Jan 2022 16:55:24 GMT
- RespHeader Server: Varnish
- RespHeader X-Varnish: 8
- VCL_call SYNTH
- RespHeader Content-Type: text/html; charset=utf-8
- RespHeader Retry-After: 5
- VCL_return deliver
- RespHeader Content-Length: 272
- Storage malloc Transient
- RespHeader Connection: keep-alive
- Timestamp Resp: 1641228924.312984 0.000437 0.000351
- ReqAcct 73 0 73 204 272 476
- End
From these logs you can conclude that an HTTP 405 Method Not Allowed
error was returned because the invalid PRI
request method was used.
This behavior is part of the built-in VCL for vcl_recv
that prevents the PRI
request method from being used.
You can slim down the output by only selecting the tags you need:
$ sudo varnishlog -i RespStatus -i RespReason -g request -q "VCL_call eq 'SYNTH'
and RespStatus >= 400 and RespStatus < 500"
* << Request >> 8
- RespStatus 405
- RespReason Method Not Allowed
- RespReason Method Not Allowed
Backend errors
When the Guru Meditation error returns output like 503 Backend fetch failed
, the vcl_backend_error
subroutine is called because the backend fetch failed.
This is not client-side logic based on faulty input from the user, but backend logic for a request that was not served from the cache. In the execution flow the vcl_backend_fetch
and the vcl_backend_response
subroutines are responsible for fetching a response from the backend and storing it in the cache.
When things go wrong, the synthetic error is returned, but you can also catch these errors in the Varnish Shared Memory Logs (VSL) using the following command:
sudo varnishlog -g request -q "VCL_call eq 'BACKEND_ERROR'"
VSL output example
Here’s a VSL example where the backend fetch fails and a backend error is returned:
$ sudo varnishlog -g request -q "VCL_call eq 'BACKEND_ERROR'"
* << Request >> 32790
- Begin req 32789 rxreq
- Timestamp Start: 1641287864.851443 0.000000 0.000000
- Timestamp Req: 1641287864.851443 0.000000 0.000000
- ReqStart 172.17.0.1 55092 a0
- ReqMethod GET
- ReqURL /
- ReqProtocol HTTP/1.1
- ReqHeader Host: localhost
- VCL_call RECV
- VCL_return hash
- VCL_call HASH
- VCL_return lookup
- VCL_call MISS
- VCL_return fetch
- Link bereq 32791 fetch
- Timestamp Fetch: 1641287864.852788 0.001346 0.001346
- RespProtocol HTTP/1.1
- RespStatus 503
- RespReason Backend fetch failed
- RespHeader Date: Tue, 04 Jan 2022 09:17:44 GMT
- RespHeader Server: Varnish
- RespHeader Content-Type: text/html; charset=utf-8
- RespHeader Retry-After: 5
- RespHeader X-Varnish: 32790
- RespHeader Age: 0
- RespHeader Via: 1.1 varnish (Varnish/6.0)
- VCL_call DELIVER
- VCL_return deliver
- Timestamp Process: 1641287864.852818 0.001375 0.000030
- RespHeader Content-Length: 282
- RespHeader Connection: keep-alive
- Timestamp Resp: 1641287864.852978 0.001535 0.000160
- ReqAcct 73 0 73 250 282 532
- End
** << BeReq >> 32791
-- Begin bereq 32790 fetch
-- VCL_use boot
-- Timestamp Start: 1641287864.851951 0.000000 0.000000
-- BereqMethod GET
-- BereqURL /
-- BereqProtocol HTTP/1.1
-- BereqHeader Host: localhost
-- BereqHeader Accept-Encoding: gzip
-- BereqHeader X-Varnish: 32791
-- VCL_call BACKEND_FETCH
-- VCL_return fetch
-- FetchError backend boot.default: fail errno 111 (Connection refused)
-- Timestamp Beresp: 1641287864.852372 0.000422 0.000422
-- Timestamp Error: 1641287864.852377 0.000426 0.000004
-- BerespProtocol HTTP/1.1
-- BerespStatus 503
-- BerespReason Service Unavailable
-- BerespReason Backend fetch failed
-- BerespHeader Date: Tue, 04 Jan 2022 09:17:44 GMT
-- BerespHeader Server: Varnish
-- VCL_call BACKEND_ERROR
-- BerespHeader Content-Type: text/html; charset=utf-8
-- BerespHeader Retry-After: 5
-- VCL_return deliver
-- Storage malloc Transient
-- Length 282
-- BereqAcct 0 0 0 0 0 0
-- End
The log output has 2 transactions:
- One for the client-side transaction with the incoming request and outgoing response
- One for the backend fetch and backend response
In the backend transaction, the FetchError
tag clearly indicates a Connection refused
error. The fact that the backend server is not available is the reason for the failure.
Of course the output can be reduced by only selecting the necessary tags.
Backend health monitoring
Instead of waiting for backend fetch failures to occur, you can actively monitor the health of your backends. This requires a health probe to be defined and assigned to the backends:
vcl 4.1;
probe health {
.request =
"HEAD / HTTP/1.1"
"Host: localhost"
"Connection: close"
"User-Agent: Varnish Health Probe";
.interval = 10s;
.timeout = 5s;
.window = 5;
.threshold = 3;
}
backend default {
.host = "127.0.0.1";
.port = "8080";
.probe = health;
}
The probe named health
will send an HTTP HEAD
call for the /
resource to the backend every 10 seconds. The probe will time out after 5 seconds. As soon as 3 out of 5 health checks fail, the backend is considered sick.
VSL backend health checks
The Backend_health
VSL tag will return the backend health on a real-time basis. You can run the following command to display the backend health:
sudo varnishlog -g raw -i Backend_health
Here’s the output for the failing backend:
$ sudo varnishlog -g raw -i backend_health
0 Backend_health - boot.default Still sick -------- 2 3 5 0.000000 0.000000 Open error 111 (Connection refused)
0 Backend_health - boot.default Still sick -------- 2 3 5 0.000000 0.000000 Open error 111 (Connection refused)
As you can see, the same Connection refused
error is returned. Because the interval of the health probe was set to 10 seconds, a new log line will appear every 10 seconds.
When the backend failure is fixed, the VSL will reflect the fix:
$ sudo varnishlog -g raw -i backend_health
0 Backend_health - boot.default Still sick 4---X-RH 1 3 5 0.003040 0.001013 HTTP/1.1 200 OK
0 Backend_health - boot.default Still sick 4---X-RH 2 3 5 0.005553 0.002148 HTTP/1.1 200 OK
0 Backend_health - boot.default Back healthy 4---X-RH 3 3 5 0.001177 0.001906 HTTP/1.1 200 OK
0 Backend_health - boot.default Still healthy 4---X-RH 4 3 5 0.001419 0.001784 HTTP/1.1 200 OK
Although the first log line already displays an HTTP/1.1 200 OK
status, the backend health is not yet healthy. For the backend the be fully healthy, 3 out of 5 checks should succeed according to the health probe settings.
As of the third log line, the backend is considered Back healthy
and as of the fourth log line the backend is considered Still healthy
.
CLI backend health
If you don’t want to consult the VSL and want a backend health summary, you can call the backend.list
CLI command.
As you can see in the output below, the default
backend is considered healthy thanks to the health check probe that is monitoring this backend:
$ sudo varnishadm backend.list
Backend name Admin Probe Last updated
boot.default probe Healthy 5/5 Tue, 04 Jan 2022 10:00:31 GMT
If the backend becomes sick, the backend.list
command will reflect this:
$ sudo varnishadm backend.list
Backend name Admin Probe Last updated
boot.default probe Sick 2/5 Tue, 04 Jan 2022 10:04:41 GMT
Note: by default, backend.list
only displays the backend
s attached to the active VCL, which can be confusing if you use VCL labels, for example. To show all the backends, use the optional globbing pattern '*.*'
(quoted to avoid shell
expansion):
$ sudo varnishadm backend.list '*.*'
Backend name Admin Probe Last updated
boot.default probe Sick 2/5 Tue, 04 Jan 2022 10:04:41 GMT
foo_vcl.backend1 probe Sick 5/5 Tue, 04 Jan 2022 10:04:42 GMT
foo_vcl.backend2 probe Sick 5/5 Tue, 04 Jan 2022 10:04:40 GMT
bar_vcl.default probe Sick 5/5 Tue, 04 Jan 2022 10:04:42 GMT
Varnish is not caching
Troubleshooting is not only about dealing with errors in Varnish. Sometimes the hit rate of the cache will be too low and the backend will be overwhelmed while Varnish is still in perfect shape.
Figuring out why Varnish isn’t caching pages it was supposed to cache can be done using specific VSL commands.
Cache misses aren’t usually that bad: a miss is just a cache hit that hasn’t taken place yet. Eventually the object will end up in the cache. However, if the TTL of your objects is too low or if the cache is full, cached objects will be removed from the cache.
Cache bypasses on the other hand have a bigger impact: these objects will never end up in the cache. This is either caused by built-in VCL logic that prevents data to be stored in the cache or served from the cache, or by custom VCL logic that has the same effect.
VSL commands
You could list all VSL transactions that resulted in a pass. The following command will help you with that:
sudo varnishlog -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
-I ReqHeader:Cookie -I ReqHeader:Authorization \
-i VCL_call -i VCL_return \
-q "VCL_call eq 'PASS'"
You could also be looking at a specific URL and ask yourself the same question:
sudo varnishlog -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
-I ReqHeader:Cookie -I ReqHeader:Authorization \
-i VCL_call -i VCL_return \
-q "ReqUrl eq '/'"
In this case we’re looking for requests that hit the homepage (/
).
The output of these commands will tell us why Varnish wasn’t serving the response from the cache
Because it was a POST request
The following VSL output indicates that an HTTP POST
call was used for the request:
$ sudo varnishlog -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
-I ReqHeader:Cookie -I ReqHeader:Authorization \
-i VCL_call -i VCL_return \
-q "VCL_call eq 'PASS'"
* << Request >> 2
- ReqMethod POST
- ReqURL /
- ReqProtocol HTTP/1.1
- ReqHeader Host: localhost
- VCL_call RECV
- VCL_return pass
- VCL_call HASH
- VCL_return lookup
- VCL_call PASS
- VCL_return fetch
- VCL_call DELIVER
- VCL_return deliver
And that’s the reason why this request was not served from the cache.
GET
and HEAD
requests are cacheable.Because the request contained a cookie
The built-in VCL for the vcl_recv
will not serve objects from the cache if the request contains a Cookie
header.
The VSL output below illustrates this behavior:
$ sudo varnishlog -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
-I ReqHeader:Cookie -I ReqHeader:Authorization \
-i VCL_call -i VCL_return \
-q "VCL_call eq 'PASS'"
* << Request >> 3
- ReqMethod GET
- ReqURL /
- ReqProtocol HTTP/1.1
- ReqHeader Host: localhost
- ReqHeader Cookie: lang=en
- VCL_call RECV
- VCL_return pass
- VCL_call HASH
- VCL_return lookup
- VCL_call PASS
- VCL_return fetch
- VCL_call DELIVER
- VCL_return deliver
Because an Authorization header was passed
The built-in VCL for the vcl_recv
also bypasses the cache when an Authorization
header is sent. The idea is that authorized content is private and should not be cached.
Here’s the VSL output for such a request:
$ sudo varnishlog -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
-I ReqHeader:Cookie -I ReqHeader:Authorization \
-i VCL_call -i VCL_return \
-q "VCL_call eq 'PASS'"
* << Request >> 98327
- ReqMethod GET
- ReqURL /
- ReqProtocol HTTP/1.1
- ReqHeader Host: localhost
- ReqHeader Authorization: Basic dGVzdDp0ZXN0
- VCL_call RECV
- VCL_return pass
- VCL_call HASH
- VCL_return lookup
- VCL_call PASS
- VCL_return fetch
- VCL_call DELIVER
- VCL_return deliver
Because of an invalid request method
In an earlier example we bypassed the cache for uncacheable request methods such as POST
. Only GET
and HEAD
are cacheable.
But Varnish will also not cache if it doesn’t recognize the request method. When an invalid request method is used, Varnish will assume the input is not an HTTP request.
At that point the request will be piped to the backend. This means that a TCP connection is opened to the backend, all the incoming bytes are sent, and the response is returned. No effort is made to treat the request or response as HTTP.
In terms of the Varnish Finite State Machine, a pass
call is no longer made and instead a pipe
call is made.
Here’s how you query for these requests in VSL:
sudo varnishlog -g request -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
-I ReqHeader:Cookie -I ReqHeader:Authorization \
-i VCL_call -i VCL_return \
-q "VCL_return eq 'pipe'"
And here’s some example VSL output:
$ sudo varnishlog -g request -i ReqUrl -i ReqMethod -i ReqProtocol -I ReqHeader:Host \
-I ReqHeader:Cookie -I ReqHeader:Authorization \
-i VCL_call -i VCL_return \
-q "VCL_return eq 'pipe'"
* << Request >> 22
- ReqMethod FOO
- ReqURL /
- ReqProtocol HTTP/1.1
- ReqHeader Host: localhost
- VCL_call RECV
- VCL_return pipe
- VCL_call HASH
- VCL_return lookup
** << BeReq >> 23
-- VCL_call PIPE
-- VCL_return pipe
In this example the FOO
request method was used, which results in a transition to the vcl_pipe
subroutine.
Because of a Hit-For-Miss
When Varnish receives a response from the backend server, the vcl_backend_response
logic will examine the cacheability of the response. If the response is not deemed cacheable, Varnish will create a Hit-For-Miss object that will allow the next requests to bypass the cache as well as the request coalescing waiting list.
As long as the backend response returns uncacheable responses, Hit-For-Misses will occur. We can spot these in VSL by looking for the HitMiss
tag.
The following command will filter out all VSL transactions that contain a HitMiss
tag.
sudo varnishlog -g request -q "HitMiss"
From a Finite State Machine point of view a transition from vcl_recv
to vcl_hash
is made. Although the requested object is found in the cache, it is a Hit-For-Miss object, which causes a transition to the vcl_miss
subroutine.
A more targeted approach to find uncacheable content is to filter the logs for backend transactions that have uncacheable
as one of the values in the TTL
tag:
sudo varnishlog -g request -b -i BeReqUrl -I BerespHeader:Cache-Control \
-I BerespHeader:Expires -I BerespHeader:Vary -I BerespHeader:Set-Cookie \
-I BerespHeader:Surrogate-Control -i TTL -q "TTL[6] eq 'uncacheable'"
Let’s use this command, that only displays specific tags, to explore some uncacheable scenarios.
Because of a zero TTL
One of the potential root causes of a Hit-For-Miss is a response that results in a zero TTL. A TTL of zero can be set through the Cache-Control
header or the Expires
header.
Here are a couple examples of such header values:
Cache-Control: max-age=0
Cache-Control: s-maxage=0
Expires: Thu, 1 Jan 1970 12:00:00 GMT
Here’s some VSL output that illustrates this scenario:
$ sudo varnishlog -g request -b -i BeReqUrl -I BerespHeader:Cache-Control \
-I BerespHeader:Expires -I BerespHeader:Vary -I BerespHeader:Set-Cookie \
-I BerespHeader:Surrogate-Control -i TTL -q "TTL[6] eq 'uncacheable'"
** << BeReq >> 32810
-- BereqURL /
-- BerespHeader Cache-Control: max-age=0
-- TTL RFC 0 10 0 1641298426 1641298426 1641298426 0 0 cacheable
-- TTL VCL 120 10 0 1641298426 cacheable
-- TTL VCL 120 10 0 1641298426 uncacheable
-- BerespHeader Vary: Accept-Encoding
You can clearly spot the Cache-Control: max-age=0
response header that was sent by the backend. And although the first occurence of the TTL
tag sees this as a cacheable response, the built-in VCL for this scenario will turn this into an uncacheable response and creates a Hit-For-Miss object as a result.
Because of private, no-cache, no-store
Another scenario where a response is deemed uncacheable is when the Cache-Control
response header contains keywords like private
, no-cache
or no-store
.
Here’s some VSL output for this scenario:
$ sudo varnishlog -g request -b -i BeReqUrl -I BerespHeader:Cache-Control \
-I BerespHeader:Expires -I BerespHeader:Vary -I BerespHeader:Set-Cookie \
-I BerespHeader:Surrogate-Control -i TTL -q "TTL[6] eq 'uncacheable'"
** << BeReq >> 43
-- BereqURL /
-- BerespHeader Cache-Control: private, no-cache, no-store
-- TTL RFC 120 10 0 1641298797 1641298797 1641298796 0 0 cacheable
-- TTL VCL 120 10 0 1641298797 cacheable
-- TTL VCL 120 10 0 1641298797 uncacheable
-- BerespHeader Vary: Accept-Encoding
The Cache-Control: private, no-cache, no-store
response header is responsible for this response not being cacheable, which is in accordance with the the built-in VCL for this scenario.
Because of Surrogate-Control: no-store header
When the Surrogate-Control
response header contains the no-store
keyword, Varnish will not cache the response.
Here’s some VSL output for this scenario:
$ sudo varnishlog -g request -b -i BeReqUrl -I BerespHeader:Cache-Control \
-I BerespHeader:Expires -I BerespHeader:Vary -I BerespHeader:Set-Cookie \
-I BerespHeader:Surrogate-Control -i TTL -q "TTL[6] eq 'uncacheable'"
** << BeReq >> 46
-- BereqURL /
-- BerespHeader Surrogate-Control: no-store
-- TTL RFC 120 10 0 1641298986 1641298986 1641298985 0 0 cacheable
-- TTL VCL 120 10 0 1641298986 cacheable
-- TTL VCL 120 10 0 1641298986 uncacheable
-- BerespHeader Vary: Accept-Encoding
Have a look at the built-in VCL for this scenario.
Because of a Set-Cookie headers
The built-in VCL describes how Varnish deals with cookies out-of-the-box.
But while the occurrence of a Cookie
request header causes a cache bypass, the occurrence of a Set-Cookie
header in the response causes a Hit-For-Miss object being created.
When setting a cookie, a state change is implied, which is not cacheable. The built-in VCL has measures in place to prevent responses with Set-Cookie
headers from being cached.
Here’s some VSL output for this scenario:
$ sudo varnishlog -g request -b -i BeReqUrl -I BerespHeader:Cache-Control \
-I BerespHeader:Expires -I BerespHeader:Vary -I BerespHeader:Set-Cookie \
-I BerespHeader:Surrogate-Control -i TTL -q "TTL[6] eq 'uncacheable'"
** << BeReq >> 49
-- BereqURL /
-- BerespHeader Set-Cookie: language=en; expires=Thu, 03-Feb-2022 12:31:26 GMT; Max-Age=2592000; path=/
-- TTL RFC 120 10 0 1641299487 1641299487 1641299486 0 0 cacheable
-- TTL VCL 120 10 0 1641299487 cacheable
-- TTL VCL 120 10 0 1641299487 uncacheable
-- BerespHeader Vary: Accept-Encoding
Because of wildcard variations
The Vary
header can be used to issue cache variations based on certain request header values.
A Vary: Accept-Language
header will make sure a variation of a cached object is created based on the value of the Accept-Language
request header. This ensures that you can cache multilingual content without having to write extra VCL.
The challenge is to strike the right balance between having enough cache variations and maintaining an acceptable hit rate.
But if a response contains a Vary: *
header, Varnish is supposed to vary on all request headers. This defies the purpose of caching and the built-in VCL considers this an uncacheable scenario.
Here’s some VSL output for this scenario:
$ sudo varnishlog -g request -b -i BeReqUrl -I BerespHeader:Cache-Control \
-I BerespHeader:Expires -I BerespHeader:Vary -I BerespHeader:Set-Cookie \
-I BerespHeader:Surrogate-Control -i TTL -q "TTL[6] eq 'uncacheable'"
** << BeReq >> 32813
-- BereqURL /
-- BerespHeader Vary: *
-- TTL RFC 120 10 0 1641299772 1641299772 1641299771 0 0 cacheable
-- TTL VCL 120 10 0 1641299772 cacheable
-- TTL VCL 120 10 0 1641299772 uncacheable
-- BerespHeader Vary: *, Accept-Encoding
Debug logs
Throughout this tutorial we’ve used the Varnish Shared Memory Logs (VSL) quite extensively. However, from a troubleshooting and debugging point of view, we haven’t yet looked at the Debug
VSL tag.
Varnish will occassionally add some debug information to the logs through the Debug
tag. VMOD developers may also use this tag to log information that wouldn’t fit in any other tag.
One example is an HTTP/2
upgrade request on a Varnish system that doesn’t have HTTP/2
enabled. The protocol upgrade happens through the Upgrade: h2c
request header.
When Varnish notices it cannot upgrade the connection to HTTP/2
because the -p feature=+http2
runtime parameter wasn’t passed, it continues to serve HTTP/1.1
content, but logs the upgrade attempt through a Debug
VSL tag.
Here’s an example:
$ sudo varnishlog -g request -c -i "Req*" -i Debug
* << Request >> 54
- Debug "H2 upgrade attempt%00"
- ReqStart 127.0.0.1 47000 a0
- ReqMethod GET
- ReqURL /
- ReqProtocol HTTP/1.1
- ReqHeader Host: localhost
- ReqHeader User-Agent: curl/7.74.0
- ReqHeader Accept: */*
- ReqHeader Upgrade: h2c
- ReqHeader X-Forwarded-For: 127.0.0.1
- ReqAcct 87 0 87 273 33 306
Timeouts
Varnish has some timeouts that can be configured, which can improve the overall experience.
Although you can focus on reducing the amount of time it takes to receive a request or the amount of time it takes to generate a response, sometimes it just makes sense to increase the timeout value.
Varnish allows you to configure both client-side timeouts as well as a collection of backend timeouts.
Client-side timeouts
The timeout_idle
parameter is one of these. It’s a sort of keep-alive timeout that defines how long a connection remains open after a request. If no new pipelined request is received on the connection within five seconds, the connection is closed.
The idle_send_timeout
parameter defines how long Varnish is willing to wait for the next bytes, after having already received data. This is a typical between-bytes timeout.
And then there’s also the send_timeout
parameter, which acts as a last-byte timeout.
From a Denial of Service (DoS) perspective these settings can help you prevent slowloris attacks.
From a performance point of view, these settings can also be used to improve the end-user experience. If your Varnish servers sit behind a set of load balancers, it makes sense to increase timeout_idle
because you know they are the only devices that are directly connecting to Varnish, and they are most probably going to reuse their connections with Varnish.
If you’re handling large volumes of data that are processed by potentially slow clients, you can also increase the send_timeout
value.
Backend timeouts
For requests that cannot be served from the cache, a backend connection is made to the origin server, which acts as the source of truth.
If your backend is slow, or the connection is unreliable, backend connections might be left open for too long. It is also possible that the connection is closed while data is still being sent.
In order to strike the right balance, Varnish offers a set of backend timeouts:
connect_timeout
: the amount of time Varnish is willing to wait for the backend to accept the connectionfirst_byte_timeout
: the timeout for receiving the first byte from the originbetween_bytes_timeout
: the amount of time we are willing to wait in between receiving bytes from the backend
Here’s a quick reminder on how to configure this in VCL:
vcl 4.1;
backend default {
.host = "origin.example.com";
.port = "80";
.connect_timeout = "10s";
.first_byte_timeout = "90s";
.between_bytes_timeout = "5s";
}
These settings are of course also available as varnishd
runtime parameters, but it is important to know that the values in VCL are on a per-backend basis, and take precedence over the runtime parameters.
These parameters can also be specified on a per-request basis, using bereq.connect_timeout
or bereq.first_byte_timeout
from VCL_backend_fetch
in VCL.
The backend_idle_timeout
parameter is not configurable in VCL, defaults to 60 seconds, and defines how long a backend connection can be idle before Varnish closes it.
Performance issues
As mentioned in the beginning of this tutorial: the performance of Varnish can depend on the available server resources and traffic patterns. The configuration of runtime parameters can also play a part in this.
It is important to point out that the default values of the various runtime parameters were carefully considered. But that doesn’t mean they are right for your environment.
Degraded performance in Varnish usually takes place when Varnish doesn’t have enough cache storage, when Varnish doesn’t have enough workspace memory or when Varnish doesn’t have enough threads.
Not enough caching storage
By default, Varnish allocates 256 MB of memory for caching objects. That’s not a lot, but it’s conservative and should work on systems that have less than 1GB of RAM.
If Varnish is the only service running on the server, the rule of thumb is to allocate 80% of the server’s available memory to Varnish. This percentage only accounts for the object storage. Memory that is used to process requests is not part of that calculation though.
Monitoring the object storage memory consumption in Varnish can be done through the varnishstat
program that is part of your Varnish installation.
varnishstat
displays continuously updated counters, which can be filtered by using the -f
option.
Here’s a varnishstat
example that only monitors Shared Memory Allocation counters:
sudo varnishstat -f "SMA.*.g_*" -1
Here’s the output:
SMA.s0.g_alloc 40 . Allocations outstanding
SMA.s0.g_bytes 41445760 . Bytes outstanding
SMA.s0.g_space 63411840 . Bytes available
SMA.Transient.g_alloc 23 . Allocations outstanding
SMA.Transient.g_bytes 7176 . Bytes outstanding
SMA.Transient.g_space 0 . Bytes available
We can conclude that this server still has 63411840
bytes of storage available and 41445760
bytes of storage are in use.
-s
runtime parameter for varnishd
, Varnish will name the first storage engine s0
.The Transient
storage engine is a secondary storage engine that is not limited in size. It is used for shortlived objects and for streaming uncacheable data.
If you’re running out of memory and Varnish needs to store new objects, it will evict the least recently used (LRU) objects to free up space.
The MAIN.n_lru_nuked
counter in varnishstat
will increase when this happens. This is a clear warning that you’re running out of memory. This is not disastrous because we’re dealing with a cache: the content can always be fetched from the origin server.
Increasing the cache storage
If it turns out that you haven’t allocated enough memory to Varnish, you can increase the cache storage by modifying the -s
runtime parameter in your systemd unit file.
You can edit the runtime configuration through the following command:
sudo systemctl edit --full varnish
An editor will open and this is what you’ll see:
[Unit]
Description=Varnish Cache, a high-performance HTTP accelerator
After=network-online.target nss-lookup.target
[Service]
Type=forking
KillMode=process
# Maximum number of open files (for ulimit -n)
LimitNOFILE=131072
# Locked shared memory - should suffice to lock the shared memory log
# (varnishd -l argument)
# Default log size is 80MB vsl + 1M vsm + header -> 82MB
# unit is bytes
LimitMEMLOCK=85983232
# Enable this to avoid "fork failed" on reload.
TasksMax=infinity
# Maximum size of the corefile.
LimitCORE=infinity
ExecStart=/usr/sbin/varnishd \
-a :80 \
-a localhost:8443,PROXY \
-p feature=+http2 \
-f /etc/varnish/default.vcl \
-s malloc,256m
ExecReload=/usr/sbin/varnishreload
[Install]
WantedBy=multi-user.target
Please change -s malloc,256m
with a value that suits your needs. Once you’re done, save the file and close the editor.
Please restart Varnish to enable these new settings by running the following command:
sudo systemctl restart varnish
Not enough workspace memory
Workspace memory is a Varnish-specific concept. This type of memory is used for request and response parsing, for VCL storage, and also for any VMOD requiring memory space to store data.
There are different kinds of workspaces, and each of them can be tuned. When a workspace overflow occurs, this means the transactions couldn’t allocate enough memory to perform their tasks.
- The
workspace_client
parameter, with a default value of 64 KB, is used to limit memory allocation for HTTP request handling. - The
workspace_backend
parameter, which also has a default value of 64 KB, sets the amount of memory that can be used during backend processing. - The
workspace_session
parameter limits the size of workspace memory used to establish the TCP connection to Varnish. The default value is 0.5 KB.
When a task consumes more memory than allowed in one of the specific workspace contexts, the transaction is aborted, and an HTTP 503 response is returned. When a workspace_session
overflow occurs, the connection will be closed.
It is always possible to increase the size of the various workspaces. Memory consumption depends on what happens in VCL, but also depends on the input Varnish receives from clients and backends.
A better solution is to optimize your VCL, or reduce the size and the amount of headers that are sent by the backend. But sometimes, you have no control over this, or no way to significantly reduce memory consumption. In that case, increasing workspace_client
or workspace_backend
is your best move.
Luckily there are ways to monitor workspace overflow. These workspaces have a varnishstat
overflow counter:
MAIN.ws_client_overflow
MAIN.ws_backend_overflow
MAIN.ws_session_overflow
When these counters start increasing, don’t blindly increase the workspace size. Instead, have a look at your logs, see which transactions cause the overflow, and try to figure out if you can optimize that part of your VCL to avoid the overflows in the future.
As always, varnishstat
and varnishlog
will be the tools you need to figure out what is going on before deciding to increase the size of the workspaces.
-p
options to the varnishd
program. This happens in the system unit file. The procedure involves sudo systemctl edit --full varnish
to open the unit file and sudo systemctl restart varnish
to restart Varnish.Not enough threads
The threading settings are the best way to tune the concurrency of Varnish. The thread_pool_min
and thread_pool_max
runtime parameters control how many threads are in the thread pool.
By default there are two thread pools, which can be changed by modifying the thread_pools
parameter. But benchmarks have shown no significant improvement by changing the value.
When varnishd
starts, 200 threads are created in advance: 100 threads per thread pool. The thread_pool_min
parameter can be used to tune this number.
Growing the thread pools
When the thread pools are running out of available threads, Varnish will grow the pools until thread_pool_max
is reached. Growing the pools is also somewhat resource intensive.
By increasing thread_pool_min
, it’s easier to cope with an onslaught of incoming traffic upon restart. This is common when Varnish sits behind a load balancer and is suddenly added to the rotation.
On the other hand, starting varnishd
with too many threads will have an impact on the resource consumption of your server.
A good indicator is the MAIN.threads
counter in varnishstat
. It lets you know how many threads are currently active. You can correlate this number to your resource usage, and it helps you establish a baseline.
The thread_pool_min
parameter should at least be the number of threads in use for an average traffic pattern.
When the MAIN.threads_limited
counter increases, it means the thread pools ran out of available threads. If there is room to grow the pools, Varnish will add threads. If not, tasks will be queued until a thread is available.
The MAIN.threads_limited
counter might increase early on when the MAIN.threads
counter reaches the thread_pool_min
threshold. As the pools are grown, it won’t occur that much. But when MAIN.threads
reaches the thread_pool_max
value, the change rate of the MAIN.threads_limited
counter can indicate problematic behavior.
The thread queue doesn’t have an infinite size: each thread pool has a queue limit of 20 tasks. This is configurable via the thread_queue_limit
parameter. When the queue is full, any new task or request will be dropped.
So when the MAIN.threads_limited
counter increases, and the MAIN.sess_dropped
or MAIN.req_dropped
counters are increasing as well, know that the queue is full and sessions/streams are being dropped.
MAIN.sess_dropped
counter refers to HTTP/1.1
sessions being dropped, whereas MAIN.req_dropped
refers to HTTP/2
streams being dropped.You can choose to increase the thread_queue_limit
, which will allow more tasks to be queued. Unless resources are too tight, you really want to increase thread_pool_min
because it will make your system more responsive.
Shrinking the thread pools
When a thread has been idle for 300 seconds, Varnish will clean it up. This is controlled by the thread_pool_timeout
parameter. The MAIN.threads
counter will reflect this.
This means Varnish will automatically shrink the thread pools based on demand, but with a delay. But if increased server load, caused by varnishd
worker threads, is too much for your system, you should decrease thread_pool_max
to an acceptable level.
If you believe Varnish needs to clean up idle threads quicker, you can reduce the thread_pool_timeout
. But remember: destroying threads also consumes resources.
Tuning the worker stack size
Another factor that will impact server load is the worker stack size. This is stack space that is consumed by every worker thread. By limiting the stack size, we manage to reduce the memory footprint of Varnish on the system. The size is configurable via the thread_pool_stack
parameter. The default stack size in Varnish is 48 KB. The default process stack size on Linux is typically multiple orders of magnitude larger than the stack sizes we use in Varnish.
Stack space is typically consumed by third-party libraries that are used by Varnish. libpcre
, the library to run Perl Compatible Regular Expressions, can consume quite a bit of stack space. If you write very complicated regular expressions in VCL, you might even cause a stack overflow.
When a stack overflow happens, you should increase the value of thread_pool_stack
. But this, in its turn, will have a direct impact on resource consumption because the worker stack size is per thread.
If you set your worker stack size to 100 KB and you have 5000 threads in two thread pools, this will consume almost 1 GB of memory. So be careful, and consider reducing thread_pool_max
when this would be too taxing on your system.
HTTP limits
HTTP requests and responses are parsed by Varnish. As mentioned earlier, parsing them requires a bit of workspace memory.
Incoming requests and cached responses are parsed in the client context, and use client workspace memory. When a cache miss takes place, and the response needs to be parsed from the origin server, we operate in the backend context. This will consume backend workspace memory.
There are certain limits in place that prevent Varnish from having to waste too much memory on request and response parsing and to avoid DoS attacks.
Here’s an overview of runtime parameters that limit the length and size of requests and responses:
http_max_hdr
: the maximum number of headers an HTTP request or response may contain. The default value is 64.http_req_hdr_len
: the maximum size of an individual request header. By default this is 8 KB.http_req_size
: the maximum total size of the HTTP request headers. This defaults to 32 KB.http_resp_hdr_len
: the maximum size of an individual response header. By default this is 8 KB.http_resp_size
: the maximum total size of the HTTP response headers. This defaults to 32 KB.
When requests or responses exceed these limits, the transaction will fail.
HTTP request limit examples
Here’s some example logging output when the http_max_hdr
threshold is exceeded:
* << Request >> 5
- Begin req 4 rxreq
- Timestamp Start: 1611051232.286266 0.000000 0.000000
- Timestamp Req: 1611051232.286266 0.000000 0.000000
- BogoHeader Too many headers: foo:bar
- HttpGarbage "GET%00"
- RespProtocol HTTP/1.1
- RespStatus 400
- RespReason Bad Request
- ReqAcct 519 0 519 28 0 28
- End
As you can see, an HTTP 400 status code is returned when this happens.
Here’s an example where an individual request header exceeds the http_req_hdr_len
limit:
* << Request >> 98314
- Begin req 98313 rxreq
- Timestamp Start: 1611051653.320914 0.000000 0.000000
- Timestamp Req: 1611051653.320914 0.000000 0.000000
- BogoHeader Header too long: test:YnEJyVqxTMgn7aX
- HttpGarbage "HEAD%00"
- RespProtocol HTTP/1.1
- RespStatus 400
- RespReason Bad Request
- ReqAcct 10081 0 10081 28 0 28
- End
When the total request size exceeds http_req_size
, the following output can be found in your VSL:
* << Session >> 32793
- Begin sess 0 HTTP/1
- SessOpen 172.21.0.1 60576 http 172.21.0.3 80 1611052643.429084 30
- SessClose RX_OVERFLOW 0.001
- End
HTTP response limit examples
When the origin server returns too many headers and exceeds the http_max_hdr
limit, this doesn’t result in an HTTP 400 status, but in an actual HTTP 503.
You might see the following output appear in your VSL:
-- BogoHeader Too many headers:foo: bar
-- HttpGarbage "HTTP/1.1%00"
-- BerespStatus 503
-- BerespReason Service Unavailable
-- FetchError http format error
And when this happens, the MAIN.losthdr
counter in varnishstat
will also increase.
When the http_resp_hdr_len
limit is exceeded, you will see the following output end up in VSL:
-- BogoHeader Header too long: Last-Modified: Tue,
-- HttpGarbage "HTTP/1.1%00"
-- BerespStatus 503
-- BerespReason Service Unavailable
-- FetchError http format error
And finally, when the http_resp_size
limit is exceeded, the following VSL line may serve as an indicator:
-- FetchError overflow
Get help
Troubleshooting can be hard sometimes. Although this detailed troubleshooting tutorial can get you a long way, you might need some help figuring out what’s going on.
Luckily there are ways to get help: help from the community and help from Varnish Software’s support department.
Get help →