October 24, 2014
5 min read time

Web traffic analysis with Varnish and VCS

Web traffic analysis software is synonymous with bouncing traffic and user data to a third-party vendor or trawling through logs and consolidating numbers at the end of the day. These metrics, such as unique visitors and page views, can play a central role in the business decision-making process.

What if one can make these decisions on the fly, as-it-happens, in a real-time manner? What if you can categorize users based on criteria other than cookies if uniqueness is not as fine-grained but more broad like client IP or geo location. And what if you could embed these data into any other system with a easy to use RESTful API?

This blog post aims to illustrate that the same type of statistics can be easily gathered in real-time and without involvement of any third-party vendors or the use of cookies. Once these statistical metrics are defined, conducting A/B testing through Varnish is then outlined in a later section. What's the catch? Well, there is none if you are already using Varnish to improve your website's performance. Enter Varnish Custom Statistics, VCS.

Background on VCS

VCS 1.3.0 was recently released. This is our marquee real-time statistics tool that gives users full control to define what they wish to track. Data is collected in real time and VCS has a simple API that supports query and primary data consolidation. Since Varnish lives on the edge of your stack, all requests are handled through Varnish and therein lies a mountain worth of interesting data. Data is user defined and each unique set of data is referenced via a key. Much like Varnish Cache, one can think of VCS as a high performance key-value store for statistical data. A typical use case is to track unique visitors or views for say streaming content.

VCS has a few characteristics that are unique.

  • VCS is real time.
  • VCS is capable of high-performance sorting of keys based on query criteria.
  • VCS leverages VCL and is therefore very flexible and capable of tracking statistics based on behaviors.
  • VCS is easily extensible to cater to very large websites, that require tracking of large number of keys.
  • Query results are usually returned as a sorted list as per query criteria.
  • Statistics on a single key can be fetched.

To log data into VCS you add a statement in your VCL where you log the transaction to the shared memory log. You log what you define as the key of the transaction. This key will later be used when querying and filtering data from VCS. The following example requires a single VCS specific line to be added to your existing VCL:

	sub vcl_deliver { 
		# This creates a grouping for each specific value of the Host request header 
		std.log("vcs-key: host:" + req.http.host); 
		# This creates a key based on both the Host header and the URL
		std.log("vcs-key: url:" + req.http.host + req.url);
	}

Redefining hits


Hits, not akin to the Varnish lingo of a cache hit, refers to number of requests going to your site. Individual request for static resources, such as images, JS or CSS, HTML, constitute as a unit of hit. For example, if your index.html requires five images, one CSS and one JS, the number of hit will be tallied up to 8, inclusive of the index.html request. Hit is often misleading and due to its broad scope, it is often considered an inaccurate view of real-life traffic.

You can choose if you want to log all your traffic to VCS, just the page views or set any other limitation. You do this in VCL with ordinary logic. So, if you just want to log delivery of content type text/html you just add an appropriate if-statement to the log. Another possible limitation could be only logging transactions that have an X-Article-ID header.

The key for VCS doesn’t have to be an URL. You can choose to log request or response headers as well. If you are logging different kinds of keys into VCS we recommend prepending the keys with the kind of key you are logging to make it easier to query that particular namespace. When you log these other keys you avoid a few potential problems like a webpage being available under multiple URLs (desktop vs. mobile views) or query parameters that might otherwise litter the data.

Let’s say that we are logging article IDs into VCS, derived from the X-Article-ID header. The VCL would look like this:

	std.log(“vcs-key: artid:” + obj.http.x-article-id);

To query VCS and get a top list of article IDs use the following query. Note that the regular expressions is URL encoded.

	/match/%5Eartid%3A/top
	# ^artid: is the regular expression 
	# Returns the top 10 most requested article.

Unique visitors

Unique visitors or visits are by far the most important performance indicator of a website. Usually unique visitors require a logged in session in order to identify who they are. Similarly, one could track unique visitors based on IP. The latter approach is not as accurate of course as a single IP address can mask thousands of different users.

For VCS, it could count the number of unique visits per key. More interestingly, it could give you a sorted list based on the number of current unique visits. In order define the uniqueness of a request, one could use cookie or some identification mechanism that resides in the HTTP header. For example:

	sub vcl_recv { 
		# Use the Cookie header as the unique request identifier
		# If the client has multiple cookies you probably want to extract the cookie
		# that uniquely identifies the client.
		# std.log("vcs-unique-id: " + req.http.cookie);
		# Use client ip as the unique request identifier
		# std.log("vcs-unique-id: " + client.ip);
		# Use a custom header as the unique request identifier.
		# std.log("vcs-unique-id: " + req.http.identifier);
		# Use a combination of custom headers
		# std.log("vcs-unique-id: " + client.ip + req.http.cookie );
	}

VCS will keep track of the number of times a unique ID occurs for a specific key. So, for a certain article you might have 9283 hits requested by 7863 unique users.

A/B testing with Varnish and VCS

A/B testing could be easily accomplished on a live-site with Varnish and VCS. The benefit of the two is that results are reported by VCS on the fly and thus propelling and speeding up the decision making process.

Simple A/B testing setup can be achieved by routing traffic to a different backend through Varnish as an example. Or through URL rewrites. In both cases, data can be collected through VCS and each variant can be identified by simply prefixing the VCS keys inside your VCL. For example:

	sub vcl_deliver {
		if( user is category A ) {
			std.log(“vcs-key: abstats:A”);
		} else {
			std.log(“vcs-key: abstats:B”);
		}
	}

Leveraging the VCS query API, one would be able to identify the metrics and get a live feed of the number of transactions that occur for the A category versus the B category. If you extract more data from the A/B testing you’ll be able to feed more data into VCS and thereby getting more data out. You could for instance log successful conversions into two other separate VCS keys and you’ll have a live feed on how successful your A/B experiment is.

Summary

VCS is a powerful and flexible tool for collection real-time statistical data. These data forms critical business metrics for the decision making process. Through using Varnish and VCS, one can query and collect the metrics in real-time, without any third-party vendors or daily trawl through logs.