Over the weekend I saw a post by Ben Hoyt on replacing Google Analytics with GoAccess which resonated with me, I turned off Google Analytics a year ago and to be honest I haven’t really looked back (mostly).
However, I am writing more consistently now and some stats might be interesting though I certainly am not reaching for Google Analytics my current approach is suboptimal.
When I do need to look at stats, I parse my access log with GoAccess and have a gander. The problem with this is well its mostly just bots and rubbish it’s actually hard to determine anything.
zcat -f access_log* | awk '$7!~/\..*$/' | grep -Ev 'HEAD' | goaccess --log-format=COMBINED
--ignore-crawlers --ignore-panel=REQUESTS_STATIC --http-protocol=no
Even stripping out a bunch of bits, most of the content is spam bots, and third-party tools like Feedbin making the results less then ideal.
So it was interesting to see Ben still effectively using logs and GoAccess but instead parsing the generic access log, he parses just a pixel log.
A very dull lightbulb came on, can I do this ultra simply with Nginx and get my stat cravings?
Yes.
ok.
Custom log by Location
So within Nginx, most things are defined in a location block when they are talking about interaction from a visitor. So if we want to add a pixel then it makes sense it to define it as a location.
location /pixel.gif {
empty_gif;
}
With this defined, if you go to example.com/pixel.gif you will be served a 1×1 pixel transparent gif. The empty_dif directive is whats delivering an in memory pixel gif.
We can then expand this location to include a custom access log, so when content hits this location it’s recorded in this access log, not the more generic one if defined. Generally but not always with Nginx things inside the location block override for that location wider directives.
So now we have
location /pixel.gif {
empty_gif;
access_log /var/www/example.com/statistics/logs/pixel.log
}
Cool job done…
Adding some useful data
ok so if we were to add that pixel to our website, leave it for a week and look at the data, we would be disappointed. Our logs will be nicely filled, and we would have some IPs and useragent, but we wouldn’t know what pages been visited, as the location option would show pixel.gif…
So how do we get the user location?
Well, most pixel trackers add extra information in the query string.
So our pixel becomes example.com/pixel.gif?u=mylocation/&r=https://google.com
We could simply hardcode the URL assuming we are using a CMS, though that wouldn’t get us the referrer for this we would need to resort to javascript so borrowing Ben example.
<script>
if (window.location.hostname == 'benhoyt.com') {
var _pixel = new Image(1, 1);
_pixel.src = "https://cloudfront.example.net/pixel.png?u=" +
encodeURIComponent(window.location.pathname) +
(document.referrer ? "&r=" + encodeURIComponent(document.referrer)
: "");
}
</script>
<noscript>
<img src="https://cloudfront.example.net/pixel.png?
u={{ page.url | url_encode }}" />
</noscript>
we can do this:
<script>
if (window.location.hostname == 'example.com') {
var _pixel = new Image(1, 1);
_pixel.src = "https://example.com/pixel.gif?u=" +
encodeURIComponent(window.location.pathname) +
(document.referrer ? "&r=" + encodeURIComponent(document.referrer)
: "");
}
</script>
Then we need to make use of these arguments.
In Nginx, any query arguments can be accessed in the location block by using $arg_[key]
location /pixel.gif {
set $referrer $arg_r;
set $rurl $arg_u;
empty_gif;
access_log /var/www/example.com/statistics/logs/pixel.log
}
So now we have used the Nginx SET directive to set variables for $referrer or $rurl. By using SET we can now access those variables in our log file.
Now we need to add a new log format, this is done in the HTTP context, not the location. If you have a conf.d folder, you can add this as a new file. the log format, has to exist, prior to the location block in the order of the document. In our case we are just doing:
log_format pixel '$remote_addr - $remote_user [$time_local] "$rurl" '
'$status $body_bytes_sent "$referrer" '
'"$http_user_agent" "$http_x_forwarded_for" ';
In the above, we have replaced the location and referrer information, with our variables. The final step is to get our location to use this new format (which we have cunningly named pixel)
location /pixel.gif {
set $referrer $arg_r;
set $rurl $arg_u;
empty_gif;
access_log /var/www/example.com/statistics/logs/pixel.log pixel
}
We do this by adding the log_format name to the end of the access_log directive.
Wrapping up & potential improvements
That’s it, you can add your javascript to pages you want to track, and your log file will slowly fill up with your visitors. if you have Geo_IP module enabled you could also get Nginx to store that data, and in theory, you can add just about anything else you want in there.
Now when you run GoAccess against the new log file, you get some actually useful stats.
So clearly this is the solution to my non-existent worries about gathering statistics is answered, and you are seeing a pixel on my site right now?
Well, it’s got a couple of problems, number one in the above, if the query string is empty, the variables are not set (this doesn’t appear to have any performance issue) but will result in very weird log entries with missing data. We could probably do some validation, but where is the fun in that.
Generally this solution is open to a fair amount of abuse, and given 99% of my traffic is trying to abuse my site in some way this feels like an area that we shouldn’t skimp.
The second issue is more I’m still pondering what to do with analytics something simple like this is one of the options I’m considering but then so is something like Fathom.
I do however like the idea, of the simplicity of using a solution like this, combined with GoAccess if I want to get fancy, GoAccess has the ability to export its output as a JSON rather than HTML report, so I could build a custom interface for it.
What GoAccess lacks, in general, is any way to handle custom data, for example, I would quite like to use UTM or similar to define sources. while there are open issues for both custom data panels and UTM specific panel nothing concrete is planned.
So while I haven’t implemented this yet, I think it has a lot of potentials to be the basic stats I have secretly been craving.