For those of us publishing content in the web there is a natural interest in gathering some statistical information about how that content is read and received. This has little to do with spying on users, actually in most cases no resolution to an individual is actually possible or even desired. Instead, such monitoring helps to understand if and how web pages are read, if navigation works as intended, which connections from or to other pages exist and what people are actually looking for when they come to the page. In general the motivation is simply to optimize the presentation of the content.
There are a number of solutions available for such tracking tasks. I prefer the free and open Piwik suite. Works like charm, is easy to install, can be customized to your needs. And, most important, it keeps the data where it belongs: on your system, under your control. Not at some service providers data center in a foreign country with twisted ideas about what is right and what is wrong. In places where people actually care about privacy and sensitive handling of usage data, like where I live, this might also be a legal issue.
The traditional way to monitor web pages using Piwik (as with most other tracking solutions) is to include a small javascript snippet into each pages foot. That script takes care to signal some detected usage activity back to the piwik tracking system where it is stored and evaluated. Piwik actually offers that snippet ready for copy and paste. So what’s left is to locate the correct place to put that snippet and store the pages modification.
But wait!
Often things are not that straight forward!
I ran into two major issues that made me think about whether the above strategy really is the best way to go:
I had to maintain a separate snippet of that javascript code for each single web page.
Now whilst this at first just sounds like storing a bunch of files, things get nasty the moment the number of pages increase. You have to name each file (and remember and cite that name!) and you have to modify each file (I had to modify the way https access was handled each time…) . And you have to keep those files up to date if your piwik setup changes.
But actually more annoying is a problem that arises when you have a page that does not rely on a small set of markup templates for all its content. Imagine a site where the users choice of a theme picks templates from a set of offerings. Or where statical pages play a relevant role. Or simply dynamic sites that do not rely on a common footer template. In those cases the act of including the script snippet into all pages becomes a challenge! You have to modify multiple files and make sure you don’t actually break the syntax. And above all, you have to really pick all pages available. Otherwise you end up with something like a black hole in your pages monitoring!
Next, we all know what happens when you upgrade a page, install a new version of a CMS or theme or simply add a few posts.
What a dump, I told myself. Computers are meant to automate things!
So I started out to look for a smarter way for the inclusion of those snippets. For a less error prone setup and a more convenient handling of site modifications. It took me a few attempts, but finally I ended up with a solution that actually works for me. I used it for a few month now, as expected it did not break. No CMS upgrade broke the monitoring. No additionally installed theme left a monitoring gap. It just works. Even for a few additional static pages I always forget about. This is how publishing in the web is meant to be!
And since I am a fan of the idea of „share, don’t hide!“, I would like to present that solution here. Think about it, use it, modify it! Leave a comment or just smile.
However you like 🙂
The solution is based on the apache http server and uses two modules: server side includes (SSI, mod_include) and environment variables (mod_setenvif). I guess it can be ported to other components as well. And I am sure that there are other, smarter solutions. If you find one drop me a note, share it!
First, I created a single, modifed version of that piwik script snippet. It is close to the original version offered by Piwik itself, except that the numerical site index is replaced with a placeholder. Also the base url of the piwik application has been turned into a placeholder, but that is only for convenience reasons:
<?php define('piwikBase','some.piwik.server/piwik/'); define('piwikSite',apache_getenv('PIWIK_ID')); if(is_numeric(piwikSite)){ ?> <!-- Piwik: vvv begin tracking code vvv --> <script type="text/javascript"> var pkBaseURL = (("https:" == document.location.protocol) ? "https://<?php echo piwikBase; ?>" : "http://<?php echo piwikBase; ?>"); document.write(unescape("%3Cscript src='" + pkBaseURL + "piwik.js' type='text/javascript'%3E%3C/script%3E")); </script><script type="text/javascript"> try { var piwikTracker = Piwik.getTracker(pkBaseURL + "piwik.php", <?php echo piwikSite; ?>); piwikTracker.trackPageView(); piwikTracker.enableLinkTracking(); } catch( err ) {} </script><noscript><p><img src="http://<?php echo piwikBase; ?>piwik.php?idsite=<?php echo piwikSite; ?>" style="border:0" alt="" /></p></noscript> <!-- Piwik: ^^^ end tracking code ^^^ --> <?php } else { ?> <!-- invalid piwik site id: <?php echo piwikSite;?> --> <?php } ?>
Ok, but how do I get that stuff included? After all this is said to be the tricky bit…
For this I use some SSI magic. Actually the trick has two steps, two filters. Both filters are supplied to all delivered pages of mimetype ‚text/html‘, so page markup. The first one includes a marker towards the end of each markup page by scanning for the closing body tag. I use the ‚SUBSTITUTE‘ command for this. That closing tag is preceded with a marker, actually an inclusion command for the SSI engine. The mechanism works like any regex based pattern replacement. The marker itself is of only temporary nature, it is to be processed and replaced by the second filter: The ‚INCLUDES‘ command will replace the marker set in step 1 by the result of an internal, virtual request. Using an ‚Alias‘ directive we map that request to the modifed piwik snippet prepared above. Both filters are chained, so that they are always applied as a pair:
# include the piwik tracking code at the bottom of every html page FilterDeclare PIWIK_token FilterProvider PIWIK_token SUBSTITUTE resp=Content-Type $text/html SUBSTITUTE 's|</body>|<!--#include virtual="/piwik" --></body>|ni' FilterDeclare PIWIK_code FilterProvider PIWIK_code INCLUDES resp=Content-Type $text/html FilterChain PIWIK_token PIWIK_code # map virtual request to the file system Alias /piwik /srv/www/internal/piwik.php
Now all that is left is to activate that strategy for every (virtual) host that should be tracked. For this I add two directives inside each hosts central configuration, so into a file in the vhosts directory of your http server. One directive specifies the hosts numerical piwik id as an environment variable, the second directive references the above include filter logic:
# include local piwik setup SetEnv PIWIK_ID 2 Include /etc/apache2/vhosts.d/_internal.inc
As a result, each delivered markup page now carries a piwik tracking snippet right before the closing body tag as shown in this example:
<html> <body> [...] <!-- Piwik: vvv begin tracking code vvv --> <script type="text/javascript"> var pkBaseURL = (("https:" == document.location.protocol) ? "https://some.piwik.server/piwik/" : "http://some.piwik.server/piwik/"); document.write(unescape("%3Cscript src='" + pkBaseURL + "piwik.js' type='text/javascript'%3E%3C/script%3E")); </script><script type="text/javascript"> try { var piwikTracker = Piwik.getTracker(pkBaseURL + "piwik.php", 2); piwikTracker.trackPageView(); piwikTracker.enableLinkTracking(); } catch( err ) {} </script><noscript><p><img src="http://some.piwik.server/piwik/piwik.php?idsite=2" style="border:0" alt="" /></p></noscript> <!-- Piwik: ^^^ end tracking code ^^^ --> </body> </html>
So let’s have a short summary of what I did:
Instead of modifying each and every page, template and ressource served as part of my pages I created a mechanism that includes the required script snippets automatically. First this means that no modifications have to be made to the actual content of the pages, content that may exist in different formats and structures. But more important, this mechanism includes the current numerical piwik id for all hosts meant to be tracked. And this even for those pages I did not think of adding the snippet to! Even after a modification of those pages, maybe by pulling fresh markup versions or by upgrading the CMS the page is based on. I don’t have to care about that. Whenever a markup page is delivered, I can be sure it contains exactly one single (correct!) copy of the Piwik script snippet.
Interesting 🙂
Would you know, if there’s a way how to pass php variables from actual sites into the template? I’m asking as I’m quite heavily using page/visit scope custom variables and guess this will be bit hard with this approach or is it possible?
cheers
I don’t see a reason why this should not be possible, although I never tried myself since I have no use case for such action. I use an environment variable to pass values from request to script. That variables value is assigned in a dynamical manner. It should be possible to have php setup an (additional) environment variable and access that within the script, why not?
There is only one requirement: request and script must be handled inside the same process. So this probably won’t work if you run php as classical cgi, so with a forked process for each request, since environment variables are not propagated ‚upwards‘. In all other cases environment variables should be accessible from both sides…
I’d say: give it a try – and drop a note when you succeed 😉
Hello,
this sounds like exactly what I have been looking for, except…
When I try to restart apache with the new configuration, I get
„Unknown filter provider SUBSTITUTE“ error message. I have googled a bit,
but I haven’t been able to figure out where this comes from. Would you have any idea what went wrong here?
Well this sounds like you did not load the substitution module which provides this filter. The module is called „Mod Substitute“ and has to be loaded for usage:
LoadModule substitute_module /usr/lib64/apache2-prefork/mod_substitute.so
or wherever the modules are located on your system. The module typically comes bundled with the apache worker packages, in case of my openSUSE system for example it is part of package apache2-prefork.I figured out what the current compilation configuration was, so I will add –enable-modules and see what happens. Thanks again.