This post looks at algorithm visualisation in two guises, first in terms of ‘how’ an algorithm does its work – looking ‘inside’ the box if you like, and second on what comes out the other end as an output or outcome. I’ve written this spurred on initially by a recent high-media-profile classification or labelling error (see below), and partly to get something out on visualisation, which is important to me.
Visualisation is a nice ‘hot’ topic in many domains or senses, and worthily so. The ubiquity of the web, and technologies associated with this, has brought what was previously a more ‘arcane’ discipline mixing statistics with design in to the wider world. The other key concept in visualisation is that of narrative, and the concept of time or time series behind this. As a historian by training, rather than a statistician or designer, I personally of course like this bit too. The way in which data ‘stories’ or narratives can be constructed and displayed is a fascinating and organic process. The world of TED is relevant to any domain, or discipline, where data and communication or insight around this is involved.
The two words that I want to concentrate on here are beauty, and belief. They are closely related.
Visualisation is an important communication tool, and can often be ‘beautiful’ in its own right. This applies to something that helps us understand how an algorithm is doing its job, step by step or stage by stage, and also to what comes out the other side. Beauty (or elegance) and function are often aligned closely, so in this case what looks good also works well.
Visualisation is also an important component of ‘testing’ or validating a process and an output, either in terms of the development process and what is working in what way when or how, or in getting a client or partner who is meant to use the output in an application to buy in to or accept what is going on behind the scenes. So we have to Believe in it too.
I like a pretty picture. Who doesn’t? And ones that move or you can interact with are even better. I’ve read some, but by no means all, of the ‘textbooks’ on data visualisation, from ‘classics’ like Tufte to the work of Steven Few. I’ve worked in my own way in visualisation applications (for me, mainly Tableau in recent years) and in conjunction with colleagues in D3 and other web technologies. Most of this has been to do with Marketing and ‘Enterprise’ data in the guise of my role in Sports Alliance. This is not the place to showcase or parade my own work, thank god. I’m going to concentrate firmly on paradigms or examples from others. This section will be quite short.
It’s easy to say, but I do love the D3 work of Mike Bostock. The examples he generates are invariably elegant, sparse and functional all at the same time. D3 works on the web and therefore potentially for anyone at any time, and he releases the code for anyone else to use. They also really work for me in terms of the varying ‘levels’ of understanding that they allow for audiences with different levels of mathematical or programming knowledge. The example below is for a sampling approach using Poisson Discs:
This next is for a shuffle. What I like here is that the visual metaphors are clear and coherent – discs are, well, discs (or anuli), and cards and shuffling (sorting) go together – and also that the visualisation is ‘sparse’ – meaning is clearly indicated in a ‘light touch’ with colour, sparingly used, shade, shape and motion in terms of a time series or iteration steps.
The next example is another D3 by a team related to exploring the relationships between journal articles and citations across 25 years and 3 journals or periodicals. Its sorted by a ‘citation’ metric, and shows clearly which articles have the most ‘influence’ in the domain.
The body of work across 3 decades represented by the scientific visualisations in the IEEE Vis events and related journals InfoVis, VAST and SciVis the exhibit above represents is breathtaking. I’ve chosen two examples ‘stolen’ below that have a strong relation to ‘Machine Learning’ or Algorithm output exploration, which serves to segue or link to the next section on ‘belief’.
Both these are examples of how a visualisation of the output of an algorithm or approach can also help understand or test what the algorithm, and any associated parameters or configuration, is actually doing, and therefore whether we ‘believe’ in it or not.
In our work in Sports Alliance, we’ve struggled at times to get clients to ‘buy in’ to a classifier in action due partly to the limitations of the software we’re using for that, and partly down to us not going the extra mile to ensure complete ‘transparency’ in what an algorithm has done to get the ‘output’. We’ve used decision trees mostly partly because they work in our domain, and partly also because of the relative communicative ease of a ‘tree’ to demonstrate and evaluate the process, regardless of whatever math or algorithm is actually behind it. What has worked best for us is tying the output of the model – a ‘score’ for an individual item (in our case a supporter churn/acquisition metric) – back to their individual ‘real world’ profile and values for features that the model utilises and has deemed ‘meaningful’.
I’ve not used it in production, but I particularly like the BigML UI for decision tree evaluation and inspection. Here is an example from their public gallery for Stroke Prediction based on data from Michigan Stage University:
Trees and branching is an ‘easy’ way or metaphor to understand classification or sorting. Additional information on relative feature or variable correlation to target or ‘importance’
The emergence of ‘Deep Neural Nets’ of varying flavours has involved a lot of these themes or issues, particularly in the area of image classification. What is the ‘Net’ actually doing inside in order to arrive at the label category? How is one version of a ‘Net’ different to another, and is this better or worse?
I like this version presented by Matthew Zeiler of Clarifai in February this year. I don’t pretend to follow exactly what this means in terms of the NN architecure, but the idea of digging in to the layers of a NN and ‘seeing’ what the Net is seeing at each stage makes some sense to me.
The talk then goes on to show how they used the ‘visualisation’ to modify the architecture of the net to improve both performance and speed.
Another approach that seems to me to serve to help demystify or ‘open the box’ is the ‘generative’ approach. At my level of understanding, this involves reversing the process, something along the lines of giving a trained Net a label and asking it to generate inputs (e.g. pictures) at different layers in the Net that are linked to the label.
See the Google DeepMind DRAW paper from Feb 2015 here and a Google Research piece from June 2015 entitled ‘Inceptionism: Going Deeper into Neural Nets’ here. Both show different aspects of generative approach. I particularly like the DRAW reference to the ‘spatial attention mechanism that mimics the foveation of the human eye’. I’m not technically qualified to understand what this means in terms of architecture, but I think I follow what the DeepMind researchers are trying to do using ‘human’ psychological or biological approaches as paradigms to help their work progess:
Here is an example of reversing the process to generate images in the second Google Research paper.
This also raises the question of error. Errors are implicit in any classifier or ‘predictive’ process, and statisticians and engineers have worked on this area for many years. This is now the time to mention the ‘recent’ high profile labelling error from Google+. Dogs as Horses is mild, but Black people as ‘Gorillas‘? I’m most definitely not laughing at Google+ for this or about this. Its serious. Its a clear example of how limited we can be to understand the ‘unforeseen’ errors and the contexts in which these errors will be seen and understood.
I haven’t myself worked in multi-class problems. In my inelegant way, I would imagine that there is a ‘final’ ‘if … where…’ SQL clause that can be implemented to pick up pre-defined scenarios, for example where the classification possibilities include both ‘human’ or ‘named friend’ and ‘gorilla’, then return ‘null’.
The latitude for error in a domain or application of course varies massively. Data Scientists, and their previous incarnations as Statisticians or Quants, have known this for a long time. Metrics for ‘precision’, ‘recall’, risk tolerance and what a false positive or false negative actually mean will vary by application.
Testing, validating and debugging, and attitude to risk or error are critical.
A few years ago I worked on a test implementation of Apache Mahout for Product Recommendation in our business. I found the work done by Sean Owen (now at Cloudera as Oryx became Myrrhix) and Ted Dunning and Ellen Friedman both now at MapR particularly useful.
Dunning’s tongue-in-cheek approach amused me as much as his obvious command or dominance of the subject matter impressed and inspired me. The ‘Dog and Pony’ show and the ‘Pink Waffles’ are great ‘anecdotal’ or ‘metaphorical’ ways to explain important messages – about testing and training and version control, as much as the inner workings of anomalous co-occurence and matrix factorisation.
And this on procedure, training and plain good sense in algorithm development and version control.
In our case we didn’t get to production on this. In professional sport retail and the data we had available there wasn’t very much variation in basket item choices as so much of the trade is focussed on a single product – the ‘shirt’, equivalent to the ‘everybody gets a pony’ in Dunning’s example above.