Visualising Machine Learning Algorithms: Beauty and Belief

This post looks at algorithm visualisation in two guises, first in terms of ‘how’ an algorithm does its work – looking ‘inside’ the box if you like, and second on what comes out the other end as an output or outcome. I’ve written this spurred on initially by a recent high-media-profile classification or labelling error (see below), and partly to get something out on visualisation, which is important to me.

Visualisation is a nice ‘hot’ topic in many domains or senses, and worthily so. The ubiquity of the web, and technologies associated with this, has brought what was previously a more ‘arcane’ discipline mixing statistics with design in to the wider world. The other key concept in visualisation is that of narrative, and the concept of time or time series behind this. As a historian by training, rather than a statistician or designer, I personally of course like this bit too. The way in which data ‘stories’ or narratives can be constructed and displayed is a fascinating and organic process. The world of TED is relevant to any domain, or discipline, where data and communication or insight around this is involved.

The two words that I want to concentrate on here are beauty, and belief. They are closely related.

Visualisation is an important communication tool, and can often be ‘beautiful’ in its own right. This applies to something that helps us understand how an algorithm is doing its job, step by step or stage by stage, and also to what comes out the other side. Beauty (or elegance) and function are often aligned closely, so in this case what looks good also works well.

Visualisation is also an important component of ‘testing’ or validating a process and an output, either in terms of the development process and what is working in what way when or how, or in getting a client or partner who is meant to use the output in an application to buy in to or accept what is going on behind the scenes. So we have to Believe in it too.

Beauty

I like a pretty picture. Who doesn’t? And ones that move or you can interact with are even better. I’ve read some, but by no means all, of the ‘textbooks’ on data visualisation, from ‘classics’ like Tufte to the work of Steven Few. I’ve worked in my own way in visualisation applications (for me, mainly Tableau in recent years) and in conjunction with colleagues in D3 and other web technologies. Most of this has been to do with Marketing and ‘Enterprise’ data in the guise of my role in Sports Alliance. This is not the place to showcase or parade my own work, thank god. I’m going to concentrate firmly on paradigms or examples from others. This section will be quite short.

It’s easy to say, but I do love the D3 work of Mike Bostock. The examples he generates are invariably elegant, sparse and functional all at the same time. D3 works on the web and therefore potentially for anyone at any time, and he releases the code for anyone else to use. They also really work for me in terms of the varying ‘levels’ of understanding that they allow for audiences with different levels of mathematical or programming knowledge. The example below is for a sampling approach using Poisson Discs:BostockPoissonDiscII

This next is for a shuffle. What I like here is that the visual metaphors are clear and coherent – discs are, well, discs (or anuli), and cards and shuffling (sorting) go together – and also that the visualisation is ‘sparse’ – meaning is clearly indicated in a ‘light touch’ with colour, sparingly used, shade, shape and motion in terms of a time series or iteration steps.

BostockQuicksortShuffle

The next example is another D3 by a team related to exploring the relationships between journal articles and citations across 25 years and 3 journals or periodicals. Its sorted by a ‘citation’ metric, and shows clearly which articles have the most ‘influence’ in the domain.

IEEEcitationD3example

The body of work across 3 decades represented by the scientific visualisations in the IEEE Vis events and related journals InfoVis, VAST and SciVis the exhibit above represents is breathtaking. I’ve chosen two examples ‘stolen’ below that have a strong relation to ‘Machine Learning’ or Algorithm output exploration, which serves to segue or link to the next section on ‘belief’.Viz2015Example

Viz2015Example2.

Both these are examples of how a visualisation of the output of an algorithm or approach can also help understand or test what the algorithm, and any associated parameters or configuration, is actually doing, and therefore whether we ‘believe’ in it or not.

Belief

In our work in Sports Alliance, we’ve struggled at times to get clients to ‘buy in’ to a classifier in action due partly to the limitations of the software we’re using for that, and partly down to us not going the extra mile to ensure complete ‘transparency’ in what an algorithm has done to get the ‘output’. We’ve used decision trees mostly partly because they work in our domain, and partly also because of the relative communicative ease of a ‘tree’ to demonstrate and evaluate the process, regardless of whatever math or algorithm is actually behind it. What has worked best for us is tying the output of the model – a ‘score’ for an individual item (in our case a supporter churn/acquisition metric) – back to their individual ‘real world’ profile and values for features that the model utilises and has deemed ‘meaningful’.

I’ve not used it in production, but I particularly like the BigML UI for decision tree evaluation and inspection. Here is an example from their public gallery for Stroke Prediction based on data from Michigan Stage University:

BigMLexampleDecisionTreeModel

Trees and branching is an ‘easy’ way or metaphor to understand classification or sorting. Additional information on relative feature or variable correlation to target or ‘importance’

BigMLexampleDecisionTreeModelSummary

The emergence of ‘Deep Neural Nets’ of varying flavours has involved a lot of these themes or issues, particularly in the area of image classification. What is the ‘Net’ actually doing inside in order to arrive at the label category? How is one version of a ‘Net’ different to another, and is this better or worse?

I like this version presented by Matthew Zeiler of Clarifai in February this year. I don’t pretend to follow exactly what this means in terms of the NN architecure, but the idea of digging in to the layers of a NN and ‘seeing’ what the Net is seeing at each stage makes some sense to me.

ClarifaiZeilerFeb2015DNNVisualisation

The talk then goes on to show how they used the ‘visualisation’ to modify the architecture of the net to improve both performance and speed.

Another approach that seems to me to serve to help demystify or ‘open the box’ is the ‘generative’ approach. At my level of understanding, this involves reversing the process, something along the lines of giving a trained Net a label and asking it to generate inputs (e.g. pictures) at different layers in the Net that are linked to the label.

See the Google DeepMind DRAW paper from Feb 2015 here and a Google Research piece from June 2015 entitled ‘Inceptionism: Going Deeper into Neural Nets’ here. Both show different aspects of generative approach. I particularly like the DRAW reference to the ‘spatial attention mechanism that mimics the foveation of the human eye’. I’m not technically qualified to understand what this means in terms of architecture, but I think I follow what the DeepMind researchers are trying to do using ‘human’ psychological or biological approaches as paradigms to help their work progess:

GoogleDRAW_MNISTexample

Here is an example of reversing the process to generate images in the second Google Research paper.

GoogleInceptionismJun2015Dumbells

This also raises the question of error. Errors are implicit in any classifier or ‘predictive’ process, and statisticians and engineers have worked on this area for many years. This is now the time to mention the ‘recent’ high profile labelling error from Google+. Dogs as Horses is mild, but Black people as ‘Gorillas‘? I’m most definitely not laughing at Google+ for this or about this. Its serious. Its a clear example of how limited we can be to understand the ‘unforeseen’ errors and the contexts in which these errors will be seen and understood.

I haven’t myself worked in multi-class problems. In my inelegant way, I would imagine that there is a ‘final’ ‘if … where…’ SQL clause that can be implemented to pick up pre-defined scenarios, for example where the classification possibilities include both ‘human’ or ‘named friend’ and ‘gorilla’, then return ‘null’.

The latitude for error in a domain or application of course varies massively. Data Scientists, and their previous incarnations as Statisticians or Quants, have known this for a long time. Metrics for ‘precision’, ‘recall’, risk tolerance and what a false positive or false negative actually mean will vary by application.

Testing, validating and debugging, and attitude to risk or error are critical.

A few years ago I worked on a test implementation of Apache Mahout for Product Recommendation in our business. I found the work done by Sean Owen (now at Cloudera as Oryx became Myrrhix) and Ted Dunning and Ellen Friedman both now at MapR particularly useful.

Dunning’s tongue-in-cheek approach amused me as much as his obvious command or dominance of the subject matter impressed and inspired me. The ‘Dog and Pony’ show and the ‘Pink Waffles’ are great ‘anecdotal’ or ‘metaphorical’ ways to explain important messages – about testing and training and version control, as much as the inner workings of anomalous co-occurence and matrix factorisation.

DunningFriedmanDogandPony

And this on procedure, training and plain good sense in algorithm development and version control.

DunningFriedmanRecommenderTrainingIn our case we didn’t get to production on this. In professional sport retail and the data we had available there wasn’t very much variation in basket item choices as so much of the trade is focussed on a single product – the ‘shirt’, equivalent to the ‘everybody gets a pony’ in Dunning’s example above.

Machine Learning, Professional Sports and Customer Marketing in the UK and Europe

Professional Sports Customer Marketing is driven primarily by two key product lines or revenue areas – subscriptions or memberships, and seat or ticket products. The two are ‘combined’ for the classic or ever-green ‘Season Ticket’ packaged product that is essentially the first tier in a membership programme, and on which other ‘loyalty’ programmes or schemes can function.

This post looks at the application of ‘Machine Learning’ in the form of both supervised and unsupervised methods to Customer Marketing in Professional Sport.

I’ll start with an example of an unsupervised approach, using a ‘standard’ k-means algorithm to identify clusters of Professional Sports Club Customers based on features or attributes that describe as broadly as possible Customer profile or behaviour over time. These features were built or sourced from an underlying ‘data model’ that looks at the following areas broadly

  1. Sales transactions – baskets or product items purchased by a customer over time, broken down by product area in to Season Tickets, Match Tickets, Retail (Merchandise), Memberships and Content subscriptions
  2. Socio-Demographic – relating to individual identity, gender, geography and also to other relationships to other supporters in the data o
  3. Marketing Behaviour – engagement and response to outbound and inbound marketing content over time

We wanted to create a ‘UK Behavioural Model’ that would be representative for UK Sports Clubs, so we created a sample in proportion to overall Club or client size from a set of 25 Clubs in the UK from Football, Rugby and Cricket. The sample consisted of 300k from an overall base of approximately 10 million supporters. The input or feature selection was normalised for all Clubs. We experimented with different iterations based on cluster numbers and sizing.

The exhibit below shows a version with 15 different clusters, numbered and coloured across the first row from 0-14. The row headers are the different features or feature groups. The cluster colours are persisted throughout the rows for each feature. The horizontal ‘size’ of the bar for each cluster in each row is the average per customer for each feature. The width or horizontal size of the bar for each cluster in each row relative to the first row for size is intended to provide a visual guide to differences between clusters.

UKSportsClusteringSpectrograph

Revenue £££s in the bottom 3 rows is generated from a handful of clusters only:

  • Purple 8 and Mauve 9 dominate Ticketing revenue
  • Red Cluster 6 and Mauve 9 dominate Memberships revenue
  • Grey Cluster 14 contributes to Merchandise revenue

Pretty much of all of the ‘NonUK’ supporters have been allocated to Light Blue Cluster 1.

Interestingly, Gender (M/F) or Age (Kids, Adults, Seniors) don’t seem to discriminate much between Clusters. See the exhibit below that plots ‘Maleness’ on the Y axis and ‘Age’ on the X axis.

UKSportsClusteringGenderAge

Cluster 4 is ‘Old Men’, Cluster 14 is ‘Ladies of a Certain Age’ but the majority of Clusters (circle diameter proportional to size or number of Customers) aren’t really discriminated by these dimensions or features. We concluded that behaviour of kids ‘followed’ or emulated adults in terms of key features for attendance and membership.

The next section looks at a supervised approach, using a decision-tree classification algorithm to identify ‘propensity’ for members or subscribers to renew or churn, and then the converse for non-members or subscribers to ‘convert’ to become a member, based on similarity to previous retention or acquisition events.

Our work in this area began tentatively in 2010 using an outside consultant (Hello Knut!) from a large software vendor working on a single project membership churn issue in a large Northern London football Club. In the course of the past 5 years, we’ve taken the approach ‘in house’ and ‘democratised’ this in a certain way and applied the techniques over-and-over to different clubs (Hello Emanuela!). We’ve tried to make this as efficient as possible by engineering a common feature set across all clubs and seasons based on a common data model.

For the retention model, we’ve continued to build and train a model for each club AND for each season, as we saw a greater predictive or accuracy over time, also based on including features that encapsulated ‘history’ for each customer up until that season as fully as possible.

For the acquisition model, we have modified the approach slightly using a single input of all acquisition events regardless of season, but still only one club at a time. This was based on the belief or the observation that people became members for roughly the same reasons regardless of season, whilst people ‘churned’ from becoming a member to a non-member based more on season -to -season performance and issues.

Decision trees are often cited as being on the more ‘open’ or ‘transparent’ end of the scale of classification techniques or approaches. However, we’ve succeeded in operationalising the retention model in to Club Sales and Marketing systems and programmes only by using the features or variables that ‘float to the top’ of the decision tree to construct a ‘risk factor’ matrix based on observed ‘real’ behaviour and change in these over time for the customer.

Here’s an example of feature or variable correlation for the STH Acquisition model:

STHAcquisitionModelVariableCorrelations

What was particularly interesting here was the importance of the ‘Half Season Ticket’ holding in the previous season and then the ‘groupings’ represented by the other Half Season Ticket holders who lived with the same supporter. This points very clearly to the inter-relationships that we ‘know’ are important between individual supporters, and leads us towards a more Graph-based analytical approach to identify and analyse relationships at play at specific points in the customer life-cycle, life-stage and buying relationship with the Club.

Our industry or sector is still dominated by the ‘Season Ticket’, a ‘hero product’ that continues, like fine wine or an ageing Hollywood A-lister, to defy the years and live on to snaffle a majority of our clients time and attention and share of revenue. The more that we can do to understand the ‘patterns’ behind this, the better.

Applying AI – A More Detailed Look at Healthcare Diagnostics

This is intended to be a slightly more detailed look at a single vertical or domain – Healthcare, and within this the single area of diagnosis support, and how ‘new AI’ in different guises is being applied, and by whom, and in what way, and to what end.

Its OK to say that the potential for a ‘marriage’ or liaison between Healthcare and AI is no closeted secret. Healthcare is large, complex, and deeply encased in centuries of knowledge and empirical reasoning. The network relationships between physicians or doctors, institutions, patients, treatments and outcomes are a barely understood global resource of great potential value. The whole is too large for any single human to encompass. Inefficiencies or discrepancies in diagnosis and outcomes are inevitable. The potential for new technologies to help to disrupt and reshape the healthcare market and the diagnosis process is clearly understood. VCs are also clearly interested in the outcome, whether commercially or philanthropically; a good example is Vinod Khosla and the ventures his firm represents, from Lumiata (see below) to Ginger.io and CrowdMed.

Healthcare is of course a universe in itself. Diagnosis, Prescription, Monitoring, Intervention – each area or subsection has its own challenges, contexts and actors involved. The potential for ‘universal’ and non-invasive monitoring or sampling tools and applications is of course enormous in itself. The forecast explosion of consumer data-creating devices and applications is going to create a ‘stream processing’ event orders of magnitude beyond what exists currently. As stated, I’m going to try to concentrate here on Diagnosis, and on software rather than hardware, in the form of ‘expert systems’ to support or guide human decision making.

One place to start is with the ‘who’ rather than the ‘what’ or the ‘how’.

The IBM Watson ‘cognitive computing’ project has a valid claim to be an early starter, and also to be in the forefront of many peoples minds with the heritage of the ‘Deep Blue’ project and on to the 2011 ‘Jeopardy‘ demonstration and subsequent publicity generated. Back in the real world, Watson is now applied as solution as a ‘Discovery Advisor‘ in different domains – including healthcare for clinical trial selection, and pharmaceutical drug development. It’s an approach that is both ambitious and intensive – involving many years of intense R&D and the costs associated, and the partnerships with leading Physicians and Institutions including Cancer and Genomic research for ‘training’ over as many years on top of or outside of this. Outside of healthcare, the ‘Question and Answer’ approach is merged with other IBM product lines for Business Analytics and Knowledge Discovery. The recent acquisition of AlchemyAPI, a younger, nimbler technology and ‘outside focussed’ by its very nature, should integrate well to the Bluemix platform. The example below is from IBM development evangelist Andrew Trice, with a voice UI now for a Healthcare QA application:

Whilst I admire the ambition (whether commercially driven or not) and the underlying scale of Watson, I may question if the ‘blinkenlight‘ aura generated by the humming blue appliance linked then to a ‘solutions and partner ecosystem’ notorious for tripling (or more) of any proposed budget will lead to a true democratisation. I feel the same unnatural commingling of awe and fear in response to the Cray appliance use cases. Amazing, awesome, and yet also extremely expensive. I guess the alternative – commodity hardware run at scale using a suitably clever network engineering process to distribute computation and process results- doesn’t come cheap either.

Also, I understand and concur with the need for ‘real stories’ that publicise and demonstrate an application in a way that the ‘average Joe’ can understand. (My personal favourite is the Google DeepMind Atari simulation – more elsewhere on this.) Some attempts, however well intentioned, simply don’t work, or at least in my opinion. The Watson-as-Chef and Food truck for ‘try me’ events makes me think ‘wow, desperate‘ rather than ‘wow, cool’.

The fast-paced improvement and application of ‘Deep Learning’ Neural Networks in image classification have opened up a new opportunity in Medical Image analysis. Some ‘general purpose Deep Learning as a Service or Appliance’ companies  such as  ErsatzLabs offer their tools as a service, and include Healthcare diagnosis use cases in their portfolio.

Enlitic.com proposes to offer a more ‘holistic’ approach combining different technologies and approaches – here both Deep Learning for Imaging and intriguingly combine this also with NLP and semantic approaches for healthcare diagnostics.

Lumiata‘s approach appears more graph-driven, ingesting text and structured data from multiple ‘sources’ from insurance claims, health records, and medical literature, creating an analytics framework for assessing or predicting patient ‘risk’, and exposing this as a service for other healthcare apps.

Its also worth mentioning Google, a potential giant of any domain if desire exists, who have already made a move in to health, leveraging their dominance in search and status as ‘first point of call on the internet’ to provide curated health content, including suitably gnostic pronouncements on search algorithm ‘tweaking’ to support this curated health service.

In terms of diagnosis and treatment, the ‘data types’ currently being referenced are essentially images, text or documents, including test results, and relationships. The technical approaches applied map closely to these – Deep Learning Nets for classification of imaging, NLP / XML for semantics, ontology and meaning in unstructured documents and text, and Graph Analytics at scale for the complexity of the ‘web’ of patient-doctor-diagnosis-disease-treatment-outcomes.

Two of the example companies discussed here – IBM Watson and Cray – have a heritage in the high-end (read expensive) appliance or super-computer systems architecture for running highly memory and processing intensive real-time analytics at scale, and the expensive hordes of suited consultants to implement, deploy and manage these solutions over time. The other, newer, smaller ventures show mixed approaches, and, although its early days for any publicly available data, I would assume on a more ‘flexible’ commercial basis.

So what’s the big story? Stepping back and looking down, healthcare and the data it consists of seems to me a big ‘brain’ of information constructed from different formats and substances, but linked together in complex relationships and patterns hidden or obfuscated by barriers of format and location or access. This is traditionally referred to as the ‘real world’, whether its Healthcare or the Enterprise.

The goals or objectives can be simply phrased – improving and optimising patient outcomes, and placing the patient at the centre.

The adversarial paradigm of ‘bad AI’ of rapidly-evolving software systems ‘competing’ against physicians  in a winner-takes-all for the right to diagnose and treat patients is of course naive. And yet the healthcare industry is clearly labelled and targetted up for a ‘disruption’ in the coming decades in terms of who does and is responsible for what. Whatever this ends up looking like, we can be sure it will be radically different to the way it is now.

Its a big, big challenge. No one venture – even at the scale of Google or IBM – is going to do this by themselves. Its going to rely also on a host of smaller ventures, but ones with inversely large ambitions.

Deep Learning: Machine Learning becomes ‘Intelligent’, Artificially or Otherwise?

‘Deep Learning’ is the new big thing. ‘Old Fashioned’ Academic AI from the second half of the C20th has passed through the last decades of a ‘narrower’ focus of Data Mining or Knowledge Discovery and Machine Learning to blossom in to what is now touted as a new, vibrant age.

There is too much here to comment on in anything like a sane manner, and this is intended to be a very brief summary of my own ‘non-expert’ position, so I’ll be very brief with a few examples:

  1. The internet and technology giants are splashing cash on ‘Artificial Intelligence’ or ‘Machine Intelligence’, see Yahoo, IBM, Microsoft, Google and Facebook. That much money can’t be wrong, can it?
  2. Where the giants have trod, the VC world has followed. See this single VC post for the ‘Machine Intelligence’ landscape here from the end of 2014. They all can’t be wrong, either, surely?
  3. Goverments have ‘quietly’ been doing their own thing anyway for security, military, and logistical purposes anyway regardless of the public involvement or awareness. See the FBI NGI here, or anything that DARPA funds.

What’s also interesting is the emergence of the ‘old school’ academics among the individuals who are leading this ‘new’ (read ‘old’!) era. This is I believe down to the fact that the skills and knowledge required to be a ‘master’ in this area are intensely academic and rare in themselves, and that the ‘Deep Learning’ technologies that are currently being worked on have a continuing evolution from the early Neural Nets through to the Convolutional or ‘learning’ approaches that represent the ‘state of the art’ today. See first Hinton, LeCun, Ng as three eminent examples who have been ‘appropriated’ or ‘acquired’ or ‘assimilated’ by commercial operations. Ng’s website is the ‘outlier’, the others are gloriously and happily ‘old school’. Demis Hassibis‘s journey to Google is slightly different – not a mainstream academic at all but games developer- but with his skills he could have been. Other leading research figures in the last decade or so include Bengio, Bottou, Ciresan. See any footnotes in published research or NIPS for further lists of individuals.

As a one-time academic of sorts, it is or will be fascinating to see how this ‘cohort’ of researchers react to ‘suddenly’ being thrust in to the spotlight of a broader and more commercial world.

Moving on from this aside, it is worth pointing out that it is a new, vibrant age. The reinforcement loops at play particularly now with the internet / webscale technology giants and their real-world need for ‘intelligent applications’, at speed and at scale, has shifted the landscape and a few paradigms with it.

One  other clear outcome at play in recent years has been the inflation of ‘Data Scientist’ wages, perhaps long overdue, and the related ‘talent search’ or ‘fill a room with Machine Learning PhDs and wait for acquisition’ approach to company formation. Machine Learning or Data Science qualifications and the courses offered to support these by existing academic institutions or online at Coursera are I would imagine highly prized.

I’m going to write separately on particular areas or approaches of interest.