“Data-Driven Healthcare”: Pt II

Data-Driven Healthcare: Introduction

I wrote up some initial thoughts on Healthcare ‘Pt I’ a few weeks ago. This ‘Pt II’ is a follow up looking in more detail at some approaches and algorithms being applied to examine what we might mean by ‘Data Driven Healthcare’.

Its worth re-iterating or restating some of the fundamentals here:

  1. Healthcare is a massive field, and Data related to this is increasing by the second. This must needs be an extremely selective overview
  2. Healthcare on the inside (B2B) – systems, workflows and business processes associated – are not yet ‘Data Driven’, and neither is the outside (B2C) for an overall ‘customer’ or patient experience
  3. A lot of the focus for data-driven healthcare is on optimising outcomes and reducing risk by identifying, diagnosing or intervening earlier and more accurately to provide better outcomes. This also helps acts to reduce costs – financial and organisational, to healthcare providers or insurers, as well as personal to the patient themselves
  4. Commercially, Data-Driven Healthcare is seen as a ‘promised land’ for those interested in disruption, and a Breughel-esque landscape of terror for those who are not. For the USA alone, trillions of $$$ are at stake. See McKinsey on market size here and on disruption here, and PWC on data-driven decision making in health here. In the USA the specific changes in the structure of the healthcare insurance market represented by the ACA have added a further ‘twist’. Here is a sample exhibit from one of the PWC papers on data and analytics in executive decision making:

PWCHealthcareDataDriven2014 v3

One thing that most observers agree on is that patience, tenacity, and targeting are going to be required. There cannot be a ‘killer app’ or a ‘single behemoth’ that floats to the top. Here’s a representative quote from Venture Beat in May 2015:

“There may be no killer apps in digital health. There may be no Uber for health care. The good investments in the space are startups that are willing to go neck-deep in the complexities of the business, the regulations, the clinical aspects, the old school workflows, and other minutia of the health care business. They fill a niche, erase a friction point, and then hang in for the long haul. Jack Young, a partner at dRx Capital, nails it with this quote: “It’s not about disruption, it’s about implementation.”

So, breaking this down in to a narrative structure is also a ‘work in progress’. Some of the key themes include:

Instrumentation – or the ‘generation’ of data. This is split between ‘business’ or B2B users – clinical or hospital based data instrumentation and ‘consumer’ users – ranging from web apps, sensors, wearables to user-generated content on the web.

Relationships – this is the ‘big Graph’ that healthcare represents – relationships, over time, between any of the actors or participants in the process. This also includes some notes on organisational ‘culture’, including policy and legislation landscapes for the USA and UK. For USA see the FDA website on Science and Research guidelines here and on compliance.

Applications – not strictly in a software sense, but more towards clear ‘requirement- or needs-driven’ cases that are leading the way. This ranges from Genomic sampling to Diagnosis.

Players – again, its often easiest to try and make sense of what is going on by looking at who is doing what, from the ‘giants’ such as Google, Microsoft or IBM to the younger, more narrowly-targeted start-ups around the Globe, and including some of the most influential individuals involved as well for the ‘human’ side.

Instrumentation from the Inside

Much of the focus currently is on future data from patients themselves that is yet to be instrumented, primarily through ‘wearables’, either as devices or applications, in some form or another. Before we get there, though, there is the question of the data that is already ‘instrumented’ in many senses, and is waiting for analysis. The trouble is, its hidden away, in silos, in systems, in locker-rooms and across multiple locations. IBM Watson Health have a few years of foothold here:


Wearables: Instrumentation from the Outside

Wearables are now mainstream. These vary from the ‘multi-functional’ or decorative that provide a home for selected health-related applications or data such as the Apple Watch, to ‘dedicated’ health specific devices…


The data streams that will be instrumented from this and similar platforms make me think also of the Hugh Laurie ‘House’ character’s predisposition to mistrust anything a patient says – the ‘Patients are liars’ approach to medical history and diagnosis.

Already we’ve seen partnerships evolve between parties, both large and small, aligning on who can provide access to the data stream (the wearable / application providers) and the data consumers. Note also that in the USA the FTC has already laid down the law in 2015 on apps claiming to diagnose melanoma based on smartphone pictures.

User Generated Content

One area for Public Health set apart uses ‘trends’ inferred from unstructured ‘user-generated content’ in the form of search activity or social media ‘Early warning’ detection and / or impact or prevalence estimation for epidemics or communicable or infections disease.

The search giants – Google and Microsoft – have led the way here for obvious reasons. Examples include Google Flu Trends in production since 2008 (see the ‘comic book’ on how this then became Google Correlate and then Google Trends here), recently expanded also to include Dengue Trends, and then also Microsoft Research studies on Ebola and on the effectiveness of a health intervention campaign on Influenza (the ‘flu’) in the UK here.


The Google Flu Trends work has been ongoing for a number of years, and, like many ‘early entrants’, suffered some unwanted and perhaps unfair adverse publicity due to significant over-reporting of ICI in 2013.  Using the ‘hype-cycle’ as a guide, this had firmly slipped down to the trough of disillusionment. Overall, ‘real’ organisations have responded favourably to try and help improve the approach, including to incorporate more time-series or ‘real-time’ data from public reporting systems, and GFT underwent a major ‘upgrade‘ in 2014 to include the ‘real’ data published by the CDC for the USA.

Microsoft have also branched out to another form of ‘Early warning’ with their ‘Project Premonition’, combining a Mosquito-trapping, drone-based ‘AWACs’ system for capturing and genetically sampling mosquito populations, identifying pathogenic or malarial mosquitos and modelling the data in the cloud here.

(An interesting ‘on the ground’ comparison is Metabiota, offering ‘data-driven’ epidemiological and disease detection services from the bottom up. The Data Collective VC website has a set of other, interesting ‘Health’ or ‘BioEngineering’ startups as well).

Drug and Genome Research

Where to start? I’m not going to rehash or repeat what would be better studied elsewhere – the USA National Human Genome Research Institute or the UK Sanger Institute at Cambridge.

One ‘novel’ approach to the instrumentation of large data sets for study has been the Icelandic Genome Project, now DeCode, aquired by Amgen in 2013, and officially concentrating on the identification of key genetic risk factors for disease.

Large-scale drug discovery spanning pharmaceuticals and bio-engineering is a universe in itself. The time-to-market and R&D costs associated with Drug Discovery are so vast that the companies involved are by definition huge, but in that they support a large ecosystem of smaller operations.

The ‘Medical Graph’ and Entity Relationships

The ‘Medical Graph’ is broad, deep and complex. There are many different graphs, and graphs-within-graphs….

  1. Inter-relatedness of medical entities such as treatments, devices, drugs, procedures. See this paper Building the graph of medicine from millions of clinical narratives” from Nature in 2014 analysing co-occurrence in 20 million EHR records for 19 years from a California hospital. Here is a nice graphic on the workflow involved:Nature2014MedicalGraph
  2. ‘Big Data’ and ‘Graph Analytics’ on a Medical dataset running on dedicated appliances. For one example, see this presentation from 2012 on converting Mayo Clinic data from SQL to SPARQL/RDF for querying
  3. Three years is a long time. The software and math as well as the ‘High Performance Computing’ required to scale for this has advanced massively. There is the hugely impressive work of Jeremy Kepner of MIT on D4M, for example here on Bio-Sequencing cross-correlation, as well as the platform providers from Cray to MemSQL…

Players, People, Partnerships

This is very much a work in progress. I’m going to do this as simple list for now, and attempt to put this together a version of a market overview picture at the end.

Major systems vendors healthcare divisions and large Pharma / Drug Co’s:

GE Healthcare

Siemens Healthcare

Philips Healthcare

Novartis for modelling and simulation in Drug Development

Roche research

Technology, Applications:

IBM Watson Health. See also relationship to Apple ‘application’ data streams reported in April 2015. IBM is a massive operation, and contains possibly ‘conflicting’ solutions – see this post from 2015 here that is basically selling hardware again.

SAS as an enterprise technology vendor with a ‘health’ vertical. See articles on sponsored AllAnalytics.com and examples on corporate site here.

Crowdmed launched in 2013 as a platform for ‘crowd-sourcing’ disease diagnosis.

Counsyl and 23andme for self-administered DNA sampling and screening, see also the FDA concern expressed reported in Forbes 2013 here and here.

Google by their very nature are a major player, operating across a wide range of research and application areas or domains. Providing additional algorithm design and research see ML for Drug Discovery in 2015, and also the infrastructure to support an ‘open’ basis for research by others – see their  Cloud Platform for Genomics Research in 2014.

Anything to do with Google X and ‘futurist’ approaches is going to make headlines. See the partnership betwen Novartis and Google X for Glucose-measuring contact lenses publicised in 2014, and the Google wristband as a data-instrumenting wearable for health data. Google Ventures is also active in the field, see their involvement in Calicolabs research in to ageing.

Apple’s ResearchKit for medical and health data applications

Brendan Frey and his team at the University of Toronto (Hinton’s legacy again) for building an algorithm to analyse DNA inputs and gene splicing outputs.

Vijay Pande and the Pande Labs at Stanford for Markov State Modelling for biophysics and biophysical chemistry.

Sage BioNetworks and Dr Stephen Friend for research in linkages between genetics and drug discovery.

Medidata as a ‘clinical cloud’ for data-driven research trials.

Microsoft Research and Health, including Dr Eric Horvitz, quoted here in Wired in 2014: “Electronic health records [are] like large quarries where there’s lots of gold, and we’re just beginning to mine them”. Here is a TV appearance on Data and Medical Research from 2013.

Its always nice to appear to be ‘cutting edge’. The Microsoft Research Faculty summit 2015 held yesterday and today (9th July 2015) includes a number of sessions that, if not directly mentioning healthcare, would be fascinating in terms of their general applicability – ‘Integrative AI’, ‘Semantics and Knowledge Bases’, ‘Programming Models for Probabilistic Reasoning’ and more. Would be lovely to be there, catch the stream later.

NCI and ABCC (Advanced Biomedical Computing Centre) under Jack Collins in Maryland.

The HPC world of pMATLAB and D4M of Jeremy Kepner and John Gilbert.

LifeImage for medical image sharing.

Phew. I’m going to conclude now with a brief note on some of the partnerships or alliances, whether strategic or acquisition-related, already emerging.

The partnership or alliances that stick out for me right now are those that help span the B2B and B2C divide – where a business or enterprise focused outfit with existing traction and experience in the healthcare market partners with a consumer focused outfit which has reach, technology and infrastructure… kind of like ‘Business’ server – ‘Consumer’ client architecture.

Due to reporting visibility and transparency these are kind of ‘obvious’ examples:

  • IBM Watson Health – Apple for app data sharing and processing. There seems to be a synergy here, but I’m not yet clear on exactly how this will be delivered?
  • Novartis – Google X for the Contact Lens Glucose / Diabetes monitoring. This is a more ‘research’ focussed partnership – where the drug and manufacturing skills of one are complemented by the data and consumer knowledge of the other.
  • Broad Institute – Google Genomics, more hybrid in that its a marriage of Google computational power and analytics infrastructure with deeper genomic analysis tools from Massachusetts.

Underneath the radar are more industry-specific partnerships between organisations with an interest in ‘benevolent disruption’ in terms of improving efficiencies and outcomes – insurers and healthcare providers – and application or solution providers that can help them do this, bit by bit if necessary.


Machine Intelligence at Speed: Some Technical or Platform Notes


This post looks at some of the underlying technologies, tools, platforms and architectures that are now enabling ‘Machine Intelligence at Speed’. Speed as a concept is closely related to both Scale and Scalability. For my convenience and to try and organise things, by this I mean applications that

  1. Are built on or involve ‘Big Data’ architecture, tools and technologies
  2. Utilise a stream or event processing design pattern for real-time ‘complex’ event processing
  3. Involve an ‘In-Memory Computing’ component to be quick and also to help scale predictably at speed
  4. Also support or embed ‘Machine Learning’ or ‘Machine Intelligence’ to help detect or infer patterns in ‘real time’

People in the Bay Area reading the above might well read the above and shout ‘AMPLabs! Spark!’, which is pretty much where I’ll finish!

Hype Cycle and the ‘New Big Thing(s)’

Here is the familiar Gartner Tech Hype Cycle curve for 2014. In it you can see ‘Big Data’, ‘Complex Event Processing’ and ‘In-Memory DBMS’ chugging their sad way down the ‘Trough of Disillusionment’, whilst ‘NLP’ is still merrily peaking. ‘Deep Learning’ in terms of Deep Neural Nets doesn’t seem to my eye to have made it in time for last year.


Its a minor and unjustifiable quibble at Gartner who have to cover an awful lot of ground in one place, but the semantic equivalence of many of the ‘tech’ terms here is questionable, and the shape and inflexion points of the curves in the cycle as well as the time to reach plateau may differ.

What is important that this demonstrates is that the cyclicity this represents is well founded in new company and new technology ‘journeys’, and often how these companies are funded, traded and acquired by VCs and by each other. What I’m also interested in here is how a number of these ‘separate’ technology entities or areas combine and are relevant to Machine Intelligence or Learning at Speed.

Big Data Architectural Models

(Important proviso – I am not another ‘self professed next ****ing Google architect’, or even a ‘real’ technologist. See the ‘in/famous’ YouTube scat ‘MongoDB is webscale‘ from Garret Smith in 2010, approx 3 mins in for a warning on this. I almost fell of my chair laughing etc etc. I work in a company where we do a lot of SQL and not much else. I also don’t code. But I’m entitled to my opinion, and I’ll try to back it up!)

Proviso aside, I quite enjoy ‘architecture’, as an observer mainly, trying to see how and why different design approaches evolve, which ones work better than others, and how everything in the pot works with everything else.

Here are two brief examples – MapR’s ‘Zeta‘ architecture and Nathan Marz’s ‘Lambda‘ architecture. I’ll start with Marz as its deceptively ‘simple’ in its approach with 3 layers – speed, batch and serving. Marz worked on the initial BackType / Twitter engine and ‘wrote the book’ for Manning so I’m inclined to treat him as an ‘expert’.


Marz’s book goes in to much more detail obviously, but the simplicity that the diagram above pervades his approach. MapR’s ‘Zeta’ architecture applied to Google is here:MapRZetaGoogleExample

I know next to nothing about what actually Google does on the inside, but I’ll trust that Jim Scott from MapR does or he wouldn’t put this out to public, would he?

What this is telling me is that the ‘redesign’ of Enterprise Architecture by the web giants and what is now the ‘Big Data’ ecosystem is here to stay, and is being ‘democratised’ via the IaaS / PaaS providers, including Google themselves, via Cloud access available anyone, at a price per instance or unit per second, hour, day or month.

There are then the ‘new’ companies like MapR that will deliver this new architecture to the Enterprise who may not want to go to the Cloud for legal or strategic reasons. Set against this are the ‘traditional’ technology Enterprise vendors – Oracle, IBM, SAS, which I’ll return to elsewhere for reasons of brevity as well as knowledge on my behalf.

Big Data has evolved rapidly from something that 5 yrs ago was the exclusive preserve of Web Giants to a set of tools that any company or enterprise can utilise now. Rather than a BYO, ‘Big Data’ tool-kits and solutions are available on a service or rental model from a variety of vendors in the Infrastructure or Platform as-a-Service space, from ‘specialists’ such as Hortonworks, MapR or Cloudera, to the ‘generic’ IaaS cloud platforms such as AWS, Azure or Google.

As well as this democratisation, one of the chief change in character has also been from ‘batch’ to ‘non-batch’ in terms of architecture, latency and the applications this can then solve or support. ‘Big Data’ must also be ‘Fast Data’ now, which lead straight in to Stream or Event processing frameworks.

Stream Processing

Other developments focus on making this faster, primarily on Spark and related stream or event processing. Even as a non-developer, I particularly like the Manning.com books series, for instance Nathan Marz’s ‘Big Data‘, Andrew Psaltis’s ‘Streaming Data‘, and Marko Bonaci’s ‘Spark in Action‘ books, and also appreciate talking with Rene Houkstra at Tibco regarding their own StreamBase CEP product. .

In technical terms this is well illustrated in the evolution from a batch data store and analytics process based on Hadoop HDFS / MapReduce / Hive towards stream or event or stream processing based on more ‘molecular’ and ‘real-time’ architectures using frameworks and tools such as Spark / Storm / Kafka / MemSQL / Redis and so on. The Web PaaS giants have developed their own ‘flavours’ as part of their own bigger Cloud services based on internal tools or products, for example Amazon Kinesis and Google Cloud Dataflow.

As in many ‘big things’ there is an important evolution to bear in mind and how different vendors and tools fit in to this. For example, at Sports Alliance we’ve just partnered with Tibco for their ‘entry’ SOA / ESB product BusinessWorks. I’ve discussed the Event Processing product with Tibco but only for later reference or future layering on top. This product does has a evolution inside Tibco of over a decade – ‘Event’ or ‘Stream’ processing was not necessarily invented in 2010 by Yahoo! or Google, and the enterprise software giants have been working in this area for a decade or more, driven primarily by industrial operations and financial services. Tibco use a set of terms including ‘Complex Event Processing’ or ‘Business Optimization’, which work on the basis of an underlying event stream sourced from disparate SOA systems via the ESB, an In-Memory ‘Rules Engine’, where state-machine or the ‘whatif’ rules for pattern recognition are or may be Analyst-defined (an important exception to the ‘Machine Learning’ paradigm below) and applied within the ‘Event Cloud’ via a correlation or relationship engine.

The example below is for an ‘Airline Disruption Management’ system, applying Analyst-defined rules over a 20,000 events per second ‘cloud’ populated by the underlying SOA systems. Whether its a human-identified pattern or not, I’m still reassured that the Enterprise Software market can do this sort of thing in real-time, in the ‘real world’.


The enterprise market for this is summarised as ‘perishable insights’ and is well evaluated by Mike Gualtieri at Forrester – see his “The Forrester Wave™: Big Data Streaming Analytics Platforms, Q3 2014“. Apart from the Enterprise software vendors such as IBM, I’ll link very briefly to DataTorrent as an example of a hybrid batch / tuple model, with Google’s MillWheel also apparently something similar(?).

In-Memory Computing

Supporting this scale at speed also means In-Memory Computing. I don’t personally know a lot about this, so this is the briefest of brief mentions. See for example the list of contributors at the In-Memory Computing Summing in SF in June this year here. Reading through the ‘case studies’ of the vendors is enough to show the ‘real world’ applications that work in this way. It also touches on some of the wider debates such as ‘scale-up’ v ‘scale-out’, and what larger hardware or infrastructure companies such as Intel and Pivotal are doing.

Machine Learning at Speed: Berkeley BDAS and Spark!

So we’re back to where we started. One of the main issue with ‘Machine Learning’ at either scale or speed in many guises is scalability of algorithms and non-linearity of performance, particularly over clustered or distributed systems. I’ve worked alongside statisticians working in R on a laptop and we’ve had to follow rules to sample, limit, condense and compress in order not to overload or time out.

In the Enterprise world one answer to this has been to ‘reverse engineer’ and productise accordingly, with the investment required to keep this proprietary and closely aligned with complentary products in your porfolio. I’m thinking mainly of Tibco and their Spotfire / TERR products, which I understand to be ‘Enterprise-speed’ R.

Another approach is to compare the evolution within the Apache ecosystem of competing solutions. Mahout initially was known to be ‘slow’ to scale, see for instance an earlier post in 2012 by Ted Dunning on the potential for scaling a knn clustering algorithm inside the MapR Mahout implementation. Scrolling forward a few years to now, this is now looks to be similar to competitive territory between separate branded vendors ‘pushing’ their version of speed at scale. I couldn’t help noticing this as a Spark MLlib v Mahout bout in a talk from Xiangru Meng of Databricks (Spark as a Service) showing not only the improvements in their MLlib 1.3 over 1.2 (yellow line v red line) but ‘poor old Mahout’ top left in blue making a bad job of scaling at all for a ‘benchmark’ of an ALS algorithm on Amazon Reviews:


So one valid answer to ‘So how do I actually do Machine Intelligence at Speed’ seems to be ‘Spark!’, and Databricks has cornered the SaaS market for this.

The Databricks performance metrics quoted are impressive, even to a novice such as myself. The ecosystem in evolution, from technologies, APIs to partners and solutions providers, looks great from a distance. There are APIs, and pipeline and workflow tools and a whole set more.

Databricks is a child of AMPLabs in Berkeley. The Berkeley Data Analytics Stack BDAS provides us with another (3rd) version of ‘architecture’ for both Big Data and Machine Learning at Speed.


BDAS already has a set of ‘In-house Apps’ or projects working, which is a good sign or at least a direction towards ‘application’. One example is the Cancer Genomics Application ADAM,  providing an API and CLI for manipulation of genomic data, running underneath on Parquet and Spark.

Velox, one of the most recent initiatives, is for model management and serving within the stack. It proposes to help deliver ‘real-time’ or low-latency model interaction with the data stream that it is ingesting, a form ‘self-learning’ in the form of iterative model lifecycle management and adaptive feedback. Until recently, large-scale ‘Web giants’ had developed their own approaches to manage this area.

AMPLabs Velox Example 1

This is particularly exciting, as it provides a framework for testing, validation and ongoing lifecycle adjustments that should allow Machine Intelligence model implementation and deployment to adapt to changing behaviours ‘online’ and not become obsolete over time, or at least not as quickly, before they require another round of ‘offline’ training and redeployment.

AMPLabs Velox Example

The examples given (for instance above for a music recommender system) are relatively constrained but show the power of this to not only make model lifecyle management more efficient, but also help drive the creation of applications that will rely not only on multiple or chained models, and thus a higher degree of complexity in terms of model lifecycle management, but also where models involve radically different data types or behavioural focus, which I’m going to look at later. And all at speed!

Visualising Machine Learning Algorithms: Beauty and Belief

This post looks at algorithm visualisation in two guises, first in terms of ‘how’ an algorithm does its work – looking ‘inside’ the box if you like, and second on what comes out the other end as an output or outcome. I’ve written this spurred on initially by a recent high-media-profile classification or labelling error (see below), and partly to get something out on visualisation, which is important to me.

Visualisation is a nice ‘hot’ topic in many domains or senses, and worthily so. The ubiquity of the web, and technologies associated with this, has brought what was previously a more ‘arcane’ discipline mixing statistics with design in to the wider world. The other key concept in visualisation is that of narrative, and the concept of time or time series behind this. As a historian by training, rather than a statistician or designer, I personally of course like this bit too. The way in which data ‘stories’ or narratives can be constructed and displayed is a fascinating and organic process. The world of TED is relevant to any domain, or discipline, where data and communication or insight around this is involved.

The two words that I want to concentrate on here are beauty, and belief. They are closely related.

Visualisation is an important communication tool, and can often be ‘beautiful’ in its own right. This applies to something that helps us understand how an algorithm is doing its job, step by step or stage by stage, and also to what comes out the other side. Beauty (or elegance) and function are often aligned closely, so in this case what looks good also works well.

Visualisation is also an important component of ‘testing’ or validating a process and an output, either in terms of the development process and what is working in what way when or how, or in getting a client or partner who is meant to use the output in an application to buy in to or accept what is going on behind the scenes. So we have to Believe in it too.


I like a pretty picture. Who doesn’t? And ones that move or you can interact with are even better. I’ve read some, but by no means all, of the ‘textbooks’ on data visualisation, from ‘classics’ like Tufte to the work of Steven Few. I’ve worked in my own way in visualisation applications (for me, mainly Tableau in recent years) and in conjunction with colleagues in D3 and other web technologies. Most of this has been to do with Marketing and ‘Enterprise’ data in the guise of my role in Sports Alliance. This is not the place to showcase or parade my own work, thank god. I’m going to concentrate firmly on paradigms or examples from others. This section will be quite short.

It’s easy to say, but I do love the D3 work of Mike Bostock. The examples he generates are invariably elegant, sparse and functional all at the same time. D3 works on the web and therefore potentially for anyone at any time, and he releases the code for anyone else to use. They also really work for me in terms of the varying ‘levels’ of understanding that they allow for audiences with different levels of mathematical or programming knowledge. The example below is for a sampling approach using Poisson Discs:BostockPoissonDiscII

This next is for a shuffle. What I like here is that the visual metaphors are clear and coherent – discs are, well, discs (or anuli), and cards and shuffling (sorting) go together – and also that the visualisation is ‘sparse’ – meaning is clearly indicated in a ‘light touch’ with colour, sparingly used, shade, shape and motion in terms of a time series or iteration steps.


The next example is another D3 by a team related to exploring the relationships between journal articles and citations across 25 years and 3 journals or periodicals. Its sorted by a ‘citation’ metric, and shows clearly which articles have the most ‘influence’ in the domain.


The body of work across 3 decades represented by the scientific visualisations in the IEEE Vis events and related journals InfoVis, VAST and SciVis the exhibit above represents is breathtaking. I’ve chosen two examples ‘stolen’ below that have a strong relation to ‘Machine Learning’ or Algorithm output exploration, which serves to segue or link to the next section on ‘belief’.Viz2015Example


Both these are examples of how a visualisation of the output of an algorithm or approach can also help understand or test what the algorithm, and any associated parameters or configuration, is actually doing, and therefore whether we ‘believe’ in it or not.


In our work in Sports Alliance, we’ve struggled at times to get clients to ‘buy in’ to a classifier in action due partly to the limitations of the software we’re using for that, and partly down to us not going the extra mile to ensure complete ‘transparency’ in what an algorithm has done to get the ‘output’. We’ve used decision trees mostly partly because they work in our domain, and partly also because of the relative communicative ease of a ‘tree’ to demonstrate and evaluate the process, regardless of whatever math or algorithm is actually behind it. What has worked best for us is tying the output of the model – a ‘score’ for an individual item (in our case a supporter churn/acquisition metric) – back to their individual ‘real world’ profile and values for features that the model utilises and has deemed ‘meaningful’.

I’ve not used it in production, but I particularly like the BigML UI for decision tree evaluation and inspection. Here is an example from their public gallery for Stroke Prediction based on data from Michigan Stage University:


Trees and branching is an ‘easy’ way or metaphor to understand classification or sorting. Additional information on relative feature or variable correlation to target or ‘importance’


The emergence of ‘Deep Neural Nets’ of varying flavours has involved a lot of these themes or issues, particularly in the area of image classification. What is the ‘Net’ actually doing inside in order to arrive at the label category? How is one version of a ‘Net’ different to another, and is this better or worse?

I like this version presented by Matthew Zeiler of Clarifai in February this year. I don’t pretend to follow exactly what this means in terms of the NN architecure, but the idea of digging in to the layers of a NN and ‘seeing’ what the Net is seeing at each stage makes some sense to me.


The talk then goes on to show how they used the ‘visualisation’ to modify the architecture of the net to improve both performance and speed.

Another approach that seems to me to serve to help demystify or ‘open the box’ is the ‘generative’ approach. At my level of understanding, this involves reversing the process, something along the lines of giving a trained Net a label and asking it to generate inputs (e.g. pictures) at different layers in the Net that are linked to the label.

See the Google DeepMind DRAW paper from Feb 2015 here and a Google Research piece from June 2015 entitled ‘Inceptionism: Going Deeper into Neural Nets’ here. Both show different aspects of generative approach. I particularly like the DRAW reference to the ‘spatial attention mechanism that mimics the foveation of the human eye’. I’m not technically qualified to understand what this means in terms of architecture, but I think I follow what the DeepMind researchers are trying to do using ‘human’ psychological or biological approaches as paradigms to help their work progess:


Here is an example of reversing the process to generate images in the second Google Research paper.


This also raises the question of error. Errors are implicit in any classifier or ‘predictive’ process, and statisticians and engineers have worked on this area for many years. This is now the time to mention the ‘recent’ high profile labelling error from Google+. Dogs as Horses is mild, but Black people as ‘Gorillas‘? I’m most definitely not laughing at Google+ for this or about this. Its serious. Its a clear example of how limited we can be to understand the ‘unforeseen’ errors and the contexts in which these errors will be seen and understood.

I haven’t myself worked in multi-class problems. In my inelegant way, I would imagine that there is a ‘final’ ‘if … where…’ SQL clause that can be implemented to pick up pre-defined scenarios, for example where the classification possibilities include both ‘human’ or ‘named friend’ and ‘gorilla’, then return ‘null’.

The latitude for error in a domain or application of course varies massively. Data Scientists, and their previous incarnations as Statisticians or Quants, have known this for a long time. Metrics for ‘precision’, ‘recall’, risk tolerance and what a false positive or false negative actually mean will vary by application.

Testing, validating and debugging, and attitude to risk or error are critical.

A few years ago I worked on a test implementation of Apache Mahout for Product Recommendation in our business. I found the work done by Sean Owen (now at Cloudera as Oryx became Myrrhix) and Ted Dunning and Ellen Friedman both now at MapR particularly useful.

Dunning’s tongue-in-cheek approach amused me as much as his obvious command or dominance of the subject matter impressed and inspired me. The ‘Dog and Pony’ show and the ‘Pink Waffles’ are great ‘anecdotal’ or ‘metaphorical’ ways to explain important messages – about testing and training and version control, as much as the inner workings of anomalous co-occurence and matrix factorisation.


And this on procedure, training and plain good sense in algorithm development and version control.

DunningFriedmanRecommenderTrainingIn our case we didn’t get to production on this. In professional sport retail and the data we had available there wasn’t very much variation in basket item choices as so much of the trade is focussed on a single product – the ‘shirt’, equivalent to the ‘everybody gets a pony’ in Dunning’s example above.