This post looks at some of the underlying technologies, tools, platforms and architectures that are now enabling ‘Machine Intelligence at Speed’. Speed as a concept is closely related to both Scale and Scalability. For my convenience and to try and organise things, by this I mean applications that
- Are built on or involve ‘Big Data’ architecture, tools and technologies
- Utilise a stream or event processing design pattern for real-time ‘complex’ event processing
- Involve an ‘In-Memory Computing’ component to be quick and also to help scale predictably at speed
- Also support or embed ‘Machine Learning’ or ‘Machine Intelligence’ to help detect or infer patterns in ‘real time’
People in the Bay Area reading the above might well read the above and shout ‘AMPLabs! Spark!’, which is pretty much where I’ll finish!
Hype Cycle and the ‘New Big Thing(s)’
Here is the familiar Gartner Tech Hype Cycle curve for 2014. In it you can see ‘Big Data’, ‘Complex Event Processing’ and ‘In-Memory DBMS’ chugging their sad way down the ‘Trough of Disillusionment’, whilst ‘NLP’ is still merrily peaking. ‘Deep Learning’ in terms of Deep Neural Nets doesn’t seem to my eye to have made it in time for last year.
Its a minor and unjustifiable quibble at Gartner who have to cover an awful lot of ground in one place, but the semantic equivalence of many of the ‘tech’ terms here is questionable, and the shape and inflexion points of the curves in the cycle as well as the time to reach plateau may differ.
What is important that this demonstrates is that the cyclicity this represents is well founded in new company and new technology ‘journeys’, and often how these companies are funded, traded and acquired by VCs and by each other. What I’m also interested in here is how a number of these ‘separate’ technology entities or areas combine and are relevant to Machine Intelligence or Learning at Speed.
Big Data Architectural Models
(Important proviso – I am not another ‘self professed next ****ing Google architect’, or even a ‘real’ technologist. See the ‘in/famous’ YouTube scat ‘MongoDB is webscale‘ from Garret Smith in 2010, approx 3 mins in for a warning on this. I almost fell of my chair laughing etc etc. I work in a company where we do a lot of SQL and not much else. I also don’t code. But I’m entitled to my opinion, and I’ll try to back it up!)
Proviso aside, I quite enjoy ‘architecture’, as an observer mainly, trying to see how and why different design approaches evolve, which ones work better than others, and how everything in the pot works with everything else.
Here are two brief examples – MapR’s ‘Zeta‘ architecture and Nathan Marz’s ‘Lambda‘ architecture. I’ll start with Marz as its deceptively ‘simple’ in its approach with 3 layers – speed, batch and serving. Marz worked on the initial BackType / Twitter engine and ‘wrote the book’ for Manning so I’m inclined to treat him as an ‘expert’.
Marz’s book goes in to much more detail obviously, but the simplicity that the diagram above pervades his approach. MapR’s ‘Zeta’ architecture applied to Google is here:
I know next to nothing about what actually Google does on the inside, but I’ll trust that Jim Scott from MapR does or he wouldn’t put this out to public, would he?
What this is telling me is that the ‘redesign’ of Enterprise Architecture by the web giants and what is now the ‘Big Data’ ecosystem is here to stay, and is being ‘democratised’ via the IaaS / PaaS providers, including Google themselves, via Cloud access available anyone, at a price per instance or unit per second, hour, day or month.
There are then the ‘new’ companies like MapR that will deliver this new architecture to the Enterprise who may not want to go to the Cloud for legal or strategic reasons. Set against this are the ‘traditional’ technology Enterprise vendors – Oracle, IBM, SAS, which I’ll return to elsewhere for reasons of brevity as well as knowledge on my behalf.
Big Data has evolved rapidly from something that 5 yrs ago was the exclusive preserve of Web Giants to a set of tools that any company or enterprise can utilise now. Rather than a BYO, ‘Big Data’ tool-kits and solutions are available on a service or rental model from a variety of vendors in the Infrastructure or Platform as-a-Service space, from ‘specialists’ such as Hortonworks, MapR or Cloudera, to the ‘generic’ IaaS cloud platforms such as AWS, Azure or Google.
As well as this democratisation, one of the chief change in character has also been from ‘batch’ to ‘non-batch’ in terms of architecture, latency and the applications this can then solve or support. ‘Big Data’ must also be ‘Fast Data’ now, which lead straight in to Stream or Event processing frameworks.
Other developments focus on making this faster, primarily on Spark and related stream or event processing. Even as a non-developer, I particularly like the Manning.com books series, for instance Nathan Marz’s ‘Big Data‘, Andrew Psaltis’s ‘Streaming Data‘, and Marko Bonaci’s ‘Spark in Action‘ books, and also appreciate talking with Rene Houkstra at Tibco regarding their own StreamBase CEP product. .
In technical terms this is well illustrated in the evolution from a batch data store and analytics process based on Hadoop HDFS / MapReduce / Hive towards stream or event or stream processing based on more ‘molecular’ and ‘real-time’ architectures using frameworks and tools such as Spark / Storm / Kafka / MemSQL / Redis and so on. The Web PaaS giants have developed their own ‘flavours’ as part of their own bigger Cloud services based on internal tools or products, for example Amazon Kinesis and Google Cloud Dataflow.
As in many ‘big things’ there is an important evolution to bear in mind and how different vendors and tools fit in to this. For example, at Sports Alliance we’ve just partnered with Tibco for their ‘entry’ SOA / ESB product BusinessWorks. I’ve discussed the Event Processing product with Tibco but only for later reference or future layering on top. This product does has a evolution inside Tibco of over a decade – ‘Event’ or ‘Stream’ processing was not necessarily invented in 2010 by Yahoo! or Google, and the enterprise software giants have been working in this area for a decade or more, driven primarily by industrial operations and financial services. Tibco use a set of terms including ‘Complex Event Processing’ or ‘Business Optimization’, which work on the basis of an underlying event stream sourced from disparate SOA systems via the ESB, an In-Memory ‘Rules Engine’, where state-machine or the ‘whatif’ rules for pattern recognition are or may be Analyst-defined (an important exception to the ‘Machine Learning’ paradigm below) and applied within the ‘Event Cloud’ via a correlation or relationship engine.
The example below is for an ‘Airline Disruption Management’ system, applying Analyst-defined rules over a 20,000 events per second ‘cloud’ populated by the underlying SOA systems. Whether its a human-identified pattern or not, I’m still reassured that the Enterprise Software market can do this sort of thing in real-time, in the ‘real world’.
The enterprise market for this is summarised as ‘perishable insights’ and is well evaluated by Mike Gualtieri at Forrester – see his “The Forrester Wave™: Big Data Streaming Analytics Platforms, Q3 2014“. Apart from the Enterprise software vendors such as IBM, I’ll link very briefly to DataTorrent as an example of a hybrid batch / tuple model, with Google’s MillWheel also apparently something similar(?).
Supporting this scale at speed also means In-Memory Computing. I don’t personally know a lot about this, so this is the briefest of brief mentions. See for example the list of contributors at the In-Memory Computing Summing in SF in June this year here. Reading through the ‘case studies’ of the vendors is enough to show the ‘real world’ applications that work in this way. It also touches on some of the wider debates such as ‘scale-up’ v ‘scale-out’, and what larger hardware or infrastructure companies such as Intel and Pivotal are doing.
Machine Learning at Speed: Berkeley BDAS and Spark!
So we’re back to where we started. One of the main issue with ‘Machine Learning’ at either scale or speed in many guises is scalability of algorithms and non-linearity of performance, particularly over clustered or distributed systems. I’ve worked alongside statisticians working in R on a laptop and we’ve had to follow rules to sample, limit, condense and compress in order not to overload or time out.
In the Enterprise world one answer to this has been to ‘reverse engineer’ and productise accordingly, with the investment required to keep this proprietary and closely aligned with complentary products in your porfolio. I’m thinking mainly of Tibco and their Spotfire / TERR products, which I understand to be ‘Enterprise-speed’ R.
Another approach is to compare the evolution within the Apache ecosystem of competing solutions. Mahout initially was known to be ‘slow’ to scale, see for instance an earlier post in 2012 by Ted Dunning on the potential for scaling a knn clustering algorithm inside the MapR Mahout implementation. Scrolling forward a few years to now, this is now looks to be similar to competitive territory between separate branded vendors ‘pushing’ their version of speed at scale. I couldn’t help noticing this as a Spark MLlib v Mahout bout in a talk from Xiangru Meng of Databricks (Spark as a Service) showing not only the improvements in their MLlib 1.3 over 1.2 (yellow line v red line) but ‘poor old Mahout’ top left in blue making a bad job of scaling at all for a ‘benchmark’ of an ALS algorithm on Amazon Reviews:
So one valid answer to ‘So how do I actually do Machine Intelligence at Speed’ seems to be ‘Spark!’, and Databricks has cornered the SaaS market for this.
The Databricks performance metrics quoted are impressive, even to a novice such as myself. The ecosystem in evolution, from technologies, APIs to partners and solutions providers, looks great from a distance. There are APIs, and pipeline and workflow tools and a whole set more.
Databricks is a child of AMPLabs in Berkeley. The Berkeley Data Analytics Stack BDAS provides us with another (3rd) version of ‘architecture’ for both Big Data and Machine Learning at Speed.
BDAS already has a set of ‘In-house Apps’ or projects working, which is a good sign or at least a direction towards ‘application’. One example is the Cancer Genomics Application ADAM, providing an API and CLI for manipulation of genomic data, running underneath on Parquet and Spark.
Velox, one of the most recent initiatives, is for model management and serving within the stack. It proposes to help deliver ‘real-time’ or low-latency model interaction with the data stream that it is ingesting, a form ‘self-learning’ in the form of iterative model lifecycle management and adaptive feedback. Until recently, large-scale ‘Web giants’ had developed their own approaches to manage this area.
This is particularly exciting, as it provides a framework for testing, validation and ongoing lifecycle adjustments that should allow Machine Intelligence model implementation and deployment to adapt to changing behaviours ‘online’ and not become obsolete over time, or at least not as quickly, before they require another round of ‘offline’ training and redeployment.
The examples given (for instance above for a music recommender system) are relatively constrained but show the power of this to not only make model lifecyle management more efficient, but also help drive the creation of applications that will rely not only on multiple or chained models, and thus a higher degree of complexity in terms of model lifecycle management, but also where models involve radically different data types or behavioural focus, which I’m going to look at later. And all at speed!