12 minute read

Machine Learning Sandcastles

Machine Learning Sandcastles: Considerations on why we need a dominant design & not another startup building a machine learning platform

/ By Gregg Barrett, Head of Cirrus /

At most organisations machine learning (ML) represents a disparate mix of tasks and tools, with data engineers working on data pipelines, data scientists on the data analysis, model training, validation and testing, hardware engineers on the compute configuration, and software engineers on the deployment. That tasks and tools are so segregated contributes to the high overall failure rates of machine learning projects and constrains the quality and quantity of ML project throughput. This has led to the development of ML platforms. When I refer to an ML platform, I am referring to an all-in-one product for data and model development, scaling experiments across multiple machines, tracking and versioning models, deploying models, and monitoring performance.

The landscape for these all-in-one platforms is still underdeveloped though and while this might sound appealing for those with engineering and startup aspirations there is a need for caution.

Notes: Gartner: Data Science and Machine Learning (ML) Platforms

The process from a data science standpoint

The Cross Industry Standard Process for Data Mining, CRISP-DM, is an example of an open stand and process model that describes many common data science tasks that today typically form part of the ML lifecycle.

It is now increasingly common for organisations that have competence in ML to have a single person owning the entire lifecycle.

Treating ML as an engineering problem is a problem

While it might be tempting to treat ML as an engineering problem those that do lose sight of the broad swath of users that are critical to the process and who are not engineers. Subject matter experts for example frequently provide input on everything from data labeling and feature engineering to model evaluation. It is necessary for subject matter experts to effectively participate in the process while remembering that many are not programmatically savvy. For these users it is reasonable to envisage a low-code environment. To support these low-code users there is need to have layers of abstraction with each layer well defined. The highest level of abstraction — user interface (UI) driven ML — for most subject matter experts is yet to be realised, however for a moment imagine that we have the emergence of many ML platforms in the marketplace that provide a UI driven capability for these users.

Without doubt each platform will operate differently. Pretty much everyone knows how to operate Microsoft Office because it is the dominant design — the design that dominates in the marketplace, and by large margin over similar offerings. In the case of Microsoft Office, it is the same design whether you are at home, at the office or at an academic institution. Without a dominant design there is no consistent and simplified user experience across organisations, domains, and tools.

Rapid iteration where ML code is but a small piece

ML platforms like general software systems need to be reliable, scalable and maintainable with the addition of being adaptable, as they are systems that learn from data — which can change frequently necessitating rapid development and deployment cycles. In addition, ML at present is in general an inherently empirical process with success in part proportional to experimental throughput, requiring rapid iteration.

However, unlike traditional systems where programming is predominately the hard part, in ML programming is but a small fraction of the overall effort and dealing with the technical debt introduced from all the various components is a significant constraint on throughput.

Generic tasks (bold) and outputs (italic) of the CRISP-DM reference model

Distinct tones symbolise the distinct possible tasks for an individual or team to perform. A. Separating distinct types of tasks, specialists can address more conditions than one individual; the numbers add. What is shown is that two specialists can do twice as many tasks as one. For example, if each one can do 10,000 tasks, together they can do 20,000. B. Teams can address an even more diverse set of conditions because the numbers multiply. For a two-member team it would be 10,000 x 10,000 = 100,000,000.

Notes: Best Practices for ML Engineering

Notes:Taxonomy of real faults in deep learning systems

Frameworks and API’s

ML deployed in the wild can resemble a hydra of models and algorithms. A deployment could include, an ensemble of Principal Component Analysis, Gradient Boosting, and Neural Networks. In terms of an ML platform the decision on frameworks and libraries to support is being heavily influenced by Deep Learning although support for Deep Learning alone is wholly insufficient. Supporting the predominant Deep Learning frameworks and libraries like PyTorch and TensorFlow, brings API’s into the picture.

Given the nascency of the landscape with no overarching standard or dominant design, there is a plethora of API’s which are costly to develop and maintain. API’s need to be open sourced to ensure that architectures remain open and organisations are not forced into proprietary technology stacks. In addition, there is the need for development and access to data orchestration rules and APIs as a single interface in order to support the deployment ML across distributed environments (see ModelOps). This points to the need for a more universal platform and less diversification.

“While there is no dominant design for an ML platform, open source has become the standard approach. There are a number of reasons for this and in the case of startups, organisations frequently require such so that in the event of the startup failing they have access to the source code ”

Rethinking architecture

Rather than simply combining disparate systems that exist today to meet the unique requirements of ML systems, there is a need to rethink the design of the ecosystem supporting ML particularly when it comes to data — what is now termed a “data-centric” approach to ML. Whether to pursue a Data Orientated Architecture (DOA) or Microservice Architecture is an important discussion that needs to take place on this front.

A DOA approach is essentially a streamingbased architecture, that makes data flowing between elements of business logic more explicit and accessible, thus the tasks of data discovery, collection and labeling for example are made simpler. Microservice Architecture is however pervasive due to its scalability.

Notes: Modern Data Oriented Programming, Milan: An Evolution of DataOriented Programming & Data-Oriented Architecture

Open source as a standard

While there is no dominant design for an ML platform, open source has become the standard approach. There are a number of reasons for this and in the case of startups, organisations frequently require such so that in the event of the startup failing they have access to the source code.

Pursuing open source also means that the startup is now competing with existing open source tools and has to determine a viable business model that includes some sort of mix of proprietary and open source features. For a startup this is no easy task. If you are a large organisation like Google for example you can direct funds from profitable parts of the business to support open source efforts on TensorFlow to create an ecosystem with a virtuous circle of network effects — the more people use the framework, the more people know about it, in turn leading to more users — leading to a large sustaining ecosystem where there are opportunities to monetise proprietary tools and services like Google Cloud Platform.

Google’s TensorFlow team is rumoured to be almost 1000 strong. In terms of an ML platform I have to yet to see many opportunities for:

1) establishing a large open source user community around a particular tool and

2) revenue opportunities from other sources that can fund the long runway to build such a community.

Notes: The Linux Foundation AI and Data, Mlflow, H2O

Focusing on a narrow use case

Many startups are now attempting to develop ML platforms that serve a particular domain which amongst other things confines the startup to a smaller market and user base. My argument is that any ML platform must be agnostic to the use case as the core underlying technology is not domain specific.

As the application of the underlying technology will not be confined to a single domain, the ability to build an enduring moat to prevent others, with a larger base because they are serving larger markets, from encroaching on a narrow use case will be found wanting.

Notes: Farewell to “Watson For Drug Discovery”

Focusing on a small step in the workflow

An ML tool (I refer to “tool” in that it is not an all-in-one product) that only supports part of the ML lifecycle, say model training and evaluation, inevitably requires an organisation to stitch multiple tools together. As products in the ML stack are constantly evolving and there is no common industry standard for interfaces, the cost of developing and maintaining the necessary integration across the ML workflow is nontrivial.

Aside from all the integration headaches, the problems created for users having to be familiar with using multiple tools and user interfaces inhibits adoption. For a large technology organisation where ML is a core component of products and services stitching together tools to create an ML platform is the current approach given the absence of a dominant design.

These organisations (unlike most) have the necessary skills, expertise, experience and resources to throw at the effort. Such organisations will typically focus on interoperability to build an integrated solution spanning the entire workflow and for them it is sufficient that the platform handle only the use cases of the products and services the organisation is supporting.

Notes: Meet Michelangelo: Uber’s Machine Learning Platform, Productionizing ML with workflows at Twitter, TFX: A TensorFlow-Based Production-Scale Machine Learning Platform, Introducing FBLearner Flow: Facebook’s AI backbone

Tightly coupled components

ML development and deployment environments are heterogeneous across organisations and an ML platform that is too tightly coupled with upstream and downstream software components will restrict its portability.

Similarly, an ML platform that is too tightly coupled to a specific hardware accelerator will itself be restricted to the adoption of that hardware. In such circumstances the startup needs to be careful to ensure that it is not betting on the adoption of a particular piece of hardware or software for the future of its market.

An ML platform needs to work in as many environments, and on as many hardware configurations as possible which brings us to ModelOps and Domain Specific Architectures.

Enter ModelOps

To avoid the manifestation of problems associated with hidden technical debt in production, Model Operations (ModelOps) is concerned with the best practices and tools used to test, deploy, manage, and monitor ML models in real-world production.

ModelOps is particularly relevant in managing the evolution of the model and data changes in the context of the underlying heterogeneous software and infrastructure stacks in operation across organisations. With organisations moving to the cloud, the major cloud providers are now looking to integrate ModelOps with the rest of the organisational infrastructure which is driving a renewed focus on open integration across various ML tools and services.

Such an approach by the major cloud providers is however easier said than done, as most have spent several years building proprietary walls around their products and services. Until such time as these platforms are truly open it is questionable as to whether any of these offerings will be the pathway to a dominant design for an ML platform.

Enter Domain Specific Architectures

Domain Specific Architectures (DSAs), often called accelerators are a class of processors tailored for a specific domain. This hardware-centric approach is driven by performance and efficiency gains as they are tailored to the needs of the application. Examples of DSA’s include Tensor Processing Units (TPU’s), and Graphics Processing Units (GPU’s).

DSA’s use domain specific languages (DSL’s) to leverage memory access, parallelism and improve the mapping of the application to the domain specific processor. DSL’s are a challenge though as while being designed for specific architectures the software needs to be portable to different environments. The vertical integration of the hardware/ software co-design for DSA’s is also supportive of open architectures as amongst other things this increases the number of users and improves security.

The question of what happens to programming languages like Python in the context of ML also remains open. Building a language is a mountain of work and the speed at which ML is moving it would simply take way too long. Looking at what is in existence, the best options appear to be Julia or Swift. At present Swift has a little presence in the ML ecosystem and has mainly been used for iOS apps, however in recent years both Apple and Google have been moving it along in similar directions and from Googles side there is S4TF — Swift for TensorFlow.

Notes: Graphcore Poplar, Cerebras, Machine Learning Systems are Stuck in a Rut, Flashlight: Fast and flexible machine learning in C++

Not understanding the requirement

The absence of a dominant design for ML platforms results in many having a poor understanding of the ultimate requirement — the ML platform builders all too often simply don’t know what they don’t know.

This results in 1) organisations significantly underestimating the scope and complexity of the undertaking and 2) decisions being made (and justified) on questionable grounds. Startups think that with a handful of engineers they can get something working.

Automation

Non-software technology companies in industry think it boils down to a build or buy. And academic institutions aided and abetted by external funding agencies delude themselves into thinking that developing an ML platform is somehow fundamentally part of their research work and a good use or resources.

Ignorance, politics and fiefdom building aside, a major contributor to the build-it bias stems from people’s affection for their own creations — as crappy as it might be, those who built it ascribe more value to it. The reality is that most have grossly insufficient resources to achieve superior execution against the likes of Palantir, C3.ai, Databricks etc who have already had several years of runway and thrown a lot of resources at it — and who may not succeed at being profitable standalone businesses.

That for most organisations the value of ML lies in application and not from building and maintaining an ML platform seems obvious, yet the absence of a dominant design and poor understanding of the requirement leads many to make poor purchasing decisions. Ultimately this provides an opportunity for startups to provide professional services around specific organisational needs, however, there has been far less startup activity and traction in this area compared to those building tools and platforms.

Conclusion

Ultimately the marketplace will settle the debate on any dominant design for an ML platform. Part of that process will be the evolution of financial conditions. Interest rates at a 4000-year nadir has misdirected capital, advanced speculation and perpetuated the unnatural lives of unsustainable businesses.

When capital is eventually repriced, the demise of many large and small vendors in the ML marketplace, and the culmination of the eventual maturity of the industry will result in opportunities for consolidation and encourage the evolution of some sort of dominant design. In the interim providers and users of ML platforms should think more strategically about the considerations raised to navigate towards an enduring value creating solution.

Additional resources: The Coming Wave of ML Systems, Stanford MLSys Seminar Series, Full Stack Deep Learning, Chip Huyen

This article is from: