A rising tide of AI industrialisation expectations is changing Data Science.
“A new era of data science is coming, that shifts focus away from Value Discovery onto Value Delivery, as organisations seek to capitalize on AI.”
Expert opinion article from Andrew Morgan, Director of Data at 6point6, author of “Mastering Spark for Data Science” (2017), and 25 year practitioner in data engineering/science.
There are some notable new trends emerging in innovative organisations that are mature in data, which indicate we are moving into a new era for data science.
Mature organisations are starting to change their strategies and tactics in how they organise their data people, and how they build their data science platforms. These organisations aim to extract more value from data than their peers — and do it more consistently, more rapidly and more cost effectively.
So what are these innovative changes they are making?
The first new trend we see is the merger of data science teams with general data analytics teams into larger organisations focussed on Data Enablement, and on delivering AI powered Products/Services in production as a goal.
It is a move that brings data engineering teams and data science teams together under the same management structure, reversing a decade of ring fencing data science teams away from their generalist data analytics peers.
This is a big change in direction. Previously, data science teams had been broadly organised around an activity I call value discovery. Value discovery was a “getting started” approach used early on, to help organisations discover if data science and AI technologies could be valuable to them. During this early phase of enterprise data science adoption, the focus was on building nascent AI teams, building their relationships with the business, and defining and testing hypotheses to discover how valuable the data science use cases might be. What they discovered was whether or not AI could be of value to the business.
The legacy of this value discovery phase can be seen in the organisation chart. Five years ago, almost all data science teams I know of were ring fenced away from generalist data analytics teams. Today we are starting to see mature organisations remove that ring fencing. Part of the motivation is that they are demanding that their AI investments result in products that run in production, rather than limited proof of value applications that run in local laboratories, or under the desk. Leaders are now pushing teams to move to a new mode of working that I call value delivery. They want AI led growth that shareholders can bank on.
The second new trend is a move to investing in technologies that incorporate data engineering best practices to help Data Scientists meet a rising tide of AI industrialisation expectations.
Data science leaders are now inheriting the requirements that general analytics teams (doing data engineering) have had for decades — such as provenance, auditability, security, resiliency, and scalability — and they are responding to them by investing in new types of tools that help them to raise their game. In these forward looking enterprises, business leaders have been very clear, “AI and data science isn’t an experiment anymore, it’s business as usual, please act accordingly.” The directive to productise AI is driving ideas from data engineering into the data science stack, a move further supporting the need to fold these teams together.
The rise of the Feature Store [1] is one of the clearest examples of this change in focus. Feature Stores, if you’ve not yet heard of them, are a specialist type of production grade data warehouse, but re-imagined such that they accelerate data scientists in building AI services. Feature Stores help data scientists in sharing trusted training data, selecting and reusing trusted feature engineering code, and delivering data security. They are also scalable and resilient, both when training and deploying AI models. In short, they help you do data-science-engineering in production rather than in labs, accelerate best practices like MLOps, and the finished models can be reliably integrated with other production grade services, to deliver enterprise and web-scale applications.
The second example is the rise of Apache Spark as a first class data science tool. Apache Spark is a highly scalable parallel computation environment that helps construct production grade applications that process data. Having personally written a book on the subject in 2017, I have a great deal of first hand experience in what I call Spark hesitancy in the data science community, who prefer python tools which often do not scale natively. Year on year, this hesitancy is turning to advocacy. This change is partly in response to the monumental efforts to build cloud ready products like DataBricks and AWS Glue, and community work to get Spark to run on Kubernetes, all of which are making it easy to run cloud native AI applications that organisations can trust to run lights out, which brings me to the next emerging trend.
The third new trend is that data scientist teams are migrating to the cloud, and to move this along, they are engaging third party suppliers, often for the first time.
Migrating to the cloud is challenging, and engineering skills not typically found in data science teams are needed to make it happen. As established teams struggle to scale up these new skills, they are starting to reach out to firms like 6point6 to help them make the transition. As
Our response.
To summarise, I see a large structural shift in the enterprise Data Science space. The new challenge being set by leaders is one firmly aimed at Value Delivery and the industrialisation of established use cases to drive shareholder value. The emerging technical and organisational tactics we are observing all corroborate that new change in direction, and I expect these changes to begin spreading to all sectors, as a wider group of organisations mature past the discovery phase.
This trend is so clear that I myself have created a dedicated Feature Engineering (FE) team at 6point6 that is specifically positioned to offer these organisations the help they need to make that data science transition to the cloud. Feature Engineering is where I’m placing my bets.
We have spent nearly 18 months establishing a blended set of Data Engineering teams that are uniquely positioned to offer Data Science leaders this help. The group brings deep experience in Data Science, Cloud Security, DevOps, DataOps, MLOps, Feed Management, and Feature Store Platform design together under a single Feature Engineering banner.
In our first few engagements with large organisations, the learning has been that the new Data Science client / Feature Engineering supplier relationship can be difficult. Client-side data scientists who have lived through the Value Discovery era have a culture of rapidly building Proof of Concept (POC) solutions. These data scientists feel greatly challenged when Value Delivery focussed enterprise solutions, often rooted in data engineering best practice, are introduced, forcing them into production environments. They worry that the new tools and new ways of working have “poor ergonomics” that will “slow them down,” and perhaps they are right. The new tools are not designed to accelerate use case discovery, rather, they are designed to productionise, manage, and scale up the success stories reliably for years to come, heralding a new era of value delivery.
A concrete example is the introduction of proper binary management of data science software libraries. Data engineering teams have been using tools like jFrog Artifactory for a decade to manage shared enterprise binaries — but data science teams, who typically use CRAN and PyPi to install software, remain incredulous. Once you move to the cloud, these types of binary management tools are imperative, and without them you run high risks that your solutions might be unstable — but for many data scientists, the benefits of the added process and complexity is not yet clear to them. On many points like this, there is significant push back. To reduce this challenge and friction at the “coalface” business leaders should be more explicit with their data science teams about their industrialisation expectations as data science use-cases become the new normal and part of the core infrastructure. Non-functional requirements need to be articulated to be part of the data science goal post.
Further Reading:
[1] A great article from Uber Engineering is here: https://eng.uber.com/optimal-feature-discovery-ml/