The Data Scientist Magazine - Issue 6 by DataScienceTalent

ISSUE 6

JAMES DUEZ THE FUTURE OF ENTERPRISE AI IS NEUROSYMBOLIC

5 ENTERPRISE USE CASES FOR LLMS BY COLIN HARMAN

TRANSFORMING FREIGHT LOGISTICS WITH ML BY LUÍS MOREIRA-MATIAS

LLMS & THE FUTURE OF ADVERTISING BY JAVIER CAMPOS

LUÍS MOREIRA-MATIAS

THE SMARTEST MINDS IN DATA SCIENCE & AI

“

JULIA STOYANOVICH

PATRICK MCQUILLAN

Transforming Freight Logistics with AI and Machine Learning LUÍS MOREIRA-MATIAS “ We’re not designing AI algorithms to replace humans. What we’re trying to do is to enable humans to be more productive.” The Path to Resposible AI JULIA STOYANOVICH “Responsible AI is about human agency. It’s about people at every level taking responsibility for what we do professionally… the agency is ours, and the responsibility is ours.” Data Strategy Evolved: How the Biological Model Fuels Enterprise Data Performance PATRICK MCQUILLAN “The concept takes the idea that the best way to design a data strategy is to align it closely with a biological system.”

Expect smart thinking and insights from leaders and academics in data science and AI as they explore how their research can scale into broader industry applications.

Helping you to expand your knowledge and enhance your career. Hear the latest podcast over on

datascienceconversations.com

INSIDE ISSUE #6

CONTRIBUTORS

James Duez Luís Moreira-Matias Javier Campos Michel de Ru Colin Parry Bartmoss St. Clair Francesco Gadaleta Rebecca Vickery Colin Harman

EDITOR

Damien Deighan

COVER STORY: JAMES DUEZ AI the World Can Trust: Bridging the Gap Between Data Science and Decision Intelligence With Neurosymbolic AI

TRANSFORMING FREIGHT LOGISTICS WITH ADVANCED MACHINE LEARNING Luís Moreira-Matias / sennder

NAVIGATING THE FUTURE OF ADS: LARGE LANGUAGE MODELS AND THEIR IMPACT ON PROGRAMMATIC STRATEGIES Javier Campos / Fenestra

LEVERAGING CONTEXT FOR SUSTAINABLE COMPETITIVE ADVANTAGE WITH GENAI Michel de Ru / DataStax

MEASURING THE EFFECT OF CHANGE ON PHYSICAL ASSETS Colin Parry / Head For Data

USING LLMS IN LANGUAGE FOR GRAMMATICAL ERROR CORRECTION (GEC) Bartmoss St. Clair / LanguageTool

HOW LANGUAGE MODELS ARE THE ULTIMATE DATABASE Francesco Gadaleta / Amethix Technologies & Data Science At Home Podcast

A JOURNEY FROM DATA ANALYST TO DATA SCIENCE LEADER Rebecca Vickery / EDF

THE 5 USE CASES FOR ENTERPRISE LLMS Colin Harman / Nesh

DESIGN

Imtiaz Deighan imtiaz@datasciencetalent.co.uk

NEXT ISSUE

21ST MAY 2024

DISCLAIMER

The Data Scientist is published quarterly by Data Science Talent Ltd, Whitebridge Estate, Whitebridge Lane, Stone, Staffordshire, ST15 8LQ, UK. Access a digital copy of the magazine at datatasciencetalent.co.uk/media.

The views and content expressed in The Data Scientist reflect the opinions of the author(s) and do not necessarily reflect the views of the magazine, Data Science Talent Ltd, or its staff. All published material is done so in good faith. All rights reserved, product, logo, brands and any other trademarks featured within The Data Scientist are the property of their respective trademark holders. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form by means of mechanical, electronic, photocopying, recording or otherwise without prior written permission. Data Science Talent Ltd cannot guarantee and accepts no liability for any loss or damage of any kind caused by this magazine for the accuracy of claims made by the advertisers.

THE DATA SCIENTIST | 03

EDITORIAL

HELLO,

AND WELCOME TO ISSUE 6 OF

THE DATA SCIENTIST

As we reflect on the rapid progress that occurred in 2023 and continue the conversation on what’s happening in enterprise, Data & AI around the world. LLMs are incredible tools but there are relatively few real enterprise use cases currently

After the initial Chat GPT breakthrough moment and the relentless pace of innovation that followed in 2023, it’s been interesting to observe a muchneeded realism appear from many in the Data and AI community. On November 27th, I spoke at the AI World Congress in London. My main message was that, even though LLMs might well be one of the most important inventions in recent human history, it could take several years for GenAI to prove real value in the complex environment of large enterprises. The diversity of topics and speakers at the conference was excellent. It was good to see a general consensus from most of the speakers about the actual enterprise use cases for LLMs that have real potential in the next 1-2 years. These are marketing & content generation, code automation and summarisation of documents in legal, compliance and R & D. In general, there was a distinct lack of hype around GenAI and most people are very realistic about the road ahead, which will require a lot of hard work. Probably the most important part of this journey involves ensuring the right data foundations are strong, and Michel de Ru of DataStax explains how this can be done with open source vector database technology. We have several articles in this issue that focus on the genuine uses of LLMs in the enterprise setting. Colin Harman sets out the categories of use cases for enterprise LLMs in his piece. The interview with Bartmoss St. Clair of LanguageTool reveals the solid use case for LLMs in Grammatical Error Correction (GEC). Finally, Javier Campos talks about the use of LLMs in the programmatic advertising industry.

04 | THE DATA SCIENTIST

EDITO RI AL

Is Neuro-symbolic AI the Future for Enterprise?

The limitations of GenAI are well documented. Less well known is the work of a handful of companies and individuals working in the developing area of neurosymbolic AI. This field combines knowledge graphs with AI & ML and has the power to deliver AI that is explainable and reliable. Knowledge graphs have been an integral part of the success of Google, Facebook and Amazon platforms for over a decade, yet they are rarely talked about. Our front cover and main story features James Duez, founder of Rainbird AI. Rainbird are probably the world’s leading company in using knowledge graphs in an enterprise setting and have been doing so for over decade. They are the first AI company to feature on the front cover. The vital role that traditional ML & data science still play

As GenAI continues to develop, ML and data science will carry on playing a vital role in enterprise. Rebecca Vickery of EDF is back to tell us about her journey in data science and how she got started. In our interview with Luís Moreira-Matias, Director of AI at sennder, Moreira-Matias talks in depth about how they have generated high growth consistently in recent years by transforming the freight logistics industry with advanced ML. Colin Parry lays out how he used CNNs to forecast energy consumption in buildings accurately, and our regular contributor Francesco Gadaleta discusses how LLMs can actually be viewed as the ultimate database. Overall, this might well be our strongest issue yet, so I really hope you enjoy the magazine and we would love to hear from you if you want us to run a feature on you or your company in 2024.

Damien Deighan Editor

THE DATA SCIENTIST | 05

JAMES DUEZ - RAINBIRD

THE WORLD CAN TRUST

BRIDGING THE GAP BETWEEN DATA SCIENCE AND DECISION INTELLIGENCE WITH NEUROSYMBOLIC AI

JAMES DUEZ is the CEO and co-founder of Rainbird.AI, a decision intelligence business focused on the automation of complex human decision-making. James has over 30 years’ experience building and investing in technology companies with experience in global compliance, enterprise transformation and decision science. He has worked extensively with Global 250 organisations and state departments, is one of Grant Thornton’s ‘Faces of a Vibrant Economy’, a member of the NextMed faculty and an official member of the Forbes Technology Council.

he evolution of artificial intelligence (AI) is a story of remarkable breakthroughs followed by challenging realisations and it dates back to the dawn of computing. AI has gone through a series of ‘summer’ and ‘winter’ cycles for decades, driven by hype, and followed by troughs of failed deployments and disillusionment. And here we are again, this time marvelling at generative AI and the capabilities of large language models (LLMs) like GPT-4. Data leaders are continuing to figure out how to leverage LLMs safely, to discover where the value actually is and what may come next. Of course, AI was never ‘one thing’, but rather a broad church of capabilities, all at the forefront of innovation, trying to add value. Since the term AI was coined in 1955 by John McCarthy, we have suffered from the AI Effect. This is a phenomenon where once an AI capability starts to deliver value, it often ceases to be recognised as

06 | THE DATA SCIENTIST

JA M ES PHILIPP DU EZ KOEHN - RAINBIRD

intelligent, or as true AI. This shift in perception occurs because the once-novel feats of technology become integrated into the fabric of everyday technology, leading people to recalibrate what they consider to be AI, often in search of the next groundbreaking advancement that eludes current capabilities. As you work in data, you’ll know that it’s your job to use AI technologies to drive value and to do so responsibly, ensuring that AI models provide reliable outcomes as well as comply with potential upcoming regulations regarding transparency, explainability, and bias. So how can you cut through the current market hype to locate the true business value? Like the generations of data-centric AI before it, we are becoming acutely aware of the challenges this latest chapter of generative AI presents. The power to generate content is no longer in question, but the real test, once again, is to focus on the business outcomes we are trying to achieve. Are we looking to generative AI to enhance the user experience of experts, or to tackle decisioning? And if the latter, can we trust the decisions it might make?

healthcare. We value ML for its ability to make predictions based on the past, but it is not in itself a decision-making technology. It cannot be left unchecked. Today, we are living in a world that increasingly demands transparency and accountability, and a focus on outcomes and their consequences. As I write this, I am aware there are many data scientists reading who may proclaim that the ‘black box’ problem is intrinsically solved and that they understand their ML models. There are many methods for statistical understanding of how ML models work, but none of them comprise a description of the chain of causation that leads to a decision. Those who are responsible for such business decisions need to understand why and how individual outcomes are achieved, if they are to trust AI. That trust remains lacking. As a society, we have a long history of being harmed by technology that we didn’t understand. It is no surprise then that despite the massive value we have extracted from data science and ML, regulators have increasingly felt the need to govern it due to this trust gap. We have to accept that what ML produces is a prediction, not a judgement, and therefore a degree of human intervention and oversight is required downstream from an ML output to bridge the gap between ‘insight’ and our ability to take ‘action’.

People forget their AI history. Back in the 1980s and 90s, the structured, rule-based world of symbolic AI was all We have to accept that what ML produces is a prediction, not a the rage. Back then, AI wasn’t even about judgement, and therefore a degree of human intervention and data but instead was focussed on solving oversight is required downstream from an ML output to bridge problems by representing knowledge in the gap between ‘insight’ and our ability to take ‘action’. models that could be reasoned over by inference engines. We called it symbolic AI because it focused on the manipulation of symbols – In parallel to this evolution of AI has been the rise of discrete labels that represent ideas, concepts, or objects, the automation agenda. This largely separate function which when combined with logic, could perform tasks is obsessed with efficiency, and has its roots in linear that required intelligent behaviour. This approach to process automation like robotic process automation (RPA) AI was based on the premise that human thought and and other rule-based workflow tools. Process automation cognition could be represented through the use of has been focused on reducing the cost of, and reliance symbols and rules for manipulating them. on, human labour. But increasingly touches on the Symbolic AI’s golden age had governments investing organisational desire to digitise products and services to billions in AI, much as we see in the hype bubble we have meet evolving human demands for 24/7 multi-channel around generative AI today. But, in time, symbolic AI gave experiences. way to sub-symbolic AI, later known as machine learning The majority of the technology used by automation (ML), due to the latter’s ability to learn directly from data teams has historically been rules-based and linear, without the need for explicit programming of rules and although the last decade has seen increasing attempts logic. In fact the pendulum swung so comprehensively, to leverage AI tools to achieve intelligent document many forgot about symbolic AI and it retreated back processing (IDP) and other more complex tasks. But, to academia. despite these efforts, process people have also struggled ML has revolutionised numerous fields, from with a gap, with their toolkits falling short of being able autonomous vehicles to personalised medicine, but to automate the more complex, human-centric and its Achilles’ heel has always been the opacity of its contextual decision-making that should naturally follow prediction-making. The ‘black box’ nature of ML models the automation of simple and repetitive tasks. has long been deemed unacceptable and unreliable when It’s like everyone on ‘planet data’ is looking for ways of it comes to making decisions of consequence, especially leveraging AI to automate more complex human decisionthose that are regulated, such as in financial services and making, and everyone on ‘planet process’ is doing the same.

THE DATA SCIENTIST | 07

JA M ES DU EZ - RAINBIRD

So does this new summer of generative AI and LLMs close this gap? They have certainly pushed the boundaries of text generation and natural language understanding, but are they the answer to decisioning? Unfortunately, the excitement around LLMs remains tempered by this same critical undercurrent of concern that exists around all other ML – that outputs generated by them still lack the transparency and accountability required for us to fully trust them. LLMs are enhancing the user experience and efficiency of experts, but due to their statistical nature, we cannot delegate decision-making to them unchecked.

[AI decisioning] represents a paradigm shift and is an absolute superpower for anyone who adopts it.

So what IS the answer? Fortunately, a new field has emerged, that of decision intelligence (DI), founded by Dr Lorien Pratt. There are many labels for this evolving field. For example, Forrester refers to it as ‘AI decisioning.’ Whatever the label, it represents a paradigm shift and is an absolute superpower for anyone who adopts it. Gartner defines decision intelligence as: ‘a practical

discipline used to improve decision-making by explicitly understanding and engineering how decisions are made and how outcomes are evaluated, managed and improved by feedback’. They already recognise it as being of the same transformational potential in the same timescale as the whole of generative AI, a $10bn market as of 2022 growing at a tremendous rate. But what is it? Data science has had decades of being a technology looking for problems to solve. DI starts with a focus on the decisions we are trying to make, and then works backwards from there to the tools we might use to make them. It leverages a hybrid approach, combining a number of technologies to achieve the desired business outcomes. That sounds terribly simple, but when we take this ‘outcome first’ approach, we find that the tools not only lie in data but in knowledge. DI encourages us to combine the analytical power of ML with the clarity and transparency of human-like reasoning that is synonymous with symbolic AI. This has led to the development of neurosymbolic AI methodologies and has explainability and trust baked into the heart of both the technology and methodology. One approach is to leverage and extend knowledge graphs – nonlinear representations of knowledge – over which a very fast symbolic reasoning engine can reason to answer queries.

08 | THE DATA SCIENTIST

JA M ES P H IL DUIPEZ P KO - RAINBIRD EH N

Unlike most AI, this neurosymbolic approach is not making a statistical prediction over historic data, it is reasoning over knowledge and data, juggling all the necessary probabilities that are synonymous with the real world, to automate complex decisions in a transparent and explainable way. A neurosymbolic approach, working as a composite of machine learning and symbolic AI, allows us to address use cases featuring a greater degree of decision complexity. It powers solutions for some big organisations like Deloitte, EY and BDO. Decision complexity measures the chain of events following a decision, along with the number of external factors that influence the decision outcomes in combination with the action related to the decision. By way of example, a low-decision complexity use case might be the choice to show a particular advertisement on a platform like Google or Facebook. The low complexity derives from the platform’s goal to create a simple behaviour; for the user to click on the ad and then possibly buy the product. In contrast, a high-complexity decision might involve choices regarding a tax policy created by a government: there are complex implications of such a decision, which ripple through society and into the future. Decisions like these require a high degree of human expertise, which is not captured in data sets. The world is attracted to LLMs because we all like the idea that they can process unstructured inputs, and provide us with natural language answers. But most have now realised their limitations. LLMs cannot reason, they are ‘stochastic parrots’, designed to produce the mean best output. We must look at LLMs as prediction machines, well suited to creating predicted drafts of content to aid experts, but not capable of the real judgement required to make decisions of consequence in high-complexity domains. Even with the utilisation of fine-tuning or techniques like retrieval augmented generation (RAG), there is a high degree of risk of hallucination (generative AI’s term for outputting errors). When you combine this with its inability to provide a chain of reasoning, you can see why powering complex decisioning is not on the cards. Like all ML, generative AI techniques remain the extension of search and are not a proxy for reasoning or decision-making.

However, it transpires that LLMs represent a powerful missing jigsaw piece in a different paradigm. Their birth now enables graph models to be built programmatically, turning other structured forms of knowledge into transparent knowledge graphs. LLMs are phenomenally good at understanding language, and as such are extremely well-suited to extracting knowledge graphs from documents. Properly tuned and prompted, you can use an LLM to process regulations, policies, or a standard operating procedure and have it return a well-formed knowledge graph that represents the expertise contained within this documentation. It’s even possible to extract weights and logic from documentation via LLMs. By doing so, you are able to turn unstructured documentation into structured, symbolic models that can be used, alongside the LLM, for reasoning. Critically this sort of reasoning takes place in the symbolic domain, with no dependency on the LLM or any other ML model. This guarantees explainability and transparency. Symbolic AI has long suffered from the closed world assumption – that they only know what has been explicitly encoded in them, potentially limiting their adoption to discrete, well-defined narrow domains. Thanks to LLMs, we can also tackle this problem and start to open up this closed world. An LLM is able to consider unstructured inputs in the context of a knowledge graph. Those inputs could be a natural language query, or any form of unstructured data. Because the LLM has the graph as context, it is able to make inferences over the data that should be extracted, even where the language use has not been explicitly defined in the graph. In effect, an LLM can find semantic similarity within degrees of certainty and inject them into the knowledge graph. Of course, this only works if you have a powerful reasoning engine that is capable of processing such queries. By way of example, imagine an insurance policyholder asking questions about whether they are covered by their insurance policy after an incident. They may provide descriptions of what happened using natural language. Using an LLM alone is not sufficient, as it cannot provide a reasoned answer that we can be confident makes sense in the context of the policy.

LLMs are phenomenally good at understanding language, and as such are extremely well-suited to extracting knowledge graphs from documents.

THE DATA SCIENTIST | 09

JA M ES DU EZ - RAINBIRD

However, if we extract a knowledge graph from the policy documentation, we are able to reason about the policyholder’s cover over this formal model. By leveraging an LLM, we can

accept unstructured input into the symbolic domain, and produce a trustworthy, evidencebacked decision that we can be sure correctly adheres to the terms of the policy.

10 | THE DATA SCIENTIST

JA M ES DU EZ - RAINBIRD

This methodology delivers the best of all worlds. It enables organisations to merge the world of probabilistic modelling that is synonymous with two decades of ML with what at its heart is a symbolic technology – knowledge graphs. It can handle natural language inputs – and generate natural language outputs – with a symbolic and evidence-based core. This AI configuration is neurosymbolic, leveraging knowledge graphs, symbolic reasoning, LLMs and neural networks in a hybrid configuration. What’s particularly exciting about neurosymbolic AI is it breaks what has become a saturated mindset that all AI must start with data. The approaches inherited from the digital native businesses like Apple, Google and Meta – who have succeeded in automating advertising decisions using massive amounts of rich data – have led to a false assumption that this translates to other sectors, where more complex decisions are being made with poorer data. That mindset led everyone to believe that all AI must start with data – a problem that

decision intelligence has now resolved. Looking forward, trust is now going to be the bedrock of AI adoption. As regulatory frameworks like the EU’s AI Act take shape, the imperative for explainable, unbiased AI becomes clear. Neurosymbolic AI, with its transparent reasoning, is well-positioned to meet these demands, offering a form of AI that regulators and the public can accept. As we move from a data-centric to a decision-centric AI mindset, the promise of AI that the world can trust becomes tangible. For data scientists and business leaders alike, the call to action is clear: embrace the decision intelligence approach and unlock the full potential of AI, not just as a tool for analysis, but as a partner in consequential decision-making that earns the trust of those it serves.

As we move from a data-centric to a decisioncentric AI mindset, the promise of AI that the world can trust becomes tangible.

THE DATA SCIENTIST | 11

LUÍS MOREIRA-MATIAS

TRANSFORMING FREIGHT LOGISTICS

WITH ADVANCED MACHINE LEARNING

DR LUÍS MOREIRA-MATIAS is Senior Director of Artificial Intelligence at sennder, Europe’s leading digital freight forwarder. At sennder, Luís created sennAI: sennder’s business unit that oversees the creation (from R&D to real-world productisation) of proprietary AI technology for the road logistics industry. During his 15-year career, Luís led 50+ FTEs across 4+ organisations to develop

Can you give us an overview of the role of data ML and AI in terms of sennder’s operations? Sure. So in the past, the business model of a freight forwarder had many limitations. On one side, we had large companies that wanted to move bulk containers of goods from A to B. On the other are typically small trucking companies (90% of trucking companies in Europe have a fleet of 15 or less trucks). Because these gigantic companies and these very small companies operate with different business models, they need the broker, or the ‘freight forwarder’ to serve as a proxy. And traditionally, these brokers were contacted via analogue methods. So there’d be dozens of phone calls, emails, faxes, etc. between these big companies, the broker, and the trucking companies. It was hard work to arrange services, to evaluate the legitimacy of the trucking businesses,

award-winning ML solutions to address realworld problems in e-commerce, travel, logistics, and finance. Luís holds a PhD in Machine Learning from the University of Porto, Portugal. He possesses a world-class academic track record, with publications on ML/AI fundamentals, five patents and multiple keynotes worldwide.

and also complex from the admin side of giving feedback, receiving invoices and so on. So what sennder did was to digitalise all this into one marketplace. This brings transparency and significantly improves the customer experience on both sides. These large companies with cargo they need transporting, approach us and advertise a service request on the platform. The small truck companies on the other side can log in to the marketplace and bid, or simply take one of the services available. One problem we encountered when going digital was how to manage the volume of bids we were receiving (we have 7,000-10,000 loads every day). So we introduced predictive pricing and enabled a ‘one button sell’, meaning we can create an optimal price for everyone involved: we can guarantee we have a brokerage margin,

12 | THE DATA SCIENTIST

LUÍS MO REIRA- M ATI AS

and also that the carrier has enough money to pay the cost of their trip and still do some business for their side. Another big problem with the old system was finding the right load. Searching through 7,000-10,000 loads for the carrier was a major hassle. So, we worked to minimise this time spent on the platform, so we could maximise conversion rate. We now have a recommendation system that can guarantee personalised experiences for our carriers, who are able to recommend three to ten loads at login time. This approach offers a marketplace that probably has no parallel in the European market today.

But we also need constraints put in place by business experts, who suggest more basic approaches. So we aim to strike a balance between these basic, common-sense business ideas, and also incorporate the complex models and data in the creation of our systems. We also have strong monitoring and feedback loops: we do have errors, but we’re quick to react to those. We have mechanisms in place to counter them. So, you mentioned something really important, which is working closely with the business. How do you make sure that this happens with very technical, intelligent engineers or ML practitioners? What’s the structure? What I see on many machine learning teams is they have the engineers, and they have the manager. And the manager is seen as this mythological figure capable of doing everything: the hiring, the team’s outputs, the model results, performance reviews, one-to-ones, etc. And what ends up happening is either the manager gets burned out, or they end up being negligent in one of these areas. Typically, one of the things that falls short is the stakeholder management, because machine learning teams aren’t well understood by the rest of the company. So, these teams end up feeling like they don’t need to connect with the business. Of course, this setup isn’t ideal.

How did you go about setting up this uber-style marketplace for freight? The challenge for us was we needed to convert, and maximise our margin as well. And these two things play against each other, because the larger the margin, the less likely you are to convert. And as a B2B organisation, we had to be careful. Consider how B2C companies scale up by increasing the number of customers they have. If these companies make an error and a customer has a bad experience, they could lose that customer – but that’s not an issue because they have millions more. At sennder, however, we deal with these massive companies. If we lose their custom, we could lose up to 10% of our revenue. Consequently, we couldn’t approach experimentation and modelling with the same freedom as, say, a retail company. Another challenge was how to create a model We aim to strike a balance between these basic, that would cope with fluctuations over time. common-sense business ideas, and also incorporate the We deal with dynamic systems, so what is a complex models and data in the creation of our systems. good recommendation today, may not be a good recommendation tomorrow; what’s a good price right now may not be a good price in five minutes. So we So, one of the things that I introduced in sennder was needed to incorporate into our model components that to have a technical product owner on the team. can evolve over time, offer predictions, and also evolve In practice, this means there’s no longer this with the patterns we’re learning over time. We needed a mythological figure of a manager. Instead we have two model that could continuously learn. humans with different responsibilities. We have this With B2B companies in particular, the high price of technical product owner who represents the customers mistakes gives an extra degree of complexity, and we have in the team, who will build that rapport with the to deal with that on a daily basis. stakeholders and work together with them. This allows the other team members to focus on their work. B2B is always a more complex environment to do And if there’s an emergency in the company, this anything. What methods or philosophies did you product owner is the one who will make sure the team use to navigate that? drops everything and investigates it promptly. This We continue to do a lot of empiric experimentation, person is responsible for roadmap building and backlog but we’re also looking to incorporate a lot of business prioritisation, and they’re also the voice of the customer knowledge. So, we work very closely with the businesses in the team. to learn from them in the way we create the systems. When it comes to technical leadership: hiring, Often in a direct way: which data sources to use, how to employer branding, performance reviews, this is where use them in the model, how to use an eligibility rule (i.e. the engineering manager steps in. So, the manager matching geography, load and distance requirements with doesn’t need to worry about what the team needs to the appropriate carrier), and so on. do next, what’s the next priority. They just need to So there are mathematical ways to optimise: these be concerned with the how: how will we be solving complex, predictive models we can create from the data. this problem?

THE DATA SCIENTIST | 13

LUÍS MO REIRA- M ATI AS

I think this approach is a professionalisation of leadership in ML teams that’s greatly needed, especially knowing that these teams tend to be distanced from the business and less well-understood. I believe it’s a way to get them closer to success. Do the product owners have one project that they primarily focus on, or are they focusing on multiple projects? So each team has a scope, a certain set of tasks or problems to solve. But inevitably in an organisation there are overlaps between teams. So one thing we do is try and minimise those overlaps. And we ensure that the product owner is responsible for that scope. So, whatever problems are prioritised by product leadership to be solved by AI teams, it will fall to that product owner to be the customer representative in the team and determine what is to be done next. And we need to be very clear on what success looks like. For this reason, it’s important that the technical product owner has a background in computer science, ideally data science and ML. Because they must have this ability to translate business into technical language. I want this person to be able to be critical of the work of the engineers if needed, and challenge them when required. Changing things up now, and maybe looking at some of your major successes in terms of ML products and solutions. Can you talk about the pricing algorithm that you’ve been working on? So, this is a use case that has existed in the company since 2021. I joined in April 2022, and since then the algorithm has undergone several incarnations. There were three challenges when perfecting the algorithm: one was how to reconcile the different flavours within our business. For example, we have contract business, and we also have spot business. Another challenge was taking into account the different prices, the way they can fluctuate and the dynamic nature of the market. This makes our business model very unpredictable. Thirdly, our business is different from geography to geography. Europe is a very regional business where each country has its own legislation on how to run road logistics. So, we designed a model that’s able to learn. Typically, machine algorithms on these types of problems end up focusing a lot on outliers; in other words, they avoid having just one price that misses the mark, which is great. But then because the phenomenon is so stochastic, our pricing as a whole will be very off. So what we do instead is we divide the learning regime into two components. One where we say, ‘okay, are we able to give a price to this load? Yes or no?’ And if we say yes, then we have another machine learning model that

says: ‘this is the price for this load.’ At this point, we can trust the model to set automated prices on certain loads, and on those loads with no price, they’ll need to be bid on in order to be sold. This allows us to be much more aggressive on the machine learning model, and say for instance, ‘ignore outliers and change the loss function of the machine learning algorithm, and be much more focused on the normal points and on the absolute error.’ And this meant we could build a more accurate model for a smaller percentage; let’s say 80%, 90% of the points, rather than trying to be exact on 100% of the points. It’s a subtle change, but it made a significant difference in the end, in terms of an automated profit that this algorithm could drive for the company. You’ve talked about how the company clearly automates this process. Is there an impact to the customers as well, with this approach? So, as a society we’ve had tough times economically, and supply chain businesses in particular have been struggling. However, sennder keeps growing their business and their revenue massively and organically, year on year. And this year was no exception. One thing that changed for us this year is that we saw a large growth in the usage of our marketplace. And at the same time, we also saw a large increase in the usage of the volume of sales made throughout our pricing, or with the price driven by our algorithm. So, these three factors are connected, and this can only happen if our customers are happy with our platform. Otherwise, they’ll not keep coming back to buy in the marketplace, and the shippers won’t keep coming to throw loads on the marketplace. The positive impact varies according to geography, but overall, we can be talking about a growth of 30% or 40% year on year; in some areas, even more. And when it comes to the usage of the marketplace, 50% are the numbers in some geographies. And in terms of the profit achieved through sales (solely based on the machine learning algorithms that we have, the recommender and the dynamic pricing algorithm), from last year to this year, there’s a 500% difference. That’s seriously impressive. I guess the oldschool way of doing things is still there in the background. So, if your price wasn’t competitive to the brokers that they can call up, they wouldn’t use your platform. Yeah. Absolutely. So, not all the business is driven through the platform. We still have a considerable part that’s done traditionally with the human in the loop. That percentage is decreasing, but it’s a component that will always exist. One thing I want to point out is that we’re not designing AI algorithms to replace humans. What we’re

14 | THE DATA SCIENTIST

LUÍS MO REIRA- M ATI AS

trying to do is to enable humans to be more productive. So, right now, a human operator can process 30 to 50 loads a day. We want to enable these same human operators to process 500 or 1000. There’s no way this person will be able to do that without the support of AI and machine learning algorithms that automate some decision-making in the process. And we allow these human operators to step in when a problem arises. By doing this, we’re actually elevating the profession of these people so they can get better salaries, because they’re responsible for driving much more revenue per headcount. So I think we’re achieving, through technology, a significant contribution to society and improvements to the way we live. And I would like to highlight that. How do you go from zero over a three-year period to something as advanced as your pricing algorithm? So, the baseline system you initially start with won’t be ML, or at least will be a very rudimentary ML. That’s because there are problems that must be dealt with before you risk anything more advanced. There are typically five main issues you need to solve when you design a new ML project: Firstly you need the problem statement: how to go from the business problem to the machine learning problem, and what success will look like. This is absolutely key. Second is methodology: the machine learning modelling. Which data you’ll use, which features you use, which label you use, which algorithm, and which loss function. Third, offline evaluation. How do we determine that this model is good enough to go live? Do we have a baseline to compare it with, something really simple and intuitive to the business? If the machine-led method doesn’t outperform these baselines, we’ll go with the baseline because it’s easier to interpret and operate. Fourth is the architecture: how your data is served to train your model; how automated your process is to train and deploy it, and how your model is served once it’s live. And finally, live evaluation. Once your model is live, how do we determine that it’s working? This involves experimentation, but it also involves live monitoring and so on. You must get these five right, otherwise you risk having a model that doesn’t scale. Or risk having a complex model with no way of evaluating whether it works. And once you have the five defined, you might discover you’ll need to tweak the methodology or the problem statement slightly, or the architecture, to get the right type of data source. So you iterate on these components. This was a pattern we saw with our pricing model: we started with a linear model, then went for a more classic

ensemble of decision trees, and now we have something more advanced. You can see this process at work in other businesses, too. A great example is Amazon, with their sales forecasting model. They started in the early 2000s with a very simple linear model, then introduced the time series model, decision trees, more complex decision trees, until the deep learning model they have today. That transition took over 10 years. And the model will continue to evolve as their data evolves and their business evolves. So, the key message is: don’t rush to find that optimal solution. It’s something you need to build on, to develop and iterate over time. There are lots of issues and considerations at play. One statistic you’ve cited is that 30% of kilometres that the freight industry goes through in Europe is with empty lorries, which has a huge environmental footprint. Is what sennder’s doing helping to reduce that kind of wastage? Yes, absolutely. Our vision is to achieve increased sustainability and to minimise emissions; to go beyond the smooth assignment of individual loads between shippers and carriers on our marketplace, to chartering contracts. In practice, this means that we say to each carrier: ‘I want you to run your drivers and your trucks for one million kilometres for three months or six months, at this price, on these routes.’ So, instead of us selling each load individually to that carrier, we basically operate their assets for them. And that means we can coordinate the scheduling for the drivers and set the scheduling for the trucks. We can specify that the trucks drive here, rest here, fuel there, pick up load here and so on, allowing the drivers to focus on what’s the priority for them, which is driving their trucks. So, it’s the ultimate customer experience for them. But this also means we’ll be able to have a holistic optimisation of the truck fleet across Europe, and implement network optimisation at AI level. So we’ll have the AI and experts dedicated to maximising efficiency. And this is the ultimate strategy that will drive our company vision of enabling trucking companies to run their businesses more profitably and with lower emissions.

One thing I want to point out is that we’re not designing AI algorithms to replace humans. What we’re trying to do is to enable humans to be more productive.

THE DATA SCIENTIST | 15

JAVIER CAMPOS

NAVIGATING

THE FUTURE OF ADS:

LARGE LANGUAGE MODELS AND THEIR IMPACT ON PROGRAMMATIC STRATEGIES JAVIER CAMPOS is a visionary Chief Information Officer at Fenestra, steering the company’s Product and Technology divisions with a focus on pioneering AI advancements in programmatic management. With a wealth of 28+ years in the global arena, his expertise spans finance, AI, market research, media, and technology. Notably, Javier has previously enhanced data innovation as Head of Experian DataLabs for UK&I and EMEA. His thought leadership in AI was further recognised through his permanent seat at the Bank of England & FCA Artificial Intelligence Public-Private Forum, which aims to expedite AI integration in the financial sector. An esteemed author, Javier released How to Grow Your Business with AI in 2023, adding depth to his role as a dynamic speaker at industry events. His career includes pivotal positions such as Global Chief Technology Officer at Kantar-WPP, Havas Media, and EMEA CIO at GroupM-WPP, marking him as a vanguard in his field.

THIS ARTICLE WILL DELVE INTO THE UNDERPINNINGS OF LARGE LANGUAGE MODELS (LLMS) THAT ARE RESHAPING MANY INDUSTRIES, INCLUDING PROGRAMMATIC ADVERTISING. WE WILL EXPLORE THE NUANCES OF FOUNDATIONAL MODELS, DEBATE THE MERITS OF OPEN SOURCE VERSUS PROPRIETARY TECHNOLOGIES, AND DISCUSS STRATEGIC APPROACHES SUCH AS FULL TRAINING VERSUS FINE-TUNING, AS WELL AS THE USE OF RETRIEVALAUGMENTED GENERATION (RAG) FOR ENHANCING AD RELEVANCE AND PERSONALISATION. PRACTICAL GUIDELINES FOR DATA SCIENTISTS IN THE ADVERTISING DOMAIN WILL BE PROVIDED, WITH A LENS ON ETHICAL IMPLICATIONS AND FUTURE DIRECTIONS.

16 | THE DATA SCIENTIST

JAV PHILIPP IER CAMP KOEHN OS

arge language models (LLMs) are reshaping ensuring that advertising budgets are spent on genuine the landscape across various industries user interactions. with their profound ability to mimic human Another challenge in programmatic advertising is the language, bringing a new wave of innovation creation and testing of a multitude of ad creatives, which and efficiency. In the realm of advertising and requires significant resources and time. LLMs streamline programmatic media buying, this transformation is this process by automating the generation of diverse and particularly pronounced. LLMs are revolutionising the personalised ad content, from textual copy to potential field by enabling more nuanced content generation and image suggestions. This automation allows for rapid A/B optimisation strategies. Their capacity to accurately testing and optimisation of ads, ensuring that the most predict, understand, and emulate human-like text and effective content is delivered to the right audience at the interactions is changing the way advertising content is right time. conceptualised, created, and delivered. In summary, LLMs bring scalability, efficiency, and This article will explore the core technologies behind a new level of innovation to programmatic advertising. these models, understand their capabilities in the By leveraging their ability to generate and optimise context of industry-specific applications, and discuss the ad content, LLMs can produce a diverse range of strategic implications of integrating these advanced AI advertisements tailored to various user segments and tools into business workflows. contexts. This not only mitigates the challenges posed As the field rapidly evolves, by the loss of thirdthe choices between open source party cookies, but also [LLMs’] capacity to accurately predict, and proprietary models, as well as significantly enhances understand, and emulate human-like text between full training, fine-tuning, the overall user and interactions is changing the way and leveraging advanced techniques experience with ads advertising content is conceptualised, like retrieval-augmented generation that are highly relevant created, and delivered. (RAG), present both opportunities and engaging. and challenges for data scientists. WHAT ARE LARGE LANGUAGE MODELS? Through an exploration of these pivotal decisions, LLMs are advanced AI systems designed to understand, this article will provide data scientists with a roadmap interpret, and generate human-like text. They can be to navigate the intricacies of LLMs in programmatic conceptualised as tools that effectively compress vast advertising, underscored by an understanding of ethical amounts of internet data, distilling the essence of human practices and an anticipation of future AI trends. communication into algorithms capable of mimicking Programmatic advertising represents the automated language. The true potential of LLMs was unlocked buying and selling of online advertising space, where following the landmark 2017 paper on the attention software algorithms determine the placement and price mechanism, which introduced a more efficient way for of ads in real time. This process uses data analytics and models to process and prioritise different parts of the machine learning to deliver personalised advertising input data. content to users across various digital platforms, such as This attention mechanism, particularly exemplified websites, social media, and Connected TV (CTV). in models like the Transformer, revolutionised the field LLMs are uniquely positioned to address several of natural language processing (NLP). It allowed for the prevailing challenges in the programmatic advertising development of more sophisticated and contextually industry, thanks to their advanced capabilities in aware models, capable of handling longer sequences of language understanding and generation. The impending text and understanding the nuances and complexities of obsolescence of third-party cookies, for instance, human language. Since then, LLMs have rapidly evolved, threatens the data-driven nature of targeted advertising. with notable examples like OpenAI, GPT-4 and Google’s LLMs, with their sophisticated data processing and Gemini showcasing their ability to generate coherent and generation abilities, offer an alternative by enabling contextually relevant text across a variety of applications. the creation of highly personalised ad content without These models are trained on enormous datasets, relying heavily on third-party data. often encompassing a significant portion of the publicly LLMs can analyse user interactions and content available text on the internet. This training enables them engagement to generate targeted advertising that aligns to learn patterns, styles, and information from a wide with user interests and behaviours. This approach range of sources, effectively giving them a compressed not only ensures ad relevance but also respects user understanding of human language as represented online. privacy, a growing concern in the digital advertising As a result, LLMs have become pivotal in numerous space. Furthermore, LLMs can significantly reduce the applications, particularly in areas like programmatic occurrence of ad fraud. By understanding and predicting advertising, where the ability to generate personalised user engagement patterns, these models can identify and relevant content at scale is crucial. Their emergence anomalies that may indicate fraudulent activity, thereby

THE DATA SCIENTIST | 17

JAV IER CAMP O S

represents a significant leap in AI’s capability to interact with and understand human language, opening up new possibilities in technology and communication.

OPEN SOURCE VS PROPRIETARY LLMS In the realm of LLMs, the distinction between open source and proprietary models presents a significant crossroad for data scientists and organisations. Open source models, like those released by Hugging Face, offer transparency and community-driven innovation. They allow researchers and practitioners to inspect the model’s architecture, training data, and inner workings, fostering an environment of collaboration and trust. The agility of open source models accelerates experimentation and adoption in programmatic advertising, enabling practitioners to fine-tune models to specific domains or tasks without the constraints of licensing agreements. Hugging Face also hosts an open LLM leaderboard, where teams around the world submit new LLMs every week, which are automatically benchmarked and ranked. By regularly visiting this page, you can view the most performing open LLM – at the time of writing (December 2023) this was Microsoft’s Phi-2, but it will likely be different by the time you read this article:

Proprietary models, on the other hand, are developed and controlled by organisations like OpenAI. These models often come with performance benefits derived from proprietary datasets and resources unavailable to the broader community. Their closed nature can mean better integration, support, and potentially advanced features that cater to specific business needs in advertising, like enhanced personalisation and targeting capabilities. However, the benefits of proprietary models come with trade-offs. The lack of transparency can raise concerns about the replicability of results and the ability to audit for biases or errors. The cost of access and potential usage restrictions can also be limiting factors, particularly for smaller organisations or independent developers. Each type of LLM carries implications for scalability, innovation, and ethical considerations. Open source models enable wider accessibility and collective problemsolving, which can be crucial for tackling industry-wide

challenges such as ad fraud detection and privacypreserving personalisation. Proprietary models may offer competitive advantages but require a commitment to vendor relationships and may necessitate navigating around black-box algorithms. In the context of programmatic advertising, the choice between open source and proprietary LLMs hinges on factors like budget, desired control over the model, ethical considerations, and the specific advertising goals of an organisation. As the industry continues to evolve, the interplay between these two paradigms will shape the development and deployment of AI-driven advertising strategies.

STRATEGIES FOR IMPLEMENTING LLMS Implementing LLMs within programmatic advertising frameworks requires strategic planning to fully harness their capabilities. Full model training is a resource-intensive process, often involving significant computational power and a vast corpus of training data to produce a model that can understand and generate human-like text. The advantages of training an LLM from scratch include the ability to customise the model to highly specific advertising needs, which can lead to increased ad relevance and engagement. However, the cost, expertise, and time required to train such models can be prohibitive for many organisations. Fine-tuning pre-trained LLMs presents a more accessible alternative. By adjusting an existing model – such as GPT-3 or BERT – on a smaller, domain-specific dataset, data scientists can imbue the model with the nuances of their target audience or specific advertising context. This method requires far less computational resources and time, allowing for rapid deployment and iteration. Fine-tuning is particularly effective when the base model is already performing well and only minor adjustments are needed to tailor the model to specific campaign objectives. RAG is a novel approach that combines the power of LLMs with external knowledge sources to generate content that is both relevant and contextually rich. By querying a database of information during the generation process, RAG can produce ad content that is informed by the latest market data, trends, or user-specific information, making it highly adaptive and personalised. This technique is especially beneficial in scenarios where ads need to be responsive to real-time events or user interactions.

EXPLORING FOUNDATION MODELS IN PROGRAMMATIC ADVERTISING Foundation models, particularly LLMs like GPT-3, are at the heart of the next generation of programmatic advertising. These models are transformative due to their deep understanding of language nuances and user intent. Their design goes beyond simple keyword matching, enabling them to interpret the subtleties of human

18 | THE DATA SCIENTIST

JAV P H ILIER IP PCAMP KO EHONS

communication and generate responses or content that feels authentic and engaging. In programmatic advertising, the role of LLMs is multifaceted. They serve as the backbone for dynamic creative optimisation, where ad content is not just personalised but also generated in real-time to match the user’s current context and sentiment. LLMs can craft copy that resonates with the user’s current emotional state or intent, a capability that traditional models, which lack nuanced language understanding, cannot match. For instance, GPT-3 can analyse vast datasets of successful advertising copy and images, learning from the patterns of high-performing ads. It can then generate similar content, but with variations tailored to different user profiles and platforms. This not only increases engagement by ensuring relevance but also significantly reduces the time and resources required for ad creative development. Moreover, the application of these models extends to improving the efficiency of ad targeting. By understanding user queries and online behaviour, LLMs can help advertisers predict the most effective touchpoints for engagement, allowing for a more strategic deployment of advertising budgets. As we continue to unlock the capabilities of LLMs, their integration into programmatic advertising workflows is becoming more nuanced. The potential for these models to learn and adapt over time promises a continuously improving advertising ecosystem, one that becomes increasingly efficient at delivering the right message to the right user at the right time. LLMs like GPT-3 and DALL-E are driving significant transformations in programmatic advertising, offering innovative solutions to long-standing industry challenges. These models have opened up new avenues in content creation, ad targeting, and campaign optimisation, reshaping the way programmatic advertising is conceptualised and executed. The ongoing development of foundation models presents an exciting frontier for data scientists in

advertising. By leveraging these advanced AI tools, they can address the industry’s most pressing challenges, including improving user engagement and ensuring content relevance, while navigating the ever-important issues of privacy and user consent.

ETHICAL IMPLICATIONS The integration of LLMs into programmatic advertising necessitates careful consideration of ethical implications, particularly regarding data privacy, algorithmic bias, and transparency. Data privacy concerns revolve around the extent and manner of personal data usage by LLMs to personalise advertising content. There is a growing demand for models that not only comply with data protection regulations like GDPR and CCPA, but also align with broader ethical principles of respect for user autonomy and consent. Algorithmic bias in LLMs is another critical ethical issue. Since these models are trained on large datasets that may contain historical biases, there is a risk of perpetuating or amplifying these biases in ad targeting and content generation. It’s essential to implement measures to identify and mitigate such biases, ensuring that advertising practices are fair and do not discriminate against any group. Transparency in the use of LLMs is about making the model’s decision-making processes understandable to users and regulators. It involves explaining how personal data influences the content generated by these models and how decisions about ad targeting are made. This level of transparency is crucial for building trust among users and for the ethical use of AI in advertising. Current ethical guidelines, such as those proposed by AI ethics boards and industry groups, emphasise principles like fairness, accountability, and transparency. Applying these guidelines to the use of LLMs in advertising means ensuring that models are audited for biases, data usage is transparent and consensual, and there are mechanisms for accountability in cases of misuse or harm.

THE DATA SCIENTIST | 19

JAV IER CAMP O S

FUTURE DIRECTIONS AND RESEARCH The future of LLMs in advertising is poised for significant advancements, with potential developments including more advanced personalisation techniques, improved user privacy protections, and enhanced model interpretability. Personalisation is likely to become more nuanced, with LLMs being able to understand and adapt to complex user preferences and behaviours while respecting privacy boundaries. Research in model explainability will be crucial, as there is a growing need to make AI decision-making processes transparent, especially when they impact consumer experiences and choices. This involves developing methods to interpret how LLMs generate specific content and how they decide on particular ad placements. Such research will not only aid in compliance with evolving privacy regulations, but also help in building user trust. Another vital area of research is finding the balance between personalisation and privacy. This includes developing techniques for data anonymisation and synthetic data generation, which can help in training LLMs without relying on sensitive personal information. The call to action for the data science community is to actively engage in these research areas. Collaboration between industry practitioners, academics, and regulatory bodies will be key in advancing these technologies in an ethical and sustainable manner. Additionally, ongoing monitoring of the societal impacts of LLMs in advertising is necessary to ensure that the benefits of AI are realised broadly and equitably.

The exploration of LLMs in the context of programmatic advertising uncovers a landscape rich with potential and challenges. CONCLUSION The exploration of LLMs in the context of programmatic advertising uncovers a landscape rich with potential and challenges. LLMs, as foundational models, are redefining the norms of content creation and ad personalisation,

REFERENCES 1. Campos Zabala, Francisco Javier (2023) “How to grow your Business with AI” Apress/Nature Springer. 2. Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative Adversarial Nets. “Advances in Neural Information Processing Systems”. 3. Kingma, D. P., & Welling, M. (2013). Auto-Encoding Variational Bayes. “arXiv preprint arXiv:1312.6114”. 4. Vaswani, A., et al. (2017). Attention Is All You Need. “31st Conference on Neural Information Processing Systems”. 5. Radford, A., et al. (2019). Language Models are Unsupervised Multitask Learners. “OpenAI Blog”.

heralding a new era in digital marketing. The shift from traditional data-driven approaches to AI-centric methods, particularly in a post-cookie world, emphasises the need for innovative solutions to maintain ad relevance and user engagement. The comparison between open source and proprietary LLMs reveals a trade-off between transparency, cost, innovation, and control. While open source models offer accessibility and collaborative advancement opportunities, proprietary models bring tailored solutions with competitive advantages. Implementing these models, whether through full training or fine-tuning, requires a strategic approach, balancing the campaign’s objectives with available resources. The integration of techniques like RAG promises enhanced responsiveness and contextual relevance in ad content. Ethical considerations, particularly concerning data privacy, bias, and transparency, remain at the forefront. The application of LLMs in advertising must adhere to ethical guidelines that prioritise fairness, accountability, and user consent. This ethical framework is not static; it evolves with the technology and societal values, necessitating continuous vigilance and adaptation from data scientists. As we look to the future, the potential advancements in LLMs beckon a wave of innovation in advertising. Research in model explainability, balancing personalisation with privacy, and the mitigation of algorithmic bias will be pivotal. For data scientists in this domain, the journey is one of continuous learning and ethical introspection, ensuring that the advancements in AI are leveraged responsibly and beneficially. On a closing note, our AI team at Fenestra.io stands at the forefront of innovation in programmatic advertising, leveraging advanced technologies to automate and optimise processes for traders. Specialising in transforming the complex landscape of digital ad trading, Fenestra.io harnesses the power of LLMs to revolutionise the industry. By automating intricate processes, from campaign optimisation to campaign reporting, the company provides traders with more time to focus on strategic decision-making and creative aspects of advertising campaigns.

Brown, T. B., et al. (2020). Language Models are Few-Shot Learners. “arXiv preprint arXiv:2005.14165”. 7. Lewis, P., et al. (2020). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. “arXiv preprint arXiv:2005.11401”. 8. GDPR. (2018). General Data Protection Regulation. 9. CCPA. (2018). California Consumer Privacy Act. 10. Bender, E. M., et al. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? “Proceedings of FAccT”.

20 | THE DATA SCIENTIST

MICHEL DE RU

LEVERAGING DATA FOR

A SUSTAINABLE COMPETITIVE ADVANTAGE WITH GENAI THERE IS NO AI WITHOUT DATA

THE DATA SCIENTIST | 21

MICHEL DE RU currently holds the position of Head of Solution Engineering at DataStax, where he oversees the Technical Solutions team for Astra DB in the EMEA region. His role involves guiding the team to enhance business value for customers by leveraging Astra DB’s capabilities in vector search and generative AI. Astra DB, known for its robust NoSQL framework based on Apache Cassandra, is integral for real-time, business-critical applications. Prior to his current role, Michel gained valuable experience in AI and Distributed Technologies through various positions in different organisations. Michel’s educational background includes a Business Management Program from Nyenrode Business University and a BSc in Information Technology from Saxion University of Applied Sciences.

PHILIPP MICH EL KOEHN DE RU progress, its widespread availability means that unique differentiation is key to leveraging its full benefits. Sure, In the evolving landscape of artificial intelligence, the the GenAI ecosystem helps make magic happen, but it’s generative AI (GenAI) arena stands out as a frontier of available to everyone! innovation and opportunity. This realm, characterised by systems capable of creating content and solutions CONTENT IS KING autonomously, is reshaping industries and challenging Those familiar with the publishing industry know the traditional notions of creativity and problem-solving. guiding principle of ‘content is king’. The publishing era, As GenAI continues to advance, its impact is being felt marked by print initially, emphasised the paramount across various sectors, from healthcare to entertainment, importance of content in attracting and retaining highlighting its potential to revolutionise how we interact audiences. Publishers and media companies thrived by with technology and data in our daily lives. producing high-quality, engaging, and often exclusive There are a plethora of practical applications that are content. This content drove readership, viewership, transforming various industries. In healthcare, GenAI and ultimately, advertising revenues. The phrase is used for drug discovery and personalised treatment underscored the idea that in a landscape filled with plans, harnessing vast datasets to predict patient various media outlets and platforms, the success of a outcomes more accurately. In the creative arts, it assists publication largely hinged on the quality and appeal of in generating novel music compositions, artwork, and its content. even literary pieces, offering a new dimension to human creativity. In the business world, GenAI aids in automating content creation Although GenAI offers significant potential for innovation for marketing, like generating targeted advertising copy, thus streamlining and progress, its widespread availability means that unique operational efficiencies. Additionally, differentiation is key to leveraging its full benefits. it plays a crucial role in developing advanced virtual assistants, capable of understanding and responding to complex user In today’s digital landscape, the adage ‘content is queries with human-like fluency. These examples only king’ remains as relevant as it was during the publishing scratch the surface of GenAI’s vast potential across era. Despite technological advancements and the multiple domains. rise of new media formats, the core principle that engaging, quality content drives audience engagement ACCELERATED INNOVATION AND RISKS and business success continues to hold true. Whether Reflecting on the advancements made by organisations it’s through social media, blogs, video streaming, or like OpenAI, Microsoft, and the leadership of individuals interactive platforms, the ability to create compelling such as Ilya Sutskever (OpenAI), Yann LeCun (Meta) and content is still a critical factor in capturing and Demis Hassabis (DeepMind), the technology industry is maintaining audience interest in a highly competitive witnessing a significant shift. The dynamic AI ecosystem, digital world. characterised by rapid changes and the emergence of As we move into GenAI use cases, the principle platforms like AWS’s Q or GCP’s Gemini, signifies an era of ‘content is king’ evolves to become even more of accelerated innovation. The accessibility of GenAI significant. In the GenAI context, it’s not just about to a wide range of use cases, metaphorically likened creating content, but also about how AI can generate to ‘English becoming the new programming language’ personalised, contextually relevant, and highly (according to Tesla’s former AI director, Andrej Karpathy) engaging content at scale. This capability extends underscores the universal impact of these technologies. the concept of content value, making it a pivotal In an era of rapid technological advancements, aspect in various applications, from personalised businesses face the challenge of judiciously selecting marketing to automated content creation. The success technologies within the GenAI ecosystem. The pace at of GenAI implementations will heavily depend on the which new frameworks emerge and integration layers quality and relevance of the content they produce, evolve necessitates a strategic approach. Organisations maintaining the age-old adage’s relevance in a new must be cautious not to become overly dependent on technological era. a single provider, a lesson highlighted by recent events Shifting from ‘content is king’ to ‘context is king’ involving major players in the GenAI field like OpenAI marks a crucial transformation, offering significant and Microsoft. These developments underscore the opportunities. Foundational large language models importance of maintaining agility and foresight in (LLMs) are trained on vast, publicly available datasets, technological adoption. up to a certain cut-off date, and do not inherently The acceleration of businesses transitioning into include proprietary or organisation-specific information. AI-driven organisations is evident. However, within This training approach limits their direct applicability the rapidly evolving GenAI ecosystem, establishing a in specialised or up-to-date business contexts. To lasting competitive edge becomes challenging. Although harness a competitive edge, augmenting these LLMs GenAI offers significant potential for innovation and

GENAI POTENTIAL USE CASES

22 | THE DATA SCIENTIST

MICH EL DE RU with tailored, contextual information becomes pivotal. Businesses can infuse LLMs with unique, contextrich data, enhancing the models’ relevance and applicability to specific business needs, thereby creating distinct, valuable results that are not just general but uniquely advantageous.

DIFFERENTIATE WITH DATA By now it has become clear that GenAI will absolutely offer your customers new magical experiences and interactions with your business. However, using GenAI will not be a differentiator for long. The ecosystem is easy to use and accessible to millions of developers, unlike traditional AI which requires specialised skills. So, what sets you apart? It’s the unique combination of your business, the talent of your team, your intellectual property and your data! Your data and its extreme added value will make the difference in the end and provide the much-needed differentiated and sustainable competitive advantage! Positioning GenAI around a centralised data strategy ensures that AI applications are optimally aligned with the organisation’s core information assets, enhancing the effectiveness and relevance of AI-driven solutions. This approach ensures that AI applications are deeply integrated with the unique context and specifics of the business’s data, leading to more tailored and relevant outputs. By aligning GenAI capabilities directly with the rich, proprietary datasets they possess, businesses can leverage AI to generate insights, solutions, and content that are directly applicable to their specific operational and customer needs. This data-centric focus allows for a more nuanced and effective use of AI, enhancing customer experiences through personalised interactions and services. It also ensures that the AI’s functionality is grounded in the reality of the business’s data landscape, making its applications more practical and impactful. In essence, by centralising GenAI around their own unique data, businesses can harness the full potential of AI to create valueadded services that resonate more deeply with their customer base.

Positioning GenAI around a centralised data strategy ensures that AI applications are optimally aligned with the organisation’s core information assets, enhancing the effectiveness and relevance of AI-driven solutions. Lastly, and perhaps even more importantly, a centralised data strategy also means you stay in control of the GenAI ecosystem by not being locked-in into one provider. It allows you to switch technologies and capabilities in and out as innovation progresses.

CENTRALISING DATA New innovative databases have emerged specifically designed to handle the new requirements around

providing context to AI, and particularly GenAI solutions in real-time. These so-called vector databases play a crucial role in positioning GenAI around a centralised data strategy. Vector data, essentially arrays of numbers, semantically describe complex data points such as images, sounds, texts, and other high-dimensional data types often used in AI and machine learning. A vector database works in a way that is similar to how humans understand the deeper meaning of sentences, images, and similar content. Let’s break this down with an analogy: Imagine you’re having a conversation with a friend. They tell you, ‘I’m feeling under the weather.’ You understand they mean they’re feeling ill, not that they are physically beneath the weather. This is because you comprehend the semantic meaning, or the deeper intent, behind their words. A vector database mirrors this as follows: 1.

Translating Data into Vectors: Just like you translated your friend’s words into their deeper meaning, a vector database translates sentences, images, and other complex data into vectors. These vectors are like a mathematical code or language that represents the deeper meaning or essence of that data.

Finding Similarities: When you hear different phrases with similar meanings, like ‘I’m not feeling well’ and ‘I’m feeling sick,’ you understand they’re conveying the same idea. Similarly, a vector database can find and match vectors that are semantically similar. It recognises that different data can have similar underlying meanings or themes, even if they’re not exactly the same on the surface.

Responding to Queries: If someone asks you for movie recommendations based on the movie they just watched, you think about the themes, genre, and style of that movie to suggest similar ones. A vector database does something like this. When given a query, it looks for vectors (representing data) that are semantically similar to the query’s vector. 4. Handling Diverse Data: Just as you can understand meanings across various types of information – be it text, an image, or spoken word – a vector database can handle different types of data, finding semantic similarities across them all.

In essence, a vector database functions by converting complex data into a form where it can easily understand and compare the deeper meanings, much like how we grasp the semantic meanings in our everyday interactions. This capability makes it incredibly useful for tasks where understanding and finding similarities in the deeper essence of data is key.

THE DATA SCIENTIST | 23

PMICH H IL IPEL P KO DE EH RUN

SELECTING THE RIGHT TECHNOLOGY Choosing the best vector solution from the numerous available options for vector storage and search is a highly impactful decision for an organisation. As vectors and AI are crucial in developing the next wave of intelligent applications for businesses and the software industry, the most effective choice typically also demonstrates superior performance. Keep in mind these key aspects while selecting a vector database for your organisation: 1.

Open Source with Enterprise Support Opt for a vector database that is open-source. This ensures transparency, community support, and continuous improvement of the software.

Availability on All Cloud Service Providers to Avoid Lock-In Choose a vector database available across all major CSPs. This prevents vendor lock-in, giving you the flexibility to switch providers or use multiple providers without compatibility issues. Being able to change CSPs is especially important while tapping into the added value of the GenAI ecosystem as explained before.

Proven Track Record Look for a vector database that is proven effective through use cases by, for instance, Fortune 100 companies. This indicates reliability and effectiveness in handling large-scale, complex data needs.

Consumption-Based Cost Model A consumption-based cost model that scales with your business case is essential. This ensures that you only pay for what you use, making it costeffective as your business grows.

Relevance of Vector Similarity Search Ensure the vector database excels in vector similarity search. This functionality is critical for efficiently finding and retrieving data based on similarity, which is a cornerstone of many AI and machine learning applications.

Hybrid Search with Metadata and Full Text The ability to use metadata and full text for hybrid search is a valuable feature. This allows for more nuanced and comprehensive searches, combining traditional full-text search with advanced vector search capabilities, ultimately boosting relevancy which improves GenAI results significantly.

24 | THE DATA SCIENTIST

MICH EL DE RU

Your data are your crown jewels. It’s your Intellectual Property that will set you apart from the competition. And for this reason alone, it’s imperative to store it into a database that provides performance and reliability!

WHAT’S NEXT? The field of GenAI is advancing swiftly, presenting vast opportunities for businesses to enhance customer interactions through personalisation. While the array of innovations, possibilities, and providers in this space may initially seem overwhelming, the key lies in making strategic decisions. Choosing the right architecture and identifying the optimal data storage solution is crucial. This approach ensures that your data, a vital source of sustainable competitive advantage, remains under your control. By maintaining ownership of your data and

avoiding locking it into a single (Gen)AI provider, you retain the flexibility to choose and adapt AI technologies as needed, keeping you at the forefront of AI application in your industry. Secondly, practical experience underscores the significance of beginning with prototypes of GenAI applications to discern what aligns best with your business needs. Numerous DataStax customers have experimented with various approaches and are now successfully running their initial GenAI applications in production, which are delivering tangible benefits to their customers. These pioneering applications not only inspire new use cases but also provide a solid foundation for further development. Launching your first GenAI application into production can serve as a catalyst, unlocking a multitude of new opportunities and possibilities for your business.

Your data are your crown jewels. It’s your Intellectual Property that will set you apart from the competition.

THE DATA SCIENTIST | 25

Need to make a critical hire for your team? To hire the best person available for your job you need to search the entire candidate pool – not just rely on the 20% who are responding to job adverts.

Data Science Talent recruits the top 5% of Data, AI & Cloud professionals.

OUR 3 PROMISES: Fast, effective recruiting - our 80:20 hiring system delivers contractors within 3 days & permanent hires within 21 days. Pre-assessed candidates - technical profiling via our proprietary DST Profiler. Access to talent you won’t find elsewhere - employer branding via our magazine and superior digital marketing campaigns.

Are you open to identifying the gaps in your current hiring process? Then why not have a hiring improvement consultation with one of our senior experts? Our expert will review your hiring process to identify issues preventing you hiring the best people. Every consultation takes just 30 minutes. There’s no pressure, no sales pitch and zero obligation.

Let us help you hire a winning team. To book your consultation, visit: datasciencetalent.co.uk/consultation

COLIN PARRY

MEASURING THE EFFECT

OF CHANGE ON PHYSICAL ASSETS COLIN PARRY is the founder of Head for Data, a data consultancy based in Scotland. Prior to this he was Director of Data Science at arbnco, where the work described below was developed. He has worked in the energy field for 14 years using data to drive innovation and improve business outcomes. He has worked on a diverse range of projects throughout his career and holds four patents in the area of energy management in buildings.

TRADITIONAL CHANGE MEASUREMENT PROCESSES How do you measure the effect of a change? Ask a data scientist this question and they are likely to think of A/B testing. An A/B test is a form of randomised control experiment in which multiple measurements are taken of a metric, but this is acquired from two different groups with a slight variation in user experience. For example, a website might measure click-through rate whilst slightly varying elements on a page, showing one group a red button and another group a green one. A drug trial might measure the recovery rate from illness whilst slightly varying the medication given to each group, giving one group a tablet with an active ingredient and another group a placebo. This is now an industry standard way of measuring the effectiveness of a change. But there are some implicit assumptions: 1. It must be physically possible to get samples from multiple groups simultaneously. 2. The marginal cost of acquiring new samples cannot be prohibitive. 3. Each sample must be comparable inside the group and across groups. Serving up a new website layout only requires a few configuration changes and the cost of acquiring multiple groups is negligible, therefore this is an easy experiment to run. In the case of a drug trial, this is trickier as more people are required to participate so the marginal cost

is higher, resulting in generally smaller sample sizes, but still enough to make the results meaningful. In both examples assumptions about the people involved need to be made – a website may segment visitors by interests or affluence to ensure the groups are comparable. The drug trial may limit participation to those with a certain BMI or age to reduce the effect of these latent variables on the results. What happens, though, if the thing being measured breaks all three of these rules? What if the measurement takes place on an asset that is one-of-a-kind, and it costs thousands of dollars to make a change?

ENTER BUILDINGS An oft-quoted statistic in building science is that 80% of the buildings that will be occupied in 2050 already exist today[i]. Globally, buildings account for 40% of our energy consumption and 33% of our greenhouse gas emissions[ii]. Unlike areas such as transport and technology, reducing the energy consumption from the buildings we all live and work in cannot be primarily achieved by building cleaner and more efficient products. Clearly, new buildings will be more efficient than those already built, but for the large stock of already constructed buildings we need to rely on retrofits. A retrofit to a building is anything that changes something already existent in the building. Switching to LED lightbulbs, adding insulation and changing out a HVAC or boiler to a more efficient alternative are all examples of retrofits aimed at reducing energy consumption. As well as the environmental benefit

28 | THE DATA SCIENTIST

PHILIPP CO L IN PARRY KOEHN

these retrofits are cost-effective in the medium to long term, but do not bring immediate benefit to the building operator due to the often high upfront cost. The traditional way to model energy consumption in buildings and assess how beneficial a retrofit might be, is through simulation. Software such as EnergyPlus[iii] is a physics-based engine that gives a very good picture of how the energy consumption of a building will vary over time. The downside is it needs a lot of accurate information about the building’s construction materials, internal layouts and usage patterns. For a building in development, this will be known from the plans, but very few existing buildings (and it only gets worse the older the building) have plans sufficiently detailed to build a simulated model. Acquiring this information requires intrusive surveys and even then, there is still a lot of guesswork around materials, particularly in areas not visible to the eye. Translating this into model parameters requires experienced building scientists and is a time-consuming process that cannot be used at scale.

WHERE DATA SCIENCE CAN HELP How does this involve data science? Let’s break down the problem. Building operators want to reduce the energy consumption of their buildings, primarily to save money. Applying retrofits is expensive and it takes time to measure the effect, and since most buildings are unique, we cannot simply run one building with the change and one without. Creating a simulation of the change is timeconsuming – and in many cases impossible – without specialist knowledge, and anyone responsible for more than a handful of buildings cannot get this done at scale. Therefore, if we could build a data science model that could approximate the physics-based simulated engine then this could be trained on buildings and used to estimate the normal operation before a retrofit is applied, and compared to the actual performance post-retrofit. A building typically has two main factors that affect the energy consumption: weather and human activity:

It is intuitive that extremes in weather require the building to be heated or cooled to make it habitable for the occupants. Where the building is situated will affect how exposed it is to weather; factors like whether the

building is detached or terraced, has areas underground or is sheltered from the prevailing winds, will be important. Additionally, the level and frequency of human occupancy will affect energy intensity – a building used for 24 hours will require more energy than one used for 8 hours. The types of equipment used by human activity will also affect consumption, so building type (such as office or factory) will impact the way that the same occupancy pattern changes the energy used.

MODEL PREPARATION If we think about what we want to achieve here – we know weather and human occupancy patterns and want to forecast energy consumption – the model is effectively learning the transfer function of how the building reacts to these external factors, and this is clearly unique to each building. It might be tempting to look at this as a problem requiring a universal solution. Maybe that would work, but it would need to factor in building type, age of construction and other hard-to-obtain information. The more of this information that is required for the model to work, the higher the chances of not being able to predict for a building if it is not available, and gets closer to requiring a physics-based solution. It is far simpler to train the model on one building at a time. Since this model does need to do real-time forecasting, we can afford to wait minutes for training time, and therefore instantaneous response is unnecessary. Now we have defined the parameters, let us dive into the data used for training and prediction.

CHOOSING DATA AND FEATURE SELECTION There’s a lot to consider when it comes to which weather data to use. This data can be acquired inexpensively from weather APIs and have a dizzying array of parameters ranging from temperature and pressure to less useful ones such as moon phase and ozone level. For this model, it was found that only air temperature and relative humidity were required to get good predictive power. Whilst historical data is relatively easy to come by, for this use case the weather data must be forecasted into the future. Again, many weather APIs can forecast months ahead and this is how the data for this part of the model was obtained – more on this at the end. The data on human occupancy would only be available if sensors were used to obtain it, and this is very unlikely to be available for most buildings. Therefore, a clustering methodology was applied which identifies periods of similar usage and creates normalised patterns for each day. The exact way this is determined is beyond the scope of this article but is described in another patent WO2022023454A1. Once these load patterns have been calculated for historical data, they can be applied onto the future predictions by making intelligent deductions about

THE DATA SCIENTIST | 29

CO L IN PARRY

building energy consumption patterns. If multiple years of data are available as part of the training dataset then the most likely usage case can be estimated, and where only one year of data is available then knowing the approximate building type can be used to estimate the typical annual pattern. For example, an office is likely to operate year-round but with known dips around public holidays, whilst a university will have drops in consumption when students are off.

To reduce the number of neurons required in the hidden layers, we can use a dilated-CNN architecture which skips out historical points in the frame of the window. This vastly reduces the size of the model, results in a shorter and less computationally intensive training time and has the bonus effect of reducing overfitting. Applying p 1D convolutions to the input data reduces the neurons down to 2p the size of the input dataset.

MODEL ARCHITECTURE AND TRAINING When it comes to time-series forecasting there are a plethora of approaches to choose from: all the way from univariate statistical models such as ARIMA through to deep learning approaches such as RNNs and LSTMs. In this case, while the chosen architecture requires online training, we still need to be mindful of performance and costs as the data volume scales. Unfortunately, due to the way sequences are handled in RNN and LSTM architectures, they can be very slow to train. The solution is to use another type of deep learning architecture – convolutional neural networks (CNNs). Normally, CNN architectures are used for tasks such as image recognition, and work by using hidden layers that modify the input data by applying a short filter window to it to reduce the size of the resulting output. For a long period of time-series data, such as in our problem, this would result in an enormous number of hidden neurons. The model is fed a series of frames of shape Ntr × Mf where Ntr is the number of training rows in a frame and Mf is the number of features. The model learns from this to predict the energy consumption a small window ahead for each frame of size Npr × 1 where Npr is the number of predicted rows. Making decisions about the length of training data in a frame and how far to predict ahead is largely a balancing act of accuracy vs training performance. Making the frame window too long will reduce the number of frames to train on, but will allow the model to go further back to learn from – however there is a limit to causality (i.e. did the temperature 6 months ago really affect the energy consumption yesterday?). Making the frame window too short will not give the model enough data to learn from.

Other commonly used deep learning techniques were applied, such as skipping connections (to reduce the chance of the network becoming stuck at a local minimum by connecting non-adjacent layers) and dropout regularisation (to further reduce overfitting by randomly disabling neurons during training).

RESULTS The model was evaluated against a database of buildings that had multiple years of data so that the model could learn from a portion of the data and be evaluated against the ‘future’ real data. The goal was to meet the entire building calibration standard as defined by the American Society of Heating, Refrigerating and Air-Conditioning Engineers (ASHRAE). In Guideline 14, they state that a building can be considered calibrated if the Normalised Mean Bias Error (NMBE) is less than 10% and the Coefficient of Variation of the Root Mean Square Error (CV(RMSE)) is less than 30%. Using this test database of buildings, this algorithm exceeded these standards when evaluating the 6-month ahead prediction.

30 | THE DATA SCIENTIST

PCO H ILLIP INPPARRY KO EH N

The main strength of this model is that no prior knowledge of the seasonality of the energy profile is required, since the weather features cover the low frequency changes, and the human occupancy gives the model information about the high frequency changes. Therefore, predictions can be made on a wide range of building types with diverse operating patterns without any real knowledge of the building itself. This is the main benefit of not attempting to build a universal solution to this problem.

LIMITATIONS One of the model features is human occupancy. As discussed, this is very rarely directly measured in a building and must be inferred from the energy data. Since the approach here normalises patterns, then the more astute reader might ask why these are not used for prediction. Whilst this would be a very simple and easy way to produce an output, the normalisation approach removes the effect of the weather features and usage changes, therefore this does not work for buildings with a high degree of seasonality caused by weather or sudden changes in usage. The other features come from a weather API. These generally source this data from weather stations and these are very unlikely to be sourced directly at the building itself. Therefore, effects such as urban heat islanding[iv] can result in the measured weather data not reflecting the true values experienced at the building. Fortunately, the weather API will return the station coordinates so these can be compared to the building location, but in some rare cases, if the weather station is too far away, then caution should be applied to the accuracy of the results. When considering the reliability of these predictions, sudden changes in use case can break the model forecast. For example, if a factory purchases a new

piece of equipment, the energy profile can change quite dramatically. This can be accounted for in the historical data by training separate models for each usage case (assuming there is enough data) but if this takes place after the retrofit is applied, then this previously unseen profile could render these predictions invalid. As discussed earlier when evaluating a potential retrofit, future weather forecasts need to be obtained in order to make a comparison. However, if this forecast ends up being incorrect (i.e. the real weather ends up being colder or warmer than at the time of the comparison) this could make the initial comparison under or over-estimate the benefit of the retrofit. It is recommended that once the real weather data has been recorded, a retrospective comparison is carried out.

SUMMARY To assess how energy consumption will change when a retrofit is applied to a building we first need to capture how the building operates currently, and this can be compared to the actual performance after the retrofit is applied. A/B testing cannot be used because there is only one copy of the building, and traditional physics-based approaches often require information that can only be obtained through intrusive physical surveys which do not scale well to a large portfolio of buildings. By using a dilated-CNN architecture to learn the transfer function we can use temperature, humidity and human occupancy as features to learn energy consumption on a historical training dataset of at least 12 months in length. After a retrofit has been applied, this can be used to estimate the energy consumption that would have taken place without it, and therefore a comparison can be made to determine whether the change was successful.

This work was developed by arbnco and is patented worldwide under patent WO2022043403A1.

[i]

UK Green Building Council, Climate Change Mitigation, Visited 16th November 2023, ukgbc.org/our-work/climate-change-mitigation

[ii]

World Economic Forum, Why buildings are the foundation of an energy-efficient future, Visited 16th November 2023, weforum.org/agenda/2021/02/why-the-buildings-of-the-future-are-key-to-an- efficient-energy-ecosystem

[iii]

EnergyPlus, EnergyPlus, Visited 20th November 2023, energyplus.net

[iv]

National Geographic, Urban Heat Island, Visited 29th November 2023, education.nationalgeographic.org/resource/urban-heat-island

THE DATA SCIENTIST | 31

BARTMOSS ST. CLAIR

USING OPEN SOURCE LLMs IN LANGUAGE

FOR GRAMMATICAL ERROR CORRECTION (GEC) BARTMOSS ST. CLAIR holds maths and physics degrees from Heidelberg University. He’s an AI researcher who previously worked at Harman International and Samsung, and has been a guest researcher at the Alexander von Humboldt Institute in Berlin. Currently, Bartmoss is Head of AI at LanguageTool. He’s also the creator of the open source community Secret Sauce AI.

How did you go from maths and physics at Heidelberg into data science and AI? I was a physicist at university, but I found academic life just didn’t suit me. Through a connection with a colleague, I got involved in a very interesting project with natural language processing, dealing with automating content governance systems for banks. Then I founded a company to build up solutions for five or six different languages, and that’s when I discovered my passion. Now you’re at the cutting edge of using LLMs in the business world. Can you give us an overview of what you’re doing at LanguageTool? So at LanguageTool, the one use case for us is grammatical error correction. Someone writes a sentence or some text, and they want their grammar checked, then the tool replaces it with the correct grammar. Of course, that’s a very basic use case. LanguageTool has

existed for about 20 years, but back then we didn’t use AI or machine learning. As head of AI, one of the things I wanted to do was create a general grammatical error correction system (GEC) to catch all kinds of errors for all languages possible, and correct them. So initially, we needed a model that would correct the text someone has written. We had to decide whether to take an existing LLM like GPT-3 or GPT-4 and just use prompting, or to create our own model. And we found that creating our own model was cheaper, faster, and more accurate. There was a lot to consider when creating a model that will run for millions of users: we had to make sure there was a good trade-off between the performance, speed, and price. Decoder models have become extremely popular, and we’ve seen a lot of scaling behind them. But sometimes the question is: how big do you need the model to be for your purpose? And do you have to benchmark and test that to see how well that works?

32 | THE DATA SCIENTIST

B ARTMO SS ST. CL AIR

Can you dive a little deeper into the business case? Who are the users, and how does the grammar tool work in practice? This is a real-time application that needs to work with low latency as users are typing in a document, or on a website, or anything in their browser. So there are actually plenty of ways you can use LanguageTool; while it needs to work as quickly as possible, the use cases are completely varied. It can be academic writing. It could be for business (we have both B2B and B2C customers). So we don’t have a one-size-fits-all solution for our customers. But we always have to find a good balance between accurate grammar correction and not annoying people. You might assume that grammar’s either correct or not, but the reality is more nuanced. There are cases where users don’t like a rule even if it’s technically correct. In these cases, you have to debate whether you want to turn it off. Take, for example, the phrase ‘to whom’ or ‘for whom.’ Nowadays, people just say ‘who.’ Grammar changes over time, and we have to be mindful of that, and strive to strike a balance with our users. How many users do you have? We have several million users in many languages; we’ve over 4 million installations of the browser add-on, in fact. We have six primary languages, but we support over 30 different languages worldwide. And we also have employees from all over the world.

fascinating challenge where it’s not black and white. Other challenges include issues with hallucinations or extreme edits. You want the output to retain the same meaning as the original sentence, and maybe you don’t want certain words changed, you just want to correct the grammar. We like to offer users a multitude of different ways of correcting. Some users want to apply a certain style; some just want their grammar checked, and some want both. So, you have to be certain you’re changing language at the right level, in the right way, to meet the user’s specifications. There are countless different tools to solve these types of problems: everything from checking edit distances with Levenshtein distance, or similarity with cosine similarity. You can also classify and tag sentence edits. This can reduce the issue of over editing, or over changing a sentence. Occasionally, there’s also the question of: how much do you change a sentence before you shouldn’t change it anymore? Because LanguageTool has existed for 20 years, there are many standards and practices in there that have developed through building a rule-based system, and we’ve inherited these when moving into AI.

So you incorporated that rule-based system from the past, and you now have a hybrid system that’s generally high-end rule-base? Absolutely. When it You mentioned comes to basic things decoders and that could be formulated One essential aspect is that we don’t just correct encoder-decoders. into rules, it’s very cheap, the grammar, but we explain to the user why; we What are the very accurate, and it just give the reason for the correction. different scenarios works. There’s no reason to where you might fix that. recommend one over the other? It’s these more complex, contextual based grammar If you’re starting out by prototyping, maybe you want issues where you can’t just create a simple rule for it to use a very large decoder using a prompt so that it because there are so many exceptions, that’s where has some sort of emergent behaviour. But if you would machine learning is needed. And running machine like to do a task where you’re doing something like learning models with inference in production could translation, generally sequence-to-sequence encoderbe pricier than just writing a simple rule with RegEx decoder models perform really well. You need to or Python. consider issues like context window. Do you want to work on large amounts of text, or do you just want to How do your rule-based systems work together do it on a sentence level? There are plenty of things with the AI systems? to consider. One essential aspect is that we don’t just correct the grammar, but we explain to the user why; we give the How does GEC work in practice? What are the reason for the correction. And every time we have a challenges with what you’re doing? match, we also have to explain the reasoning behind that. We first aim to fine-tune the model with really good And this means that every match has a unique rule data, where you just have sentence pairs, or text pairs ID. So, for certain rule-based systems, you have an ID for between one that could have mistakes in it and one those, and you have an ID for all the types of matches that’s completely golden. Next, we check the model to that can occur in the machine learning aspects, and you see how it’s working. This is a complex process: there can then prioritise those rules. can be multiple grammar corrections for a sentence, For example, let’s say that nine times out of ten, and it must be able to handle that. This aspect is a very there’s a rule that works with the rule-based system, but

10 | THE THE DATA DATA SCIENTIST SCIENTIST | 33

B ARTMO SS ST CL AIR

then it triggers for that one time when there’s a deeper context. In that case, you can prioritise the AI model over the rule-based system. This approach works really well. It ensures we don’t get stuck in endless correction loops. You mentioned you were trying to develop a generalised GEC. How far away are you from that? Is it specialised agents that solve specific tasks, and the rule-based system uses them, depending on the need? Or is it really like one generalised AI system that answers some questions? We do keep our legacy rules where they function very well, and they always work. There are certain cases where something’s always going to be true, like capitalising at the beginning of a sentence or punctuation at the end, and so on. But for general grammatical error correction, this is something that we worked on for a while, and we actually do have running in production. The system has many, many layers in production from rule-based systems to specific machine learning-based solutions, for one type of correction. And we also use the GEC, the general grammatical error correction, which can solve a lot. How much data do you need to train those models, and what’s the role of the data? What kind of data do you use – natural, or synthetic? In the end, it’s a mixture. We have opt-in user data that

we collect from our website, and also golden data that we’ve annotated internally ourselves from our language experts. We have experts for every language, who go through and review and annotate data for us, which is extremely helpful. Furthermore, we also generate synthetic data. Generally, when it comes to these models, you have to ask yourself: do you want to train these things in stages, or do you want to train them all in one go? Is it better to use synthetic data for a part of it, or should you just use fewer points of data and focus on quality? There’s an interesting mixture of tasks and methods that we use with generation and user data, with data that we internally create and annotate. One important thing we have to watch out for is the distribution of the data. It must reflect how it is for the users. We always aim to keep that distribution as close to how it’s used as possible. That way, you can pick up common user errors – such as misplaced commas. You’ll notice these errors occur more often than certain other types of errors. Given that language is so subjective, how do you effectively benchmark the GEC and its tasks? We’ve had to build tools from scratch to address that issue because they just didn’t exist before. As I mentioned earlier, you can get something marked wrong, but it’s technically correct. There’s more than one way to correct a sentence. Of course, the most obvious solution

34 | THE DATA SCIENTIST

B ARTMO SS ST. CL AIR is to collect every possible variant of a correct version of a sentence. But that’s a momentous task, and it’s not always the best way to do it. You need to ask yourself: what’s more important, precision or recall? When it comes to the users, we do have some fantastic analytics. We can see if users generally applied our suggestion, or if they said no to it, or just ignored it. And we guide ourselves a lot by that. There’s no silver bullet. Instead, it’s a barrage of methodologies to ensure that we are giving our users the best possible quality corrections that they can trust, whether they’re writing a thesis or a legal document. And because we’ve been doing this for 20 years, we’ve ironed out a lot of those issues. And how do you measure the performance of your system? There are countless different measures we use for performance. There’s the typical F1 scores: when you’re training or fine-tuning a model with your evaluation dataset, this is something you look at. That’s not 100% foolproof, but it offers a good rule of thumb. Also, the user analytics. When you put something online, you can see very quickly if the users really hate it. So we do partial rollouts and A/B tests. We also do manual reviews: taking subsets of data, and have professional language experts review it. These reviews are time-consuming, but extremely important. Our language experts have a very deep understanding of language, so they will catch any errors. Can you talk about the platform before LLMs, and how introducing LLMs has improved the user experience and the performance? There are two main use cases for us with LLMs. One is general grammatical error correction, and the other is rewriting or rephrasing; for example, you can’t do paraphrasing without these kinds of models. That was a huge use case for us, something you couldn’t do before with previous systems. And also with grammatical error correction, you can catch things that you couldn’t catch before LLMs. Back then, the best we could do was train much smaller models on a very specific error – where there’s a comma missing, for example. You would have to literally think of each and every type of error possible. Whereas when you’re training a model, it can catch a lot of things that are much deeper than you would ever think about; very deep contextual things. You can also change things with style, improve fluency. There are so many use cases for language that we can bring into all kinds of applications. I use the tool myself, and even though I taught English for several years, I’ll discover that I missed a comma here and there. And that’s quite helpful!

Regarding that switch to using LLMs more, did that have an impact on your cost side? Absolutely. There was a cost trade-off there because we host all of our own infrastructure. With our GPU servers, it’s a no-brainer. You get higher retention when you find more errors and improve people’s spelling and grammar. And consequently, you get better conversion rates for our premium product. So, it’s a worthy investment for us. Of course, working to bring the cost down is something we actively do. We’ve worked a lot on compressing models and accelerating models. These things aren’t plug and play; you’ve got to build the framework, fix bugs etc. You mentioned earlier that language is something that’s changing. Do you think that at some point in the future, in 10 years or so, LLMs will start having an impact on shaping language? I think to some degree they already do. We can, a lot of times, detect whether a longer text is generated. And I’ve noticed by playing around with different LLMs that they like to inject certain words all the time. And I don’t know if that’s for watermarking purposes or what that actually is because a lot of times they’re not very transparent about these things. I think as we rely on LLMs more, we will definitely see a greater impact on language. But it’s always important to note that these things are just predicting the next token based on a huge amount of data from people. And so, as long as you’re training new models or fine-tuning new models, that’ll change with the times as well. Language is very fluid, and it changes over time, and we have to change with it. And I think there’s nothing wrong with that. When using LLMs with LanguageTool, are you combining it with GPT via the API or are you using open source LLMs? We don’t use external APIs, especially for grammatical error correction. That would be a privacy concern because a lot of our users are very privacy-focused, and our data generally stays within the EU. And we don’t save data from our users unless they’ve opted in from the website. And so, we keep that all in-house with processing. Everything we’re doing is completely built internally. And we use a variety of different types of open-source models, and we’re always experimenting with new ones.

Language is very fluid, and it changes over time, and we have to change with it.

THE DATA SCIENTIST | 35

FRANCESCO GADALETA

ARE LARGE LANGUAGE MODELS THE ULTIMATE DATABASE?

n a recent article, software engineer François Chollet (the mind behind the Python library Keras) put forward a bold interpretation of large language models (LLMs): he claimed that modern LLMs act like databases of programs. Chollet’s interpretation might on the surface sound strange, but I believe there are lots of ways his analogy is an astute one. In this article, I’ll explore whether LLMs really are a new breed of database – and deepdive into the intricate structure of LLMs, revealing how these powerful algorithms exploit concepts from the past.

FRANCESCO GADALETA is a seasoned professional in the field of technology, AI and data science. He is the founder of Amethix Technologies, a firm specialising in advanced data and robotics solutions. Francesco also shares his insights and knowledge as the host of the podcast Data Science at Home. His illustrious career includes a significant tenure as the Chief Data Officer at Abe AI, which was later acquired by Envestnet Yodlee Inc. Francesco was a pivotal member of the Advanced Analytics Team at Johnson & Johnson. His professional interests are diverse, spanning applied mathematics, advanced machine learning, computer programming, robotics, and the study of decentralised and distributed systems. Francesco’s expertise spans domains including healthcare, defence, pharma, energy, and finance.

THE ORIGINS OF LLMS Before we explore Chollet’s analogy, let’s consider his take on the development of LLMs’ core methodology. Chollet sees this development stem from a key advancement in the field of natural language processing (NLP) made a decade ago by Tomáš Mikolov and the folks at Google. Mikolov introduced the Word2vec arithmetic, which solved the problem of how to numerically compute text. The Word2vec

36 | THE DATA SCIENTIST

algorithm works by translating words, phrases and paragraphs into vectors called ‘word vectors’ and then operates on those vectors. Today, of course, this is a normal concept, but back in 2013, it was very innovative (though I should point out there have been many papers prior to 2013 in which the concept of embedding was very well-known to academics and researchers). Back in 2013, these word vectors could do things like arithmetic

F RANCESCO G ADAL E TA

operations. As a classic example, if you have the embedding of ‘king’ minus the embedding of ‘man’ plus the embedding of ‘woman’, it would have given you as a result the embedding of ‘queen’. Which is the semantically closest concept to this arithmetic operation. Thanks to these word embeddings, we could calculate arithmetic operations on words. Fast-forward 10 years, and what’s changed? These algorithms have evolved into LLMs. In 2023, around March (at least officially), LLMs were announced, and we started using ChatGPT and the many other LLMs out there. And so, billion parameter models became the new normal.

WORD2VEC: THE DECADEOLD CONCEPT AT THE HEART OF MODERN LLMS The original 2013 Word2vec was, of course, incapable of generating fluent language. Back then, methodologies were much

more limited. But at the core of our modern LLMs, there’s still a version of that Word2vec concept: of embedding tokens of words, or sub-words, or entire documents, in a vector space. In his article, Chollet explores this correlation between LLMs and Word2vec. He points out that LLMs are perceived as ultra-aggressive models, trained to predict the next word conditioned on the previous word sequence. And if the previous sequence is large enough – if there is a lot of context, or a more complete context – the prediction of the next word can be increasingly accurate. But in fact, Word2vec and LLMs both rely on the same fundamental principle: to learn the vector space that allows them to predict the next token (or the next word), given a condition; given a context. The common strength of these two relatively different methods is that tokens that appear together in the training text, also end up

THE DATA SCIENTIST | 37

close together in the embedding space. That’s the most important consequence when you train these models. In fact, it’s crucial, because if we lost that kind of correspondence, we wouldn’t be able to match numerical vectors to similar semantic concepts. Even the dimensionality of the embedding space is quite similar. The order of 10 to the power of three or 10 to the power of four, the magic numbers obtained by trial and error – as LLMs increased in scale, that aspect really didn’t grow that much, because people noticed that by increasing the dimensionality of the embedding space, there were few improvements on the accuracy of the next estimated token or word. At the same time, if you increase the dimensionality of the vector, you must also increase the compute power to calculate all these things. So LLMs seem to encode correlated tokens in close locations: there’s still a strong connection

F RANCESCO G ADAL E TA

between training LLMs and the training part of Word2vec.

SELF-ATTENTION AND THE DEVELOPMENT OF LLMS The outcome of all this, as Chollet says, is self-attention. Chollet sees self-attention as the single most important component in the transformer architecture. He defines it as: ‘...a mechanism for learning a new token embedding space by linearly recombining token embeddings from some prior space in weighted combinations which give greater importance to tokens that are already “closer” to each other,’ which I think is a great summary of how self-attention works. A classic example is having the input sequence: ‘The train left the station on time.’ First, the self-attention mechanism would convert each word into its token vector. So, we would have the vector of ‘the’, the vector of ‘train’, the vector of ‘left’, and so on. And we put those vectors into metrics that we use to calculate the attention scores, because we want to calculate the attention score of each word against all the others. Next (of course I’m oversimplifying this stage of the process) we get to the scores for the word ‘station’, and we will immediately compute that ‘station’ is usually close to the word ‘train’ and the word ‘left’. But ‘station’ and ‘train’ are almost always close to each other (at least according to the enormous amount of text that is used for learning such relations). So, when we do this exercise considering this course for ‘station’, we would immediately see that it has a higher score with respect to ‘train’ than to all the other words. If we extrapolate these scores from the sequence, and order the words that have the highest attention scores in the sequence, we would find ‘train’, ‘left’ and ‘station’, which gives us a hint about what’s going on in this sequence: that a train is indeed leaving a station.

Of course, we don’t know any further details because we have just one sequence and one phrase. But imagine you had several gigabytes or even terabytes of text in which all these correspondences of words and contexts would be scored according to the self-attention mechanism. Then you would have a very interesting context where vectors summarise very nicely whatever is going on in those sequences. Two important things are happening here. Firstly, the embedding spaces that these topologies learn are semantically continuous. This means if we move slightly in the embedding space, we’re also slightly changing the corresponding token. Also, the semantic meaning that we humans assign to that group of words is changing slightly. And

Why are these two properties so important? Because human brains work in a very similar fashion. Essentially, neurons that fire together, often wire together, so when the brain learns something new, it creates maps of a space of information (this is known as Hebbian learning). Indeed, there are strategies people employ (such as mind maps) that enhance this innate property of the brain to build maps of a space of information to learn fast, to remember concepts better, or to retrieve these concepts after years and years. So the way the brain learns is in fact just like what’s happening on our GPUs when we train LLMs, which respond to these two properties of semantic continuity and the semantically interpolative nature of the methodologies. Of course, there have The LLM doesn’t only give you what you inserted, been huge but also something more, something that is an improvements interpolation of what you have inserted. since Word2vec was introduced so, this semantically continuous in 2013. LLM models aren’t just space is actually a very interesting about finding a semantically similar property to move continuously word anymore. Modern LLMs in the embedding space, which is have become way more powerful. made of numeric vectors, and also For example, you can provide a in the world of semantics, which paragraph, a document, or even a is the world we humans live in. description of how your day went, This means that we slightly change and you ask in your prompt: ‘Write numeric parameters or vectors, and this in the style of Shakespeare’, we slightly change the semantic and you would get, as an output, a meaning of those vectors. poem that resembles Shakespeare’s Secondly, the embedding spaces poems. In this way, LLMs are they learn are semantically enhancing the concept of Word2vec interpolative. This means that and taking it to new dimensions. if you take the intermediate This brings us to Chollet’s point between any two points in intriguing interpretation of LLMs as the embedding space, this third program databases. point represents the intermediate LLMS AS DATABASES meaning between the corresponding You might be wondering: what does tokens. In other words, if you cut a database have to do with a large somewhere in between two points, language model? you get the semantical intermediate Well, consider that just point as well (I’d argue this feature is like a database, an LLM stores a consequence of the first property information, and this information of the semantically continuous; can be retrieved via query, which we you can’t have the second without now call prompt. the first).

38 | THE DATA SCIENTIST

F RANCESCO G ADAL E TA

However, the LLM goes beyond a basic database because it doesn’t merely retrieve the same data that you have inserted, but it also retrieves data that’s interpolative and continuous. So, the LLM can be seen as a continuous database that allows you to insert some data points and retrieve not only the data points that you have inserted, but all the other data points that might have been in between. That’s the generative power of LLMs as a database: the LLM doesn’t only give you what you inserted, but also something more, something that is an interpolation of what you have inserted. These results don’t always make sense, because the LLM still has the potential to hallucinate. But even when taking these glitches into account, Chollet has given us a very neat interpretation. The LLM could be analogous to a database, with this major difference: the capability of returning not just the data points that you have inserted, but also an interpolation of those (and therefore points you never inserted before). The second major difference between LLMs and conventional databases is that LLMs don’t just

contain data (of course, they do contain billions and billions of parameters now, even hundreds of billions with GPT4); LLMs also contain a database of programs. So, the search isn’t only in the data space, it’s also a program space, in which you have millions of programs. The programs are a way to interpolate data in different ways. In other words, the prompt is essentially pointing to the most appropriate program to solve your problem. For example, when you say, ‘rewrite this poem in the style of Shakespeare’, and provide the poem, the ‘rewrite this thing in the style of’ element is using a program key that’s pointing to a particular location in the program space of the LLM, while ‘Shakespeare’, and the poem that you input are the program inputs. The output – the result of the LLM – is the result of the program execution. In summary: you point to the program, you provide some arguments or some inputs, the program executes and you receive an output. The model operates in the same way as giving inputs to functions in an imperative programming language and waiting

for the result after compute. We can also think of LLMs as machines that generate the next word given a context that is the statistical condition. This interpretation is easier to explain to non-technical people, even though there are many technicalities here that we’re skipping, of course.

CONCLUSION There is one more thing that Chollet and I agree on: even though LLMs seem to be sentient machines that understand your prompt, this, of course, is not happening. LLMs do not understand what you’re typing. That prompt is a query, a way to search in a program space, just like a query you insert in Google Search, or a query you’re providing to a database. It’s in fact a key, just a complex one, that allows the database to search and retrieve the information you’re seeking or interpolate it for you. That’s the advice that Chollet is giving to all of us: resist the temptation to anthropomorphise neural networks and in particular, LLMs. Instead, consider them a sophisticated kind of database. Happy prompting!

Resist the temptation to anthropomorphise neural networks and in particular, LLMs. Instead, consider them a sophisticated kind of database.

THE DATA SCIENTIST | 39

REBECCA VICKERY

A JOURNEY

FROM DATA ANALYST TO DATA SCIENCE LEADER REBECCA VICKERY is an award-winning data science leader with more than 16 years of extensive experience in the field of data across diverse industries. A prolific writer, she has published over 100 articles on Medium, primarily dedicated to guiding others in starting their journey into the world of data science. Rebecca is also a sought-after speaker, having presented at numerous data conferences globally over the last few years. Currently, she directs her expertise towards leading a data team at EDF, where their primary focus is on developing machine learning models and conducting research to provide insights for targeted customer engagement.

y career in data started a long time before cloud computing had become established. Maintaining records of plant genome size at the Royal Botanic Gardens Kew during my undergraduate placement in an on-premise Microsoft Access database, was very different from the work I’m currently doing running machine learning workloads on Amazon Web Services (AWS) and Snowflake. However, it sparked a passion for data discovery, science and problem-solving that ultimately would lead to a career in data science.

EARLY CAREER Inspired by a birthday gift of a microscope at age ten, I developed a keen interest in science and nature from a

young age. As a result, I chose to study Molecular and Cellular Biology at university. An industrial placement year at the Royal Botanic Gardens Kew involved early work with databases and writing basic code. After graduating from university in 2001, keen to further pursue programming, I took a trainee role in a media agency to learn front-end web development. Data analysis gradually became another part of my role at the agency as Google Analytics became established and clients had a desire to understand how their websites were performing. Finding a particular passion for analytics I applied for a role with an online travel retailer, Holiday Extras, where they were looking for someone to build a new web analytics practice.

40 | THE DATA SCIENTIST

REBECCA V ICKERY

DIGITAL TRANSFORMATION, CAREER TRANSFORMATION I would spend the next few years working to establish a suite of web analytics reporting tools using an in-house event tracking system. To begin with, my work mainly centred around SQL and Excel, but as the business rapidly underwent a digital transformation so did my role. The company’s data migrated to Google Cloud, and as an early adopter of the business intelligence tool Looker, I was able to automate much of the manual data analysis that previously consumed my time. I now had an increased capacity to perform more strategic data analysis and build models. Quickly finding the limits of SQL and Excel, and keen to expand the impact of my work, I started on a journey to learn Python for Data Science. At the time I worked a full-time job and had two young children, so studying had to fit into the limited free time that I had. I was fortunate to have an experienced data scientist as a mentor who guided me in designing a bespoke curriculum to learn from. Data science is a cross-functional field and the chances are that anyone starting out will already have some of the skills required. I already had a background in data analysis and statistics so for me, the main focus for learning was programming in Python and machine learning. Focussed on learning as much as I could in limited time, I quickly developed a technique for accelerated learning. I would use a variety of resources, mostly freely available tutorials and articles on the internet. I’d learn just enough to be able to apply these skills in my work and then explain the concept back by writing a blog post. This act of explaining would guide me to develop a deeper understanding of the topic and would solidify my knowledge, but most importantly it would tell me where the remaining gaps were in my understanding. I would then repeat the process, learn more, and build even better applications. Writing has since become another passion of mine and continues to be a source for learning and development as well as enabling me to give back to the data science community who helped me so much at the beginning of my career. I wrote my first article on Medium in 2018 and to date, I have published over 100 articles on the platform.

TRANSITION TO LEADERSHIP Being driven by the impact that my work in data science could have, I later applied for a data scientist role at EDF UK. EDF’s core mission is to help Britain achieve Net Zero and I was hugely inspired by the opportunity to apply my

skills to solving a problem that affects everyone and the world around us. Two years later, I had the chance to step into a leadership position by taking on a role as Senior Leader of a brand new Customer Insight and Targeting team. My team and I are responsible for building models, research insight and data sets to develop audience segmentation and improved targeting across digital, media and direct marketing. I remain partially hands-on and now also have the opportunity to train and mentor others in Python for Data Science. Moving into leadership has allowed me to grow my impact even further through guiding and directing an entire team.

THE FUTURE I’m not sure there is a traditional route into data science and my journey certainly wasn’t a direct path. My focus has consistently been on the impact of my work, continuous learning, and the joy derived from the process, rather than adhering strictly to job titles, and this has resulted in an interesting and varied career in data. The data space has never been more interesting than it is right now. The explosion in generative AI this year has made machine learning more mainstream and it seems that everyone is talking about it now. Unlocking the potential of machine learning used to be a significant challenge for many organisations, but I think we are now at a tipping point for driving substantial positive change. In my new role, I am already witnessing the tangible effects of this shift and I’m excited about the future of data and the impact it will have on the world as we know it.

My views expressed in this article are entirely my own and don’t necessarily reflect the views of any of the companies mentioned.

10 | THE THE DATA DATA SCIENTIST SCIENTIST | 41

COLIN HARMAN

THE 5 USE CASES FOR ENTERPRISE LLMS FROM LLM CAPABILITIES TO USE CASE CATEGORIES

COLIN HARMAN is an enterprise AI-focused engineer, leader, and writer. He has been implementing LLM-based software solutions for large enterprises for over two years and serves as the Head of Technology at Nesh. Over this time, he’s come to understand the unique challenges and risks posed by the interaction of generative AI and the enterprise environment and has come up with recipes to overcome them consistently.

elcome to the age of Enterprise LLM pilot projects! Right now, enterprises are cautiously but enthusiastically embarking on their first large language model projects, with the goal of demonstrating value and lighting the way for mass LLM adoption, use case proliferation across the organisation, and business impact. In a well-planned and well-executed project, that’s exactly what will happen: the use case will provide new capabilities and efficiencies to the business, function reliably, delight users, and make a measurable impact on the business. Understanding of the technology and its value will spread outside the pilot project, where new use cases will be surfaced and championed. However, if the wrong use case is chosen the project could fall flat, even when executed to perfection. Imagine a use case is chosen that’s a poor fit for the capabilities LLMs provide: You implement an e-commerce analytics assistant, but the LLM-based conversational interface is a clumsier experience than the previous column-based lookup interface, so you remove it. Then you realise all the

analytics requirements are best satisfied by traditional data analytics patterns like statistical analysis, topic modelling, and anomaly detection, so instead you shoehorn a conversational interface into some side feature that nobody will use. You’ve reinvented the wheel, and the LLM provides no benefit to users. Or imagine that a use case is chosen that fails to provide an advantage over existing workflows, even if it is a fit for LLM capabilities: You implement a chat-with-data system on a set of marketing publications. Users try it out, but prefer to use Google because they can’t ask your system about topics not contained in the dataset, while Google has everything and still gets answers and sources. Feedback is negative and your fancy AI system is perceived as less useful than Google search. Or, imagine that the data for your selected use case is not organised or reliable enough to support it: You implement a question-answering system over a massive dump of unstructured data. enterprise search initiatives over this dataset had previously failed spectacularly, due to nonexistent data management, chaotic content, and erratic formatting. Now, the

42 | THE DATA SCIENTIST

CO L IN H ARM AN

question-answering system, which is also search based, faces the same problems and also fails spectacularly. In any of these cases, your future LLM projects will struggle to gain support from leadership or adoption from users. However, under different circumstances, an LLM-based assistant, chat-withdata system, or question-answering system can be an impressive success! In this article, we’ll review the core capabilities of LLMs, map those to use case categories with examples, and consider what kind of impact they can generate. By the end of the article, you should have an intuition of how to spot candidate use cases in your organisation.

BUT FIRST: AI AND GENERATIVE AI DO TOTALLY DIFFERENT THINGS If you’re already very familiar with LLMs and their capabilities, skip to the ‘Major Use Case Categories’ section below. What did ‘enterprise AI’ mean before 2023? Usually it meant deriving insights or making judgments based on data. It went hand in hand with Big Data, and was almost never generative. Predict, classify, detect, analyse, recommend. These functions typically manifested in optimisations grounded in the dataset. ●● You’re recommending products to buy? Recommend better ones, based on the data. ●● You’re predicting account lifetime value? Predict more accurately, based on the data. ●● You’re detecting suspicious transactions? Detect more accurately, based on the data. The idea is, the secret to doing better is right there in the data! After all, enterprises have tons of data, so it makes sense and works really well. Grab a handful of business processes in your organisation and if there’s relevant data, there’s likely an optimisation to be had.

Generative AI is totally different. By generating information, it provides new capabilities that are almost mutually exclusive from the optimisation power of non-generative AI. And the capabilities of a generative language model don’t come from the data surrounding a business task – they come primarily from the text of the entire internet, and the instruction data it was fine-tuned over, which sometimes extends into your enterprise. This coin has two faces: ●● LLMs don’t need any of your data in order to be useful. ●● LLMs will never perfectly align with the distribution of your data.

But perfectly matching your internal data distribution is generally only needed for optimisations, so that’s okay! What’s more, the LLM was not trained on your specific task either. To reinforce the difference between them, let’s spell out how the different models are trained: ●● Non-generative AI: Usually 100% trained on the relevant task and data. ●● Generative LLMs: Often 0% trained on the relevant task and data. Clearly, they couldn’t be more different. As you approach LLMs, be careful not to think of them in the same category as non-generative AI.

We’ll use LLM as shorthand for generative large language model throughout this article, and expand these capabilities below.

LLM CORE CAPABILITIES To lay the groundwork for use case selection, let’s divide LLMs’ broad capabilities into three more specific ones. 1. Instruction following and reasoning 2. Natural language fluency 3. Memorised knowledge

1. INSTRUCTION FOLLOWING AND REASONING. LLMs with some measure of fluency, memorised knowledge, and creativity have existed since as far back as 2018 (GPT-1). GPT-3, a version of which underpins ChatGPT, was released in 2020 and yet failed to permeate the technology landscape. What changed (and enabled an entire class of models to be used off-the-shelf for countless enterprise purposes) was training it to follow human instructions, which picked up speed with the release of InstructGPT in 2022 and reached exit velocity with ChatGPT later that year. Previously, generative models (BART, T5, etc) had to be extensively fine-tuned for every type of task you wanted them to perform, e.g. text summarisation, query generation. Even GPT-3 had to be tricked into following instructions with clever yet unreliable prompts. But with instruction-tuned (we’ll use that to encompass both task-tuning and RLHF) models, you could simply describe the task, and the model would somewhat reliably interpret the instructions to generate a plausible answer. All of a sudden, projects could skip the costly data collection and training processes that were required before, with even better results!

10 | THE THE DATA DATA SCIENTIST SCIENTIST | 43

CO L IN H ARM AN

This ability is the cornerstone of enterprise value that can be derived from LLMs: follow text instructions, perform basic reasoning, and provide a text output. Without it, LLMs regress to 2021 where they could be used for plenty of fun, but not enterprise-valuable applications. It’s so important that I recommend using the following mindset while ideating and selecting LLM use cases.

This framing correctly focuses thinking on the transformation of inputs into outputs by following instructions. It avoids treating LLMs as a knowledge source. It avoids focusing solely on their conversational power, because that is just the tip of the iceberg and the majority of enterprise value lies under the water, beneath the chit-chat. It focuses on processing an input at a time, rather than generating a single output over an entire dataset. The disclaimer to this impressive skill is that instruction following and reasoning performance depend on the complexity and nature of the task. Rarely can it be relied on to make business-critical decisions, and some types of tasks (like numerical reasoning) can cause even the strongest LLM to err spectacularly.

2. NATURAL LANGUAGE FLUENCY Since the early generative models of 2018, LLMs have steadily improved their fluency in natural language, making fewer mistakes and generating text in more tones, formats, styles, and languages (even programming languages!). To explore how far we’ve come, check out the Write with Transformers demo from Hugging Face. (Note: since I wrote this, the webpage containing models other than GPT-2 seems to have broken, so the link only points to the GPT-2 version). The older models hosted there are still quite fluent, albeit without the level of intelligence behind them to which we’re now accustomed in 2023. They’re also poorly suited to taskoriented fluency due to the lack of instruction-tuning, which nowadays often includes chat-tuning to produce seamless conversational experiences. However, it’s reductive and dangerous to focus only on chat as the only way to interact with or benefit from LLMs; just know that these models can produce fluent language under nearly any condition. As a side note to fluency, let’s propose another capability: Creativity. This won’t get its own section as it can be considered a combination of fluency, reasoning, and some X-factor. Suffice to say that LLMs can produce surprisingly novel and interesting

outputs, even as far back as the demo linked above. Debate continues to rage about whether the models are ‘truly creative’ but when you ditch the philosophy and focus on the capability, the answer is resoundingly: Yes! However, creativity is opposed to many enterprise use cases but is easily discouraged through narrow, grounded, task-oriented prompting.

3. MEMORISED KNOWLEDGE LLMs accumulate a vast amount of ‘knowledge’ during their pre-training task of attempting to autocomplete text across billions of words. This feature strongly contributed to the hype and utility of ChatGPT upon its release – it’s like a search engine that only gives me what I need! However, the community quickly discovered the limitations of this knowledge: 1. The memorised knowledge is unreliable and unverifiable. 2. The memorised knowledge cannot be edited without building a new model. All are now familiar with ‘hallucinations,’ or factually inaccurate generations, as a harmful byproduct of trying to extract memorised knowledge from a large language model. Information changes over time too; both issues could clearly harm an enterprise LLM project if unmitigated. The accepted (and only feasible) practice to address these issues is called retrieval-augmented generation or RAG. Factual information is stored in an external knowledge base, information is retrieved relevant to a particular task, and then the LLM is fed the information and instructed to only consider the information provided to it when generating. Another victory for instruction-tuning! But that does mean that most enterprise use cases now require a search engine or database and the requisite data readiness to support it. There are still some use cases where the memorised knowledge in a large language model can be useful. I’ve even found some small applications of it in my designs. So don’t totally count it out, but in general, assume that your system’s knowledge will come from outside the model and treat the memorised knowledge with caution.

MAJOR USE CASE CATEGORIES The capabilities of instruction following and reasoning, natural language fluency, and memorised knowledge manifest in use cases can be grouped into the following categories, from least to most complex: 1. 2. 3. 4. 5.

Data transformations Natural language interfaces Workflow automations Copilots and assistants Autonomous agents

44 | THE DATA SCIENTIST

CO L IN H ARM AN

Nearly every use case can either be classified into one of these categories or be composed from them. If you’re not sure how a certain project idea of yours might map into these, let me know!

1. DATA TRANSFORMATION The simplest thing you can do with an LLM is to submit a text input and receive a text output. That can be just a single call to a model, although it can be more complex, too, including advanced prompting systems (chain-of-thought, tree-of-thought, etc), guided generation (guidance, LMQL, outlines), or context from a conversation. But what’s in common is, the input to the system is text, the output is text, and no external tools or data were used. Within this category, there are two very distinct subgroupings.

DATA TRANSFORMATIONS TYPE 1: AUGMENTATION OR REFORMATTING These are typically data solutions, not user-facing, that transform data in some pipeline. The outputs may eventually be presented to a user, but usually with other application layers in between. Guided generation is extremely important for achieving output formatting, for integration with the rest of the pipeline. These systems are often very quick to spin up compared to their non-generative AI-based alternatives, but may be prone to bias and require data collection to evaluate (which is a normal part of the training and evaluation process for most non-generative alternatives). Here are some examples: SUMMARISATION ●● E.g. An email thread summarisation feature in a customer support tool.

●● Highlights reasoning and fluency capabilities. CLASSIFICATION ●● E.g. Tag podcasts with topics by classifying their transcript text, ‘entertainment’, ‘culture’, ‘technology’, etc. ●● Highlights reasoning capabilities. FEATURE EXTRACTION ●● E.g. Extract key words and phrases from customer complaint tickets. ●● Highlights reasoning capabilities. FORMAT CONVERSION ●● E.g. Convert free-text feedback messages into JSON objects that can be submitted as bug tickets. ●● Highlights reasoning capabilities. INFORMATION EXPANSION ●● E.g. Get definitions or synonyms for opendomain terms. ●● Highlights memory capabilities.

DATA TRANSFORMATIONS TYPE 2: MODEL-ONLY CHATBOTS In contrast to the ‘data product’ examples above, these examples are very much user-facing and familiar. They are model-as-a-product, with a user interface. Their limitations have to do with limited knowledge (relying on model memory or short corpus). It may seem strange

THE DATA SCIENTIST | 45

CO L IN H ARM AN

to think of chatbots as data transformations, but that is all they are – text in, text out. No contact with external systems. Examples: MODEL-ONLY CHATBOT ●● E.g. ChatGPT without plugins. ●● Highlights reasoning, fluency, and memory capabilities. SMALL-CORPUS CHATBOT (ENTIRE KNOWLEDGE BASE INCLUDED IN PROMPT) ●● E.g. Chat-a-document, InstaBase AI Hub. ●● Highlights reasoning and fluency capabilities.

IMPACT The value of these use cases can vary widely. On one hand, operations like classifying records with an LLM can typically be performed with non-generative methods equally well or better, and at lower operational costs. But the startup costs are much higher, and if your team doesn’t have the skills or time to develop that system, quick-starting using an LLM could unlock a capability you didn’t have time for before, or allow you to prototype much faster. And single-document chatbots and question-answering systems can improve review efficiency, but that only matters if you have a document-review process that happens frequently, and if it can’t be solved by simply highlighting keywords or extracting consistently formatted fields. So, as with any use case, the impact depends on your business. Here are some clues to help you spot good ones: ●● You want to perform AI data transformations without having machine learning/data science skill sets on the team. ●● You want to prototype data transformations to gauge impact before investing in a machine learning/data science project. ●● You have frequent document review processes that deal with variable language or formatting.

2. NATURAL LANGUAGE INTERFACES (NLI)) This is the most common of the enterprise LLM pilot project types. Connect an LLM to a knowledge base or (less commonly) a tool, like an internal API, to automate question answering or basic tool usage. CHAT-YOUR-DATA ●● E.g. Customer support chatbot that answers questions based on product documentation indexed in a search engine. Demos like Weaviate Verba. ●● Note: May seem similar to the small-corpus chatbot mentioned above. The difference is that the entire knowledge base cannot be included in a single

model prompt. This requires a system to retrieve the relevant records, which introduces a lot of complexity. ●● Possible forms of knowledge bases: Search engines, graph databases, data warehouses, CMS, etc. ●● Highlights reasoning and fluency capabilities. TOOL-USER CHATBOT ●● E.g. A chatbot that can book flights by interacting with an API. ChatGPT with Plugins (can also be a chat-your-data example, depending on the plugin type). ●● Possible forms of tools: run a query against a database, submit a ticket via API. ●● Highlights reasoning and fluency capabilities.

IMPACT It can be easy to overestimate the value of these applications. Essentially, they improve the usability of existing search engines and business tools – they rarely introduce new capabilities of their own. If you don’t have a search engine, you’ll need to build one as part of the project, and that will provide a stepchange in capabilities. The natural language interface provides efficiencies on top of that. However, if your data is chaotic you may have trouble getting a search engine working, and then your NLI on top of it is useless. Do not underestimate the difficulty of enterprise search. With tools it’s similar: most tools’ programmatic interface gets translated into a UI with buttons and fields… so make sure that using natural language would be more valuable than that type of interface. On the other hand, the visibility and familiarity of these applications can help give a taste of LLM to members of your organisation, which can be valuable on its own. This is why so many leaders are choosing NLIs as the subject of their pilot projects. Here are my suggestions for the best opportunities to look for: ●● Enterprise search systems already in use and generating value, but without an NLI. ●● Well-organised and heavily trafficked data sources, with bounded use cases, that would benefit from search. ●● A business tool with a busy interface, that frustrates non-technical users.

3. WORKFLOW AUTOMATIONS. Pst: Everybody is focused on the other categories but this one has the potential to generate the most value at your business, IF you can find the right place for it. Put yourself in the shoes of a random business function at your organisation. Over the course of a week,

46 | THE DATA SCIENTIST

CO L IN H ARM AN

that function performs dozens of distinct workflows (a collection of tasks towards an outcome). Some repeat often, some rarely. (Frequency) Some are quick and easy, others are slow and tedious. (Burden) Some vary wildly, others are nearly the same each time. (Variability) Some involve using tools and outside knowledge sources, some require interactions with colleagues or clients, some are based solely on the input artefact. (Resource complexity) Some require complex reasoning processes, others simple operations. (Process complexity) If you choose workflows in the best slice of these axes, you can completely automate them and generate massive business value. With the right execution you can save employee hours and deliver outcomes nearinstantly. Of course, if you choose the wrong workflow you’ll find it an infeasible automation target or it won’t generate any value and nobody will realise it exists. Keep reading to see how to choose the right one. An implementation generally looks like this: 1. Define the inputs, outputs, process steps, and external resources necessary to complete the workflow. 2. Create logic to represent the process steps, and interfaces with any of the external resources necessary.

documents. Outputs are suspected conflicts, with the respective sections from both policy and regulation documents. ●● FREQUENCY: Medium (whenever policies or regulations change). ●● BURDEN: High (reading through policies and regulations and comparing them is time-intensive, error-prone, and tedious). ●● VARIABILITY: Low (the review process is nearly identical every time). ●● RESOURCE COMPLEXITY: Low (two narrowly scoped data sources – policies and regulations – is very simple, as far as these projects go. And keeping the human review for afterward means the workflow can run end-to-end). ●● PROCESS COMPLEXITY: Low (though it is timeconsuming to evaluate every statement in a policy against a corpus of regulations, it is a relatively simple task with few steps. Some domain-specific reasoning may be involved, though). Filling out e-commerce forms using product documentation, for submitting products to new marketplaces. Inputs are forms to complete. Resources are product documentation. Outputs are completed forms. ●● FREQUENCY: Medium (any time you have a new or updated product, or need to submit your products to a new marketplace. ●● BURDEN: Medium (filling out forms is fairly easy, but can take a while). ●● VARIABILITY: Low (the forms and product documentation may be fairly consistent). ●● RESOURCE COMPLEXITY: Low (the forms are simple, but it all depends on the product documentation. If it’s well organised, this is very simple). ●● PROCESS COMPLEXITY: Low (usually, little reasoning needs to happen when filling out a form from documentation - it’s a mechanical process). Reviewing engineering change requests for compliance, quality, traceability. Inputs are change requests. Resources are policies and quality guidelines. Outputs are audits of requests, containing suspected risks or gaps in documentation.

EXAMPLES It sounds simple! But of course the complexity depends on the application. Here are some examples of what effective workflow automations might look like in your business: Compare company policies against updated regulations and surfacing risks. Inputs/resources are a collection of policy documents and a collection of regulation

●● FREQUENCY: High (any time you have a new or updated product, or need to submit your products to a new marketplace. ●● BURDEN: Medium (filling out forms is fairly easy, but can take a while). ●● VARIABILITY: Low (reviewing a request should involve roughly the same steps, each time).

THE DATA SCIENTIST | 47

CO L IN H ARM AN

●● RESOURCE COMPLEXITY: Low (the change requests should have a consistent style and format, and the policies and guidelines should be easily circumscribed). ●● PROCESS COMPLEXITY: Medium (using the resources to audit the change request could be straightforward, but there may be more advanced reasoning involved in assessing quality, for example). What makes this impact possible is to circumscribe and define the workflows. We’re not expecting LLMs to figure out how to solve new problems – that’s the job of autonomous agent systems (see the section below). Instead, we’re turning a business process into a pipeline of transformations, interactions, and decisions. To keep your mind open to the best opportunities, do not fixate on solutions with a conversational/chat interface. If you limit yourself to chatbots, you lose the document comparison, form filler, and automated review use cases above, among others. Many of these automations are best implemented as a behind-thescenes pipeline rather than a user-facing app.

Here’s a basic guide of how to select candidates for automation: 1. Select for high impact, as measured by frequency * burden. This translates to hours saved, minus some time for manual review of the automation output. For a work function, create a Pareto chart of the workflows and their frequency, then multiply the frequency values by the time burden. Take the 20% of workflows on the left and move to the next stage.

2. From the impactful candidate workflows, select the most feasible to automate. This is determined by the other three attributes, which we want to all be low: VARIABILITY ●● Can you create a flowchart of the workflow that’s followed each time? ●● If the process to complete the task changes every time, it may be a poor fit for automation. Instead, what we’re looking for is a repeatable process for which the inputs or data change every time. This likely has made it resistant to automation in the past, but LLMs’ reasoning abilities may now render automation feasible. EXTERNAL SYSTEM COMPLEXITY ●● Can you draw a clean box around the resources (information, tools, and interactions) involved in a workflow? ●● If a workflow normally involves looking things up in a narrowly-scoped set of documents, great – that’ll be easy to automate. If it involves performing Google searches and using many information sources that change as you go, it’ll be tougher. ●● If collaboration with colleagues or clients is part of the workflow, it may still be feasible to automate, but is clearly more complex than otherwise. We’re moving toward an enterprise where automated processes send messages to employees to collect the necessary information to perform their automatic workflow. Since this is still a new idea, treat these workflows carefully, as those employees the system depends on create another risk for your project. PROCESS COMPLEXITY ●● How complex are the operations in the workflow? ●● LLMs are impressive reasoning engines that newly enable automation of countless operations. However, they’re not a perfect replacement for expert decision-making and have some critical weak spots, such as numeric reasoning (although that can be outsourced to symbolic systems). A good rule of thumb here is: for each operation in the workflow, could you teach it to a new employee with a one-pager or less? 3. Control or avoid risk. Avoid any use cases where bias is a consideration. Include some form of human review as part of the process, usually as a final step before delivery. Build and test iteratively, with stakeholders involved throughout the process.

48 | THE DATA SCIENTIST

CO L IN H ARM AN

IMPACT The potential value of these solutions is straightforward: employee hours saved, and outcomes accelerated. Reaching this potential depends on choosing appropriate automation candidates and executing the project well. The challenges are significant, but if your organisation is looking for actual business impact rather than flashy demos, this is the answer. The very same is the drawback though: these implementations will not result in a fun chatbot everyone can play with to spread the LLM gospel. And integrating this automation tightly into your business creates upside, but also risk. Workflow automations are cold, hard business tools, not toys.

tech organisations with deep pockets. However, building your own NLIs and workflow automations is a good way to progress to that goal, generating value along the way.

5. AUTONOMOUS AGENTS Now imagine a copilot that pilots itself. Remember those workflows we talked about automating, and the ones we ignored?

4. COPILOTS. Take a natural language interface and connect it to additional data sources and systems. Maybe integrate it with an existing text editor. If you like, even connect it to some workflow automations. Now you’ve got a copilot! The idea is: with each knowledge source and tool integrated, and the deeper that integration goes, the greater the utility. The goal of copilots is to dramatically increase the efficiency of the user across a broad set of activities. Here are some examples of what that might look like: ADVANCED CUSTOMER SUPPORT CHATBOT ●● Answer questions from documentation ●● Answer questions from user account data ●● Perform actions on behalf of the user CODING ASSISTANT E.G. GITHUB COPILOT X ●● Autocomplete code ●● Create and rewrite code based on knowledge of the codebase ●● Answer questions about the code and codebase These systems are very complex and require an immense amount of engineering to perfect. Unsurprisingly, some of the best current examples come from Microsoft (GitHub Copilot, Microsoft 365 Copilot), which has been pouring resources into LLM-related solutions for years.

IMPACT As these systems seep with our everyday life, it will become evident that they can make us multiple times more productive. Even the old, autocomplete-only GitHub Autopilot automated 25-60% of users’ code. The same is happening with word processing and I’m sure you’ve already used it in emails. If these simple augmentations can approach 2x efficiency (in the GitHub example) then imagine how effective they will be with deep integrations with other systems. The value is very real, but very difficult to reach. For now, expect these solutions to come from extremely sophisticated

The undefined, long tail of workflows is what autonomous agents seek to address, by integrating as many knowledge sources and tools as possible and implementing an executive reasoning system that can plan and execute, doing the job of the user in a Copilotstyle partnership. Imagine a system that can completely replace human employees, using the same tools and knowledge sources that they do. Currently there is a massive amount of capital being invested into developing these systems ($1.3B to Inflection AI, $415M to Adept AI, etc), but they aren’t changing the world just yet. As it turns out, the complexity of navigating the entire digital world is rather high. In domains of limited scope, we know that automation works. But once you add all the tools, all the knowledge sources, all the possible environments… the long tail of complications explodes and reliability tanks. Sound familiar? Autonomous cars present similar benefits but encounter the same problem: in limited environments they are reliable, but once you throw them into the infinite complexity of the real world, the issues never seem to end. And, sure – in 2023 they’re pretty advanced. But it took decades of work and $75B to get here. Don’t expect LLM agents to be too much easier.

IMPACT Autonomous agents represent escape velocity for the productivity of AI systems. But we are not there yet and making progress towards them remains the domain of tech giants and startups with hundreds of millions of dollars to spend toward uncertain returns.

THE DATA SCIENTIST | 49

ISSUE 5

ALEC SPROTEN & IRIS HOLLICK ADVANCED DEMAND FORECASTING AT BREUNINGER

THE PATH TO RESPONSIBLE AI

FUTURE ISSUE RELEASE DATES FOR 2024

ENTERPRISE DATA & LLM’S

HOW ML IS DRIVING PATIENT-CENTRED DRUG DISCOVERY

21st May 3rd September

WOULD YOU LIKE TO BE FEATURED IN THE DATA SCIENTIST MAGAZINE? We welcome contributions from both individuals and organisations, and we do not charge for features.

Key Benefits of Getting Published: EDUCATIONAL CONTRIBUTION By sharing your technical know-how and business insights, you contribute to the larger data science community, fostering an environment of learning and growth.

THOUGHT LEADERSHIP A feature in an industry magazine positions you or your company as a thought leader in the field of data science. This can attract talent, partners, and customers who are looking for forward-thinking businesses.

RECRUITMENT A feature can showcase your company’s work culture, projects, and achievements, making it an attractive place for top-tier data scientists and other professionals.

BRAND EXPOSURE An article can significantly enhance a company’s or individual’s visibility among a targeted, niche audience of professionals and enthusiasts.

SHOWCASING SUCCESS STORIES Highlight your personal achievements or company-level successful projects, providing proof of your expertise and building trust in your capabilities.

To set up a 30-minute initial chat with our editor to talk about contributing a magazine article please email: imtiaz@datasciencetalent.co.uk

The Data Scientist Magazine - Issue 6

Articles inside