Pinterest - June 2023

Page 1

How data engineering is powering Pinterest’s global platform

DIGITAL REPORT 2023
IN ASSOCIATION WITH:

How data engineering is powering Pinterest’s global platform

PINTEREST

Pinterest is the visual inspiration platform people around the world use to shop for products personalised to their taste, find ideas and crafts to do offline, and discover the most inspiring creators.

Beginning as a tool to help people collect the things they were passionate about online, today more than 460 million people flock to Pinterest’s platform every month to explore and experience billions of ideas.

Central to powering this platform is data engineering on a vast scale – as Dr. Dave Burgess, VP of Data Engineering at Pinterest, explains.

“In Data Engineering we create and run reliable and efficient planet-scale data platforms and services to accelerate innovation at Pinterest and sustain our business,” he says. “We do everything from online data systems to logging data, big data and stream processing platforms, analytics and experimentation platforms, machine learning (ML) platforms, and the Pinterest Developer Platform for external developers to build applications using Pinterest APIs.”

4 pinterest.com PINTEREST
With more than 460 million ‘Pinners’ using its platform each month, Pinterest is powered by its data architecture, ML and experimentation platforms

As Burgess explains, one of the biggest challenges in data engineering is improving Pinterest’s developer productivity, which is measured through surveys and the time taken to complete tasks: “For example, the time it takes to train and deploy a new machine learning model or run an experiment.”

From the survey results, a developer productivity NPS (Net Promoter Score) is calculated, from +100 to -100. “When I first started at Pinterest four years ago, our developer productivity NPS was -5 and now it’s +65.”

Since joining the business four years ago, Burgess has overseen the replacement of many of Pinterest’s data engineering systems with the latest in open-source software. “We’ve also built machine learning and experimentation platforms on top of our data platform, increased ML Engineering velocity by 10x and run hundreds of new experiments every week,” he adds. “We’ve also democratised our data so that everyone in the company can use data to make decisions, build applications and experiment. All of this has significantly improved our agility, developer productivity, and the products for our customers.”

2010 Year founded

$2.8bn Revenue in 2022

4K+ Employees around the globe

460m+ Monthly users

pinterest.com 7 PINTEREST

Pinterest’s machine learning and experimentation platforms

‘Under the covers’, according to Burgess, Pinterest is a ‘massive ML machine’: “We use ML to generate recommendations for our home feed, search results, related products, advertising, and also have augmented reality for our Pinners (the affectionate name we call our users) to see makeup on their face.”

Central to Pinterest’s success is its ML platform. Used to power everything from product recommendations and image categorisation to online advertising and spam filtering, Burgess explains that it enables Pinterest’s engineers to be significantly more productive.

“Our ML engineers can iterate much more quickly, building and deploying new ML models in a day, performing offline training

“In data engineering, we create and run reliable and efficient planet-scale data platforms and services to accelerate innovation at Pinterest, sustaining our business”
DR. DAVE BURGESS VP DATA ENGINEERING, PINTEREST
How
data engineering is powering Pinterest’s global platform
8 pinterest.com PINTEREST
WATCH NOW

to iterate and improve their models offline before testing them with real production traffic, and have production ML systems be automatically monitored and self-healed,” he comments.

One such tool is Pinterest Lens, a visual search tool allowing users to search for ideas and products using images. The tech trick behind this feature is computer vision, which identifies objects in photos to suggest related content, allowing users to find similar items on Pinterest. These innovations, Burgess explains, are powered by open-source and internal advancements in ML technology.

“Our ML platform is built with a combination of open source ML technologies, like PyTorch, Tensorflow and MLFlow, and tech that integrates with our own big data and online systems,” he explains. “That enables us

to train ML models and automatically deploy them into serving systems for ML inference.”

Pinterest is an organisation defined by a culture of experimentation. As Burgess describes, its Experimentation Platform encourages experimentation and datadriven decision-making throughout the whole organisation, while also enabling the organisation to test thousands of new ideas.

“Our Experimentation Platform is designed to support rapid iteration and the continuous improvement of our products, and allow us to quickly test and refine new features, user interfaces, and other elements of the user experience. By using data to guide our product development decisions, Pinterest is able to better meet the needs and preferences of our users, as well as increase inspiration.”

pinterest.com 9

Pinterest had long recognised the need to optimise its data storage system. Using HBase, the image sharing platform was carrying a large footprint, with more than 50 clusters and data totalling one petabyte. Enter PingCAP, an enterprise company launched in 2015 by seasoned infrastructure engineers frustrated with the way databases were managed, scaled and maintained.

Seeing no capable solutions on the market, they built TiDB, an advanced, open-source, distributed SQL database for powering modern applications with elastic scaling, realtime analytics and continuous access to data.

What was Pinterest trying to achieve?

“Pinterest’s storage and caching team wanted to find their next-generation, unifying storage system,” explains Liquan Pei, Principal Technologist at PingCAP. “As a NoSQL database, HBase offers a very simple key value interface, but the business logistics are complex. To add new features, Pinterest had to build additional layers on top of HBase, which incurs a very high maintenance workload.” With those motivations in mind, Pinterest

evaluated more than 15 solutions and settled on TiDB in 2020. Pei says the reason for Pinterest choosing PingCAP came down to TiDB’s robust technical capabilities and PingCAP’s high-quality enterprise support.

PingCAP brings lasting benefits to Pinterest data operations

TiDB is set to bring a host of benefits to Pinterest’s day-to-day operations. When carrying out the project, PingCAP evaluated Pinterest’s secondary index services system and, using TiDB, achieved better performance and 80% cost reduction.

“Because of TiDB’s capabilities, we were able to reduce the system from six components to one, greatly reducing the maintenance burden,” adds Pei. In the long run, TiDB’s expressiveness and scalability should also help Pinterest’s IT teams from a practical perspective. Pei continues: “People from Pinterest will enjoy peace of mind because a lot of work is handled by TiDB, so they can focus instead on more impactful work.”

As Pinterest went in search of a next-generation, unifying data storage system, the company found the perfect solution in PingCAP’s TiDB

A next-generation data warehouse

One of the number of changes made in Pinterest’s data systems involves the building of a next-generation data warehouse and the transition to a Data Mesh: an emerging approach to data architecture that aims to address the challenges of managing large and complex data environments, which was first introduced by Zhamak Dehghani – a software architect at ThoughtWorks – in 2019.

“At a high level, Data Mesh is a decentralised data architecture that emphasises data ownership and autonomy,” Burgess explains. “Rather than having a central data team manage all the data for an organisation, Data Mesh encourages each business unit or team to take ownership of their own data domains, managing their data in a way that is best suited to their needs.”

This approach involves breaking down data into smaller, more manageable domains that can be owned and managed by individual teams. Each team is responsible for the data within their domain, including defining the schema, ensuring data quality, and providing access to other teams that need to use the data.

To enable collaboration and sharing across domains, Pinterest has a catalogue of schemas and metadata stored in Apache DataHub, has standardised its data vocabularies and metrics, has tiered the quality of its data, and has integrated its open-sourced Querybook platform to collaborate and share SQL queries.

“Querybook is an open-source data collaboration platform developed by Pinterest,” Burgess explains. “It has a userfriendly interface for data analysts and engineers to collaborate on data analysis

12 pinterest.com PINTEREST

tasks, allowing them to share queries, datasets, and insights with one another. It’s the most popular and highly-rated internal tooling platform at Pinterest.”

As Burgess describes, Querybook also benefits from advanced data analysis capabilities for ad-hoc data analysis, generating visualisations, and even building machine learning models: “We’ve also built a ChatGPT-like interface to automatically generate and execute queries from a text business statement. For example, you could ask it how many daily active users there are on Pinterest over the past month and it will generate a SQL query with the right tables and fields.”

“Overall,” Burgess asserts, “Data Mesh represents a new way of thinking about data architecture that helps us to manage our large and complex data environment more effectively, while also fostering greater collaboration and innovation.”

DR. DAVE BURGESS VP DATA ENGINEERING, PINTEREST
pinterest.com 13
“By using data to guide our product development decisions, Pinterest is able to better meet the needs and preferences of our users, while increasing inspiration”

Building a successful partner ecosystem Pinterest’s Data Engineering department works with a number of third party partners, including AWS for cloud infrastructure and Percona for MySQL support, along with a number of other companies on open source software such as Netflix, Lyft, AirBnB, AWS, Starburst (for Presto/Trino), StarRocks Technologies, and Preset (for Superset), as well as close collaborations with the open source community.

Another of Pinterest’s partners, PingCAP, has assisted with the deployment of its TiDB system: a distributed SQL database engine that provided users with better data consistency, reducing tail latencies by 30-90% while reducing hardware instance costs by more than 50%.

“We had been using an older version of HBase for many years, which is a scalable open-source, distributed, column-oriented NoSQL database,” Burgess explains. “We’ve made many fixes to HBase over the years to make it faulttolerant at our scale on AWS, used it for different kinds of use cases, and added a lot of functionality on top.”

“The biggest pain points with this older version of HBase were: the total cost of ownership to maintain and run this; limited functionality, which led to lower engineering productivity and increased application complexity; the lack of data consistency across tables, affecting our users’ experience; and the scalability requirements our internal users wanted to run at.”

This partnership with PingCAP to use TiDB is already reaping benefits, providing better data consistency, a lower total cost of ownership, and more powerful features than the previous solution, HBase.

“As a NewSQL database, TiDB provides a scalable solution in a huge problem space for use cases that need stronger consistency or richer functionalities”, Burgess explains. “It fills in the gap between our existing SQL and NoSQL systems, allowing developers to build

storage applications faster without making painful tradeoffs.”

“All these factors combined enable us to more easily build and scale businesscritical applications including shopping catalogues, advertising index systems, trust and safety systems and many more.”

Use more image captions as often as possible
“Data Mesh represents a new way of thinking about data architecture that helps us to manage our large and complex data environment more effectively, while also fostering greater collaboration and innovation”
DR. DAVE BURGESS VP DATA ENGINEERING, PINTEREST
pinterest.com 15 PINTEREST
“We will make it easier for Pinners to shop for the things they love. Pinners will be able to go from being inspired to making this a reality in their lives”
DR. DAVE BURGESS VP DATA ENGINEERING, PINTEREST

What are Pinterest’s main aims for the next five years?

As Burgess describes, central to Pinterest’s plans for the future is innovating and creating new technologies and products that put Pinners first. “This means enhancing the user experience and driving growth internationally.”

The organisation will also look to improve its advertising products and expand its advertising partnerships with businesses of all sizes, while becoming a more sustainable and socially responsible company. Reducing its environmental impact is part of the latter, as is promoting diversity and inclusion, in addition to supporting causes related to social and environmental issues.

“We will make it easier for Pinners to shop for the things they love. They’ll be able to go from being inspired to making this a reality in their lives,” Burgess adds. “We will also be a more sustainable company, with almost 100% renewable energy for our operations. This includes renewable energy for our offices and data centres.”

With the space moving quickly, making the most of the opportunities presented by developments in ML and AI will also be central to Pinterest’s success going forward.

“This space is changing quickly with the recent advances in Large Language Models, Stable Diffusion, and Transformer models,” Burgess concludes. “We have the ability to generate images and text answers, augment ML models with more data, recognise objects in images, and create an augmented reality. We can also significantly improve our productivity with AI-assisted bots that generate code and answers.”

“There are many applications of this and it’s going to be a game changer.”

pinterest.com 17 PINTEREST
pinterest.com Pinterest 651 Brannan Street San Francisco CA USA 94107 POWERED BY:

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.