Building Data Pipelines on Google Cloud Platform

In today’s digital world, gigs of data get churned out every day. The data might include information essential for businesses to thrive, government to function, and for us to receive the right products and servicesweorderedfromanonlinemarketplace.

As an entrepreneur, businessman/businesswoman in this century, you might have already considered hiringadataanalysttoanalyzeandprocessthecollecteddataandtransformyourbusiness.

To process these data, data analysts used data pipelines But what do we mean by data pipelines, their features,andhowcanweusecloudplatformslikeGoogletobuildthesepipelines?

This article will help you understand everything about data pipelines, so without further ado, let’s get started!

WhatisaDataPipeline?

“Pipeline”, in general, refers to the system of big pipes movingresourceslikenaturalgasoroilfromone place to another. Undoubtedly,thesepipelinesarefastermeansofcarryinglargeamountsofmaterialover largedistances

Read:WhatisDataPipelineArchitecture

Similarly, data processing pipelines act as a backbone working on the same principle for data ingestion. Data pipelines are the set of steps for data processing where the dataisingestedattheinitialstageofthe

BuildingDataPipelinesonGoogleCloudPlatform(GCP)

pipeline if that data has not been stored in the data platform already. The pipeline defines what, where, andhowthedatawillbecollected.

Simply put, there are step series where eachstepprovidesanoutputthatactsasaninputforthenextone, thiscontinuestillthepipelineiscomplete

Moreover, a data pipeline includes three elements, namely a source, processing steps, and a destination (sink) With data pipelines, it becomes easier to transfer data from an app to a data warehouse, or data lake to an analytics database It is possible to have the same source and destination for data pipelinesas well, and that data pipeline will be purely there for modification ofthepreviousdataset.Adatapipeline mightalsohavefilteringandresilienceforbetterperformance

TypesofDataPipeline

DatapipelinesaredividedintoBatchProcessingandStreamingdata.

● BatchProcessingDataPipelines

In batch processing data pipelines, “batches of data” are loaded into the repository at the same time intervals, often scheduled during low peak business hours The batch is thenqueriedbythe software program or user when it is ready for processing,allowingthemtoexploreandvisualize thedata

Batch processing tasks create a sequence commands workflow, i.e., the output of one command becomes the input of the next one Onecommandmaytriggercolumnfiltering,andthenextmay workondataaggregation,forinstance.

Batchprocessingistheoptimaldatapipelineusedwhenthereisnotanimmediaterequirementfor datasetanalysis.

● StreamingDataPipelines

Streaming data is used when there is a near real time data processingrequirement. Unlikebatch processing,itisaboutderivinginsightsfromthedatawithinmillisecondsbyingestingdatasetsas they are created and continuously updating reports, metrics, or summaries in response to every event.

Read:Top5DataStreamingTools

It enables organizationstogainreal timeanalyticstogetupdatedinformationonoperationstoact without delay. Streaming data pipelines are better used for social media or point or salesappsto updatedataandinformationinstantly

DataPipelineElements

Understanding the elementsofadatapipelinewillhelpyouunderstandhowitworks.So,let’stakeabrief lookatthesedatapipelinecomponents

Read:WhatisDataOps

Source: The source is the entry point of thedatapipeline Thesourcecanbeastoragesystemofacompanylikea data lake, data warehouse, etc, or other data sources such as IoT devices, APIs, transaction processing systems,andsocialmedia.

Destination: The destination is the final point of the data pipeline where all the collected data from the source gets stored Mostoftenthannot,adatawarehouseordatalakeactsasthedestination

Dataflow:

Dataflow refers to the entire movement and changes data undergoes whiletransferringfromitssourceto destination.

Processing:

Processing refers to the steps or activities involved to extract or ingest data from sources, its transformation, and moving it to the destination It decides how the movement of dataflow should be implemented.

Workflow: Inthedatapipeline,workflowfocusesondefiningtheprocesssequenceanditsdependencies.

Monitoring

Working with the data pipeline requires continuous monitoring toensuredataintegrityandpotentialdata loss Other than that, monitoring the data pipeline helps to check if the efficiency of the pipeline is affectedbyincreasingdataload

Now that we have a betterknowledgeofdatapipelines,itwouldbebeneficialtounderstandwhatGoogle CloudPlatform(GCP)isbeforewemoveaheadtobuildingdatapipelinesonGCP.

GoogleCloudPlatform-AnOverview

Google Cloud Platform is a cloud computing services suite, running on the same infrastructure used by Google internally for its products like Google Drive, Gmail, or Google Search GCP provides modular cloud services such as data storage, computing, machine learning, and data analytics along with its managementtools

Read:5WaysCloudComputingCanBenefitWebAppDevelopment

Platform as a Service (PaaS), Infrastructure as a Service (IaaS), and serverless environments for computingareotherexamplesofservicesthatGoogleCloudPlatformoffers

Cloud

thekeyservicesthatweneedto knowarelistedbelow

●

Under the Google

brand, Google has over 100 products Some of

● AppEngine ● GoogleKubernetesEngine ● CloudFunctions ● ComputeEngine ● CloudRun ● CloudStorage ● CloudSQL ● CloudBigtable ● CloudSpanner ● CloudDatastore ● PersistentDisk ● CloudMemorystore ● LocalSSD ● Filestore ● AlloyDB ● CloudCDN ● CloudDNS ● CloudInterconnect ● CloudArmor ● CloudLoadBalancing ● VirtualPrivateCloud ● NetworkServiceTiers ● DataProc ● BigQuery ● CloudDataflow ● CloudComposer ● CloudDataprep ● CloudDataLab ● CloudDataStudio

CloudShell ● CloudAPIs ● CloudAutoML

CloudTPU

CloudConsole

CloudIdentity

EdgeTPU

MethodstoBuildDataPipelinesontheGoogleCloudPlatform

Before creating data pipelines, make sure to add necessary IAM roles like datapipelines.admin, datapipelinesinvoker,datapipelinesviewertoallowusingcertainoperationsrespectively

To createthedatapipelineusingGoogleCloudPlatform,accessthedatapipelinefeaturefromitsconsole. Then a setup pagewillopenwhereyoucanallowlistedAPIsbeforecreatingdatapipelines Nowyoucan eitherimportajoborcreateadatapipeline.

Read:PrinciplesofWebAPIDesign

Tocreateapipelinefollowthefollowingsteps:

● IntheGoogleCloudConsole,gotoDataflowPipelineandselect‘CreateDataPipeline’.

● Provide a name to the data pipeline, fill in other parameters and template selections on the pipelinetemplate

● Forabatchjob,youcanprovidearecurrencescheduleforthepipeline.

Now to create a batch data pipeline give your project access to Cloud Storage Bucket and BigQuery datasetforstoringinputandoutputdatawhilecreatingtablessimultaneously.

Let’s take an examplepipelinethatreadsCSVfilesfromstorage(source),runsatransformation,andthen storesthevalueintheBigQuerytable(destination)withthreecolumns

Now,createthebelow mentionedfilesinthelocaldrive: 1 Abig query column tablejsonthatwillcontainthedestinationschemaas bqquery nouse legacy sql”\ CREATETABLEattendence datacurrent attendence( employee namestring, employee idstring, attendance countint64 )” 2 AtransformationjsJavaScriptfiletoimplementasimpledatatransformation functiontransform(line){ varvalues=linesplit(','); varobj=newObject(); obj.employee name=values[0]; objemployee id=values[1]; obj.attendence count=values[2];

● Streaming data pipelines do not have schedules speciṣfied for the Pipeline schedule, as the Dataflowstreamingbeginsimmediately

● Go to Process Data Continuously (stream) and then Text Files on Cloud Storage to BigQuery, whenusingthetemplateforDataflow

● If you use the Worker machine type, the pipeline processes the records youuploadtotheinputs/ folder that match the pattern gs://BUCKET ID/inputs/record01csv To avoid out of memory errors when CSV files exceed several GBs, select a machine type with more memory than the defaultn1 standard 4machinetype

varjsonString=JSONstringify(obj); return

}

jsonString;

3 Arecord01csvCSVfilewithrecordstobeinsertedintheBigQuerytable Kayling,65487,30 Scarlet,65878,31 Frank,45781,28 Tyler,63679,29 Elena,54876,25 Stefan,54845,30 Markus,69324,28 Adelyn,54751,31 Jonas,54875,27 Blaze,48721,31 4 Use gsutil and copy the JSON and JS files in Cloud Storage Bucket ID Attendance-Record of yourprojectandCSVfiletoBucketIDInputsas gsutilcpbig query column table.jsongs://BUCKET ID/attendence record/ gsutilcptransformationjsgs://BUCKET ID/attendence record/ gsutilcprecord01csvgs://BUCKET ID/inputs/ After creating a record folder in Cloud Storage, create an attendance record pipeline by entering the pipeline name, source, and destination, selecting “Text Files on Cloud Storage to BigQuery” under processdatainbulkbatch,andschedulingthepipelinebasedonyourneeds Other than the batch data pipeline, you can also create a streaming data pipeline based on the batch pipelineinstructions,butrememberthedifferencesgivenbelow:

Conclusion

So that was all about the data pipeline and GoogleCloudPlatform.Andthisishowyoucaneasilycreate a simple yet featured data pipeline using GCP. Remember, there is no exception handling mentioned above,sowhileworkingforanorganizationyouwouldhavetoincludeityourself

Turn static files into dynamic content formats.

Create a flipbook