Site Impact – Big Data Pipeline

Site Impact needed a way to modernize and scale-out their data management procedures. We were tasked with architecting a cloud-run platform, built to scale their business operations by presenting normalized data for better analysis.

The challenge

Site Impact reached out to us to discuss the ability of tackling uptime and scale problems faced with their current data management platform. Their current workflow for leveraging the data was also a piecemeal approach and consisted of a lot of manual data manipulations.

The solution

After assessing the project during a Discovery phase, we were able to extract an MVP in order to produce what was desired. We focused heavily on Data Science and built a multi-functional data pipeline that allowed the client to provide data to run through standardizing and deduping processes that allowed anyone in their organization to analyze the exported data wherever they desire.

The results

Key Technologies: BigQuery – Ability to execute efficient SQL queries on tables 400GB large with hundreds of millions of rows of data, some tables spanning 600+ columns. Composer – Airflow provides a dependency-driven ETL pipeline which runs all needed manipulations and automatically presents the data up to BigQuery. Dataproc – Pyspark code utilizing dataproc’s compute, built to handle PB’s of data.
We focused heavily on Data Science and built a
multi-functional data pipeline that allowed the client to
provide data to run through standardizing and
deduping processes in real time or batch form.
About Site Impact, LLC
Site Impact are one of the leading providers in
data and marketing resources. Specializing in
multi-channel direct marketing services.
Industry: Advertising & Marketing
Primary project location: United States

 

About WALTLabs.io, LLC
At WALT Labs, we provide
professional services around application
modernization and cloud strategy.
Products: Google Cloud Platform
First Published on July 30, 2020