Case Study: Marketing Data Warehouse Engineering
Case Problem Overview
In the fast-paced world of digital marketing, companies often struggle with managing and engineering data pipelines from various sources. Traditional data warehouses, even cloud-based ones, can be expensive and complex to maintain, leading to significantly higher costs on equipment and salaries.
Stakeholders from various teams like marketing, accounting, product, and development need comprehensive data from websites, ad systems, and CRM to build their own dashboards and reports to present performance results to the company board. However, they might not be familiar with the fundamentals of data engineering or have read "Data Engineering for Dummies." As a result, each report may have significantly different results because they lack a single source of truth, and each stakeholder makes their own calculations without considering the data engineering processes of their colleagues. This leads to final reports having controversial numbers that can feel unprofessional and provide misleading information.
The challenge is to create a cost-effective, scalable, and easy-to-maintain data warehouse solution that can handle marketing data from GA4, Facebook ADS, and Shopify efficiently and provide a single source of truth for all stakeholders based on data engineering principles.
Dataflow process overview
Company growth is not a linear, well-controlled process. Many services will arrive and many will be phased out as the company evolves. This constant change can lead to challenges in maintaining a cohesive and integrated data engineering infrastructure. New tools and platforms may be adopted to meet emerging needs without concern about data engineering principles. This dynamic environment requires a flexible and scalable data engineering pipeline architecture that can adapt to the ever-changing landscape of company requirements.
However, violating the fundamentals of data engineering principles can lead to the creation of data silos, where information is isolated within different systems and departments. These silos prevent seamless data integration and hinder the ability to gain comprehensive insights. Consequently, the lack of a unified data view can slow down decision-making processes and impede further growth, as stakeholders struggle to access and analyze the data they need efficiently.
As-Is Process
The graph illustrates the common problem of data silos, where multiple services such as Facebook, Google Analytics, and Shopify are connected to different ETL(Extract Transform Load fundamental of data engineering) processes and reporting tools. The more connections and services involved in the graph, the greater the complexity and potential for data inconsistencies, leading to fragmented and unreliable reports. This fragmentation makes it challenging to maintain a single source of truth, causing discrepancies and inefficiencies in data analysis and decision-making.
To-Be Process
The approach involves identifying shared functionalities among various services and creating a unified pipeline instead of maintaining multiple parallel processes. By analyzing the common characteristics and data processing requirements of services like Facebook ADS, Google Analytics, and Shopify, a single, streamlined ETL pipeline can be engineered. This unified process reduces redundancy, minimizes data silos, and ensures consistency across reports, ultimately providing a single source of truth for all stakeholders.
Implementation Architecture
The detailed data engineering with AWS graph leverages cutting-edge services such as serverless AWS Lambda functions and the in-memory DuckDB database to create a scalable data pipeline. The process begins with AWS Lambda functions ingesting data from various sources like Facebook, Google Analytics, and Shopify. These functions then store the raw data in Amazon S3. The raw data is processed and transformed using additional Lambda functions, which load the data into DuckDB for further analysis. DuckDB, running directly on S3 data, provides a lightweight and efficient database engine capable of performing complex queries and transformations in-memory. This data engineering solution ensures that data pipelines are processed quickly and efficiently, with minimal latency and overhead. The final processed data is stored in a datamart within S3, ready for analysis and reporting by any company department. This serverless data engineering with AWS not only reduces costs by eliminating the need for dedicated servers but also scales automatically with the volume of data, providing a robust and flexible solution for marketing data engineering.
The key advantage of this engineering lies in the use of two types of functions that are easy to scale, update, and schedule. The first type connects to APIs and collects data on a daily basis. The second type is responsible for ETL transformations and can be easily controlled by running code from a Git repository.
Facebook ADS API cost data ingestion function example
//the code be easaly translated if data engineering process created with python
const AWS = require('aws-sdk');
const fetch = require('node-fetch');
const fs = require('fs');
const path = require('path');
// Configure AWS SDK
AWS.config.update({ region: 'us-east-1' });
const s3 = new AWS.S3();
// Facebook Ads API credentials
const accessToken = 'YOUR_FACEBOOK_ACCESS_TOKEN';
const adAccountId = 'YOUR_AD_ACCOUNT_ID';
// Function to fetch cost data from Facebook Ads API
async function fetchFacebookAdsCostData() {
const url = `https://graph.facebook.com/v12.0/act_${adAccountId}/insights?fields=spend&access_token=${accessToken}`;
const response = await fetch(url);
const data = await response.json();
return data;
}
// Function to upload file to S3
async function uploadFileToS3(filePath, bucketName, key) {
const fileContent = fs.readFileSync(filePath);
const params = {
Bucket: bucketName,
Key: key,
Body: fileContent
};
await s3.upload(params).promise();
console.log(`File uploaded successfully at https://s3.amazonaws.com/${bucketName}/${key}`);
}
// Main function
(async () => {
try {
const data = await fetchFacebookAdsCostData();
const filePath = path.join(__dirname, 'facebook_ads_cost_data.json');
fs.writeFileSync(filePath, JSON.stringify(data));
const bucketName = 'YOUR_S3_BUCKET_NAME';
const key = 'facebook_ads_cost_data.json';
await uploadFileToS3(filePath, bucketName, key);
} catch (error) {
console.error('Error:', error);
}
})();
DuckDB instance with connection to S3 raw data from Facebook API, extracting fields date and cost, aggregating by date, and storing it as a new file in S3 extracted layer
//the code be easaly translated if data engineering process created with python
const AWS = require('aws-sdk');
const fetch = require('node-fetch');
const fs = require('fs');
const path = require('path');
const duckdb = require('duckdb');
// Configure AWS SDK
AWS.config.update({ region: 'us-east-1' });
const s3 = new AWS.S3();
// Facebook Ads API credentials
const accessToken = 'YOUR_FACEBOOK_ACCESS_TOKEN';
const adAccountId = 'YOUR_AD_ACCOUNT_ID';
// Function to fetch cost data from Facebook Ads API
async function fetchFacebookAdsCostData() {
const url = `https://graph.facebook.com/v12.0/act_${adAccountId}/insights?fields=date_start,date_stop,spend&access_token=${accessToken}`;
const response = await fetch(url);
const data = await response.json();
return data.data;
}
// Function to upload file to S3
async function uploadFileToS3(filePath, bucketName, key) {
const fileContent = fs.readFileSync(filePath);
const params = {
Bucket: bucketName,
Key: key,
Body: fileContent
};
await s3.upload(params).promise();
console.log(`File uploaded successfully at https://s3.amazonaws.com/${bucketName}/${key}`);
}
// Main function
(async () => {
try {
const data = await fetchFacebookAdsCostData();
const rawFilePath = path.join(__dirname, 'facebook_ads_raw_data.json');
fs.writeFileSync(rawFilePath, JSON.stringify(data));
const bucketName = 'YOUR_S3_BUCKET_NAME';
const rawKey = 'raw/facebook_ads_raw_data.json';
await uploadFileToS3(rawFilePath, bucketName, rawKey);
// Connect to DuckDB and load data from S3
const db = new duckdb.Database(':memory:');
const connection = db.connect();
connection.run(`INSTALL httpfs; LOAD httpfs; SET s3_region='us-east-1'; SET s3_access_key_id='YOUR_AWS_ACCESS_KEY'; SET s3_secret_access_key='YOUR_AWS_SECRET_KEY';`);
connection.run(`CREATE TABLE raw_data AS SELECT * FROM read_json_auto('s3://${bucketName}/${rawKey}');`);
// Aggregate data by date
const result = connection.all(`SELECT date_start AS date, SUM(spend) AS cost FROM raw_data GROUP BY date_start;`);
// Save aggregated data to a new file
const extractedFilePath = path.join(__dirname, 'facebook_ads_extracted_data.json');
fs.writeFileSync(extractedFilePath, JSON.stringify(result));
const extractedKey = 'extracted/facebook_ads_extracted_data.json';
await uploadFileToS3(extractedFilePath, bucketName, extractedKey);
} catch (error) {
console.error('Error:', error);
}
})();
Results and Benefits
By implementing this data architecture, the outcomes are denormalized datamarts that are efficiently stored in the S3 cloud. These datamarts can be easily connected with various visualization tools such as Power BI, Looker, and even Excel. The denormalized structure ensures that data is readily available for analysis and reporting, providing a seamless experience and coherent results for stakeholders from deferent teams or even department.
Here is an example of a CSV data mart file stored in S3 cloud storage. This file contains aggregated marketing data that can be used for analysis and reporting.
By providing a comprehensive table that contains all the necessary information, independent stakeholders from the marketing, product, accounting, or developers team can craft their own reports while maintaining consistency. This unified table ensures that all reports are based on the same data source, eliminating discrepancies and fostering a coherent understanding across different departments. As a result, stakeholders can generate accurate and professional reports that align with the overall objectives and insights derived from a single, reliable data set.
This architecture also offers significant benefits in terms of automation and maintenance. By leveraging serverless AWS Lambda functions, the entire data processing pipeline can be automated, reducing the need for manual intervention and minimizing the risk of human error. Scheduled Lambda functions can handle data ingestion, transformation, and loading processes seamlessly, ensuring that data is always up-to-date and ready for analysis. Additionally, the use of infrastructure-as-code practices allows for easy updates and maintenance of the data pipeline, enabling quick adjustments to changing business requirements without disrupting the overall workflow. This automated and maintainable approach ensures a reliable and efficient data processing system that can adapt to the evolving needs of the organization.
Category | Details |
---|---|
Cost-Effective |
|
Scalability |
|
Simplicity |
|
Data Ingestion |
|
Data Processing |
|
Data Storage |
|
Data Analysis |
|
Cost Savings |
|
Improved Performance |
|
Ease of Use |
|
Enhanced Insights |
|
Summary
This data engineering case study explores the challenges and solutions for marketing data warehouse engineering in a dynamic business environment. Traditional data warehouses are often costly and complex, making it difficult for small to medium-sized businesses to manage and analyze large volumes of data from various sources effectively. The proposed engineering solution involves creating a cost-effective, scalable, and easy-to-maintain data warehouse using serverless AWS Lambda services and the in-memory DuckDB database.
Summary of Key Points | Details |
---|---|
Case Problem Overview |
|
Dataflow Process Overview |
|
To-Be Process |
|
Implementation Architecture |
|
Results and Benefits |
|
Recommendations |
|
By following these recommendations, businesses can create a robust and flexible data warehouse solution that meets their marketing data engineering needs while ensuring cost-effectiveness, scalability, and simplicity.