Case Study: Marketing Data Warehouse Engineering

Case Problem Overview

In the fast-paced world of digital marketing, companies often struggle with managing and engineering data pipelines from various sources. Traditional data warehouses, even cloud-based ones, can be expensive and complex to maintain, leading to significantly higher costs on equipment and salaries.

Stakeholders from various teams like marketing, accounting, product, and development need comprehensive data from websites, ad systems, and CRM to build their own dashboards and reports to present performance results to the company board. However, they might not be familiar with the fundamentals of data engineering or have read "Data Engineering for Dummies." As a result, each report may have significantly different results because they lack a single source of truth, and each stakeholder makes their own calculations without considering the data engineering processes of their colleagues. This leads to final reports having controversial numbers that can feel unprofessional and provide misleading information.

The challenge is to create a cost-effective, scalable, and easy-to-maintain data warehouse solution that can handle marketing data from GA4, Facebook ADS, and Shopify efficiently and provide a single source of truth for all stakeholders based on data engineering principles.

Dataflow process overview

Company growth is not a linear, well-controlled process. Many services will arrive and many will be phased out as the company evolves. This constant change can lead to challenges in maintaining a cohesive and integrated data engineering infrastructure. New tools and platforms may be adopted to meet emerging needs without concern about data engineering principles. This dynamic environment requires a flexible and scalable data engineering pipeline architecture that can adapt to the ever-changing landscape of company requirements.

However, violating the fundamentals of data engineering principles can lead to the creation of data silos, where information is isolated within different systems and departments. These silos prevent seamless data integration and hinder the ability to gain comprehensive insights. Consequently, the lack of a unified data view can slow down decision-making processes and impede further growth, as stakeholders struggle to access and analyze the data they need efficiently.

As-Is Process

The graph illustrates the common problem of data silos, where multiple services such as Facebook, Google Analytics, and Shopify are connected to different ETL(Extract Transform Load fundamental of data engineering) processes and reporting tools. The more connections and services involved in the graph, the greater the complexity and potential for data inconsistencies, leading to fragmented and unreliable reports. This fragmentation makes it challenging to maintain a single source of truth, causing discrepancies and inefficiencies in data analysis and decision-making.

To-Be Process

The approach involves identifying shared functionalities among various services and creating a unified pipeline instead of maintaining multiple parallel processes. By analyzing the common characteristics and data processing requirements of services like Facebook ADS, Google Analytics, and Shopify, a single, streamlined ETL pipeline can be engineered. This unified process reduces redundancy, minimizes data silos, and ensures consistency across reports, ultimately providing a single source of truth for all stakeholders.

Implementation Architecture

The detailed data engineering with AWS graph leverages cutting-edge services such as serverless AWS Lambda functions and the in-memory DuckDB database to create a scalable data pipeline. The process begins with AWS Lambda functions ingesting data from various sources like Facebook, Google Analytics, and Shopify. These functions then store the raw data in Amazon S3. The raw data is processed and transformed using additional Lambda functions, which load the data into DuckDB for further analysis. DuckDB, running directly on S3 data, provides a lightweight and efficient database engine capable of performing complex queries and transformations in-memory. This data engineering solution ensures that data pipelines are processed quickly and efficiently, with minimal latency and overhead. The final processed data is stored in a datamart within S3, ready for analysis and reporting by any company department. This serverless data engineering with AWS not only reduces costs by eliminating the need for dedicated servers but also scales automatically with the volume of data, providing a robust and flexible solution for marketing data engineering.

The key advantage of this engineering lies in the use of two types of functions that are easy to scale, update, and schedule. The first type connects to APIs and collects data on a daily basis. The second type is responsible for ETL transformations and can be easily controlled by running code from a Git repository.

Facebook ADS API cost data ingestion function example

//the code be easaly translated if data engineering process created with python
const AWS = require('aws-sdk');
const fetch = require('node-fetch');
const fs = require('fs');
const path = require('path');

// Configure AWS SDK
AWS.config.update({ region: 'us-east-1' });
const s3 = new AWS.S3();

// Facebook Ads API credentials
const accessToken = 'YOUR_FACEBOOK_ACCESS_TOKEN';
const adAccountId = 'YOUR_AD_ACCOUNT_ID';

// Function to fetch cost data from Facebook Ads API
async function fetchFacebookAdsCostData() {
    const url = `https://graph.facebook.com/v12.0/act_${adAccountId}/insights?fields=spend&access_token=${accessToken}`;
    const response = await fetch(url);
    const data = await response.json();
    return data;
}

// Function to upload file to S3
async function uploadFileToS3(filePath, bucketName, key) {
    const fileContent = fs.readFileSync(filePath);
    const params = {
        Bucket: bucketName,
        Key: key,
        Body: fileContent
    };
    await s3.upload(params).promise();
    console.log(`File uploaded successfully at https://s3.amazonaws.com/${bucketName}/${key}`);
}

// Main function
(async () => {
    try {
        const data = await fetchFacebookAdsCostData();
        const filePath = path.join(__dirname, 'facebook_ads_cost_data.json');
        fs.writeFileSync(filePath, JSON.stringify(data));

        const bucketName = 'YOUR_S3_BUCKET_NAME';
        const key = 'facebook_ads_cost_data.json';
        await uploadFileToS3(filePath, bucketName, key);
    } catch (error) {
        console.error('Error:', error);
    }
})();

DuckDB instance with connection to S3 raw data from Facebook API, extracting fields date and cost, aggregating by date, and storing it as a new file in S3 extracted layer

//the code be easaly translated if data engineering process created with python
const AWS = require('aws-sdk');
const fetch = require('node-fetch');
const fs = require('fs');
const path = require('path');
const duckdb = require('duckdb');

// Configure AWS SDK
AWS.config.update({ region: 'us-east-1' });
const s3 = new AWS.S3();

// Facebook Ads API credentials
const accessToken = 'YOUR_FACEBOOK_ACCESS_TOKEN';
const adAccountId = 'YOUR_AD_ACCOUNT_ID';

// Function to fetch cost data from Facebook Ads API
async function fetchFacebookAdsCostData() {
    const url = `https://graph.facebook.com/v12.0/act_${adAccountId}/insights?fields=date_start,date_stop,spend&access_token=${accessToken}`;
    const response = await fetch(url);
    const data = await response.json();
    return data.data;
}

// Function to upload file to S3
async function uploadFileToS3(filePath, bucketName, key) {
    const fileContent = fs.readFileSync(filePath);
    const params = {
        Bucket: bucketName,
        Key: key,
        Body: fileContent
    };
    await s3.upload(params).promise();
    console.log(`File uploaded successfully at https://s3.amazonaws.com/${bucketName}/${key}`);
}

// Main function
(async () => {
    try {
        const data = await fetchFacebookAdsCostData();
        const rawFilePath = path.join(__dirname, 'facebook_ads_raw_data.json');
        fs.writeFileSync(rawFilePath, JSON.stringify(data));

        const bucketName = 'YOUR_S3_BUCKET_NAME';
        const rawKey = 'raw/facebook_ads_raw_data.json';
        await uploadFileToS3(rawFilePath, bucketName, rawKey);

        // Connect to DuckDB and load data from S3
        const db = new duckdb.Database(':memory:');
        const connection = db.connect();
        connection.run(`INSTALL httpfs; LOAD httpfs; SET s3_region='us-east-1'; SET s3_access_key_id='YOUR_AWS_ACCESS_KEY'; SET s3_secret_access_key='YOUR_AWS_SECRET_KEY';`);
        connection.run(`CREATE TABLE raw_data AS SELECT * FROM read_json_auto('s3://${bucketName}/${rawKey}');`);

        // Aggregate data by date
        const result = connection.all(`SELECT date_start AS date, SUM(spend) AS cost FROM raw_data GROUP BY date_start;`);

        // Save aggregated data to a new file
        const extractedFilePath = path.join(__dirname, 'facebook_ads_extracted_data.json');
        fs.writeFileSync(extractedFilePath, JSON.stringify(result));

        const extractedKey = 'extracted/facebook_ads_extracted_data.json';
        await uploadFileToS3(extractedFilePath, bucketName, extractedKey);
    } catch (error) {
        console.error('Error:', error);
    }
})();

Results and Benefits

By implementing this data architecture, the outcomes are denormalized datamarts that are efficiently stored in the S3 cloud. These datamarts can be easily connected with various visualization tools such as Power BI, Looker, and even Excel. The denormalized structure ensures that data is readily available for analysis and reporting, providing a seamless experience and coherent results for stakeholders from deferent teams or even department.

Here is an example of a CSV data mart file stored in S3 cloud storage. This file contains aggregated marketing data that can be used for analysis and reporting.

By providing a comprehensive table that contains all the necessary information, independent stakeholders from the marketing, product, accounting, or developers team can craft their own reports while maintaining consistency. This unified table ensures that all reports are based on the same data source, eliminating discrepancies and fostering a coherent understanding across different departments. As a result, stakeholders can generate accurate and professional reports that align with the overall objectives and insights derived from a single, reliable data set.

This architecture also offers significant benefits in terms of automation and maintenance. By leveraging serverless AWS Lambda functions, the entire data processing pipeline can be automated, reducing the need for manual intervention and minimizing the risk of human error. Scheduled Lambda functions can handle data ingestion, transformation, and loading processes seamlessly, ensuring that data is always up-to-date and ready for analysis. Additionally, the use of infrastructure-as-code practices allows for easy updates and maintenance of the data pipeline, enabling quick adjustments to changing business requirements without disrupting the overall workflow. This automated and maintainable approach ensures a reliable and efficient data processing system that can adapt to the evolving needs of the organization.

Category	Details
Cost-Effective	AWS Lambda: Serverless architecture reduces costs by only charging for actual compute time. Amazon S3: Cost-effective storage solution for large volumes of data. DuckDB: Lightweight and efficient database engine that can run directly on S3 data.
Scalability	AWS Lambda: Automatically scales with the volume of data and processing needs. Amazon S3: Virtually unlimited storage capacity.
Simplicity	Serverless: No need to manage servers or infrastructure. DuckDB: Simple integration with S3 and easy to use SQL interface.
Data Ingestion	AWS Lambda functions to process incoming marketing data from various sources (e.g., social media, email campaigns, web analytics). Amazon S3 buckets to store raw and processed data.
Data Processing	AWS Lambda functions to transform and clean data. DuckDB to query and analyze data directly from S3.
Data Storage	Amazon S3 for storing raw, processed, and aggregated data.
Data Analysis	DuckDB for running SQL queries on data stored in S3. Integration with BI tools for visualization and reporting.
Cost Savings	Significant reduction in infrastructure and maintenance costs. Pay-as-you-go pricing model with AWS services.
Improved Performance	Faster data processing and querying with serverless architecture and DuckDB. Scalable solution that can handle increasing data volumes without performance degradation.
Ease of Use	Simplified data pipeline with minimal management overhead. Easy integration with existing marketing tools and platforms.
Enhanced Insights	Ability to quickly analyze and visualize marketing data. Improved decision-making with timely and accurate data insights.

Summary

This data engineering case study explores the challenges and solutions for marketing data warehouse engineering in a dynamic business environment. Traditional data warehouses are often costly and complex, making it difficult for small to medium-sized businesses to manage and analyze large volumes of data from various sources effectively. The proposed engineering solution involves creating a cost-effective, scalable, and easy-to-maintain data warehouse using serverless AWS Lambda services and the in-memory DuckDB database.

Summary of Key Points	Details
Case Problem Overview	Companies struggle with managing and analyzing large volumes of marketing data. Traditional data warehouses are expensive and complex. Stakeholders need a single source of truth to avoid discrepancies in reports.
Dataflow Process Overview	Business growth leads to data silos and integration challenges. A unified data architecture is needed to adapt to changing business requirements.
To-Be Process	Develop a single, streamlined ETL pipeline to reduce redundancy and ensure consistency. Use AWS Cloud ETL to integrate data from various sources and provide a single source of truth.
Implementation Architecture	Use AWS Lambda functions for data ingestion and transformation. Store raw data in Amazon S3 and process it using DuckDB. Create a datamart within S3 for analysis and reporting.
Results and Benefits	Denormalized datamarts stored in S3 can be connected to visualization tools. Automated and maintainable data processing pipeline. Cost-effective, scalable, and simple architecture. Enhanced data analysis and decision-making capabilities.
Recommendations	Adopt Serverless Architecture: Utilize AWS Lambda functions to reduce costs and simplify management. Leverage Cloud Storage: Use Amazon S3 for scalable and cost-effective data storage. Implement Unified ETL Pipeline: Develop a single ETL process to minimize data silos and ensure consistency. Use In-Memory Database: Employ DuckDB for efficient data processing and querying. Automate Data Processing: Schedule Lambda functions for seamless data ingestion, transformation, and loading. Integrate with BI Tools: Connect datamarts to visualization tools for comprehensive data analysis and reporting.

By following these recommendations, businesses can create a robust and flexible data warehouse solution that meets their marketing data engineering needs while ensuring cost-effectiveness, scalability, and simplicity.