Greek Net Pulse: A Serverless GCP Data Lakehouse
A robust, end-to-end data pipeline designed to ingest, transform, and serve internet quality analytics at scale. This project demonstrates modern infrastructure-as-code practices and custom serverless data processing using the Medallion architecture.
🚀 The Motivation
To accurately track and analyze internet connectivity trends across Greece using the official data.gov.gr API, I needed a highly reliable pipeline that wouldn't incur the costs of always-on compute clusters. I designed this platform to solve three main challenges:
- Infrastructure Reproducibility: Moving away from manual click-ops in the cloud console by fully managing all GCP resources and IAM permissions via Terraform.
- Serverless Scalability: Utilizing stateless, containerized PySpark jobs on Google Cloud Run instead of maintaining expensive, idle Dataproc clusters.
- Data Quality Guarantees: Implementing a strict Medallion architecture to ensure downstream dashboards only read highly curated, aggregated data.
🛠️ Key Technical Features
This pipeline is completely decoupled, relying on event-driven orchestration rather than monolithic scripts.
Acts as the control plane for the entire pipeline. A scheduled cron job triggers a lightweight Python script for data ingestion, monitors for a success exit code, and subsequently triggers the heavy-duty compute jobs.
Data transformation is handled by a custom PySpark Docker image hosted on Docker Hub. Google Cloud Run spins up the container on-demand to process data from GCS into BigQuery, scaling entirely to zero when finished.
A structured progression of data quality: Bronze (raw data ingestion in GCS), Silver (partitioned & clustered relational tables for sessions), and Gold (business-level aggregations like user retention, regional metrics, and a custom "frustration staircase").
📐 System Architecture
I designed the architecture as code using D2. The diagram below updates automatically via a CI/CD workflow whenever the infrastructure changes.
💻 Tech Stack
🔄 The Data Flow
The entire lifecycle is automated and requires zero manual intervention:
- Provision: Terraform establishes the remote state, provisions GCS buckets, configures BigQuery datasets, and locks down IAM service account secrets.
- Ingest (Bronze): GitHub Actions triggers a Python script to fetch the latest internet traffic data from the data.gov.gr API, performing a raw data dump into our Bronze GCS buckets.
- Transform (Silver/Gold): On ingestion success, Actions triggers a Cloud Run job, dynamically pulling the custom PySpark image from DockerHub. The job reads the Bronze data, cleans it into Silver tables (measurements, sessions), and aggregates it into Gold tables (regional metrics, user retention, and frustration indexes).
- Serve: A hosted Streamlit dashboard connects directly to the BigQuery Gold dataset to visualize the finalized analytics.