This document outlines my most important professional & personal projects across various industries & domains, focusing on expertise in Data Engineering, Machine Learning (ML), Generative AI (GenAI), and Databricks Solution Architecture.
| Category | Details |
|---|---|
| Business Domain | ML model and GenAI development for audit-related business problems for a Big4 client. |
| Project Description | Developed various ML models and GenAI tools for high-impact business problems. Key deliverables included: - Binary and multi-class classification models for project records and reports. - LLM assisted processing of client supplied files in various formats (text, Excel, PDF, images, zip files, folders). - Automated data extraction and enrichment from finance documents (e.g. balance sheets, invoices, fund reports). - Built business-facing reporting dashboards and lightweight UI for the AI tools. |
| Tech Stack | Azure (ADLS Gen2, AI Document Intelligence, AI Content Understanding, Azure OpenAI) Databricks (Delta Lake, Unity Catalog, Vector Search, Jobs & Workflows, REST API) Python SQL |
| Role(s) | ML Engineer, Data Engineer, Data Scientist Databricks Solution Architect Note: Roles were often overlapping, sometimes on parallel projects. |
| Category | Details |
|---|---|
| Business Domain | AI-powered project health analysis |
| Project Description | Developed an AI tool that analyzes projects for health metrics based on textual summary of project status. Utilizes configurable LLM backend and RAG (with ChromaDB) for intelligent insights and recommendations. |
| Tech Stack | Python (FastAPI, LangChain, PyTest) LLM APIs (OpenAI, Gemini, LiteLLM) ChromaDB for RAG Bootstrap for UI GitHub Actions for CI/CD: UV, Ruff, MyPy, CodeCov, Docker |
| Role(s) | AI Engineer |
| Category | Details |
|---|---|
| Business Domain | LLM, RAG, and Vector Database related solution development |
| Project Description | As part of Datapao’s Data Science Group we built an in-house RAG framework with end-to-end support for deploying RAG pipelines using Databricks technologies. My main focus was integrating various vector databases and vector search providers (e.g., ChromaDB, FAISS, Databricks Vector Search) and providing abstractions above the providers to improve developer experience. |
| Tech Stack | Vector Databases & libraries (Databricks Vector Search, ChromaDB, FAISS) Python Apache Spark |
| Role(s) | ML Engineer Backend Developer |
| Category | Details |
|---|---|
| Business Domain | PoC for ML use cases on Databricks |
| Project Description | Proof of concept for an Enterprise client of Databricks. Showcasing machine learning workflows using Databricks platform, including model training, deployment, and integration with various data sources. |
| Tech Stack | Databricks (MLflow, Delta Lake, Unity Catalog) Python Apache Spark |
| Role(s) | Data engineer |
| Category | Details |
|---|---|
| Business Domain | Near real-time ELT solution for Australian Energy Market Operator data. |
| Project Description | Migrated the Australian subsidiary of a major energy company from a slow, costly legacy Oracle system that couldn’t handle modern ML workloads. Notable features: - Custom solution using Azure and Databricks reduced data availability time for traders from 24 hours to less than 5 minutes, significantly cheaper than the old solution. - Enabling new use cases, such as predicting energy demand & prices and “day-trading” electricity (coupled with client’s battery farm data) , resulting in multiple 10-100k USD extra profit daily. - System uses Azure Blob Storage/ADLS Gen2 to land data, then Databricks AutoLoader and Spark streaming to ingest files as they arrive. - Utilizes a custom Python parser to unzip files (also in streaming fashion), process the AEMO proprietary format, and separate data into about 800 different tables. The data is available via Unity Catalog as Delta tables. |
| Tech Stack | Azure (ADLS Gen2, Blob storage, Azure Data Factory) Databricks (Delta Lake, Unity Catalog, AutoLoader, Jobs & Workflows) Python, Pandas Apache Spark |
| Role(s) | Data Engineer |
| Category | Details |
|---|---|
| Business Domain | SAP HANA/BODS to Azure Databricks migration of a Global Retail Analytics platform. |
| Project Description | Migration and refactoring of an enterprise-scale Retail Analytics & Reporting platform from an end-of-life SAP HANA & Business Object Data Services (BODS) solution to a modern Databricks-based system on Azure. Key deliverables: - Redesign of the entire architecture for better performance and to better suit business needs. - Rewrote and optimized all legacy SAP HANA SQL code, SAP Calculation Views, and procedures to Databricks SQL and Spark native Python code. |
| Tech Stack | Azure (ADLS Gen2) Databricks (Delta Lake, Unity Catalog, Delta Live Tables, Jobs & Workflows) Apache Spark SQL Python |
| Role(s) | Data Engineer Solutions Consultant |
| Category | Details |
|---|---|
| Business Domain | Spark-native, highly scalable Payments Reconciliation. |
| Project Description | Addressed a huge backlog of financial records (tens of millions of invoice items and payment transactions) traditionally reconciled manually by a large team. The manual process took 20 FTEs over a year for a single account with only about 100k items. - Achieved multi-million USD cost savings and huge efficiency gains by making the process mostly automatic and fully scalable. - Solution extracts items from SAP FI, then executes preprocessing and a custom matching algorithm on Databricks using Spark. - Azure Data Factory manages extraction from and loading to SAP, and triggers the Databricks Workflow for the reconciliation process. |
| Tech Stack | Azure (ADLS Gen2, ADF) Azure Databricks (Delta Lake, Unity Catalog, Workflows) Apache Spark Python, Pandas |
| Role(s) | Data Engineer |
| Category | Details |
|---|---|
| Business Domain | Cost and performance optimization of Apache Spark & Databricks workloads, Architecture reviews and best practice workshops. |
| Project Description | Optimization (Delivered 30+ smaller use cases): - Optimized performance of existing SQL or Python (sometimes Scala) code. - Refactored code to run efficiently on Spark/Databricks, or optimized architecture for reducing operation cost. Consulting & Workshops (Ranging from Startup to Enterprise clients, mainly NL, UK, DE, ES): - Reviewed architecture, suggested improvements, and identified performance and cost optimization opportunities. - Held various Databricks-related workshops (e.g., Delta Lake, Unity Catalog, migration, integration with systems like Power BI/MSSQL/PowerApps) and general workshops (e.g., Data Governance, DevOps in data, MLOps). |
| Tech Stack | Azure (ADLS Gen2, Blob storage, ADF) AWS, GCP Databricks (Delta Lake, Unity Catalog, AutoLoader, Jobs & Workflows, Vector Search, Model Serving, DLT, Lakeflow) Apache Spark Python (PySpark, Pandas, NumPy, MLFlow, etc.) SQL |
| Role(s) | Data Engineer, ML Engineer Solutions Consultant, Resident Solutions Architect |
| Category | Details |
|---|---|
| Business Domain | Slack integrated Databricks automation. |
| Project Description | Developed a Slack Bot with integrated Databricks cluster management capabilities and notifications for users to better control company-wide Databricks resource consumption. (As part of a team. I was responsible for Databricks API integration and end 2 end Cluster management workflow.) |
| Tech Stack | Databricks (Delta Lake, Unity Catalog, REST API) Python Slack Python SDK |
| Role(s) | Full-stack developer |