Peter Kaszt

đź“‘ Relevant Projects

This document outlines my most important professional & personal projects across various industries & domains, focusing on expertise in Data Engineering, Machine Learning (ML), Generative AI (GenAI), and Databricks Solution Architecture.


ML and GenAI Development

Category Details
Business Domain ML model and GenAI development for audit-related business problems for a Big4 client.
Project Description Developed various ML models and GenAI tools for high-impact business problems.

Key deliverables included:
- Binary and multi-class classification models for project records and reports.
- LLM assisted processing of client supplied files in various formats (text, Excel, PDF, images, zip files, folders).
- Automated data extraction and enrichment from finance documents (e.g. balance sheets, invoices, fund reports).
- Built business-facing reporting dashboards and lightweight UI for the AI tools.
Tech Stack Azure (ADLS Gen2, AI Document Intelligence, AI Content Understanding, Azure OpenAI)
Databricks (Delta Lake, Unity Catalog, Vector Search, Jobs & Workflows, REST API)
Python
SQL
Role(s) ML Engineer, Data Engineer, Data Scientist
Databricks Solution Architect
Note: Roles were often overlapping, sometimes on parallel projects.

Project Health Status Analyzer

Category Details
Business Domain AI-powered project health analysis
Project Description Developed an AI tool that analyzes projects for health metrics based on textual summary of project status.
Utilizes configurable LLM backend and RAG (with ChromaDB) for intelligent insights and recommendations.
Tech Stack Python (FastAPI, LangChain, PyTest)
LLM APIs (OpenAI, Gemini, LiteLLM)
ChromaDB for RAG
Bootstrap for UI
GitHub Actions for CI/CD:
UV, Ruff, MyPy, CodeCov, Docker
Role(s) AI Engineer

Datapao RAG framework development

Category Details
Business Domain LLM, RAG, and Vector Database related solution development
Project Description As part of Datapao’s Data Science Group we built an in-house RAG framework with end-to-end support for deploying RAG pipelines using Databricks technologies.

My main focus was integrating various vector databases and vector search providers (e.g., ChromaDB, FAISS, Databricks Vector Search) and providing abstractions above the providers to improve developer experience.
Tech Stack Vector Databases & libraries (Databricks Vector Search, ChromaDB, FAISS)
Python
Apache Spark
Role(s) ML Engineer
Backend Developer

Databricks ML POC

Category Details
Business Domain PoC for ML use cases on Databricks
Project Description Proof of concept for an Enterprise client of Databricks.

Showcasing machine learning workflows using Databricks platform, including model training, deployment, and integration with various data sources.
Tech Stack Databricks (MLflow, Delta Lake, Unity Catalog)
Python
Apache Spark
Role(s) Data engineer

Near Real-time ELT of Australian Energy Market Data

Category Details
Business Domain Near real-time ELT solution for Australian Energy Market Operator data.
Project Description Migrated the Australian subsidiary of a major energy company from a slow, costly legacy Oracle system that couldn’t handle modern ML workloads.

Notable features:
- Custom solution using Azure and Databricks reduced data availability time for traders from 24 hours to less than 5 minutes, significantly cheaper than the old solution.
- Enabling new use cases, such as predicting energy demand & prices and “day-trading” electricity (coupled with client’s battery farm data) , resulting in multiple 10-100k USD extra profit daily.
- System uses Azure Blob Storage/ADLS Gen2 to land data, then Databricks AutoLoader and Spark streaming to ingest files as they arrive.
- Utilizes a custom Python parser to unzip files (also in streaming fashion), process the AEMO proprietary format, and separate data into about 800 different tables. The data is available via Unity Catalog as Delta tables.
Tech Stack Azure (ADLS Gen2, Blob storage, Azure Data Factory)
Databricks (Delta Lake, Unity Catalog, AutoLoader, Jobs & Workflows)
Python, Pandas
Apache Spark
Role(s) Data Engineer

SAP HANA/BODS to Azure Databricks Migration (Retail Analytics)

Category Details
Business Domain SAP HANA/BODS to Azure Databricks migration of a Global Retail Analytics platform.
Project Description Migration and refactoring of an enterprise-scale Retail Analytics & Reporting platform from an end-of-life SAP HANA & Business Object Data Services (BODS) solution to a modern Databricks-based system on Azure.

Key deliverables:
- Redesign of the entire architecture for better performance and to better suit business needs.
- Rewrote and optimized all legacy SAP HANA SQL code, SAP Calculation Views, and procedures to Databricks SQL and Spark native Python code.
Tech Stack Azure (ADLS Gen2)
Databricks (Delta Lake, Unity Catalog, Delta Live Tables, Jobs & Workflows)
Apache Spark
SQL
Python
Role(s) Data Engineer
Solutions Consultant

Spark-Native Payments Reconciliation

Category Details
Business Domain Spark-native, highly scalable Payments Reconciliation.
Project Description Addressed a huge backlog of financial records (tens of millions of invoice items and payment transactions) traditionally reconciled manually by a large team. The manual process took 20 FTEs over a year for a single account with only about 100k items.

- Achieved multi-million USD cost savings and huge efficiency gains by making the process mostly automatic and fully scalable.
- Solution extracts items from SAP FI, then executes preprocessing and a custom matching algorithm on Databricks using Spark.
- Azure Data Factory manages extraction from and loading to SAP, and triggers the Databricks Workflow for the reconciliation process.
Tech Stack Azure (ADLS Gen2, ADF)
Azure Databricks (Delta Lake, Unity Catalog, Workflows)
Apache Spark
Python, Pandas
Role(s) Data Engineer

Optimization & Consulting (various use cases & clients)

Category Details
Business Domain Cost and performance optimization of Apache Spark & Databricks workloads,
Architecture reviews and best practice workshops.
Project Description Optimization (Delivered 30+ smaller use cases):
- Optimized performance of existing SQL or Python (sometimes Scala) code.
- Refactored code to run efficiently on Spark/Databricks, or optimized architecture for reducing operation cost.

Consulting & Workshops (Ranging from Startup to Enterprise clients, mainly NL, UK, DE, ES):
- Reviewed architecture, suggested improvements, and identified performance and cost optimization opportunities.
- Held various Databricks-related workshops (e.g., Delta Lake, Unity Catalog, migration, integration with systems like Power BI/MSSQL/PowerApps) and general workshops (e.g., Data Governance, DevOps in data, MLOps).
Tech Stack Azure (ADLS Gen2, Blob storage, ADF) AWS, GCP
Databricks (Delta Lake, Unity Catalog, AutoLoader, Jobs & Workflows, Vector Search, Model Serving, DLT, Lakeflow)
Apache Spark
Python (PySpark, Pandas, NumPy, MLFlow, etc.)
SQL
Role(s) Data Engineer, ML Engineer
Solutions Consultant, Resident Solutions Architect

Slack Integrated Databricks Automation

Category Details
Business Domain Slack integrated Databricks automation.
Project Description Developed a Slack Bot with integrated Databricks cluster management capabilities and notifications for users to better control company-wide Databricks resource consumption. (As part of a team. I was responsible for Databricks API integration and end 2 end Cluster management workflow.)
Tech Stack Databricks (Delta Lake, Unity Catalog, REST API)
Python
Slack Python SDK
Role(s) Full-stack developer

Open Source Contributions


Personal Projects

AI & LLM

Data Engineering

Webapps & API Experiments

Games & Game Helpers


Participation in the Hungarian National IT Competition