Peter Kaszt

📑 Relevant Projects

This document outlines my most important professional & personal projects across various industries & domains, focusing on expertise in Data Engineering, Machine Learning (ML), Generative AI (GenAI), and Databricks Solution Architecture.

ML and GenAI Development

Category	Details
Business Domain	ML model and GenAI development for audit-related business problems for a Big4 client.
Project Description	Developed various ML models and GenAI tools for high-impact business problems. Key deliverables included: - Binary and multi-class classification models for project records and reports. - LLM assisted processing of client supplied files in various formats (text, Excel, PDF, images, zip files, folders). - Automated data extraction and enrichment from finance documents (e.g. balance sheets, invoices, fund reports). - Built business-facing reporting dashboards and lightweight UI for the AI tools.
Tech Stack	Azure (ADLS Gen2, AI Document Intelligence, AI Content Understanding, Azure OpenAI) Databricks (Delta Lake, Unity Catalog, Vector Search, Jobs & Workflows, REST API) Python SQL
Role(s)	ML Engineer, Data Engineer, Data Scientist Databricks Solution Architect Note: Roles were often overlapping, sometimes on parallel projects.

Project Health Status Analyzer

Category	Details
Business Domain	AI-powered project health analysis
Project Description	Developed an AI tool that analyzes projects for health metrics based on textual summary of project status. Utilizes configurable LLM backend and RAG (with ChromaDB) for intelligent insights and recommendations.
Tech Stack	Python (FastAPI, LangChain, PyTest) LLM APIs (OpenAI, Gemini, LiteLLM) ChromaDB for RAG Bootstrap for UI GitHub Actions for CI/CD: UV, Ruff, MyPy, CodeCov, Docker
Role(s)	AI Engineer

Datapao RAG framework development

Category	Details
Business Domain	LLM, RAG, and Vector Database related solution development
Project Description	As part of Datapao’s Data Science Group we built an in-house RAG framework with end-to-end support for deploying RAG pipelines using Databricks technologies. My main focus was integrating various vector databases and vector search providers (e.g., ChromaDB, FAISS, Databricks Vector Search) and providing abstractions above the providers to improve developer experience.
Tech Stack	Vector Databases & libraries (Databricks Vector Search, ChromaDB, FAISS) Python Apache Spark
Role(s)	ML Engineer Backend Developer

Databricks ML POC

Category	Details
Business Domain	PoC for ML use cases on Databricks
Project Description	Proof of concept for an Enterprise client of Databricks. Showcasing machine learning workflows using Databricks platform, including model training, deployment, and integration with various data sources.
Tech Stack	Databricks (MLflow, Delta Lake, Unity Catalog) Python Apache Spark
Role(s)	Data engineer

Near Real-time ELT of Australian Energy Market Data

Category	Details
Business Domain	Near real-time ELT solution for Australian Energy Market Operator data.
Project Description	Migrated the Australian subsidiary of a major energy company from a slow, costly legacy Oracle system that couldn’t handle modern ML workloads. Notable features: - Custom solution using Azure and Databricks reduced data availability time for traders from 24 hours to less than 5 minutes, significantly cheaper than the old solution. - Enabling new use cases, such as predicting energy demand & prices and “day-trading” electricity (coupled with client’s battery farm data) , resulting in multiple 10-100k USD extra profit daily. - System uses Azure Blob Storage/ADLS Gen2 to land data, then Databricks AutoLoader and Spark streaming to ingest files as they arrive. - Utilizes a custom Python parser to unzip files (also in streaming fashion), process the AEMO proprietary format, and separate data into about 800 different tables. The data is available via Unity Catalog as Delta tables.
Tech Stack	Azure (ADLS Gen2, Blob storage, Azure Data Factory) Databricks (Delta Lake, Unity Catalog, AutoLoader, Jobs & Workflows) Python, Pandas Apache Spark
Role(s)	Data Engineer

SAP HANA/BODS to Azure Databricks Migration (Retail Analytics)

Category	Details
Business Domain	SAP HANA/BODS to Azure Databricks migration of a Global Retail Analytics platform.
Project Description	Migration and refactoring of an enterprise-scale Retail Analytics & Reporting platform from an end-of-life SAP HANA & Business Object Data Services (BODS) solution to a modern Databricks-based system on Azure. Key deliverables: - Redesign of the entire architecture for better performance and to better suit business needs. - Rewrote and optimized all legacy SAP HANA SQL code, SAP Calculation Views, and procedures to Databricks SQL and Spark native Python code.
Tech Stack	Azure (ADLS Gen2) Databricks (Delta Lake, Unity Catalog, Delta Live Tables, Jobs & Workflows) Apache Spark SQL Python
Role(s)	Data Engineer Solutions Consultant

Spark-Native Payments Reconciliation

Category	Details
Business Domain	Spark-native, highly scalable Payments Reconciliation.
Project Description	Addressed a huge backlog of financial records (tens of millions of invoice items and payment transactions) traditionally reconciled manually by a large team. The manual process took 20 FTEs over a year for a single account with only about 100k items. - Achieved multi-million USD cost savings and huge efficiency gains by making the process mostly automatic and fully scalable. - Solution extracts items from SAP FI, then executes preprocessing and a custom matching algorithm on Databricks using Spark. - Azure Data Factory manages extraction from and loading to SAP, and triggers the Databricks Workflow for the reconciliation process.
Tech Stack	Azure (ADLS Gen2, ADF) Azure Databricks (Delta Lake, Unity Catalog, Workflows) Apache Spark Python, Pandas
Role(s)	Data Engineer

Optimization & Consulting (various use cases & clients)

Category	Details
Business Domain	Cost and performance optimization of Apache Spark & Databricks workloads, Architecture reviews and best practice workshops.
Project Description	Optimization (Delivered 30+ smaller use cases): - Optimized performance of existing SQL or Python (sometimes Scala) code. - Refactored code to run efficiently on Spark/Databricks, or optimized architecture for reducing operation cost. Consulting & Workshops (Ranging from Startup to Enterprise clients, mainly NL, UK, DE, ES): - Reviewed architecture, suggested improvements, and identified performance and cost optimization opportunities. - Held various Databricks-related workshops (e.g., Delta Lake, Unity Catalog, migration, integration with systems like Power BI/MSSQL/PowerApps) and general workshops (e.g., Data Governance, DevOps in data, MLOps).
Tech Stack	Azure (ADLS Gen2, Blob storage, ADF) AWS, GCP Databricks (Delta Lake, Unity Catalog, AutoLoader, Jobs & Workflows, Vector Search, Model Serving, DLT, Lakeflow) Apache Spark Python (PySpark, Pandas, NumPy, MLFlow, etc.) SQL
Role(s)	Data Engineer, ML Engineer Solutions Consultant, Resident Solutions Architect

Slack Integrated Databricks Automation

Category	Details
Business Domain	Slack integrated Databricks automation.
Project Description	Developed a Slack Bot with integrated Databricks cluster management capabilities and notifications for users to better control company-wide Databricks resource consumption. (As part of a team. I was responsible for Databricks API integration and end 2 end Cluster management workflow.)
Tech Stack	Databricks (Delta Lake, Unity Catalog, REST API) Python Slack Python SDK
Role(s)	Full-stack developer

Open Source Contributions

Chispa PR #141 (Oct 24, 2024): Added an ignore_columns option to dataframe_comparer.py to improve dataframe comparison flexibility in tests.
Apache Spark PR #43190 (Oct 2, 2023): Fixed Python sample code in StreamingQueryListener docs (documentation fix).
CodersRank Libraries PR #157 (Apr 12, 2023): Added PySpark support to CodersRank libraries repository.
CodersRank Repo Info Extractor PR #192 (Jul 22, 2021): Added Julia language support to the repo info extractor.
Flask Zappa Tutorial PR #2 (Jul 22, 2021): Added “lambda:DeleteFunctionConcurrency” to zappa_security_policy.json (small security policy fix).

Personal Projects

AI & LLM

AI CV Generator: A generative AI application for creating customized CVs/resumes based on user input and templates.
AWS Deepracer: Autonomus racing of robocars based on reinforcement learning. (Participated in the official leauge for more than 5 years.)

Data Engineering

HASS Databricks: Integration project for connecting Home Assistant (smart home automation) with Databricks for data processing and analytics.
Databricks Exam Guide: A comprehensive guide and study summary for Databricks certification exams, including tips and resources.

Webapps & API Experiments

Map route creator: A web application for creating and visualizing routes on maps, based on uploaded or user drawn pictures.
Sourcingtool: A web-based tool (with Google based login and user management) for sourcing and ranking candidates based on LinkedIn & Github profiles.
HR_Ranking: Ranking of Hungarian HR service providers based on the offical list of providers, and their Google Maps reviews.
URL_Shortener: A simple URL shortening service with API for creating short links.
TextCompression: A simple experiement for a text-compression service.

Games & Game Helpers

Ninja_vs_Bakugan: A game about a Ninja who battles against Bakugan creatures. (Entry for PyGames hackathon.)
Letterpress_Helper: A helper tool for the Letterpress word game, providing word suggestions.
Wordle_Helper: An assistant for the Wordle puzzle game, offering hints and possible solutions.
Szolanc: A Hungarian word-based game helper.
Scrabble: A Scrabble game helper with word validation and scoring in English and Hungarian.
SudokuAPI & backtracking Sudoku solvers: A Sudoku API and a small collection of backtracking solvers implemented in multiple languages including pure Python (with some heuristics), NumPy, Go, and Julia for solving Sudoku puzzles efficiently.

Participation in the Hungarian National IT Competition

2023: I challenged myself to enter and try to do my best in all categories.
- 10th place in Azure Cloud category
- 18th place in Cyber Security category
- 29th place in Cloud BI category
- Top 7% in Using LLMs category
- Top 25 in 4 other categories
- Top 10% in AI, Linux, DevOps categories
2022: To my surprise I got into the Top 25 in one category.
- Top 25 in Embedded Systems category