Projects
A showcase of my work across data engineering, machine learning, analytics, and operations research.
Predicting News Article Popularity with Machine Learning
Two-step ML pipeline combining classification and regression to predict article engagement and click-to-impression ratios.
Optimized data aggregation for a 5M-record dataset using parallel processing on Google Cloud Vertex AI, reducing processing time by 60x through distributed parallelization
Designed a two-step ML pipeline combining classification and regression models to predict clicks-to-impressions ratio, improving baseline by 7x using a custom cost function
Presented results to 50+ peers and evaluated model performance and feature importance to interpret drivers of article engagement, linking predictive insights to potential revenue and content optimization strategies
dataset
5M records
improvement
7x over baseline
speedup
60x faster processing
Gaze Detection with CNN Architectures
Convolutional neural network for binary gaze direction classification using transfer learning with MobileNetV2.
Developed a convolutional neural network architecture to binarily classify the gaze direction of a subject (looking at camera vs. averted gaze)
Preprocessed a dataset of 5,000+ images using Keras, standardizing resolution and image size to improve model reliability and performance
Tested multiple data augmentation methods such as Gaussian noise, extra dropout layers, and varying image brightness to optimize model precision and reduce overfitting
dataset
5,000+ images
data Size
5GB
task
Binary classification
Minimizing MLB Team Travel Using Mixed-Integer Optimization
Prescriptive analytics project optimizing MLB scheduling and travel logistics using mixed-integer programming.
Built an airport-based MLB travel dataset by mapping teams to IATA airports and computing a full pairwise distance matrix, enabling accurate and consistent travel cost estimation for scheduling
Developed a slot-based mixed-integer optimization model in Gurobi with binary matchup, team location, and travel transition decisions, producing schedules that correctly capture travel across consecutive series
Applied real-world scheduling rules and ran sensitivity benchmarking against observed travel estimates, ensuring feasible itineraries while quantifying the efficiency and fairness tradeoff
teams
30 MLB teams
approach
Mixed-integer optimization
Financial Transaction Anomaly Detection
Large-scale fraud detection system processing 60M+ transactions on Spark cluster with ML-based anomaly detection.
Processed 60M+ transactions on a 4-node Spark cluster; exploratory analysis revealed weekday and midnight spikes and 98.5% U.S. volume, informing fraud risk hypotheses
Built a class-weighted Logistic Regression (grid search) achieving 83% recall / 89% accuracy, with LIME highlighting foreign-currency payments (AUD) as top risk driver
Flagged cross-border outliers (USD→JPY) and segmented 7M high-value transactions using Isolation Forest + K-Means, generating tiered monitoring rules for fraud teams
scale
60M+ transactions
accuracy
89%
recall
83%
cluster
4-node Spark cluster
Categorizing Text of Yelp Reviews
Large-scale NLP project analyzing 5GB of Yelp reviews with sentiment analysis and custom recommendation system.
Performed various machine learning methods like Sentiment Analysis to analyze 5 gigabytes of Yelp text data and categorize them into relevant groups
Created Python functions to present users with a relevant review for a selected business based on user preferences
Improved function performance by adjusting text preprocessing parameters, generating more relevant reviews for users from given input
data Size
5GB
notebooks
7 comprehensive notebooks
Job Market Analytics with Adzuna
Production-ready job-market data platform with automated ETL pipelines, semantic matching, and RAG-powered hiring assistant.
Built production-ready job-market data platform by orchestrating 6 daily Airflow ETL pipelines from the Adzuna API into a BigQuery star schema, delivering reliable, up-to-date analytics for job seekers and recruiters
Implemented semantic resume-to-job matching using vector embeddings, and deployed a K-means clustering pipeline to group jobs by skill similarity using LLM-extracted skills
Shipped a Streamlit app with a RAG hiring assistant that retrieves from live postings to generate data-grounded job descriptions and recommendations, accelerating recruiter workflows
scale
6 daily ETL pipelines
team
4-person team
impact
Production-ready analytics platform
Interactive Real-Time Investment Rating Dashboard
Multi-page Streamlit dashboard for dynamic stock risk scoring based on fundamental and technical analysis with live market data.
Engineered an interactive five-page Streamlit dashboard that ingests live Yahoo Finance data, lets users assign custom weightings to fundamental vs. technical factors, and returns a normalized 1-10 Investment Rating
Integrated two pre-trained ML pipelines (fundamental and technical) plus a Min-Max scaler to harmonize raw model outputs and enabled re-scoring through a slider-controlled weighting scheme
rating
1-10 scale
data Source
Live Yahoo Finance
deployment
Streamlit Cloud
Comprehensive Analysis of UK Corporations with SQL and Tableau
Multi-dimensional economic analysis of 1M+ UK firms examining corporate structure, industry resilience, and investment patterns.
Unified fragmented Companies House data for 1M+ UK firms using SQL, enabling clear insights into survival, ownership, and industry resilience
Exposed an 80% corporate dissolution rate (2014–2024) and 60% surge in foreign ownership, influencing investor and policy strategy post-Brexit
Developed interactive Tableau dashboards that transformed raw registration data into executive-ready insights, guiding capital allocation across sectors
scale
1M+ firms
insights
80% dissolution rate, 60% foreign ownership surge
Corporate Pulse: Analysing Financial KPIs and Market Trends
Financial statement analysis exploring correlations between KPIs and stock performance for VOO and VSMAX portfolio companies.
Utilized Python (pandas) to process, clean, and visualize data for 5 companies from Vanguard S&P 500 ETF (VOO) and 5 from Vanguard Small-Cap Index Fund (VSMAX)
Analysed corporate financial data to explore correlations between KPIs and stock returns, showing that FCF/Sales and EBITDA have weaker correlations with stock price movements
Presented key financial metrics driving market performance using data visualizations, collaborated with team of 6 and facilitated Q&A for audience of 60
portfolios
10 companies (VOO + VSMAX)
team
6 members
presentation
60 attendees
Causal Impact of Jazz Music on Task Performance Using A/B Testing
Randomized controlled trial assessing jazz music's effect on cognitive performance with causal inference analysis.
Designed and implemented a randomized controlled trial (RCT) to assess how jazz music influences task performance and cognitive metrics
Utilized A/B testing, exploratory data analysis (EDA), regression analysis, and estimated treatment effects (ATE and CATE) with Python
Ran randomized trial (n=62) testing jazz music's effect on focus and accuracy during cognitive tasks
sample
n=62
findings
15s faster, 0.86-point accuracy gain