Aryan Sehgal | Data Scientist & Business Analytics Professional

Projects

A showcase of my work across data engineering, machine learning, analytics, and operations research.

Machine LearningOct 2024 - Dec 2024

Predicting News Article Popularity with Machine Learning

Two-step ML pipeline combining classification and regression to predict article engagement and click-to-impression ratios.

▸

Optimized data aggregation for a 5M-record dataset using parallel processing on Google Cloud Vertex AI, reducing processing time by 60x through distributed parallelization

▸

Designed a two-step ML pipeline combining classification and regression models to predict clicks-to-impressions ratio, improving baseline by 7x using a custom cost function

▸

Presented results to 50+ peers and evaluated model performance and feature importance to interpret drivers of article engagement, linking predictive insights to potential revenue and content optimization strategies

dataset

5M records

improvement

7x over baseline

speedup

60x faster processing

PythonGoogle Cloud Vertex AIscikit-learnMachine LearningParallel Processing

GitHub

Computer Vision & Deep LearningMar 2025 - Apr 2025

Gaze Detection with CNN Architectures

Convolutional neural network for binary gaze direction classification using transfer learning with MobileNetV2.

▸

Developed a convolutional neural network architecture to binarily classify the gaze direction of a subject (looking at camera vs. averted gaze)

▸

Preprocessed a dataset of 5,000+ images using Keras, standardizing resolution and image size to improve model reliability and performance

▸

Tested multiple data augmentation methods such as Gaussian noise, extra dropout layers, and varying image brightness to optimize model precision and reduce overfitting

dataset

5,000+ images

data Size

5GB

task

Binary classification

PythonTensorFlowKerasMobileNetV2Computer VisionDeep Learning+1

GitHub

Operations ResearchSep 2025 - Dec 2025

Minimizing MLB Team Travel Using Mixed-Integer Optimization

Prescriptive analytics project optimizing MLB scheduling and travel logistics using mixed-integer programming.

▸

Built an airport-based MLB travel dataset by mapping teams to IATA airports and computing a full pairwise distance matrix, enabling accurate and consistent travel cost estimation for scheduling

▸

Developed a slot-based mixed-integer optimization model in Gurobi with binary matchup, team location, and travel transition decisions, producing schedules that correctly capture travel across consecutive series

▸

Applied real-world scheduling rules and ran sensitivity benchmarking against observed travel estimates, ensuring feasible itineraries while quantifying the efficiency and fairness tradeoff

teams

30 MLB teams

approach

Mixed-integer optimization

PythonGurobiMixed-Integer ProgrammingOptimizationOperations Research

GitHub

Big DataMar 2025 - Apr 2025

Financial Transaction Anomaly Detection

Large-scale fraud detection system processing 60M+ transactions on Spark cluster with ML-based anomaly detection.

▸

Processed 60M+ transactions on a 4-node Spark cluster; exploratory analysis revealed weekday and midnight spikes and 98.5% U.S. volume, informing fraud risk hypotheses

▸

Built a class-weighted Logistic Regression (grid search) achieving 83% recall / 89% accuracy, with LIME highlighting foreign-currency payments (AUD) as top risk driver

▸

Flagged cross-border outliers (USD→JPY) and segmented 7M high-value transactions using Isolation Forest + K-Means, generating tiered monitoring rules for fraud teams

scale

60M+ transactions

accuracy

89%

recall

83%

cluster

4-node Spark cluster

PySparkApache SparkPythonscikit-learnLIMEIsolation Forest+2

NLPJan 2025 - Mar 2025

Categorizing Text of Yelp Reviews

Large-scale NLP project analyzing 5GB of Yelp reviews with sentiment analysis and custom recommendation system.

▸

Performed various machine learning methods like Sentiment Analysis to analyze 5 gigabytes of Yelp text data and categorize them into relevant groups

▸

Created Python functions to present users with a relevant review for a selected business based on user preferences

▸

Improved function performance by adjusting text preprocessing parameters, generating more relevant reviews for users from given input

data Size

5GB

notebooks

7 comprehensive notebooks

PythonNLPSentiment AnalysisTopic ModelingMachine LearningText Mining

GitHub

Data EngineeringSep 2025 - Dec 2025

Job Market Analytics with Adzuna

Production-ready job-market data platform with automated ETL pipelines, semantic matching, and RAG-powered hiring assistant.

▸

Built production-ready job-market data platform by orchestrating 6 daily Airflow ETL pipelines from the Adzuna API into a BigQuery star schema, delivering reliable, up-to-date analytics for job seekers and recruiters

▸

Implemented semantic resume-to-job matching using vector embeddings, and deployed a K-means clustering pipeline to group jobs by skill similarity using LLM-extracted skills

▸

Shipped a Streamlit app with a RAG hiring assistant that retrieves from live postings to generate data-grounded job descriptions and recommendations, accelerating recruiter workflows

scale

6 daily ETL pipelines

team

4-person team

impact

Production-ready analytics platform

PythonApache AirflowBigQueryDockerStreamlitVector Embeddings+4

GitHub

Financial AnalyticsMar 2025 - Apr 2025

Interactive Real-Time Investment Rating Dashboard

Multi-page Streamlit dashboard for dynamic stock risk scoring based on fundamental and technical analysis with live market data.

▸

Engineered an interactive five-page Streamlit dashboard that ingests live Yahoo Finance data, lets users assign custom weightings to fundamental vs. technical factors, and returns a normalized 1-10 Investment Rating

▸

Integrated two pre-trained ML pipelines (fundamental and technical) plus a Min-Max scaler to harmonize raw model outputs and enabled re-scoring through a slider-controlled weighting scheme

rating

1-10 scale

data Source

Live Yahoo Finance

deployment

Streamlit Cloud

PythonStreamlitscikit-learnYahoo Finance APIMachine LearningJupyter Notebook

GitHub

Business IntelligenceOct 2024 - Dec 2024

Comprehensive Analysis of UK Corporations with SQL and Tableau

Multi-dimensional economic analysis of 1M+ UK firms examining corporate structure, industry resilience, and investment patterns.

▸

Unified fragmented Companies House data for 1M+ UK firms using SQL, enabling clear insights into survival, ownership, and industry resilience

▸

Exposed an 80% corporate dissolution rate (2014–2024) and 60% surge in foreign ownership, influencing investor and policy strategy post-Brexit

▸

Developed interactive Tableau dashboards that transformed raw registration data into executive-ready insights, guiding capital allocation across sectors

scale

1M+ firms

insights

80% dissolution rate, 60% foreign ownership surge

SQLTableauPythonData AnalysisBusiness Intelligence

GitHub

Financial AnalyticsSep 2024 - Oct 2024

Corporate Pulse: Analysing Financial KPIs and Market Trends

Financial statement analysis exploring correlations between KPIs and stock performance for VOO and VSMAX portfolio companies.

▸

Utilized Python (pandas) to process, clean, and visualize data for 5 companies from Vanguard S&P 500 ETF (VOO) and 5 from Vanguard Small-Cap Index Fund (VSMAX)

▸

Analysed corporate financial data to explore correlations between KPIs and stock returns, showing that FCF/Sales and EBITDA have weaker correlations with stock price movements

▸

Presented key financial metrics driving market performance using data visualizations, collaborated with team of 6 and facilitated Q&A for audience of 60

portfolios

10 companies (VOO + VSMAX)

team

6 members

presentation

60 attendees

PythonpandasData VisualizationFinancial AnalysisQuantitative Finance

GitHub

A/B Testing & Statistical AnalysisJan 2025 - Mar 2025

Causal Impact of Jazz Music on Task Performance Using A/B Testing

Randomized controlled trial assessing jazz music's effect on cognitive performance with causal inference analysis.

▸

Designed and implemented a randomized controlled trial (RCT) to assess how jazz music influences task performance and cognitive metrics

▸

Utilized A/B testing, exploratory data analysis (EDA), regression analysis, and estimated treatment effects (ATE and CATE) with Python

▸

Ran randomized trial (n=62) testing jazz music's effect on focus and accuracy during cognitive tasks

sample

n=62

findings

15s faster, 0.86-point accuracy gain

PythonA/B TestingStatistical AnalysisRegression AnalysisCausal Inference

GitHub