

VINAY JOGANI
Master of Science in Information Systems, Northeastern University (August 2025)
Former AI/ML Research Assistant at Amal Lab, Northeastern University, Boston
Former Research Data Scientist at Brigham and Women's Hospital, Boston
Passionate about data science, deep learning, computer vision, and their applications in healthcare and finance
​
ABOUT ME
I am a data and machine learning professional with expertise spanning the full data lifecycle, from engineering scalable pipelines and databases to building production ML systems and delivering actionable analytics. I hold a Master of Science in Information Systems from Northeastern University (August 2025) and a Bachelor of Technology in Information Technology from Veermata Jijabai Technological Institute.
​
My experience encompasses both research and production environments. At Brigham and Women's Hospital, I developed NLP pipelines processing clinical trial reports, built statistical models for biomarker discovery, and engineered quantitative finance systems analyzing pharmaceutical market dynamics. As a Research Assistant at Northeastern University's Amal Lab, I architected deep learning models for medical image classification and built distributed computing infrastructure processing large-scale multi-omics datasets using Apache Spark and Dask.
​
I have strong software engineering foundations from my work at HCL Technologies, where I designed enterprise automation systems significantly reducing manual processes, and from building production-grade applications including a microservices-based Personal Finance API with comprehensive test coverage and a real-time Healthcare Data Pipeline processing patient records using Kafka, Spark, and Airflow.
​
Proficient in Python (TensorFlow, PyTorch, Scikit-learn, Pandas, NumPy), SQL/NoSQL databases, cloud platforms (AWS, GCP), and MLOps tools (Docker, Kubernetes, MLflow), I excel at building end-to-end solutions, from data infrastructure and ETL pipelines to advanced ML models and interactive analytics dashboards. My work spans statistical analysis, A/B testing, time series forecasting, computer vision, and NLP.
​
I have presented peer-reviewed technical papers at international conferences on explainable AI, adversarial robustness, and deep learning applications. I am actively seeking opportunities to apply my versatile skill set in building impactful, scalable data and ML solutions across healthcare, finance, and beyond.
EDUCATION
September 2023 - August 2025
Master of Science in Information Systems
Northeastern University
Boston, MA
Focused Coursework: Advance Data Science & Architecture, Parallel Machine Learning & AI, LLM w/ Knowledge Graph DB, Natural Language Engineering, AI Generative Modeling with focus in Finance
August 2019 - May 2023
Bachelor of Technology in Information Technology
Veermata Jijabai Technological Institute
Mumbai, India
Focused Coursework: Data Structures and Algo, Linear Algebra, Discrete Mathematics, Artificial Intelligence, Machine Learning, Data Architecture, Network Security, Big Data Analysis, Computer Networks, Operating Systems
SKILLS

Programming Languages
Python, SQL, C++, Java, R, JavaScript, MATLAB, Cypher
Deep Learning
CNN, LSTM, GRU, ResNet, DenseNet, EfficientNet, VGG, Vision Transformers, DQN, PPO, Distributed Data Parallel, Multi-GPU Training
Computer Vision
OpenCV, torchvision, scikit-image, Grad-CAM, Image Segmentation, Object Detection
Data Science & Analytics
NumPy, Pandas, SciPy, statsmodels, Statistical Analysis, A/B Testing, Hypothesis Testing, Experimental Design, Time Series Forecasting, Feature Engineering, PCA, t-SNE, UMAP, Causal Inference, Monte Carlo Simulations
Cloud Platforms
AWS (EC2, S3, Lambda), GCP (BigQuery, GCS, Cloud Pub/Sub)
MLOps & Production
MLflow, Docker, Kubernetes, Docker Compose, CI/CD, GitHub Actions, Model Monitoring, Model Versioning, Prometheus, Grafana, Terraform
Automation & Web Scraping
BeautifulSoup, Selenium, UiPath, RPA
Machine Learning & AI
TensorFlow, PyTorch, Keras, Scikit-learn, XGBoost, LightGBM, Stable-Baselines3, OpenAI Gym, Gymnasium, Supervised Learning, Unsupervised Learning, Transfer Learning, Reinforcement Learning, Ensemble Methods, Hyperparameter Optimization (Optuna, GridSearchCV)
Natural Language Processing
BERT, RoBERTa, Transformers (Hugging Face), NLTK, spaCy, CodeT5, VADER, TextBlob, Named Entity Recognition
Explainable AI & Model Security
LIME, SHAP, Adversarial Training, Adversarial Robustness Toolbox
Data Engineering & ETL
Apache Spark, PySpark, Apache Kafka, Apache Airflow, dbt, Hadoop, Dask, ETL/ELT Pipelines, Stream Processing, Batch Processing, Data Orchestration, Great Expectations, Pydantic, Medallion Architecture
Databases & Data Storage
PostgreSQL, MySQL, MongoDB, Neo4j, SQLite, Snowflake, BigQuery, Redis, SQLAlchemy, FAISS, Database Design, Window Functions, CTEs, Indexing
Software Engineering
FastAPI, Flask, REST APIs, Microservices, Git, pytest, Unit Testing, Integration Testing, Async/Await, Exception Handling, Agile/Scrum, Jira
Quantitative Finance
Portfolio Optimization, Options Pricing, Risk Metrics (Sharpe, VaR), Time Series Modeling (ARIMA, GARCH), yFinance, Bloomberg API
PROFESSIONAL EXPERIENCE
​
Northeastern University
Boston, MA
June 2024 - August 2025
AI/ML RESEARCH ASSISTANT
-
Engineered Med-SAM medical image segmentation system for multi-modal datasets (MRI, CT, histopathology) with automated preprocessing pipelines and custom data loaders, achieving superior organ and lesion boundary detection through fine-tuned transformer architectures and cross-validation frameworks on clinical datasets
-
Architected scalable big data infrastructure processing 18TB+ TCGA multi-omics datasets (RNA-Seq, methylation, CNV) using distributed computing frameworks, implementing unsupervised clustering algorithms (K-means++, DBSCAN, hierarchical clustering) and dimensionality reduction (PCA, t-SNE, UMAP) for biomarker discovery across 33+ cancer types
-
Developed production-ready deep learning classification system achieving 97.28% accuracy on skin cancer detection through comparative analysis of state-of-the-art architectures (ResNet-34, EfficientNet-B1, VGG16, Vision Transformers) with ensemble learning and external validation across Dermofit, BCN20000, and Buenos Aires datasets
-
Technical Stack: PyTorch, TensorFlow, Keras, Hugging Face Transformers, OpenCV, scikit-learn, pandas, NumPy, Dask, Apache Spark, PySpark
Brigham and Women’s Hospital
Boston, MA
Aug 2024 - Dec 2024
RESEARCH DATA SCIENTIST
-
Architected comprehensive meta-analysis framework processing ClinicalTrials.gov and PubMed databases using machine learning algorithms (random forests, gradient boosting, ensemble methods) to investigate participant heterogeneity across 10+ years of IBS and psychiatric disease trials with automated data extraction pipelines, implementing statistical modeling (Cox proportional hazards, mixed-effects models) and PostgreSQL database management while maintaining HIPAA compliance and IRB protocol adherence
-
Engineered clinical trial news analytics pipeline analyzing 10,000+ pharmaceutical reports using transformer-based NLP models (BERT, RoBERTa), sentiment analysis (VADER, TextBlob), and spaCy named entity recognition to identify correlations between Phase 2/3 trial media coverage sentiment and regulatory failure rates
-
Developed quantitative finance modeling system analyzing market microstructure impacts of clinical trial announcements using yfinance and Alpha Vantage APIs, implementing time-series econometric analysis, stock price volatility patterns, and trading volume anomaly detection through statistical hypothesis testing (t-tests, Mann-Whitney U) and volatility modeling (GARCH, ARCH)
-
Technical Stack: Python, pandas, scikit-learn, NumPy, SciPy, spaCy, BERT, RoBERTa, Transformers, VADER, TextBlob, statsmodels, yfinance, Alpha Vantage, ClinicalTrials.gov API, PubMed API, REST APIs, PostgreSQL, HIPAA compliance, IRB protocols
HCL Technologies
Noida, India
June 2022 - July 2022
SOFTWARE ENGINEER
-
Architected enterprise-grade RPA automation system using UiPath Studio integrating web scraping algorithms, API orchestration, and dynamic data extraction pipelines to process travel booking platforms with multi-threaded execution and parameterized input validation (origin/destination cities, travel dates), achieving 85% reduction in manual search processes
-
Engineered intelligent document processing solution leveraging Regular Expression parsing and OCR technologies within UiPath framework, developing machine learning-enhanced extraction algorithms for structured PDF invoice processing with 95% accuracy across variable document formats, implementing data validation schemas and exception handling for invoice metadata
-
Designed production-ready automation infrastructure with comprehensive error handling mechanisms and robust logging frameworks using UiPath Orchestrator, supporting scalable workflow architecture for dynamic web content parsing, PDF format variations, and automated Excel report generation with advanced formatting capabilities
-
Technical Stack: UiPath Studio, UiPath Orchestrator, Regular Expressions, OCR, Web Scraping, API Integration, PDF Processing, Excel Automation, Exception Handling
Technoriya eTechnologies Pvt. Ltd.
Navi Mumbai, India
Dec 2021- Jan 2022
SOFTWARE DEVELOPER
-
Architected machine learning-powered adaptive assessment engine implementing reinforcement learning algorithms and real-time performance analytics to dynamically adjust examination difficulty levels, achieving 92.31% prediction accuracy in performance-based question recommendation systems through collaborative filtering and behavioral pattern recognition models
-
Developed full-stack web application infrastructure using JavaScript frameworks and cloud-based architectures with responsive user interfaces, real-time data synchronization, secure authentication protocols, and scalable database management systems supporting concurrent multi-user examination environments with automated grading
-
Engineered intelligent question difficulty calibration system utilizing statistical modeling techniques (Item Response Theory, Bayesian inference) and historical performance data analysis, implementing machine learning pipelines for continuous model training and validation to personalize assessment experiences
-
Technical Stack: Python, scikit-learn, pandas, NumPy, Machine Learning, JavaScript, React.js, Node.js, Express.js, MongoDB, PostgreSQL, AWS, RESTful APIs, Authentication, SciPy, statsmodels
DataBit Technologies Pvt. Ltd.
Pune, India
May 2021 - July 2021
DATA ANALYST
-
Architected comprehensive data preprocessing pipelines implementing statistical techniques for anomaly detection, missing value imputation using MICE, outlier identification through IQR and Z-score methodologies, and duplicate record resolution, ensuring 99.5% data integrity across 100,000+ client records
-
Engineered machine learning analytics framework deploying ensemble methods (Random Forest, Gradient Boosting), unsupervised clustering algorithms (K-means++, DBSCAN, hierarchical clustering), and supervised learning models (Linear/Polynomial Regression, K-Nearest Neighbors) to extract actionable business intelligence patterns, achieving statistical significance (p<0.05) in predictive model performance
-
Developed end-to-end analytics solutions implementing feature engineering techniques, dimensionality reduction (PCA, t-SNE), cross-validation frameworks, and model evaluation metrics (ROC-AUC, precision-recall curves, confusion matrices), creating automated reporting dashboards using Tableau and Power BI with MySQL database integration for stakeholder decision-making
-
Technical Stack: Python, pandas, NumPy, scikit-learn, matplotlib, seaborn, SciPy, statsmodels, Machine Learning, Statistical Analysis, Tableau, Power BI, Data Visualization, SQL, MySQL
Navlakhi
Mumbai, India
July, 2020
BACK END PROGRAMMER
-
Architected comprehensive fee payment oversight module implementing secure transaction processing architecture using PHP backend frameworks and MySQL database optimization with indexed queries and stored procedures, developing RESTful API endpoints for real-time balance monitoring, transaction validation, and automated CRUD operations with encrypted data handling
-
Engineered full-stack student fee management system integrating server-side PHP logic with responsive frontend interfaces using JavaScript, HTML5, and CSS3, implementing asynchronous payment processing workflows and session management protocols ensuring seamless user experience across desktop and mobile platforms
-
Developed scalable payment gateway integration system with automated payment reminders, installment processing algorithms, and real-time dashboard updates, integrating secure payment streams while maintaining PCI DSS compliance standards and implementing comprehensive error handling for transaction failures
-
Technical Stack: PHP, MySQL, JavaScript, HTML5, CSS3, RESTful APIs, Payment Gateway APIs, Session Management
TECHNICAL PAPERS
Analysis of Explainable Methods on Medical Image Classification
Third International Conference on Advances in Electrical, Computing, Communications and Sustainable Technologies (ICAECT 2023) affiliated to IEEE, Published in May 2023
-
Conducted comprehensive comparative analysis of Explainable AI methodologies for lung cancer classification using deep convolutional neural networks (VGG-16, ResNet-50), implementing gradient-based attribution techniques (Grad-CAM, Integrated Gradients) and perturbation-based interpretability methods (LIME) on histopathology image datasets
-
Engineered systematic evaluation framework measuring computational efficiency and interpretability effectiveness across multiple XAI approaches, implementing performance benchmarking protocols with execution time analysis and memory utilization profiling for optimal XAI method selection in clinical diagnostic workflows
-
Developed reproducible research methodology with rigorous experimental design for medical AI interpretability assessment, implementing cross-validation protocols, statistical significance testing, and comprehensive ablation studies for transparent machine learning systems in healthcare applications
-
Technical Stack: TensorFlow, Keras, PyTorch, VGG-16, ResNet-50, Grad-CAM, LIME, Computer Vision, Medical Image Processing, Statistical Analysis, NumPy, SciPy, pandas
Intrusion Detection: A Deep Learning Approach
2023 Second International Conference on Electrical, Electronics, Information and Communication Technologies (ICEEICT 2023), Published in April 2023
-
Architected novel hybrid intrusion detection system combining deep learning architectures (CNN feature extraction layers, LSTM temporal sequence modeling) with classical machine learning classifiers (Support Vector Machine with RBF kernel), implementing ensemble learning methodology to achieve 97.29% accuracy on multi-class network attack classification, outperforming traditional IDS approaches by 15+ percentage points
-
Engineered comprehensive comparative analysis framework evaluating traditional machine learning algorithms (Random Forest, Naive Bayes, Decision Trees) against deep learning architectures (CNNs, LSTMs, hybrid models) across multiple cybersecurity datasets, implementing rigorous statistical evaluation protocols with cross-validation, ROC-AUC analysis, and computational complexity assessment
-
Developed advanced data preprocessing pipeline implementing Principal Component Analysis for dimensionality reduction, feature scaling normalization techniques, and statistical feature selection algorithms, optimizing computational efficiency while maintaining detection accuracy above 95% for real-time network security monitoring
-
Technical Stack: Deep Learning, TensorFlow, Keras, Machine Learning, scikit-learn, SVM, CNN, LSTM, PCA, Feature Engineering, Cybersecurity Datasets (NSL-KDD, CICIDS2017, UNSW-NB15), Statistical Analysis, NumPy, pandas, SciPy
Adversarial Attacks and Defences for Skin Cancer Classification
International Conference for Advancement in Technology (ICONAT 2023) affiliated to IEEE, Published in April 2023
-
Architected comprehensive adversarial attack evaluation framework implementing gradient-based attack methodologies (Projected Gradient Descent, Fast Gradient Sign Method) against deep CNN architectures for dermatoscopic skin cancer classification, conducting systematic vulnerability assessment across multiple attack perturbation budgets and demonstrating critical security vulnerabilities in medical AI diagnostic systems
-
Engineered robust adversarial defense mechanisms implementing PGD-based adversarial training protocols with multi-step gradient ascent optimization, data augmentation strategies, and ensemble defense techniques, achieving 27.73 percentage point improvement in model robustness against white-box attacks while maintaining baseline classification performance on clean datasets
-
Developed end-to-end adversarial robustness evaluation pipeline with automated attack generation, defense validation protocols, and comprehensive performance benchmarking across multiple skin cancer datasets, implementing statistical significance testing and confidence interval analysis for medical AI security
-
Technical Stack: Deep Learning, CNN Architectures, Adversarial Machine Learning, Attack Methods (PGD, FGSM), Adversarial Training, Data Augmentation, Medical Imaging, Statistical Analysis
Image Captioning Using Transformer: VISIONAID
International Research Journal Of Engineering and Technology (IRJET), Published in October 2022
-
Architected novel image captioning system "VisionAid" implementing Swin Transformer architectures with hierarchical shifted window attention mechanisms, addressing internal covariate shift through batch normalization techniques and geometric-aware self-attention modules, achieving superior contextual understanding and caption accuracy compared to traditional CNN-RNN approaches
-
Engineered comprehensive comparative analysis framework conducting systematic literature review of transformer-based image captioning methodologies, implementing rigorous evaluation protocols across multiple benchmark datasets (MSCOCO, Flickr30k), identifying critical performance bottlenecks in existing models through statistical significance testing and ablation studies
-
Developed innovative attention mechanism architectures integrating multi-head self-attention with geometric spatial reasoning capabilities, implementing advanced word embedding techniques (positional encoding, semantic embeddings) and caption diversity enhancement algorithms through beam search optimization and nucleus sampling strategies with quantifiable improvements in BLEU, METEOR, and CIDEr evaluation metrics
-
Technical Stack: Transformer Architectures, Swin Transformer, Deep Learning, Computer Vision, Natural Language Processing, Multi-Head Attention, Word Embeddings, Positional Encoding, Evaluation Metrics (BLEU, METEOR, CIDEr), Benchmark Datasets (MSCOCO, Flickr30k)
PROJECTS
Healthcare Data Pipeline: Real-Time Analytics & Enterprise Data Engineering Platform
October 2025
​
-
Architected production-grade data pipeline processing 10M+ patient records with dual processing paradigms: real-time event streaming via Apache Kafka handling 500+ events/minute with Spark Structured Streaming using 5-minute tumbling windows and stateful aggregations, plus batch ETL via Spark jobs implementing incremental processing with date partitioning achieving 70% processing time reduction
-
Engineered enterprise data transformation framework with dbt implementing medallion architecture (staging, intermediate, marts layers) across 31 models including SCD Type 2 dimension tables for historical tracking, 50+ data quality tests with Great Expectations validation suites maintaining 99%+ pass rates, and incremental materialization strategies reducing full-refresh times by 85%, orchestrated via Apache Airflow with complex DAG dependencies and workflow automation
-
Implemented robust data quality and observability infrastructure with Spark-based validation jobs checking completeness, uniqueness constraints, referential integrity across 6 source tables, and data freshness monitoring, integrated with Prometheus metrics collection and Grafana dashboards tracking pipeline throughput, processing latency, and data quality scores with automated alerting
-
Designed cloud-native deployment architecture with Docker Compose and Kubernetes containerization, Terraform infrastructure-as-code provisioning Snowflake data warehouse and GCP resources (BigQuery, GCS), PostgreSQL for metadata management, automated installation scripts with Make commands, and comprehensive testing suite with pytest demonstrating enterprise DevOps/DataOps practices with CI/CD
-
Technical Stack: Python, Apache Kafka, Apache Spark, PySpark, Structured Streaming, Apache Airflow, dbt, PostgreSQL, Snowflake, BigQuery, GCP, GCS, Great Expectations, Docker, Docker Compose, Kubernetes, Terraform, Prometheus, Grafana, SQL, pytest, Make, CI/CD, ETL, Stream Processing, Batch Processing, Medallion Architecture
Personal Finance API: Microservices Architecture & Production Infrastructure
October 2024
​
-
Architected production-grade microservices system with 4 independent FastAPI services (Authentication, Transaction Management, Budget Tracking, Analytics) implementing RESTful architecture with JWT-based authentication, achieving 92% test coverage across 80+ pytest unit and integration tests with parameterized fixtures, mocking strategies, and automated CI/CD pipeline executing linting (flake8), formatting (black), type checking (mypy), and security scanning (bandit) on every GitHub Actions deployment
-
Engineered high-performance backend infrastructure with PostgreSQL database cluster featuring normalized databases with indexed queries achieving sub-50ms response times, Redis caching layer reducing API latency by 30% with 5-10 minute TTL and 85%+ cache hit rates, distributed rate limiting (100 requests/minute per user using Redis atomic operations), Nginx load balancing across service replicas, and database connection pooling (10 connections per service) handling 500+ concurrent users in Apache Bench load testing
-
Designed scalable deployment architecture with Docker containerization using multi-stage Dockerfiles with health checks, Kubernetes orchestration manifests with deployments, services, ingress configurations, resource limits and auto-scaling, complete observability with structured logging, automated backup and restore procedures, and production-ready configurations demonstrating enterprise DevOps practices with infrastructure-as-code principles
-
Technical Stack: Python, FastAPI, PostgreSQL, Redis, SQLAlchemy, Pydantic, Docker, Docker Compose, Kubernetes, Nginx, pytest, GitHub Actions, JWT, bcrypt, Uvicorn, Make, Apache Bench, CI/CD, REST APIs, Microservices, Async Programming
Sports Betting A/B Testing Platform & Analytics Infrastructure
October 2024
​
-
Engineered production-grade experimentation framework implementing multiple statistical methodologies (frequentist t-tests, chi-square, Mann-Whitney U, bootstrap resampling, Bayesian inference with Beta-Binomial conjugates) plus sequential testing with O'Brien-Fleming spending functions for alpha control, power analysis calculating optimal sample sizes across 50+ experiments, and multi-armed bandit algorithms (epsilon-greedy, UCB1, Thompson Sampling) for dynamic traffic allocation
-
Architected PostgreSQL analytics database with 5 normalized tables processing 1M+ betting transactions across 100K users, implementing advanced SQL with window functions (NTILE, LAG, PERCENTILE_CONT), recursive CTEs for cohort retention analysis, multi-touch revenue attribution models (first-touch, last-touch, time-decay, U-shaped), and covering indexes achieving sub-100ms query performance on complex analytical workloads
-
Developed end-to-end ML pipeline with Random Forest models achieving CLV prediction (R²=0.89, RMSE=$47) and churn classification (AUC-ROC=0.92, 85% recall), featuring automated feature engineering (RFM scoring, engagement trends, interaction terms), cross-validation, and scikit-learn deployment predicting $500K+ annual revenue impact from experiment optimization with ROI calculations and business impact projections
-
Technical Stack: PostgreSQL, Python, pandas, scikit-learn, scipy, statsmodels, numpy, matplotlib, seaborn, psycopg2, Jupyter, pytest, SQL (CTEs, window functions, PERCENTILE_CONT, recursive queries)
Financial Health Dashboard & Analytics Platform
October 2025
​
-
Architected PostgreSQL data warehouse with 7 normalized tables processing 1,754 transactions, implementing recursive CTEs for hierarchical queries, window functions (RANK, LAG, LEAD) for moving averages, and B-tree indexing achieving sub-50ms query latency on complex aggregations with 12+ advanced SQL analytics queries
-
Built Python ETL pipeline with pandas processing 1,700+ records across 6 data sources, implementing connection pooling with psycopg2 and automated CSV and Excel exports with data validation and quality checks, plus monthly PDF reporting system with matplotlib visualizations and Z-score anomaly detection (z-score > 2 threshold) identifying unusual transactions
-
Developed 5-page interactive Tableau dashboard (Executive Summary, Spending Analysis, Budget Performance, Financial Goals, Insights & Alerts) with LOD expressions enabling real-time budget variance analysis, spending patterns visualization, and statistical anomaly detection across normalized data sources
-
Technical Stack: PostgreSQL, Python, pandas, psycopg2, Tableau Public, SQL (CTEs, window functions, recursive queries), matplotlib, ReportLab, openpyxl
NBA Injury Prediction System
October 2025
​
-
Engineered end-to-end ML pipeline processing 4,525 NBA games from official API with 55+ engineered features (rolling averages, fatigue indices, workload metrics, cumulative minutes, days rest, back-to-back indicators) using pandas and NumPy, achieving 78% AUC-ROC and 76% recall on imbalanced dataset through advanced feature engineering and hyperparameter optimization
-
Built production REST API with FastAPI serving real-time injury predictions at sub-100ms p95 latency, implementing Pydantic validation, Redis caching layer, async request handling, and Prometheus metrics collection (prediction latency histograms, cache hit rates, error counters) with 90%+ pytest coverage and load testing handling 1000+ requests per second
-
Developed MLOps infrastructure with Docker multi-stage builds, Docker Compose orchestrating 5 services (API, Redis, MLflow, Prometheus, Grafana), MLflow experiment tracking for model versioning across XGBoost, LightGBM, and PyTorch models, and Kubernetes deployment manifests with autoscaling, health probes, and CI/CD pipelines
-
Technical Stack: Python, XGBoost, LightGBM, PyTorch, FastAPI, Uvicorn, Pydantic, Redis, MLflow, Prometheus, Grafana, Docker, Docker Compose, Kubernetes, pytest, pandas, NumPy, scikit-learn
Trading Signals
October 2025
​
-
Built real-time trading app using Streamlit and Roboquant framework, integrating Yahoo Finance APIs with yfinance library to process live market data and generate automated buy/sell signals through multi-indicator technical analysis (RSI, MACD, Simple Moving Averages, Bollinger Bands) with consensus decision-making algorithm
-
Developed financial data pipeline using pandas for complex indicator calculations with vectorized operations, created interactive candlestick charts with multi-panel analysis (Price, RSI, MACD) for market visualization, and implemented consensus algorithm combining 4 technical indicators for risk-adjusted price targets
-
Engineered trading analytics system with automated risk/reward ratio calculations, dynamic support and resistance detection using Bollinger Bands, and real-time profit/loss analysis with stop loss and take profit target recommendations
-
Technical Stack: Python, Streamlit, Roboquant, yfinance, pandas, NumPy
KellyBet Analytics Platform
August 2025
​
-
Engineered distributed sports analytics system utilizing Python and Streamlit, implementing concurrent data processing with asyncio for real-time odds integration via RESTful APIs (The Odds API, CricAPI), achieving <50ms latency and handling 1000+ simultaneous user requests
-
Developed sophisticated ML pipeline leveraging scikit-learn ensemble methods (RandomForestClassifier, GradientBoostingClassifier) with custom feature engineering for each sport, achieving 87% prediction accuracy through automated hyperparameter tuning via GridSearchCV and cross-validation
-
Designed scalable ETL architecture using SQLite with custom ORM patterns, implementing efficient database indexing and normalized schema design for 4 sports verticals, reducing query latency by 60% while processing 100K+ historical odds records daily for model training
-
Technical Stack: Python, Streamlit, scikit-learn, asyncio, SQLite, The Odds API, CricAPI, GridSearchCV
CodeBuddy: A Natural Language Code Explanation Generator Using Fine-Tuned CodeT5 and Retrieval-Augmented Generation for Enhanced Developer Productivity
July 2025 - August 2025
​
-
Developed CodeBuddy, an AI-powered code explanation system integrating fine-tuned CodeT5 transformer with Retrieval-Augmented Generation pipeline, achieving 24.1% BLEU score improvement over GPT-4 baseline and 78.1% accuracy in interactive Q&A tasks
-
Architected scalable vector database solution using FAISS with semantic embeddings, enabling sub-35ms similarity search across 2M+ code snippets while maintaining 384-dimensional embedding space for real-time code retrieval
-
Implemented comprehensive ML training pipeline processing 2.1M code-documentation pairs across 4 datasets (CodeSearchNet, HumanEval, MBPP, DocString), with automated preprocessing, AST validation, and multi-metric evaluation framework achieving 0.84 BERTScore semantic similarity
-
Technical Stack: Python, CodeT5, Transformers, FAISS, PyTorch, BLEU, BERTScore
SorosAI Intelligent Biography and Pairs Trading System
July 2025 - August 2025
​
-
Built NLP chatbot using Paraphrase-MiniLM-L6-v2 transformer with FAISS vector search for semantic similarity, processing 4,000+ Q&A pairs and biography text into embeddings for contextual question-answering from unstructured text, achieving 90%+ accuracy on structured queries
-
Developed statistical arbitrage system implementing Engle-Granger cointegration tests with Z-score trading signals for spread dynamics prediction, achieving Sharpe ratios >1.5 through market-neutral pairs trading strategies with consistent profitability in historical backtests
-
Engineered real-time trading pipeline with yFinance API integration, asynchronous data processing for stock data retrieval, and comprehensive backtesting framework with performance metrics visualization (PnL, Sharpe ratio, returns, drawdowns) using machine learning models for predictive spread analysis
-
Technical Stack: Python, PyTorch, scikit-learn, statsmodels, SentenceTransformers, FAISS, yFinance, pandas, NumPy, Matplotlib, Seaborn, Plotly
Workflow Automation Assistant: Converting Swimlane Diagrams into Structured Event Sequences​
June 2025
​
-
Engineered domain-specific chatbot parsing Swimlane diagrams and OpenAPI 3.0 specifications using GPT-4 Vision API, transforming unstructured process flow images and structured YAML/JSON into structured event sequences with actor, action, and decision extraction for intelligent query handling
-
Built semantic retrieval system using transformer-based embeddings (all-MiniLM-L6-v2 via SentenceTransformer) and FAISS vector store for in-memory indexing, enabling fast similarity search across unstructured diagrams and API documentation for contextual question-answering with GPT-4o
-
Designed modular, production-ready pipeline with Streamlit UI featuring separate components for image-to-sequence extraction, OpenAPI spec parsing, query understanding, and response generation, ensuring scalability and adaptability to enterprise workflow automation use cases
-
Technical Stack: Python, GPT-4 Vision, GPT-4o, SentenceTransformer, all-MiniLM-L6-v2, FAISS, Streamlit, OpenAI API, OpenAPI 3.0
AgileSmartSOR: A Trading Simulator for Smart Order Routing and Execution Algos​
May 2024
​
-
Built quantitative trading simulator in Python with smart order routing (SOR) engine and agency execution algorithms (VWAP, TWAP, POV), simulating institutional order execution across multiple exchanges under synthetic and real-time market data conditions
-
Engineered modular, data-driven simulation framework using Streamlit and pandas, enabling analysis of execution costs, slippage, and market microstructure behaviors across varying strategies with interactive parameter tuning and real-time price refresh capabilities
-
Applied quantitative execution logic with volume-weighted order splitting (VWAP), time-based allocation (TWAP), and dynamic participation rate control (POV) using adaptive price feeds via Yahoo Finance API (yfinance) to emulate institutional trade flows in agency brokerage context
-
Technical Stack: Python, Streamlit, pandas, yfinance, Yahoo Finance API
Skin Cancer Classification using High Parallel Machine Learning
March - April 2025
​
-
Implemented deep learning pipeline for dermatological disease classification using EfficientNet-B3 and DenseNet-121 architectures with ImageNet pretraining, achieving 91.8% multi-class accuracy and 96.2% macro F1 score across 35 skin conditions on 245,000-image dataset supporting automated medical diagnosis applications
-
Accelerated model training by 3.37× through distributed computing infrastructure utilizing PyTorch Distributed Data Parallel (DDP) framework across 4× NVIDIA A100 GPUs with Automatic Mixed Precision (AMP) reducing memory usage by 50%, achieving 84.3% parallel efficiency and optimizing throughput across multi-GPU training operations
-
Conducted comprehensive performance profiling and scalability analysis with deterministic convolution settings and synchronized training processes, systematically analyzing computational bottlenecks in training workflows and multi-threaded data loading with DistributedSampler to optimize distributed training efficiency
-
Technical Stack: Python, PyTorch, PyTorch DistributedDataParallel (DDP), Automatic Mixed Precision (AMP), EfficientNet-B3, DenseNet-121, NVIDIA A100 GPUs, torchrun, Adam optimizer, ImageNet pretraining
Supply Chain Management using Knowledge Graph and LLM
February - April 2025
​
-
Built interactive supply chain dashboard with Streamlit and Neo4j graph database, enabling real-time graph analysis of product, supplier, order, and shipment data using Cypher-driven visualizations with Plotly, Seaborn, and Matplotlib for custom queries and relationship-based analytics
-
Modeled complex retail supply networks in Neo4j with node-relationship architecture, mapping CSV datasets (Orders, Products, Shipments, Suppliers, Departments, Aisles) into graph structures supporting use cases like supplier traceability, shipment delay detection, bottleneck analysis, and department-wise trend analysis
-
Designed framework for natural language querying with LLM integration architecture, enabling translation of user input into Cypher queries through planned GPT-4 API integration for conversational analytics and NL2Cypher query conversion
-
Technical Stack: Python, Streamlit, Neo4j, Neo4j Driver, Cypher Query Language, pandas, NumPy, Plotly, Seaborn, Matplotlib, LLM Integration (GPT-4 API)
Volatility Surface Modeling
February 2025​​
-
Developed and calibrated volatility surface model using implied volatility data with Black-Scholes and Heston stochastic volatility models, utilizing root-finding algorithms (Brent, Newton-Raphson) and optimization libraries (SciPy, CVXPY) for precise parameter estimation through convex and non-convex optimization routines
-
Engineered and backtested options trading strategies leveraging volatility surface to identify arbitrage opportunities through long/short volatility and delta-neutral strategies, tracking Sharpe ratio improvements, drawdown analysis, and P&L attribution for enhanced portfolio performance
-
Conducted comprehensive risk analysis with Value-at-Risk (VaR) and Expected Shortfall (ES) metrics to inform portfolio hedging strategies with stress testing capabilities, utilizing real-time data from yFinance API and creating interactive 3D visualizations of volatility smiles and surfaces using Plotly for actionable insights
-
Technical Stack: Python, NumPy, pandas, SciPy, CVXPY, Plotly, yFinance, Streamlit, Black-Scholes Model, Heston Model
Options Pricing Model
September 2024
​
-
Developed and implemented advanced financial models including Black-Scholes analytical solution, Binomial Tree discrete-time model, and Monte Carlo Simulation with geometric Brownian motion for European option pricing, utilizing numerical methods in Python with optimization for real-time data analysis via yfinance API
-
Engineered multi-method financial model comparison tools by integrating NumPy for large array computations, SciPy for statistical functions and cumulative normal distribution, and Matplotlib for visualization of pricing discrepancies across stochastic and analytical models, demonstrating model convergence and computational differences
-
Technical Stack: Python, NumPy, SciPy, pandas, Matplotlib, yfinance
Deep Reinforcement Learning Algorithm
August 2024
​
-
Designed and implemented custom trading environment leveraging OpenAI Gym, incorporating action spaces (buy, sell, hold), observation spaces with financial time-series OHLCV data from Yahoo Finance, technical indicators via ta library, and dynamic reward functions linked to profit/loss for optimizing trading agent performance
-
Engineered and deployed deep reinforcement learning models using stable_baselines3 with hyperparameter optimization via Bayesian search, enabling automated trading decisions and performance metrics visualization tracking profits, losses, and strategy performance over large-scale historical market data
-
Technical Stack: Python, OpenAI Gym, stable_baselines3, pandas, NumPy, yfinance, ta (technical analysis), Matplotlib, Bayesian Optimization
Reinforcement Learning for Portfolio Management
July 2024
​
-
Engineered custom Gymnasium-based reinforcement learning environment (PortfolioEnv) to simulate portfolio dynamics across 5 synthetic assets over 1,000 trading days, incorporating 1.5% transaction costs, volatility shocks, and concentration penalties for positions exceeding 50% allocation to single assets to reflect real-world trading conditions
-
Designed dynamic reward function balancing portfolio returns and transaction costs with penalties for overly concentrated positions, promoting portfolio diversification and reducing exposure to high-risk allocations while enabling the agent to maintain 90%+ diversification consistency throughout training
-
Trained Proximal Policy Optimization (PPO) agent using Stable-Baselines3 over 10,000+ time steps on continuous action space for dynamic weight allocation, achieving 23% improvement in cumulative return compared to baseline strategies while controlling downside risk through optimized portfolio allocation decisions
-
Technical Stack: Python, Gymnasium, Stable-Baselines3, PPO (Proximal Policy Optimization), NumPy, pandas
Deep Reinforcement Learning Pairs Trading
June 2024
​
-
Developed Deep Q-Network (DQN) based reinforcement learning model to automate market-neutral pairs trading strategy between correlated assets (AAPL and MSFT), incorporating real-time feature engineering of stock price spreads, rolling averages (10-day and 30-day), and volatility indicators via rolling standard deviation for enhanced decision-making
-
Designed and deployed custom OpenAI Gym-like trading environment with discrete action space (buy, sell, hold), integrating reinforcement learning algorithms and reward maximization functions based on profits/losses to train the agent on dynamic market conditions and mean-reverting spread behavior
-
Technical Stack: Python, Deep Q-Network (DQN), Deep Reinforcement Learning, OpenAI Gym, pandas, NumPy
Mean Variance Optimization with ESG scores
June 2024
​
-
Engineered ESG-integrated mean-variance optimizer using PyPortfolioOpt with 15 S&P 500 stocks, targeting Sharpe ratio maximization through constrained quadratic programming with SciPy while enforcing ESG score thresholds (≥0.6 and ≥0.8), enabling scenario-based sustainable investing strategies with ESG screening for above-average sustainability performance
-
Calibrated covariance matrices using Ledoit-Wolf shrinkage and CAPM-based return estimates, generating efficient frontiers for varying ESG mandates, and visualized portfolio performance using Matplotlib across three ESG constraint levels, highlighting return-risk tradeoffs in both ESG and financial performance dimensions
-
Achieved up to 22% reduction in portfolio volatility and Sharpe ratio improvement of 18% under ESG-constrained portfolios versus unconstrained baselines through backtesting against S&P 500 benchmark, demonstrating the quantitative viability of ESG integration in portfolio optimization
-
Technical Stack: Python, PyPortfolioOpt, SciPy, Matplotlib, Mean-Variance Optimization, Sharpe Ratio, CAPM
Algorithmic Insider Trading Detector
May 2024
​
-
Constructed comprehensive data ingestion pipeline using yfinance for historical stock price retrieval and BeautifulSoup with HTTP requests for parsing SEC Form 4 filings from EDGAR database, employing Python pandas for ETL processes including data normalization, outlier detection, and feature engineering of transaction volumes, trade timings, and price movements
-
Engineered financial features including moving averages, price-volume trend (PVT), insider buy-sell ratios, and historical price trends, utilizing NumPy for numerical computations and scikit-learn for training classification models (Random Forest, Gradient Boosting) and unsupervised anomaly detection models (Isolation Forest, clustering algorithms) to detect anomalous trading behaviors indicative of potential insider trading
-
Technical Stack: Python, yfinance, BeautifulSoup, pandas, NumPy, scikit-learn, Random Forest, Gradient Boosting, Isolation Forest, Matplotlib, Seaborn, Time-Series Analysis
Trading Recommendation with Technical and Sentimental Analysis
April 2024
​
-
Designed and implemented end-to-end stock analysis system using Python, incorporating data extraction via NASDAQ FTP server for stock symbols and Yahoo Finance API (yFinance) for historical stock data and market capitalization, integrating technical analysis with sentiment analysis from Reddit and Twitter to algorithmically rank and recommend the top 5 stocks daily
-
Developed scalable automated data pipeline for retrieval, cleaning, and integration of stock market data and social media sentiment, employing FTP for NASDAQ symbol downloads, pandas for data manipulation and processing, and real-time sentiment scraping to deliver timely and reliable investment recommendations
-
Technical Stack: Python, yFinance, Yahoo Finance API, pandas, FTP, Sentiment Analysis, Web Scraping, Reddit API, Twitter API
Air Pollution Analysis
April 2024
​
-
Applied causal inference techniques to analyze the impact of various pollutants including PM2.5 and NO2 on air quality, utilizing datasets from global air pollution reports to uncover causal relationships between pollutants and public health impacts
-
Performed statistical analysis and data visualization using Python with Pandas for data manipulation, Matplotlib and Seaborn for exploratory data analysis and visualization to explore air quality indicators and environmental health relationships
-
Technical Stack: Python, pandas, Matplotlib, Seaborn, Causal Inference, Statistical Analysis, Data Visualization
Product Recommendation
March 2024
​
-
Implemented causal inference techniques including Propensity Score Matching and Difference-in-Differences (DiD) to analyze cause-and-effect relationships and uncover actionable insights in product recommendation systems for e-grocery brand applications
-
Conducted advanced data analysis using methodologies including Randomized Controlled Trials (RCTs) and Directed Acyclic Graphs (DAGs) for causal modeling, improving customer lifetime value predictions and informing business strategy decisions
-
Technical Stack: Python, pandas, Causal Inference, Propensity Score Matching, Difference-in-Differences (DiD), RCTs, DAGs, Statistical Analysis
Fifa Data Analysis
Februrary 2024
​
-
Conducted in-depth exploratory data analysis on FIFA player data including foot preference, position distribution, and attribute correlations using data aggregation, pivoting, and visualization techniques with Python pandas for data manipulation and analysis
-
Generated key insights through correlation analysis and heatmap visualizations using Matplotlib and Seaborn, identifying patterns in player ratings, defending attributes, and goalkeeper performance to contribute to deeper understanding of player dynamics
-
Technical Stack: Python, pandas, Matplotlib, Seaborn, Exploratory Data Analysis, Data Visualization, Statistical Analysis
Titanic Analysis
February 2024
​
-
Conducted data cleaning and exploratory analysis on Titanic passenger data, exploring key variables including survival rate, passenger class, and gender using Pandas for data manipulation and Seaborn for statistical visualization
-
Visualized survival patterns across different demographics including gender and class, identifying significant trends through count plots and statistical summaries to uncover insights about passenger survival factors
-
Technical Stack: Python, pandas, Seaborn, Exploratory Data Analysis, Data Visualization, Statistical Analysis
Tennis Analysis
November 2023 - December 2023
​
-
Analyzed 508 ATP Grand Slam matches from 2023, extracting key performance metrics including aces, win percentages, and break points from over 3,000 records using Python for data processing and statistical analysis
-
Automated data extraction and analysis with Pandas for data manipulation, Matplotlib for visualization, and PyMC for statistical modeling, creating insights and visualizations for player performance analysis in major tournaments
-
Technical Stack: Python, pandas, Matplotlib, PyMC, Data Analysis, Statistical Analysis, Data Visualization
Copy Move Forgery Detection
September 2022 - May 2023
​
-
Developed customized Convolutional Neural Network (CNN) using CASIA 2.0 dataset with multiple weight initialization techniques to detect image forgery and picture modifications, achieving 96.90% testing accuracy for copy-move manipulation detection
-
Deployed model to cloud using Amazon EC2 for accessibility and real-time image testing, integrated Support Vector Machine (SVM) for picture categorization, and created user-friendly graphical user interface (GUI) website enabling users to test and interact with the forgery detection system
-
Technical Stack: Python, CNN, Convolutional Neural Networks, SVM, CASIA 2.0 Dataset, AWS EC2, Weight Initialization, Image Processing, Web GUI
Deep Fake Detection
December 2022 - May 2023
​
-
Developed DeepFake detection system utilizing Generative Adversarial Network (GAN) based models to classify manipulated content by exploiting perceptual differences in images, implementing MRI-GAN framework that generates MRI-like outputs highlighting synthesized artifacts in DeepFake images using DeepFake Detection Challenge Dataset
-
Achieved 91% test accuracy with plain frames-based model, surpassing MRI-GAN SSIM-based model at 74% accuracy using Structural Similarity Index Measurement (SSIM) for detecting perceptual differences, with potential for further enhancements through adjustments in loss functions, hyperparameters, and advanced perceptual metrics
-
Technical Stack: Python, Generative Adversarial Networks (GANs), MRI-GAN, SSIM (Structural Similarity Index Measurement), Deep Learning, Image Processing, DeepFake Detection Challenge Dataset




