Database Management: SQL Server Configuration, Remote Connectivity, Data Storage & Retrieval
Visualization & Analytics: Power BI, Python (Matplotlib, Seaborn, Scikit-learn)
Project Overview:
This project automates the end-to-end data processing, machine learning model training,
and visualization of the Wisconsin Breast Cancer Dataset using Apache Airflow, SQL Server, and Power BI.
Data Cleaning & Preparation: Airflow DAGs clean the dataset, handling missing values and duplicates before storing the processed data.
Model Training & Prediction: A Jupyter notebook, triggered by Airflow, trains a Logistic Regression model and stores predictions in SQL Server.
Data Visualization: A Power BI dashboard provides interactive insights into feature trends, model performance, and risk evaluation.
Key Contributions:
End-to-End Automated Data Pipeline: Developed an Airflow DAG to clean and preprocess data, storing it in a mounted folder. The DAG is also responsible for triggering the Jupyter notebook.
Machine Learning Model Implementation: Trained a Logistic Regression model to predict whether a tumor is Benign or Malignant.
Power BI Dashboard for Model Insights: Designed an interactive dashboard featuring feature impact analysis, model performance metrics, a confusion matrix, and an ROC curve.
Results and Impact:
Enabled seamless integration of ML, Data Engineering, and Business Intelligence, making predictive insights accessible to non-technical users.
Automated model training and retraining for adaptability to new data.
Demonstrated real-world applications of an ML-powered analytics dashboard in healthcare, aiding early breast cancer detection.
Learnings and Takeaways:
Gained expertise in end-to-end ML deployment, data pipelines, and automation.
Configured remote access to enable Docker-based airflow to work with on-premises SQL Server, solving connectivity challenges.
Explored ML model evaluation and visualization in Power BI, making results easily interpretable for stakeholders.