Distributed ML Pipeline in Pandas and Dask

Mentor: Nicholas Lind, Engagement Manager, Strategy at Deloitte Consulting

Team members: Megan Jacob, Rahul Chandra, Riteka Murugesh

Build a distributed machine learning pipeline in Pandas and Dask using gigabytes of retail data from a large retail company. Team will learn ML, CI/CD, and data engineering skills applicable to the real world.

Technologies Used : pandas, dask, various modeling algorithms (xgboost, lightGBM, catboost, prophet), seaborn, GitHub, Trello [note that the team is free to suggest alternative technologies]

Final Deliverable: presentation in Jupyter / PowerPoint describing your exploration approach, modeling techniques, final results, and considerations for the future

How much experience does your group have? Does the project use anything (art, music, starter kits) you didn't create?

CodeDay Labs advanced-track team