🚀 Data Engineering Challenge - Take Home Assignment

⏰ SUBMISSION DEADLINE

📅 Date: June 23rd, 2025

🕐 Time: 12:30 PM (12:30 hours)

Late submissions will not be accepted

🎯 Problem Statement

Business Context:

You've joined TechCorp as a Data Engineering Intern. The company has recently acquired three different e-commerce platforms, each with their own data storage systems and formats. Your first task is to create a unified data pipeline that can:

📊 The Data Challenge

The datasets you'll work with represent real-world messy data scenarios:

🔍 Data Quality Issues You'll Encounter:

  • Multiple ID Systems: Same entities referenced by different ID formats
  • Inconsistent Naming: customer_name, full_name, customer_full_name for same field
  • Mixed Date Formats: MM/DD/YYYY, YYYY-MM-DD, timestamps
  • Data Type Inconsistencies: Numbers stored as strings, booleans as text
  • Missing Relationships: Foreign keys not explicitly defined
  • Duplicate Fields: Same information in multiple columns with slight variations
  • Null Values: Represented as null, '', 'N/A', 'NULL', empty strings
  • Case Sensitivity Issues: 'ACTIVE', 'active', 'Active' for same status

🧩 Your Mission

Build a complete data engineering solution that transforms chaos into insights!

📋 Technical Requirements

Phase 1: Data Discovery & Analysis (Jupyter Notebook)

Phase 2: ETL Pipeline Development

Phase 3: Interactive Dashboard (Streamlit)

Phase 4: Advanced Challenge (Bonus)

📁 Dataset Downloads

Download the messy datasets below. Each contains 15+ columns with various data quality issues:

Primary Datasets:

Reconciliation Challenge:

⚠️ Important: The datasets intentionally contain inconsistencies and quality issues. Your job is to identify and fix these problems systematically.

🎯 Expected Deliverables

  1. 📓 Jupyter Notebook:
    • Data exploration and quality assessment
    • Relationship discovery process
    • Step-by-step data cleaning approach
    • Clear documentation and reasoning
  2. 🗄️ SQLite Database:
    • Clean, normalized data structure
    • Proper relationships and constraints
    • Optimized with appropriate indexes
  3. 📊 Streamlit Application:
    • Interactive data exploration dashboard
    • Business insights and KPIs
    • Data quality visualization
    • User-friendly interface
  4. 📚 Documentation:
    • README with setup instructions
    • Architecture decisions explanation
    • Challenges faced and solutions implemented
  5. 🤖 AI Reconciliation (Bonus):
    • Strategy for handling schema mismatches
    • Implementation using Gemini AI
    • Documentation of your analytical thinking

🏆 Evaluation Criteria

Technical Skills (60%)

  • Data Quality Handling: How well you identify and fix issues
  • ETL Pipeline: Robustness and efficiency of your solution
  • Database Design: Proper normalization and relationships
  • Code Quality: Clean, documented, maintainable code

Problem Solving (40%)

  • Analytical Thinking: Systematic approach to problems
  • Creativity: Innovative solutions to challenges
  • Documentation: Clear communication of approach
  • Business Understanding: Practical insights from data

💡 Tips for Success

📋 Submission Instructions

Submit your complete solution by June 23rd, 2025 at 12:30 PM

Include: Jupyter notebook, Python scripts, SQLite database, Streamlit app, and documentation