🚀 Data Engineering Challenge - Take Home Assignment
⏰ SUBMISSION DEADLINE
📅 Date: June 23rd, 2025
🕐 Time: 12:30 PM (12:30 hours)
Late submissions will not be accepted
🎯 Problem Statement
Business Context:
You've joined TechCorp as a Data Engineering Intern. The company has recently acquired three different e-commerce platforms, each with their own data storage systems and formats. Your first task is to create a unified data pipeline that can:
Extract data from multiple inconsistent sources
Transform and clean messy, unstructured data
Load the cleaned data into a normalized database
Analyze the data to provide business insights
📊 The Data Challenge
The datasets you'll work with represent real-world messy data scenarios:
🔍 Data Quality Issues You'll Encounter:
Multiple ID Systems: Same entities referenced by different ID formats
Inconsistent Naming: customer_name, full_name, customer_full_name for same field
Mixed Date Formats: MM/DD/YYYY, YYYY-MM-DD, timestamps
Data Type Inconsistencies: Numbers stored as strings, booleans as text
Missing Relationships: Foreign keys not explicitly defined
Duplicate Fields: Same information in multiple columns with slight variations
Null Values: Represented as null, '', 'N/A', 'NULL', empty strings
Case Sensitivity Issues: 'ACTIVE', 'active', 'Active' for same status
🧩 Your Mission
Build a complete data engineering solution that transforms chaos into insights!
📋 Technical Requirements
Phase 1: Data Discovery & Analysis (Jupyter Notebook)
Explore each dataset and document data quality issues
Identify hidden relationships between tables
Map out the entity-relationship model
Document your data cleaning strategy
Phase 2: ETL Pipeline Development
Create robust data cleaning functions
Handle all data inconsistencies and edge cases
Implement data validation and error handling
Design normalized database schema
Load cleaned data into SQLite database
Phase 3: Interactive Dashboard (Streamlit)
Build user-friendly data exploration interface
Create meaningful visualizations and KPIs
Implement data filtering and search capabilities
Show data quality metrics and cleaning results
Phase 4: Advanced Challenge (Bonus)
Use Gemini AI to reconcile mismatched schema data
Document your approach and reasoning (no LLM assistance for documentation)
Show creative problem-solving skills
📁 Dataset Downloads
Download the messy datasets below. Each contains 15+ columns with various data quality issues:
Primary Datasets:
Reconciliation Challenge:
⚠️ Important: The datasets intentionally contain inconsistencies and quality issues. Your job is to identify and fix these problems systematically.
🎯 Expected Deliverables
📓 Jupyter Notebook:
Data exploration and quality assessment
Relationship discovery process
Step-by-step data cleaning approach
Clear documentation and reasoning
🗄️ SQLite Database:
Clean, normalized data structure
Proper relationships and constraints
Optimized with appropriate indexes
📊 Streamlit Application:
Interactive data exploration dashboard
Business insights and KPIs
Data quality visualization
User-friendly interface
📚 Documentation:
README with setup instructions
Architecture decisions explanation
Challenges faced and solutions implemented
🤖 AI Reconciliation (Bonus):
Strategy for handling schema mismatches
Implementation using Gemini AI
Documentation of your analytical thinking
🏆 Evaluation Criteria
Technical Skills (60%)
Data Quality Handling: How well you identify and fix issues
ETL Pipeline: Robustness and efficiency of your solution
Database Design: Proper normalization and relationships