# MySQL to PostgreSQL Migration Workflow ## Overview This tool supports three migration modes: 1. **Full Migration** (`full_migration.py`) - Initial complete migration 2. **Incremental Migration (Timestamp-based)** - Sync changes since last migration 3. **Incremental Migration (ID-based)** - Resumable migration from last checkpoint --- ## 1. Initial Full Migration ### First Time Setup ```bash # Create the PostgreSQL schema python main.py setup --create-schema # Run full migration (one-time) python main.py migrate --full RAWDATACOR python main.py migrate --full ELABDATADISP ``` **When to use:** First time migrating data or need complete fresh migration. **Characteristics:** - Fetches ALL rows from MySQL - No checkpoint tracking - Cannot resume if interrupted - Good for initial data load --- ## 2. Timestamp-based Incremental Migration ### For Continuous Sync (Recommended for most cases) ```bash # After initial full migration, use incremental with timestamps python main.py migrate --incremental RAWDATACOR python main.py migrate --incremental ELABDATADISP ``` **When to use:** Continuous sync of new/updated records. **Characteristics:** - Tracks `created_at` (RAWDATACOR) or `updated_at` (ELABDATADISP) - Uses JSON state file (`migration_state.json`) - Only fetches rows modified since last run - Perfect for scheduled jobs (cron, airflow, etc.) - Syncs changes but NOT deletions **How it works:** 1. First run: Returns with message "No previous migration found" - must run full migration first 2. Subsequent runs: Only fetches rows where `created_at` > last_migration_timestamp 3. Updates state file with new timestamp for next run **Example workflow:** ```bash # Day 1: Initial full migration python main.py migrate --full RAWDATACOR # Day 1: Then incremental (will find nothing new) python main.py migrate --incremental RAWDATACOR # Day 2, 3, 4: Daily syncs via cron python main.py migrate --incremental RAWDATACOR ``` --- ## 3. ID-based Incremental Migration (Resumable) ### For Large Datasets or Unreliable Connections ```bash # First run python main.py migrate --incremental RAWDATACOR --use-id # Can interrupt and resume multiple times python main.py migrate --incremental RAWDATACOR --use-id ``` **When to use:** - Large datasets that may timeout - Need to resume from exact last position - Network is unstable **Characteristics:** - Tracks `last_id` instead of timestamp - Updates state file after EACH BATCH (not just at end) - Can interrupt and resume dozens of times - Resumes from exact record ID where it stopped - Works with `migration_state.json` **How it works:** 1. First run: Starts from beginning (ID = 0) 2. Each batch: Updates state file with max ID from batch 3. Interrupt: Can stop at any time 4. Resume: Next run continues from last ID stored 5. Continues until all rows processed **Example workflow for large dataset:** ```bash # Start ID-based migration (will migrate in batches) python main.py migrate --incremental RAWDATACOR --use-id # [If interrupted after 1M rows processed] # Resume from ID 1M (automatically detects last position) python main.py migrate --incremental RAWDATACOR --use-id # [Continues until complete] ``` --- ## State Management ### State File Location ``` migration_state.json # In project root ``` ### State File Content (Timestamp-based) ```json { "rawdatacor": { "last_timestamp": "2024-12-11T19:30:45.123456", "last_updated": "2024-12-11T19:30:45.123456", "total_migrated": 50000 } } ``` ### State File Content (ID-based) ```json { "rawdatacor": { "last_id": 1000000, "total_migrated": 1000000, "last_updated": "2024-12-11T19:45:30.123456" } } ``` ### Reset Migration State ```python from src.migrator.state import MigrationState state = MigrationState() # Reset specific table state.reset("rawdatacor") # Reset all tables state.reset() ``` --- ## Recommended Workflow ### For Daily Continuous Sync ```bash # Week 1: Initial setup python main.py setup --create-schema python main.py migrate --full RAWDATACOR python main.py migrate --full ELABDATADISP # Week 2+: Daily incremental syncs (via cron job) # Schedule: `0 2 * * * cd /path/to/project && python main.py migrate --incremental RAWDATACOR` python main.py migrate --incremental RAWDATACOR python main.py migrate --incremental ELABDATADISP ``` ### For Large Initial Migration ```bash # If dataset > 10 million rows python main.py setup --create-schema python main.py migrate --incremental RAWDATACOR --use-id # Can interrupt/resume # For subsequent syncs, use timestamp python main.py migrate --incremental RAWDATACOR # Timestamp-based ``` --- ## Key Differences at a Glance | Feature | Full | Timestamp | ID-based | |---------|------|-----------|----------| | Initial setup | ✅ Required first | ✅ After full | ✅ After full | | Sync new/updated | ❌ No | ✅ Yes | ✅ Yes | | Resumable | ❌ No | ⚠️ Partial* | ✅ Full | | Batched state tracking | ❌ No | ❌ No | ✅ Yes | | Large datasets | ⚠️ Risky | ✅ Good | ✅ Best | | Scheduled jobs | ❌ No | ✅ Perfect | ⚠️ Unnecessary | *Timestamp mode can resume, but must wait for full batch to complete before continuing --- ## Default Partitions Both tables are partitioned by year (2014-2031) plus a DEFAULT partition: - **rawdatacor_2014** through **rawdatacor_2031** (yearly partitions) - **rawdatacor_default** (catches data outside 2014-2031) Same for ELABDATADISP. This ensures data with edge-case timestamps doesn't break migration. --- ## Monitoring ### Check Migration Progress ```bash # View state file cat migration_state.json # Check PostgreSQL row counts psql -U postgres -h localhost -d your_db -c "SELECT COUNT(*) FROM rawdatacor;" ``` ### Common Issues **"No previous migration found"** (Timestamp mode) - Solution: Run full migration first with `--full` flag **"Duplicate key value violates unique constraint"** - Cause: Running full migration twice - Solution: Use timestamp-based incremental sync instead **"Timeout during migration"** (Large datasets) - Solution: Switch to ID-based resumable migration with `--use-id` --- ## Summary - **Start with:** Full migration (`--full`) for initial data load - **Then use:** Timestamp-based incremental (`--incremental`) for daily syncs - **Switch to:** ID-based resumable (`--incremental --use-id`) if full migration is too large