Files
mysql2postgres/MIGRATION_WORKFLOW.md
alex b09cfcf9df fix: Add timeout settings and retry logic to MySQL connector
Configuration improvements:
- Set read_timeout=300 (5 minutes) to handle long queries
- Set write_timeout=300 (5 minutes) for writes
- Set max_allowed_packet=64MB to handle larger data transfers

Retry logic:
- Added retry mechanism with max 3 retries on fetch failure
- Auto-reconnect on connection loss before retry
- Better error messages showing retry attempts

This fixes the 'connection is lost' error that occurs during
long-running migrations by:
1. Giving MySQL queries more time to complete
2. Allowing larger packet sizes for bulk data
3. Automatically recovering from connection drops

Fixes: 'Connection is lost' error during full migration
2025-12-21 09:53:34 +01:00

6.2 KiB

MySQL to PostgreSQL Migration Workflow

Overview

This tool supports three migration modes:

  1. Full Migration (full_migration.py) - Initial complete migration
  2. Incremental Migration (Timestamp-based) - Sync changes since last migration
  3. Incremental Migration (ID-based) - Resumable migration from last checkpoint

1. Initial Full Migration

First Time Setup

# Create the PostgreSQL schema
python main.py setup --create-schema

# Run full migration (one-time)
python main.py migrate --full RAWDATACOR
python main.py migrate --full ELABDATADISP

When to use: First time migrating data or need complete fresh migration.

Characteristics:

  • Fetches ALL rows from MySQL
  • No checkpoint tracking
  • Cannot resume if interrupted
  • Good for initial data load

2. Timestamp-based Incremental Migration

# After initial full migration, use incremental with timestamps
python main.py migrate --incremental RAWDATACOR
python main.py migrate --incremental ELABDATADISP

When to use: Continuous sync of new/updated records.

Characteristics:

  • Tracks created_at (RAWDATACOR) or updated_at (ELABDATADISP)
  • Uses JSON state file (migration_state.json)
  • Only fetches rows modified since last run
  • Perfect for scheduled jobs (cron, airflow, etc.)
  • Syncs changes but NOT deletions

How it works:

  1. First run: Returns with message "No previous migration found" - must run full migration first
  2. Subsequent runs: Only fetches rows where created_at > last_migration_timestamp
  3. Updates state file with new timestamp for next run

Example workflow:

# Day 1: Initial full migration
python main.py migrate --full RAWDATACOR

# Day 1: Then incremental (will find nothing new)
python main.py migrate --incremental RAWDATACOR

# Day 2, 3, 4: Daily syncs via cron
python main.py migrate --incremental RAWDATACOR

3. ID-based Incremental Migration (Resumable)

For Large Datasets or Unreliable Connections

# First run
python main.py migrate --incremental RAWDATACOR --use-id

# Can interrupt and resume multiple times
python main.py migrate --incremental RAWDATACOR --use-id

When to use:

  • Large datasets that may timeout
  • Need to resume from exact last position
  • Network is unstable

Characteristics:

  • Tracks last_id instead of timestamp
  • Updates state file after EACH BATCH (not just at end)
  • Can interrupt and resume dozens of times
  • Resumes from exact record ID where it stopped
  • Works with migration_state.json

How it works:

  1. First run: Starts from beginning (ID = 0)
  2. Each batch: Updates state file with max ID from batch
  3. Interrupt: Can stop at any time
  4. Resume: Next run continues from last ID stored
  5. Continues until all rows processed

Example workflow for large dataset:

# Start ID-based migration (will migrate in batches)
python main.py migrate --incremental RAWDATACOR --use-id

# [If interrupted after 1M rows processed]

# Resume from ID 1M (automatically detects last position)
python main.py migrate --incremental RAWDATACOR --use-id

# [Continues until complete]

State Management

State File Location

migration_state.json  # In project root

State File Content (Timestamp-based)

{
  "rawdatacor": {
    "last_timestamp": "2024-12-11T19:30:45.123456",
    "last_updated": "2024-12-11T19:30:45.123456",
    "total_migrated": 50000
  }
}

State File Content (ID-based)

{
  "rawdatacor": {
    "last_id": 1000000,
    "total_migrated": 1000000,
    "last_updated": "2024-12-11T19:45:30.123456"
  }
}

Reset Migration State

from src.migrator.state import MigrationState

state = MigrationState()

# Reset specific table
state.reset("rawdatacor")

# Reset all tables
state.reset()

For Daily Continuous Sync

# Week 1: Initial setup
python main.py setup --create-schema
python main.py migrate --full RAWDATACOR
python main.py migrate --full ELABDATADISP

# Week 2+: Daily incremental syncs (via cron job)
# Schedule: `0 2 * * * cd /path/to/project && python main.py migrate --incremental RAWDATACOR`
python main.py migrate --incremental RAWDATACOR
python main.py migrate --incremental ELABDATADISP

For Large Initial Migration

# If dataset > 10 million rows
python main.py setup --create-schema
python main.py migrate --incremental RAWDATACOR --use-id  # Can interrupt/resume

# For subsequent syncs, use timestamp
python main.py migrate --incremental RAWDATACOR  # Timestamp-based

Key Differences at a Glance

Feature Full Timestamp ID-based
Initial setup Required first After full After full
Sync new/updated No Yes Yes
Resumable No ⚠️ Partial* Full
Batched state tracking No No Yes
Large datasets ⚠️ Risky Good Best
Scheduled jobs No Perfect ⚠️ Unnecessary

*Timestamp mode can resume, but must wait for full batch to complete before continuing


Default Partitions

Both tables are partitioned by year (2014-2031) plus a DEFAULT partition:

  • rawdatacor_2014 through rawdatacor_2031 (yearly partitions)
  • rawdatacor_default (catches data outside 2014-2031)

Same for ELABDATADISP. This ensures data with edge-case timestamps doesn't break migration.


Monitoring

Check Migration Progress

# View state file
cat migration_state.json

# Check PostgreSQL row counts
psql -U postgres -h localhost -d your_db -c "SELECT COUNT(*) FROM rawdatacor;"

Common Issues

"No previous migration found" (Timestamp mode)

  • Solution: Run full migration first with --full flag

"Duplicate key value violates unique constraint"

  • Cause: Running full migration twice
  • Solution: Use timestamp-based incremental sync instead

"Timeout during migration" (Large datasets)

  • Solution: Switch to ID-based resumable migration with --use-id

Summary

  • Start with: Full migration (--full) for initial data load
  • Then use: Timestamp-based incremental (--incremental) for daily syncs
  • Switch to: ID-based resumable (--incremental --use-id) if full migration is too large