Configuration improvements: - Set read_timeout=300 (5 minutes) to handle long queries - Set write_timeout=300 (5 minutes) for writes - Set max_allowed_packet=64MB to handle larger data transfers Retry logic: - Added retry mechanism with max 3 retries on fetch failure - Auto-reconnect on connection loss before retry - Better error messages showing retry attempts This fixes the 'connection is lost' error that occurs during long-running migrations by: 1. Giving MySQL queries more time to complete 2. Allowing larger packet sizes for bulk data 3. Automatically recovering from connection drops Fixes: 'Connection is lost' error during full migration
6.2 KiB
6.2 KiB
MySQL to PostgreSQL Migration Workflow
Overview
This tool supports three migration modes:
- Full Migration (
full_migration.py) - Initial complete migration - Incremental Migration (Timestamp-based) - Sync changes since last migration
- Incremental Migration (ID-based) - Resumable migration from last checkpoint
1. Initial Full Migration
First Time Setup
# Create the PostgreSQL schema
python main.py setup --create-schema
# Run full migration (one-time)
python main.py migrate --full RAWDATACOR
python main.py migrate --full ELABDATADISP
When to use: First time migrating data or need complete fresh migration.
Characteristics:
- Fetches ALL rows from MySQL
- No checkpoint tracking
- Cannot resume if interrupted
- Good for initial data load
2. Timestamp-based Incremental Migration
For Continuous Sync (Recommended for most cases)
# After initial full migration, use incremental with timestamps
python main.py migrate --incremental RAWDATACOR
python main.py migrate --incremental ELABDATADISP
When to use: Continuous sync of new/updated records.
Characteristics:
- Tracks
created_at(RAWDATACOR) orupdated_at(ELABDATADISP) - Uses JSON state file (
migration_state.json) - Only fetches rows modified since last run
- Perfect for scheduled jobs (cron, airflow, etc.)
- Syncs changes but NOT deletions
How it works:
- First run: Returns with message "No previous migration found" - must run full migration first
- Subsequent runs: Only fetches rows where
created_at> last_migration_timestamp - Updates state file with new timestamp for next run
Example workflow:
# Day 1: Initial full migration
python main.py migrate --full RAWDATACOR
# Day 1: Then incremental (will find nothing new)
python main.py migrate --incremental RAWDATACOR
# Day 2, 3, 4: Daily syncs via cron
python main.py migrate --incremental RAWDATACOR
3. ID-based Incremental Migration (Resumable)
For Large Datasets or Unreliable Connections
# First run
python main.py migrate --incremental RAWDATACOR --use-id
# Can interrupt and resume multiple times
python main.py migrate --incremental RAWDATACOR --use-id
When to use:
- Large datasets that may timeout
- Need to resume from exact last position
- Network is unstable
Characteristics:
- Tracks
last_idinstead of timestamp - Updates state file after EACH BATCH (not just at end)
- Can interrupt and resume dozens of times
- Resumes from exact record ID where it stopped
- Works with
migration_state.json
How it works:
- First run: Starts from beginning (ID = 0)
- Each batch: Updates state file with max ID from batch
- Interrupt: Can stop at any time
- Resume: Next run continues from last ID stored
- Continues until all rows processed
Example workflow for large dataset:
# Start ID-based migration (will migrate in batches)
python main.py migrate --incremental RAWDATACOR --use-id
# [If interrupted after 1M rows processed]
# Resume from ID 1M (automatically detects last position)
python main.py migrate --incremental RAWDATACOR --use-id
# [Continues until complete]
State Management
State File Location
migration_state.json # In project root
State File Content (Timestamp-based)
{
"rawdatacor": {
"last_timestamp": "2024-12-11T19:30:45.123456",
"last_updated": "2024-12-11T19:30:45.123456",
"total_migrated": 50000
}
}
State File Content (ID-based)
{
"rawdatacor": {
"last_id": 1000000,
"total_migrated": 1000000,
"last_updated": "2024-12-11T19:45:30.123456"
}
}
Reset Migration State
from src.migrator.state import MigrationState
state = MigrationState()
# Reset specific table
state.reset("rawdatacor")
# Reset all tables
state.reset()
Recommended Workflow
For Daily Continuous Sync
# Week 1: Initial setup
python main.py setup --create-schema
python main.py migrate --full RAWDATACOR
python main.py migrate --full ELABDATADISP
# Week 2+: Daily incremental syncs (via cron job)
# Schedule: `0 2 * * * cd /path/to/project && python main.py migrate --incremental RAWDATACOR`
python main.py migrate --incremental RAWDATACOR
python main.py migrate --incremental ELABDATADISP
For Large Initial Migration
# If dataset > 10 million rows
python main.py setup --create-schema
python main.py migrate --incremental RAWDATACOR --use-id # Can interrupt/resume
# For subsequent syncs, use timestamp
python main.py migrate --incremental RAWDATACOR # Timestamp-based
Key Differences at a Glance
| Feature | Full | Timestamp | ID-based |
|---|---|---|---|
| Initial setup | ✅ Required first | ✅ After full | ✅ After full |
| Sync new/updated | ❌ No | ✅ Yes | ✅ Yes |
| Resumable | ❌ No | ⚠️ Partial* | ✅ Full |
| Batched state tracking | ❌ No | ❌ No | ✅ Yes |
| Large datasets | ⚠️ Risky | ✅ Good | ✅ Best |
| Scheduled jobs | ❌ No | ✅ Perfect | ⚠️ Unnecessary |
*Timestamp mode can resume, but must wait for full batch to complete before continuing
Default Partitions
Both tables are partitioned by year (2014-2031) plus a DEFAULT partition:
- rawdatacor_2014 through rawdatacor_2031 (yearly partitions)
- rawdatacor_default (catches data outside 2014-2031)
Same for ELABDATADISP. This ensures data with edge-case timestamps doesn't break migration.
Monitoring
Check Migration Progress
# View state file
cat migration_state.json
# Check PostgreSQL row counts
psql -U postgres -h localhost -d your_db -c "SELECT COUNT(*) FROM rawdatacor;"
Common Issues
"No previous migration found" (Timestamp mode)
- Solution: Run full migration first with
--fullflag
"Duplicate key value violates unique constraint"
- Cause: Running full migration twice
- Solution: Use timestamp-based incremental sync instead
"Timeout during migration" (Large datasets)
- Solution: Switch to ID-based resumable migration with
--use-id
Summary
- Start with: Full migration (
--full) for initial data load - Then use: Timestamp-based incremental (
--incremental) for daily syncs - Switch to: ID-based resumable (
--incremental --use-id) if full migration is too large