Files

alex b09cfcf9df fix: Add timeout settings and retry logic to MySQL connector

Configuration improvements:
- Set read_timeout=300 (5 minutes) to handle long queries
- Set write_timeout=300 (5 minutes) for writes
- Set max_allowed_packet=64MB to handle larger data transfers

Retry logic:
- Added retry mechanism with max 3 retries on fetch failure
- Auto-reconnect on connection loss before retry
- Better error messages showing retry attempts

This fixes the 'connection is lost' error that occurs during
long-running migrations by:
1. Giving MySQL queries more time to complete
2. Allowing larger packet sizes for bulk data
3. Automatically recovering from connection drops

Fixes: 'Connection is lost' error during full migration

2025-12-21 09:53:34 +01:00

6.2 KiB

Raw Blame History

MySQL to PostgreSQL Migration Workflow

Overview

This tool supports three migration modes:

Full Migration (full_migration.py) - Initial complete migration
Incremental Migration (Timestamp-based) - Sync changes since last migration
Incremental Migration (ID-based) - Resumable migration from last checkpoint

1. Initial Full Migration

First Time Setup

# Create the PostgreSQL schema
python main.py setup --create-schema

# Run full migration (one-time)
python main.py migrate --full RAWDATACOR
python main.py migrate --full ELABDATADISP

When to use: First time migrating data or need complete fresh migration.

Characteristics:

Fetches ALL rows from MySQL
No checkpoint tracking
Cannot resume if interrupted
Good for initial data load

2. Timestamp-based Incremental Migration

For Continuous Sync (Recommended for most cases)

# After initial full migration, use incremental with timestamps
python main.py migrate --incremental RAWDATACOR
python main.py migrate --incremental ELABDATADISP

When to use: Continuous sync of new/updated records.

Characteristics:

Tracks created_at (RAWDATACOR) or updated_at (ELABDATADISP)
Uses JSON state file (migration_state.json)
Only fetches rows modified since last run
Perfect for scheduled jobs (cron, airflow, etc.)
Syncs changes but NOT deletions

How it works:

First run: Returns with message "No previous migration found" - must run full migration first
Subsequent runs: Only fetches rows where created_at > last_migration_timestamp
Updates state file with new timestamp for next run

Example workflow:

# Day 1: Initial full migration
python main.py migrate --full RAWDATACOR

# Day 1: Then incremental (will find nothing new)
python main.py migrate --incremental RAWDATACOR

# Day 2, 3, 4: Daily syncs via cron
python main.py migrate --incremental RAWDATACOR

3. ID-based Incremental Migration (Resumable)

For Large Datasets or Unreliable Connections

# First run
python main.py migrate --incremental RAWDATACOR --use-id

# Can interrupt and resume multiple times
python main.py migrate --incremental RAWDATACOR --use-id

When to use:

Large datasets that may timeout
Need to resume from exact last position
Network is unstable

Characteristics:

Tracks last_id instead of timestamp
Updates state file after EACH BATCH (not just at end)
Can interrupt and resume dozens of times
Resumes from exact record ID where it stopped
Works with migration_state.json

How it works:

First run: Starts from beginning (ID = 0)
Each batch: Updates state file with max ID from batch
Interrupt: Can stop at any time
Resume: Next run continues from last ID stored
Continues until all rows processed

Example workflow for large dataset:

# Start ID-based migration (will migrate in batches)
python main.py migrate --incremental RAWDATACOR --use-id

# [If interrupted after 1M rows processed]

# Resume from ID 1M (automatically detects last position)
python main.py migrate --incremental RAWDATACOR --use-id

# [Continues until complete]

State Management

State File Location

migration_state.json  # In project root

State File Content (Timestamp-based)

{
  "rawdatacor": {
    "last_timestamp": "2024-12-11T19:30:45.123456",
    "last_updated": "2024-12-11T19:30:45.123456",
    "total_migrated": 50000
  }
}

State File Content (ID-based)

{
  "rawdatacor": {
    "last_id": 1000000,
    "total_migrated": 1000000,
    "last_updated": "2024-12-11T19:45:30.123456"
  }
}

Reset Migration State

from src.migrator.state import MigrationState

state = MigrationState()

# Reset specific table
state.reset("rawdatacor")

# Reset all tables
state.reset()

Recommended Workflow

For Daily Continuous Sync

# Week 1: Initial setup
python main.py setup --create-schema
python main.py migrate --full RAWDATACOR
python main.py migrate --full ELABDATADISP

# Week 2+: Daily incremental syncs (via cron job)
# Schedule: `0 2 * * * cd /path/to/project && python main.py migrate --incremental RAWDATACOR`
python main.py migrate --incremental RAWDATACOR
python main.py migrate --incremental ELABDATADISP

For Large Initial Migration

# If dataset > 10 million rows
python main.py setup --create-schema
python main.py migrate --incremental RAWDATACOR --use-id  # Can interrupt/resume

# For subsequent syncs, use timestamp
python main.py migrate --incremental RAWDATACOR  # Timestamp-based

Key Differences at a Glance

Feature	Full	Timestamp	ID-based
Initial setup	✅ Required first	✅ After full	✅ After full
Sync new/updated	❌ No	✅ Yes	✅ Yes
Resumable	❌ No	⚠️ Partial*	✅ Full
Batched state tracking	❌ No	❌ No	✅ Yes
Large datasets	⚠️ Risky	✅ Good	✅ Best
Scheduled jobs	❌ No	✅ Perfect	⚠️ Unnecessary

*Timestamp mode can resume, but must wait for full batch to complete before continuing

Default Partitions

Both tables are partitioned by year (2014-2031) plus a DEFAULT partition:

rawdatacor_2014 through rawdatacor_2031 (yearly partitions)
rawdatacor_default (catches data outside 2014-2031)

Same for ELABDATADISP. This ensures data with edge-case timestamps doesn't break migration.

Monitoring

Check Migration Progress

# View state file
cat migration_state.json

# Check PostgreSQL row counts
psql -U postgres -h localhost -d your_db -c "SELECT COUNT(*) FROM rawdatacor;"

Common Issues

"No previous migration found" (Timestamp mode)

Solution: Run full migration first with --full flag

"Duplicate key value violates unique constraint"

Cause: Running full migration twice
Solution: Use timestamp-based incremental sync instead

"Timeout during migration" (Large datasets)

Solution: Switch to ID-based resumable migration with --use-id

Summary

Start with: Full migration (--full) for initial data load
Then use: Timestamp-based incremental (--incremental) for daily syncs
Switch to: ID-based resumable (--incremental --use-id) if full migration is too large

6.2 KiB Raw Blame History

MySQL to PostgreSQL Migration Workflow

Overview

1. Initial Full Migration

First Time Setup

2. Timestamp-based Incremental Migration

For Continuous Sync (Recommended for most cases)

3. ID-based Incremental Migration (Resumable)

For Large Datasets or Unreliable Connections

State Management

State File Location

State File Content (Timestamp-based)

State File Content (ID-based)

Reset Migration State

Recommended Workflow

For Daily Continuous Sync

For Large Initial Migration

Key Differences at a Glance

Default Partitions

Monitoring

Check Migration Progress

Common Issues

Summary

6.2 KiB

Raw Blame History