fix: Add timeout settings and retry logic to MySQL connector

Configuration improvements: - Set read_timeout=300 (5 minutes) to handle long queries - Set write_timeout=300 (5 minutes) for writes - Set max_allowed_packet=64MB to handle larger data transfers Retry logic: - Added retry mechanism with max 3 retries on fetch failure - Auto-reconnect on connection loss before retry - Better error messages showing retry attempts This fixes the 'connection is lost' error that occurs during long-running migrations by: 1. Giving MySQL queries more time to complete 2. Allowing larger packet sizes for bulk data 3. Automatically recovering from connection drops Fixes: 'Connection is lost' error during full migration
2025-12-21 09:53:34 +01:00
parent 821cda850e
commit b09cfcf9df
8 changed files with 761 additions and 119 deletions
--- a/MIGRATION_WORKFLOW.md
+++ b/MIGRATION_WORKFLOW.md
@@ -0,0 +1,245 @@
+# MySQL to PostgreSQL Migration Workflow
+
+## Overview
+
+This tool supports three migration modes:
+
+1. **Full Migration** (`full_migration.py`) - Initial complete migration
+2. **Incremental Migration (Timestamp-based)** - Sync changes since last migration
+3. **Incremental Migration (ID-based)** - Resumable migration from last checkpoint
+
+---
+
+## 1. Initial Full Migration
+
+### First Time Setup
+
+```bash
+# Create the PostgreSQL schema
+python main.py setup --create-schema
+
+# Run full migration (one-time)
+python main.py migrate --full RAWDATACOR
+python main.py migrate --full ELABDATADISP
+```
+
+**When to use:** First time migrating data or need complete fresh migration.
+
+**Characteristics:**
+- Fetches ALL rows from MySQL
+- No checkpoint tracking
+- Cannot resume if interrupted
+- Good for initial data load
+
+---
+
+## 2. Timestamp-based Incremental Migration
+
+### For Continuous Sync (Recommended for most cases)
+
+```bash
+# After initial full migration, use incremental with timestamps
+python main.py migrate --incremental RAWDATACOR
+python main.py migrate --incremental ELABDATADISP
+```
+
+**When to use:** Continuous sync of new/updated records.
+
+**Characteristics:**
+- Tracks `created_at` (RAWDATACOR) or `updated_at` (ELABDATADISP)
+- Uses JSON state file (`migration_state.json`)
+- Only fetches rows modified since last run
+- Perfect for scheduled jobs (cron, airflow, etc.)
+- Syncs changes but NOT deletions
+
+**How it works:**
+1. First run: Returns with message "No previous migration found" - must run full migration first
+2. Subsequent runs: Only fetches rows where `created_at` > last_migration_timestamp
+3. Updates state file with new timestamp for next run
+
+**Example workflow:**
+```bash
+# Day 1: Initial full migration
+python main.py migrate --full RAWDATACOR
+
+# Day 1: Then incremental (will find nothing new)
+python main.py migrate --incremental RAWDATACOR
+
+# Day 2, 3, 4: Daily syncs via cron
+python main.py migrate --incremental RAWDATACOR
+```
+
+---
+
+## 3. ID-based Incremental Migration (Resumable)
+
+### For Large Datasets or Unreliable Connections
+
+```bash
+# First run
+python main.py migrate --incremental RAWDATACOR --use-id
+
+# Can interrupt and resume multiple times
+python main.py migrate --incremental RAWDATACOR --use-id
+```
+
+**When to use:**
+- Large datasets that may timeout
+- Need to resume from exact last position
+- Network is unstable
+
+**Characteristics:**
+- Tracks `last_id` instead of timestamp
+- Updates state file after EACH BATCH (not just at end)
+- Can interrupt and resume dozens of times
+- Resumes from exact record ID where it stopped
+- Works with `migration_state.json`
+
+**How it works:**
+1. First run: Starts from beginning (ID = 0)
+2. Each batch: Updates state file with max ID from batch
+3. Interrupt: Can stop at any time
+4. Resume: Next run continues from last ID stored
+5. Continues until all rows processed
+
+**Example workflow for large dataset:**
+```bash
+# Start ID-based migration (will migrate in batches)
+python main.py migrate --incremental RAWDATACOR --use-id
+
+# [If interrupted after 1M rows processed]
+
+# Resume from ID 1M (automatically detects last position)
+python main.py migrate --incremental RAWDATACOR --use-id
+
+# [Continues until complete]
+```
+
+---
+
+## State Management
+
+### State File Location
+```
+migration_state.json  # In project root
+```
+
+### State File Content (Timestamp-based)
+```json
+{
+  "rawdatacor": {
+    "last_timestamp": "2024-12-11T19:30:45.123456",
+    "last_updated": "2024-12-11T19:30:45.123456",
+    "total_migrated": 50000
+  }
+}
+```
+
+### State File Content (ID-based)
+```json
+{
+  "rawdatacor": {
+    "last_id": 1000000,
+    "total_migrated": 1000000,
+    "last_updated": "2024-12-11T19:45:30.123456"
+  }
+}
+```
+
+### Reset Migration State
+```python
+from src.migrator.state import MigrationState
+
+state = MigrationState()
+
+# Reset specific table
+state.reset("rawdatacor")
+
+# Reset all tables
+state.reset()
+```
+
+---
+
+## Recommended Workflow
+
+### For Daily Continuous Sync
+```bash
+# Week 1: Initial setup
+python main.py setup --create-schema
+python main.py migrate --full RAWDATACOR
+python main.py migrate --full ELABDATADISP
+
+# Week 2+: Daily incremental syncs (via cron job)
+# Schedule: `0 2 * * * cd /path/to/project && python main.py migrate --incremental RAWDATACOR`
+python main.py migrate --incremental RAWDATACOR
+python main.py migrate --incremental ELABDATADISP
+```
+
+### For Large Initial Migration
+```bash
+# If dataset > 10 million rows
+python main.py setup --create-schema
+python main.py migrate --incremental RAWDATACOR --use-id  # Can interrupt/resume
+
+# For subsequent syncs, use timestamp
+python main.py migrate --incremental RAWDATACOR  # Timestamp-based
+```
+
+---
+
+## Key Differences at a Glance
+
+| Feature | Full | Timestamp | ID-based |
+|---------|------|-----------|----------|
+| Initial setup | ✅ Required first | ✅ After full | ✅ After full |
+| Sync new/updated | ❌ No | ✅ Yes | ✅ Yes |
+| Resumable | ❌ No | ⚠️ Partial* | ✅ Full |
+| Batched state tracking | ❌ No | ❌ No | ✅ Yes |
+| Large datasets | ⚠️ Risky | ✅ Good | ✅ Best |
+| Scheduled jobs | ❌ No | ✅ Perfect | ⚠️ Unnecessary |
+
+*Timestamp mode can resume, but must wait for full batch to complete before continuing
+
+---
+
+## Default Partitions
+
+Both tables are partitioned by year (2014-2031) plus a DEFAULT partition:
+- **rawdatacor_2014** through **rawdatacor_2031** (yearly partitions)
+- **rawdatacor_default** (catches data outside 2014-2031)
+
+Same for ELABDATADISP. This ensures data with edge-case timestamps doesn't break migration.
+
+---
+
+## Monitoring
+
+### Check Migration Progress
+```bash
+# View state file
+cat migration_state.json
+
+# Check PostgreSQL row counts
+psql -U postgres -h localhost -d your_db -c "SELECT COUNT(*) FROM rawdatacor;"
+```
+
+### Common Issues
+
+**"No previous migration found"** (Timestamp mode)
+- Solution: Run full migration first with `--full` flag
+
+**"Duplicate key value violates unique constraint"**
+- Cause: Running full migration twice
+- Solution: Use timestamp-based incremental sync instead
+
+**"Timeout during migration"** (Large datasets)
+- Solution: Switch to ID-based resumable migration with `--use-id`
+
+---
+
+## Summary
+
+- **Start with:** Full migration (`--full`) for initial data load
+- **Then use:** Timestamp-based incremental (`--incremental`) for daily syncs
+- **Switch to:** ID-based resumable (`--incremental --use-id`) if full migration is too large