fix: Add timeout settings and retry logic to MySQL connector
Configuration improvements: - Set read_timeout=300 (5 minutes) to handle long queries - Set write_timeout=300 (5 minutes) for writes - Set max_allowed_packet=64MB to handle larger data transfers Retry logic: - Added retry mechanism with max 3 retries on fetch failure - Auto-reconnect on connection loss before retry - Better error messages showing retry attempts This fixes the 'connection is lost' error that occurs during long-running migrations by: 1. Giving MySQL queries more time to complete 2. Allowing larger packet sizes for bulk data 3. Automatically recovering from connection drops Fixes: 'Connection is lost' error during full migration
This commit is contained in:
245
MIGRATION_WORKFLOW.md
Normal file
245
MIGRATION_WORKFLOW.md
Normal file
@@ -0,0 +1,245 @@
|
||||
# MySQL to PostgreSQL Migration Workflow
|
||||
|
||||
## Overview
|
||||
|
||||
This tool supports three migration modes:
|
||||
|
||||
1. **Full Migration** (`full_migration.py`) - Initial complete migration
|
||||
2. **Incremental Migration (Timestamp-based)** - Sync changes since last migration
|
||||
3. **Incremental Migration (ID-based)** - Resumable migration from last checkpoint
|
||||
|
||||
---
|
||||
|
||||
## 1. Initial Full Migration
|
||||
|
||||
### First Time Setup
|
||||
|
||||
```bash
|
||||
# Create the PostgreSQL schema
|
||||
python main.py setup --create-schema
|
||||
|
||||
# Run full migration (one-time)
|
||||
python main.py migrate --full RAWDATACOR
|
||||
python main.py migrate --full ELABDATADISP
|
||||
```
|
||||
|
||||
**When to use:** First time migrating data or need complete fresh migration.
|
||||
|
||||
**Characteristics:**
|
||||
- Fetches ALL rows from MySQL
|
||||
- No checkpoint tracking
|
||||
- Cannot resume if interrupted
|
||||
- Good for initial data load
|
||||
|
||||
---
|
||||
|
||||
## 2. Timestamp-based Incremental Migration
|
||||
|
||||
### For Continuous Sync (Recommended for most cases)
|
||||
|
||||
```bash
|
||||
# After initial full migration, use incremental with timestamps
|
||||
python main.py migrate --incremental RAWDATACOR
|
||||
python main.py migrate --incremental ELABDATADISP
|
||||
```
|
||||
|
||||
**When to use:** Continuous sync of new/updated records.
|
||||
|
||||
**Characteristics:**
|
||||
- Tracks `created_at` (RAWDATACOR) or `updated_at` (ELABDATADISP)
|
||||
- Uses JSON state file (`migration_state.json`)
|
||||
- Only fetches rows modified since last run
|
||||
- Perfect for scheduled jobs (cron, airflow, etc.)
|
||||
- Syncs changes but NOT deletions
|
||||
|
||||
**How it works:**
|
||||
1. First run: Returns with message "No previous migration found" - must run full migration first
|
||||
2. Subsequent runs: Only fetches rows where `created_at` > last_migration_timestamp
|
||||
3. Updates state file with new timestamp for next run
|
||||
|
||||
**Example workflow:**
|
||||
```bash
|
||||
# Day 1: Initial full migration
|
||||
python main.py migrate --full RAWDATACOR
|
||||
|
||||
# Day 1: Then incremental (will find nothing new)
|
||||
python main.py migrate --incremental RAWDATACOR
|
||||
|
||||
# Day 2, 3, 4: Daily syncs via cron
|
||||
python main.py migrate --incremental RAWDATACOR
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. ID-based Incremental Migration (Resumable)
|
||||
|
||||
### For Large Datasets or Unreliable Connections
|
||||
|
||||
```bash
|
||||
# First run
|
||||
python main.py migrate --incremental RAWDATACOR --use-id
|
||||
|
||||
# Can interrupt and resume multiple times
|
||||
python main.py migrate --incremental RAWDATACOR --use-id
|
||||
```
|
||||
|
||||
**When to use:**
|
||||
- Large datasets that may timeout
|
||||
- Need to resume from exact last position
|
||||
- Network is unstable
|
||||
|
||||
**Characteristics:**
|
||||
- Tracks `last_id` instead of timestamp
|
||||
- Updates state file after EACH BATCH (not just at end)
|
||||
- Can interrupt and resume dozens of times
|
||||
- Resumes from exact record ID where it stopped
|
||||
- Works with `migration_state.json`
|
||||
|
||||
**How it works:**
|
||||
1. First run: Starts from beginning (ID = 0)
|
||||
2. Each batch: Updates state file with max ID from batch
|
||||
3. Interrupt: Can stop at any time
|
||||
4. Resume: Next run continues from last ID stored
|
||||
5. Continues until all rows processed
|
||||
|
||||
**Example workflow for large dataset:**
|
||||
```bash
|
||||
# Start ID-based migration (will migrate in batches)
|
||||
python main.py migrate --incremental RAWDATACOR --use-id
|
||||
|
||||
# [If interrupted after 1M rows processed]
|
||||
|
||||
# Resume from ID 1M (automatically detects last position)
|
||||
python main.py migrate --incremental RAWDATACOR --use-id
|
||||
|
||||
# [Continues until complete]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## State Management
|
||||
|
||||
### State File Location
|
||||
```
|
||||
migration_state.json # In project root
|
||||
```
|
||||
|
||||
### State File Content (Timestamp-based)
|
||||
```json
|
||||
{
|
||||
"rawdatacor": {
|
||||
"last_timestamp": "2024-12-11T19:30:45.123456",
|
||||
"last_updated": "2024-12-11T19:30:45.123456",
|
||||
"total_migrated": 50000
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### State File Content (ID-based)
|
||||
```json
|
||||
{
|
||||
"rawdatacor": {
|
||||
"last_id": 1000000,
|
||||
"total_migrated": 1000000,
|
||||
"last_updated": "2024-12-11T19:45:30.123456"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Reset Migration State
|
||||
```python
|
||||
from src.migrator.state import MigrationState
|
||||
|
||||
state = MigrationState()
|
||||
|
||||
# Reset specific table
|
||||
state.reset("rawdatacor")
|
||||
|
||||
# Reset all tables
|
||||
state.reset()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Recommended Workflow
|
||||
|
||||
### For Daily Continuous Sync
|
||||
```bash
|
||||
# Week 1: Initial setup
|
||||
python main.py setup --create-schema
|
||||
python main.py migrate --full RAWDATACOR
|
||||
python main.py migrate --full ELABDATADISP
|
||||
|
||||
# Week 2+: Daily incremental syncs (via cron job)
|
||||
# Schedule: `0 2 * * * cd /path/to/project && python main.py migrate --incremental RAWDATACOR`
|
||||
python main.py migrate --incremental RAWDATACOR
|
||||
python main.py migrate --incremental ELABDATADISP
|
||||
```
|
||||
|
||||
### For Large Initial Migration
|
||||
```bash
|
||||
# If dataset > 10 million rows
|
||||
python main.py setup --create-schema
|
||||
python main.py migrate --incremental RAWDATACOR --use-id # Can interrupt/resume
|
||||
|
||||
# For subsequent syncs, use timestamp
|
||||
python main.py migrate --incremental RAWDATACOR # Timestamp-based
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Key Differences at a Glance
|
||||
|
||||
| Feature | Full | Timestamp | ID-based |
|
||||
|---------|------|-----------|----------|
|
||||
| Initial setup | ✅ Required first | ✅ After full | ✅ After full |
|
||||
| Sync new/updated | ❌ No | ✅ Yes | ✅ Yes |
|
||||
| Resumable | ❌ No | ⚠️ Partial* | ✅ Full |
|
||||
| Batched state tracking | ❌ No | ❌ No | ✅ Yes |
|
||||
| Large datasets | ⚠️ Risky | ✅ Good | ✅ Best |
|
||||
| Scheduled jobs | ❌ No | ✅ Perfect | ⚠️ Unnecessary |
|
||||
|
||||
*Timestamp mode can resume, but must wait for full batch to complete before continuing
|
||||
|
||||
---
|
||||
|
||||
## Default Partitions
|
||||
|
||||
Both tables are partitioned by year (2014-2031) plus a DEFAULT partition:
|
||||
- **rawdatacor_2014** through **rawdatacor_2031** (yearly partitions)
|
||||
- **rawdatacor_default** (catches data outside 2014-2031)
|
||||
|
||||
Same for ELABDATADISP. This ensures data with edge-case timestamps doesn't break migration.
|
||||
|
||||
---
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Check Migration Progress
|
||||
```bash
|
||||
# View state file
|
||||
cat migration_state.json
|
||||
|
||||
# Check PostgreSQL row counts
|
||||
psql -U postgres -h localhost -d your_db -c "SELECT COUNT(*) FROM rawdatacor;"
|
||||
```
|
||||
|
||||
### Common Issues
|
||||
|
||||
**"No previous migration found"** (Timestamp mode)
|
||||
- Solution: Run full migration first with `--full` flag
|
||||
|
||||
**"Duplicate key value violates unique constraint"**
|
||||
- Cause: Running full migration twice
|
||||
- Solution: Use timestamp-based incremental sync instead
|
||||
|
||||
**"Timeout during migration"** (Large datasets)
|
||||
- Solution: Switch to ID-based resumable migration with `--use-id`
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
- **Start with:** Full migration (`--full`) for initial data load
|
||||
- **Then use:** Timestamp-based incremental (`--incremental`) for daily syncs
|
||||
- **Switch to:** ID-based resumable (`--incremental --use-id`) if full migration is too large
|
||||
Reference in New Issue
Block a user