Configuration improvements: - Set read_timeout=300 (5 minutes) to handle long queries - Set write_timeout=300 (5 minutes) for writes - Set max_allowed_packet=64MB to handle larger data transfers Retry logic: - Added retry mechanism with max 3 retries on fetch failure - Auto-reconnect on connection loss before retry - Better error messages showing retry attempts This fixes the 'connection is lost' error that occurs during long-running migrations by: 1. Giving MySQL queries more time to complete 2. Allowing larger packet sizes for bulk data 3. Automatically recovering from connection drops Fixes: 'Connection is lost' error during full migration
246 lines
6.2 KiB
Markdown
246 lines
6.2 KiB
Markdown
# MySQL to PostgreSQL Migration Workflow
|
|
|
|
## Overview
|
|
|
|
This tool supports three migration modes:
|
|
|
|
1. **Full Migration** (`full_migration.py`) - Initial complete migration
|
|
2. **Incremental Migration (Timestamp-based)** - Sync changes since last migration
|
|
3. **Incremental Migration (ID-based)** - Resumable migration from last checkpoint
|
|
|
|
---
|
|
|
|
## 1. Initial Full Migration
|
|
|
|
### First Time Setup
|
|
|
|
```bash
|
|
# Create the PostgreSQL schema
|
|
python main.py setup --create-schema
|
|
|
|
# Run full migration (one-time)
|
|
python main.py migrate --full RAWDATACOR
|
|
python main.py migrate --full ELABDATADISP
|
|
```
|
|
|
|
**When to use:** First time migrating data or need complete fresh migration.
|
|
|
|
**Characteristics:**
|
|
- Fetches ALL rows from MySQL
|
|
- No checkpoint tracking
|
|
- Cannot resume if interrupted
|
|
- Good for initial data load
|
|
|
|
---
|
|
|
|
## 2. Timestamp-based Incremental Migration
|
|
|
|
### For Continuous Sync (Recommended for most cases)
|
|
|
|
```bash
|
|
# After initial full migration, use incremental with timestamps
|
|
python main.py migrate --incremental RAWDATACOR
|
|
python main.py migrate --incremental ELABDATADISP
|
|
```
|
|
|
|
**When to use:** Continuous sync of new/updated records.
|
|
|
|
**Characteristics:**
|
|
- Tracks `created_at` (RAWDATACOR) or `updated_at` (ELABDATADISP)
|
|
- Uses JSON state file (`migration_state.json`)
|
|
- Only fetches rows modified since last run
|
|
- Perfect for scheduled jobs (cron, airflow, etc.)
|
|
- Syncs changes but NOT deletions
|
|
|
|
**How it works:**
|
|
1. First run: Returns with message "No previous migration found" - must run full migration first
|
|
2. Subsequent runs: Only fetches rows where `created_at` > last_migration_timestamp
|
|
3. Updates state file with new timestamp for next run
|
|
|
|
**Example workflow:**
|
|
```bash
|
|
# Day 1: Initial full migration
|
|
python main.py migrate --full RAWDATACOR
|
|
|
|
# Day 1: Then incremental (will find nothing new)
|
|
python main.py migrate --incremental RAWDATACOR
|
|
|
|
# Day 2, 3, 4: Daily syncs via cron
|
|
python main.py migrate --incremental RAWDATACOR
|
|
```
|
|
|
|
---
|
|
|
|
## 3. ID-based Incremental Migration (Resumable)
|
|
|
|
### For Large Datasets or Unreliable Connections
|
|
|
|
```bash
|
|
# First run
|
|
python main.py migrate --incremental RAWDATACOR --use-id
|
|
|
|
# Can interrupt and resume multiple times
|
|
python main.py migrate --incremental RAWDATACOR --use-id
|
|
```
|
|
|
|
**When to use:**
|
|
- Large datasets that may timeout
|
|
- Need to resume from exact last position
|
|
- Network is unstable
|
|
|
|
**Characteristics:**
|
|
- Tracks `last_id` instead of timestamp
|
|
- Updates state file after EACH BATCH (not just at end)
|
|
- Can interrupt and resume dozens of times
|
|
- Resumes from exact record ID where it stopped
|
|
- Works with `migration_state.json`
|
|
|
|
**How it works:**
|
|
1. First run: Starts from beginning (ID = 0)
|
|
2. Each batch: Updates state file with max ID from batch
|
|
3. Interrupt: Can stop at any time
|
|
4. Resume: Next run continues from last ID stored
|
|
5. Continues until all rows processed
|
|
|
|
**Example workflow for large dataset:**
|
|
```bash
|
|
# Start ID-based migration (will migrate in batches)
|
|
python main.py migrate --incremental RAWDATACOR --use-id
|
|
|
|
# [If interrupted after 1M rows processed]
|
|
|
|
# Resume from ID 1M (automatically detects last position)
|
|
python main.py migrate --incremental RAWDATACOR --use-id
|
|
|
|
# [Continues until complete]
|
|
```
|
|
|
|
---
|
|
|
|
## State Management
|
|
|
|
### State File Location
|
|
```
|
|
migration_state.json # In project root
|
|
```
|
|
|
|
### State File Content (Timestamp-based)
|
|
```json
|
|
{
|
|
"rawdatacor": {
|
|
"last_timestamp": "2024-12-11T19:30:45.123456",
|
|
"last_updated": "2024-12-11T19:30:45.123456",
|
|
"total_migrated": 50000
|
|
}
|
|
}
|
|
```
|
|
|
|
### State File Content (ID-based)
|
|
```json
|
|
{
|
|
"rawdatacor": {
|
|
"last_id": 1000000,
|
|
"total_migrated": 1000000,
|
|
"last_updated": "2024-12-11T19:45:30.123456"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Reset Migration State
|
|
```python
|
|
from src.migrator.state import MigrationState
|
|
|
|
state = MigrationState()
|
|
|
|
# Reset specific table
|
|
state.reset("rawdatacor")
|
|
|
|
# Reset all tables
|
|
state.reset()
|
|
```
|
|
|
|
---
|
|
|
|
## Recommended Workflow
|
|
|
|
### For Daily Continuous Sync
|
|
```bash
|
|
# Week 1: Initial setup
|
|
python main.py setup --create-schema
|
|
python main.py migrate --full RAWDATACOR
|
|
python main.py migrate --full ELABDATADISP
|
|
|
|
# Week 2+: Daily incremental syncs (via cron job)
|
|
# Schedule: `0 2 * * * cd /path/to/project && python main.py migrate --incremental RAWDATACOR`
|
|
python main.py migrate --incremental RAWDATACOR
|
|
python main.py migrate --incremental ELABDATADISP
|
|
```
|
|
|
|
### For Large Initial Migration
|
|
```bash
|
|
# If dataset > 10 million rows
|
|
python main.py setup --create-schema
|
|
python main.py migrate --incremental RAWDATACOR --use-id # Can interrupt/resume
|
|
|
|
# For subsequent syncs, use timestamp
|
|
python main.py migrate --incremental RAWDATACOR # Timestamp-based
|
|
```
|
|
|
|
---
|
|
|
|
## Key Differences at a Glance
|
|
|
|
| Feature | Full | Timestamp | ID-based |
|
|
|---------|------|-----------|----------|
|
|
| Initial setup | ✅ Required first | ✅ After full | ✅ After full |
|
|
| Sync new/updated | ❌ No | ✅ Yes | ✅ Yes |
|
|
| Resumable | ❌ No | ⚠️ Partial* | ✅ Full |
|
|
| Batched state tracking | ❌ No | ❌ No | ✅ Yes |
|
|
| Large datasets | ⚠️ Risky | ✅ Good | ✅ Best |
|
|
| Scheduled jobs | ❌ No | ✅ Perfect | ⚠️ Unnecessary |
|
|
|
|
*Timestamp mode can resume, but must wait for full batch to complete before continuing
|
|
|
|
---
|
|
|
|
## Default Partitions
|
|
|
|
Both tables are partitioned by year (2014-2031) plus a DEFAULT partition:
|
|
- **rawdatacor_2014** through **rawdatacor_2031** (yearly partitions)
|
|
- **rawdatacor_default** (catches data outside 2014-2031)
|
|
|
|
Same for ELABDATADISP. This ensures data with edge-case timestamps doesn't break migration.
|
|
|
|
---
|
|
|
|
## Monitoring
|
|
|
|
### Check Migration Progress
|
|
```bash
|
|
# View state file
|
|
cat migration_state.json
|
|
|
|
# Check PostgreSQL row counts
|
|
psql -U postgres -h localhost -d your_db -c "SELECT COUNT(*) FROM rawdatacor;"
|
|
```
|
|
|
|
### Common Issues
|
|
|
|
**"No previous migration found"** (Timestamp mode)
|
|
- Solution: Run full migration first with `--full` flag
|
|
|
|
**"Duplicate key value violates unique constraint"**
|
|
- Cause: Running full migration twice
|
|
- Solution: Use timestamp-based incremental sync instead
|
|
|
|
**"Timeout during migration"** (Large datasets)
|
|
- Solution: Switch to ID-based resumable migration with `--use-id`
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
- **Start with:** Full migration (`--full`) for initial data load
|
|
- **Then use:** Timestamp-based incremental (`--incremental`) for daily syncs
|
|
- **Switch to:** ID-based resumable (`--incremental --use-id`) if full migration is too large
|