mysql2postgres/MIGRATION_WORKFLOW.md

# MySQL to PostgreSQL Migration Workflow

## Overview

This tool supports three migration modes:

1. **Full Migration** (`full_migration.py`) - Initial complete migration
2. **Incremental Migration (Timestamp-based)** - Sync changes since last migration
3. **Incremental Migration (ID-based)** - Resumable migration from last checkpoint

---

## 1. Initial Full Migration

### First Time Setup

```bash
# Create the PostgreSQL schema
python main.py setup --create-schema

# Run full migration (one-time)
python main.py migrate --full RAWDATACOR
python main.py migrate --full ELABDATADISP
```

**When to use:** First time migrating data or need complete fresh migration.

**Characteristics:**
- Fetches ALL rows from MySQL
- No checkpoint tracking
- Cannot resume if interrupted
- Good for initial data load

---

## 2. Timestamp-based Incremental Migration

### For Continuous Sync (Recommended for most cases)

```bash
# After initial full migration, use incremental with timestamps
python main.py migrate --incremental RAWDATACOR
python main.py migrate --incremental ELABDATADISP
```

**When to use:** Continuous sync of new/updated records.

**Characteristics:**
- Tracks `created_at` (RAWDATACOR) or `updated_at` (ELABDATADISP)
- Uses JSON state file (`migration_state.json`)
- Only fetches rows modified since last run
- Perfect for scheduled jobs (cron, airflow, etc.)
- Syncs changes but NOT deletions

**How it works:**
1. First run: Returns with message "No previous migration found" - must run full migration first
2. Subsequent runs: Only fetches rows where `created_at` > last_migration_timestamp
3. Updates state file with new timestamp for next run

**Example workflow:**
```bash
# Day 1: Initial full migration
python main.py migrate --full RAWDATACOR

# Day 1: Then incremental (will find nothing new)
python main.py migrate --incremental RAWDATACOR

# Day 2, 3, 4: Daily syncs via cron
python main.py migrate --incremental RAWDATACOR
```

---

## 3. ID-based Incremental Migration (Resumable)

### For Large Datasets or Unreliable Connections

```bash
# First run
python main.py migrate --incremental RAWDATACOR --use-id

# Can interrupt and resume multiple times
python main.py migrate --incremental RAWDATACOR --use-id
```

**When to use:**
- Large datasets that may timeout
- Need to resume from exact last position
- Network is unstable

**Characteristics:**
- Tracks `last_id` instead of timestamp
- Updates state file after EACH BATCH (not just at end)
- Can interrupt and resume dozens of times
- Resumes from exact record ID where it stopped
- Works with `migration_state.json`

**How it works:**
1. First run: Starts from beginning (ID = 0)
2. Each batch: Updates state file with max ID from batch
3. Interrupt: Can stop at any time
4. Resume: Next run continues from last ID stored
5. Continues until all rows processed

**Example workflow for large dataset:**
```bash
# Start ID-based migration (will migrate in batches)
python main.py migrate --incremental RAWDATACOR --use-id

# [If interrupted after 1M rows processed]

# Resume from ID 1M (automatically detects last position)
python main.py migrate --incremental RAWDATACOR --use-id

# [Continues until complete]
```

---

## State Management

### State File Location
```
migration_state.json  # In project root
```

### State File Content (Timestamp-based)
```json
{
  "rawdatacor": {
    "last_timestamp": "2024-12-11T19:30:45.123456",
    "last_updated": "2024-12-11T19:30:45.123456",
    "total_migrated": 50000
  }
}
```

### State File Content (ID-based)
```json
{
  "rawdatacor": {
    "last_id": 1000000,
    "total_migrated": 1000000,
    "last_updated": "2024-12-11T19:45:30.123456"
  }
}
```

### Reset Migration State
```python
from src.migrator.state import MigrationState

state = MigrationState()

# Reset specific table
state.reset("rawdatacor")

# Reset all tables
state.reset()
```

---

## Recommended Workflow

### For Daily Continuous Sync
```bash
# Week 1: Initial setup
python main.py setup --create-schema
python main.py migrate --full RAWDATACOR
python main.py migrate --full ELABDATADISP

# Week 2+: Daily incremental syncs (via cron job)
# Schedule: `0 2 * * * cd /path/to/project && python main.py migrate --incremental RAWDATACOR`
python main.py migrate --incremental RAWDATACOR
python main.py migrate --incremental ELABDATADISP
```

### For Large Initial Migration
```bash
# If dataset > 10 million rows
python main.py setup --create-schema
python main.py migrate --incremental RAWDATACOR --use-id  # Can interrupt/resume

# For subsequent syncs, use timestamp
python main.py migrate --incremental RAWDATACOR  # Timestamp-based
```

---

## Key Differences at a Glance

| Feature | Full | Timestamp | ID-based |
|---------|------|-----------|----------|
| Initial setup | ✅ Required first | ✅ After full | ✅ After full |
| Sync new/updated | ❌ No | ✅ Yes | ✅ Yes |
| Resumable | ❌ No | ⚠️ Partial* | ✅ Full |
| Batched state tracking | ❌ No | ❌ No | ✅ Yes |
| Large datasets | ⚠️ Risky | ✅ Good | ✅ Best |
| Scheduled jobs | ❌ No | ✅ Perfect | ⚠️ Unnecessary |

*Timestamp mode can resume, but must wait for full batch to complete before continuing

---

## Default Partitions

Both tables are partitioned by year (2014-2031) plus a DEFAULT partition:
- **rawdatacor_2014** through **rawdatacor_2031** (yearly partitions)
- **rawdatacor_default** (catches data outside 2014-2031)

Same for ELABDATADISP. This ensures data with edge-case timestamps doesn't break migration.

---

## Monitoring

### Check Migration Progress
```bash
# View state file
cat migration_state.json

# Check PostgreSQL row counts
psql -U postgres -h localhost -d your_db -c "SELECT COUNT(*) FROM rawdatacor;"
```

### Common Issues

**"No previous migration found"** (Timestamp mode)
- Solution: Run full migration first with `--full` flag

**"Duplicate key value violates unique constraint"**
- Cause: Running full migration twice
- Solution: Use timestamp-based incremental sync instead

**"Timeout during migration"** (Large datasets)
- Solution: Switch to ID-based resumable migration with `--use-id`

---

## Summary

- **Start with:** Full migration (`--full`) for initial data load
- **Then use:** Timestamp-based incremental (`--incremental`) for daily syncs
- **Switch to:** ID-based resumable (`--incremental --use-id`) if full migration is too large