clean docs

2025-12-30 15:24:19 +01:00
parent 5c9df3d06f
commit 5f6e3215a5
6 changed files with 492 additions and 159 deletions
--- a/MIGRATION_WORKFLOW.md
+++ b/MIGRATION_WORKFLOW.md
@@ -2,244 +2,349 @@

 ## Overview

-This tool supports three migration modes:
+This tool implements **consolidation-based incremental migration** for two tables:
+- **RAWDATACOR**: Raw sensor measurements
+- **ELABDATADISP**: Elaborated/calculated data

-1. **Full Migration** (`full_migration.py`) - Initial complete migration
-2. **Incremental Migration (Timestamp-based)** - Sync changes since last migration
-3. **Incremental Migration (ID-based)** - Resumable migration from last checkpoint
+Both tables use **consolidation keys** to group and migrate data efficiently.

 ---

-## 1. Initial Full Migration
+## Migration Modes

-### First Time Setup
+### 1. Full Migration
+
+Initial migration of all historical data, migrating one partition at a time.

 ```bash
-# Create the PostgreSQL schema
-python main.py setup --create-schema
+# Migrate all partitions for all tables
+python main.py migrate full

-# Run full migration (one-time)
-python main.py migrate --full RAWDATACOR
-python main.py migrate --full ELABDATADISP
+# Migrate specific table
+python main.py migrate full --table RAWDATACOR
+
+# Migrate specific partition (year-based)
+python main.py migrate full --table ELABDATADISP --partition 2024
+
+# Dry-run to see what would be migrated
+python main.py migrate full --dry-run
 ```

-**When to use:** First time migrating data or need complete fresh migration.
-
 **Characteristics:**
- Fetches ALL rows from MySQL
- No checkpoint tracking
- Cannot resume if interrupted
- Good for initial data load
+- Migrates data partition by partition (year-based)
+- Uses consolidation groups for efficiency
+- Tracks progress in `migration_state` table (PostgreSQL)
+- Can resume from last completed partition if interrupted
+- Uses `mysql_max_id` optimization for performance

 ---

-## 2. Timestamp-based Incremental Migration
+### 2. Incremental Migration

-### For Continuous Sync (Recommended for most cases)
+Sync only new data since the last migration.

 ```bash
-# After initial full migration, use incremental with timestamps
-python main.py migrate --incremental RAWDATACOR
-python main.py migrate --incremental ELABDATADISP
+# Migrate new data for all tables
+python main.py migrate incremental
+
+# Migrate specific table
+python main.py migrate incremental --table ELABDATADISP
+
+# Dry-run to check what would be migrated
+python main.py migrate incremental --dry-run
 ```

-**When to use:** Continuous sync of new/updated records.
-
 **Characteristics:**
- Tracks `created_at` (RAWDATACOR) or `updated_at` (ELABDATADISP)
- Uses JSON state file (`migration_state.json`)
- Only fetches rows modified since last run
- Perfect for scheduled jobs (cron, airflow, etc.)
- Syncs changes but NOT deletions
+- Uses **consolidation keys** to identify new records:
+  - `(UnitName, ToolNameID, EventDate, EventTime)`
+- Tracks last migrated key in `migration_state` table
+- Optimized with `min_mysql_id` filter for performance
+- Handles duplicates with `ON CONFLICT DO NOTHING`
+- Perfect for scheduled jobs (cron, systemd timers)

 **How it works:**
-1. First run: Returns with message "No previous migration found" - must run full migration first
-2. Subsequent runs: Only fetches rows where `created_at` > last_migration_timestamp
-3. Updates state file with new timestamp for next run
-
-**Example workflow:**
-```bash
-# Day 1: Initial full migration
-python main.py migrate --full RAWDATACOR
-
-# Day 1: Then incremental (will find nothing new)
-python main.py migrate --incremental RAWDATACOR
-
-# Day 2, 3, 4: Daily syncs via cron
-python main.py migrate --incremental RAWDATACOR
-```
+1. Retrieves `last_key` from `migration_state` table
+2. Gets `MAX(mysql_max_id)` from PostgreSQL table for optimization
+3. Queries MySQL: `WHERE id > max_mysql_id AND (key_tuple) > last_key`
+4. Migrates new consolidation groups
+5. Updates `migration_state` with new `last_key`

 ---

-## 3. ID-based Incremental Migration (Resumable)
+## Consolidation Keys

-### For Large Datasets or Unreliable Connections
+Both tables use consolidation to group multiple measurements into a single JSONB record.

-```bash
-# First run
-python main.py migrate --incremental RAWDATACOR --use-id
+### Consolidation Key Structure

-# Can interrupt and resume multiple times
-python main.py migrate --incremental RAWDATACOR --use-id
+```sql
+(UnitName, ToolNameID, EventDate, EventTime)
 ```

-**When to use:**
- Large datasets that may timeout
- Need to resume from exact last position
- Network is unstable
+### Why Consolidation?

-**Characteristics:**
- Tracks `last_id` instead of timestamp
- Updates state file after EACH BATCH (not just at end)
- Can interrupt and resume dozens of times
- Resumes from exact record ID where it stopped
- Works with `migration_state.json`
+Instead of migrating individual sensor readings, we:
+1. **Group** all measurements for the same (unit, tool, date, time)
+2. **Transform** 16-25 columns into structured JSONB
+3. **Migrate** as a single consolidated record

-**How it works:**
-1. First run: Starts from beginning (ID = 0)
-2. Each batch: Updates state file with max ID from batch
-3. Interrupt: Can stop at any time
-4. Resume: Next run continues from last ID stored
-5. Continues until all rows processed
+**Example:**

-**Example workflow for large dataset:**
-```bash
-# Start ID-based migration (will migrate in batches)
-python main.py migrate --incremental RAWDATACOR --use-id
+MySQL has 10 rows for `(Unit1, Tool1, 2024-01-01, 10:00:00)`:
+```
+id  | UnitName | ToolNameID | EventDate  | EventTime | Val0  | Val1  | ...
+1   | Unit1    | Tool1      | 2024-01-01 | 10:00:00  | 23.5  | 45.2  | ...
+2   | Unit1    | Tool1      | 2024-01-01 | 10:00:00  | 23.6  | 45.3  | ...
+...
+```

-# [If interrupted after 1M rows processed]
-
-# Resume from ID 1M (automatically detects last position)
-python main.py migrate --incremental RAWDATACOR --use-id
-
-# [Continues until complete]
+PostgreSQL gets 1 consolidated record:
+```json
+{
+  "unit_name": "Unit1",
+  "tool_name_id": "Tool1",
+  "event_timestamp": "2024-01-01 10:00:00",
+  "measurements": {
+    "0": {"value": 23.5, "unit": "°C"},
+    "1": {"value": 45.2, "unit": "bar"},
+    ...
+  },
+  "mysql_max_id": 10
+}
 ```

 ---

 ## State Management

-### State File Location
-```
-migration_state.json  # In project root
+### Migration State Table
+
+The `migration_state` table in PostgreSQL tracks migration progress:
+
+```sql
+CREATE TABLE migration_state (
+    table_name VARCHAR(50),
+    partition_name VARCHAR(50),
+    last_key JSONB,  -- Last migrated consolidation key
+    started_at TIMESTAMP,
+    completed_at TIMESTAMP,
+    total_rows INTEGER,
+    status VARCHAR(20)
+);
 ```

-### State File Content (Timestamp-based)
-```json
-{
-  "rawdatacor": {
-    "last_timestamp": "2024-12-11T19:30:45.123456",
-    "last_updated": "2024-12-11T19:30:45.123456",
-    "total_migrated": 50000
-  }
-}
+### State Records
+
+- **Per-partition state**: Tracks each partition's progress
+  - Example: `('rawdatacor', '2024', {...}, '2024-01-15 10:30:00', 'completed', 1000000)`
+
+- **Global state**: Tracks overall incremental migration position
+  - Example: `('rawdatacor', '_global', {...}, NULL, NULL, 0, 'in_progress')`
+
+### Checking State
+
+```sql
+-- View all migration state
+SELECT * FROM migration_state ORDER BY table_name, partition_name;
+
+-- View global state (for incremental migration)
+SELECT * FROM migration_state WHERE partition_name = '_global';
 ```

-### State File Content (ID-based)
-```json
-{
-  "rawdatacor": {
-    "last_id": 1000000,
-    "total_migrated": 1000000,
-    "last_updated": "2024-12-11T19:45:30.123456"
-  }
-}
+---
+
+## Performance Optimization
+
+### MySQL ID Filter
+
+The incremental migration uses `MAX(mysql_max_id)` from PostgreSQL to filter MySQL queries:
+
+```sql
+SELECT UnitName, ToolNameID, EventDate, EventTime
+FROM RAWDATACOR
+WHERE id > 267399536  -- max_mysql_id from PostgreSQL
+  AND (UnitName, ToolNameID, EventDate, EventTime) > (?, ?, ?, ?)
+GROUP BY UnitName, ToolNameID, EventDate, EventTime
+ORDER BY UnitName, ToolNameID, EventDate, EventTime
+LIMIT 10000
 ```

-### Reset Migration State
-```python
-from src.migrator.state import MigrationState
+**Why this is fast:**
+- Uses PRIMARY KEY index on `id` to skip millions of already-migrated rows
+- Tuple comparison only applied to filtered subset
+- Avoids full table scans

-state = MigrationState()
+### Consolidation Group Batching

-# Reset specific table
-state.reset("rawdatacor")
-
-# Reset all tables
-state.reset()
-```
+Instead of fetching individual rows, we:
+1. Fetch 10,000 consolidation keys at a time
+2. For each key, fetch all matching rows from MySQL
+3. Transform and insert into PostgreSQL
+4. Update state every batch

 ---

 ## Recommended Workflow

-### For Daily Continuous Sync
-```bash
-# Week 1: Initial setup
-python main.py setup --create-schema
-python main.py migrate --full RAWDATACOR
-python main.py migrate --full ELABDATADISP
+### Initial Setup (One-time)

-# Week 2+: Daily incremental syncs (via cron job)
-# Schedule: `0 2 * * * cd /path/to/project && python main.py migrate --incremental RAWDATACOR`
-python main.py migrate --incremental RAWDATACOR
-python main.py migrate --incremental ELABDATADISP
+```bash
+# 1. Configure .env file
+cp .env.example .env
+nano .env
+
+# 2. Create PostgreSQL schema
+python main.py setup --create-schema
+
+# 3. Run full migration
+python main.py migrate full
 ```

-### For Large Initial Migration
-```bash
-# If dataset > 10 million rows
-python main.py setup --create-schema
-python main.py migrate --incremental RAWDATACOR --use-id  # Can interrupt/resume
+### Daily Incremental Sync

-# For subsequent syncs, use timestamp
-python main.py migrate --incremental RAWDATACOR  # Timestamp-based
+```bash
+# Run incremental migration (via cron or manual)
+python main.py migrate incremental
+```
+
+**Cron example** (daily at 2 AM):
+```cron
+0 2 * * * cd /path/to/mysql2postgres && python main.py migrate incremental >> /var/log/migration.log 2>&1
 ```

 ---

-## Key Differences at a Glance
+## Resuming Interrupted Migrations

-| Feature | Full | Timestamp | ID-based |
-|---------|------|-----------|----------|
-| Initial setup | ✅ Required first | ✅ After full | ✅ After full |
-| Sync new/updated | ❌ No | ✅ Yes | ✅ Yes |
-| Resumable | ❌ No | ⚠️ Partial* | ✅ Full |
-| Batched state tracking | ❌ No | ❌ No | ✅ Yes |
-| Large datasets | ⚠️ Risky | ✅ Good | ✅ Best |
-| Scheduled jobs | ❌ No | ✅ Perfect | ⚠️ Unnecessary |
+### Full Migration

-*Timestamp mode can resume, but must wait for full batch to complete before continuing
+If interrupted, full migration resumes from the last completed partition:
+
+```bash
+# First run: migrates partitions 2014, 2015, 2016... (interrupted after 2020)
+python main.py migrate full --table RAWDATACOR
+
+# Resume: continues from partition 2021
+python main.py migrate full --table RAWDATACOR
+```
+
+### Incremental Migration
+
+Incremental migration uses the `last_key` from `migration_state`:
+
+```bash
+# Always safe to re-run - uses ON CONFLICT DO NOTHING
+python main.py migrate incremental
+```

 ---

-## Default Partitions
+## Syncing Migration State

-Both tables are partitioned by year (2014-2031) plus a DEFAULT partition:
- **rawdatacor_2014** through **rawdatacor_2031** (yearly partitions)
- **rawdatacor_default** (catches data outside 2014-2031)
+If `migration_state` becomes out of sync with actual data, use the sync utility:

-Same for ELABDATADISP. This ensures data with edge-case timestamps doesn't break migration.
+```bash
+# Sync migration_state with actual PostgreSQL data
+python scripts/sync_migration_state.py
+```
+
+This finds the most recent row (by `created_at`) and updates `migration_state._global`.

 ---

 ## Monitoring

-### Check Migration Progress
-```bash
-# View state file
-cat migration_state.json
+### Check Progress

-# Check PostgreSQL row counts
-psql -U postgres -h localhost -d your_db -c "SELECT COUNT(*) FROM rawdatacor;"
+```bash
+# View migration state
+psql -h localhost -U postgres -d migrated_db -c \
+  "SELECT table_name, partition_name, status, total_rows, completed_at
+   FROM migration_state
+   ORDER BY table_name, partition_name"
 ```

-### Common Issues
+### Verify Row Counts

-**"No previous migration found"** (Timestamp mode)
- Solution: Run full migration first with `--full` flag
+```sql
+-- PostgreSQL
+SELECT COUNT(*) FROM rawdatacor;
+SELECT COUNT(*) FROM elabdatadisp;

-**"Duplicate key value violates unique constraint"**
- Cause: Running full migration twice
- Solution: Use timestamp-based incremental sync instead
+-- Compare with MySQL
+-- mysql> SELECT COUNT(DISTINCT UnitName, ToolNameID, EventDate, EventTime) FROM RAWDATACOR;
+```

-**"Timeout during migration"** (Large datasets)
- Solution: Switch to ID-based resumable migration with `--use-id`
+---
+
+## Common Issues
+
+### "No previous migration found"
+
+**Cause**: Running incremental migration before full migration
+
+**Solution**: Run full migration first
+```bash
+python main.py migrate full
+```
+
+### "Duplicate key value violates unique constraint"
+
+**Cause**: Data already exists (shouldn't happen with ON CONFLICT DO NOTHING)
+
+**Solution**: Migration handles this automatically - check logs for details
+
+### Slow Incremental Migration
+
+**Cause**: `MAX(mysql_max_id)` query is slow (~60 seconds on large tables)
+
+**Solution**: This is expected and only happens once at start. The MySQL queries are instant thanks to the optimization.
+
+**Alternative**: Create an index on `mysql_max_id` in PostgreSQL (uses disk space):
+```sql
+CREATE INDEX idx_rawdatacor_mysql_max_id ON rawdatacor (mysql_max_id DESC);
+CREATE INDEX idx_elabdatadisp_mysql_max_id ON elabdatadisp (mysql_max_id DESC);
+```
+
+---
+
+## Key Technical Details
+
+### Tuple Comparison in MySQL
+
+MySQL supports lexicographic tuple comparison:
+
+```sql
+WHERE (UnitName, ToolNameID, EventDate, EventTime) > ('Unit1', 'Tool1', '2024-01-01', '10:00:00')
+```
+
+This is equivalent to:
+```sql
+WHERE UnitName > 'Unit1'
+   OR (UnitName = 'Unit1' AND ToolNameID > 'Tool1')
+   OR (UnitName = 'Unit1' AND ToolNameID = 'Tool1' AND EventDate > '2024-01-01')
+   OR (UnitName = 'Unit1' AND ToolNameID = 'Tool1' AND EventDate = '2024-01-01' AND EventTime > '10:00:00')
+```
+
+But much more efficient!
+
+### Partitioning in PostgreSQL
+
+Tables are partitioned by year (2014-2031):
+```sql
+CREATE TABLE rawdatacor_2024 PARTITION OF rawdatacor
+  FOR VALUES FROM (2024) TO (2025);
+```
+
+PostgreSQL automatically routes INSERTs to the correct partition based on `event_year`.

 ---

 ## Summary

- **Start with:** Full migration (`--full`) for initial data load
- **Then use:** Timestamp-based incremental (`--incremental`) for daily syncs
- **Switch to:** ID-based resumable (`--incremental --use-id`) if full migration is too large
+1. **Full migration**: One-time initial load, partition by partition
+2. **Incremental migration**: Daily sync of new data using consolidation keys
+3. **State tracking**: PostgreSQL `migration_state` table
+4. **Performance**: `mysql_max_id` filter + consolidation batching
+5. **Resumable**: Both modes can resume from interruptions
+6. **Safe**: ON CONFLICT DO NOTHING prevents duplicates