clean docs

This commit is contained in:
2025-12-30 15:24:19 +01:00
parent 5c9df3d06f
commit 5f6e3215a5
6 changed files with 492 additions and 159 deletions

View File

@@ -2,244 +2,349 @@
## Overview
This tool supports three migration modes:
This tool implements **consolidation-based incremental migration** for two tables:
- **RAWDATACOR**: Raw sensor measurements
- **ELABDATADISP**: Elaborated/calculated data
1. **Full Migration** (`full_migration.py`) - Initial complete migration
2. **Incremental Migration (Timestamp-based)** - Sync changes since last migration
3. **Incremental Migration (ID-based)** - Resumable migration from last checkpoint
Both tables use **consolidation keys** to group and migrate data efficiently.
---
## 1. Initial Full Migration
## Migration Modes
### First Time Setup
### 1. Full Migration
Initial migration of all historical data, migrating one partition at a time.
```bash
# Create the PostgreSQL schema
python main.py setup --create-schema
# Migrate all partitions for all tables
python main.py migrate full
# Run full migration (one-time)
python main.py migrate --full RAWDATACOR
python main.py migrate --full ELABDATADISP
# Migrate specific table
python main.py migrate full --table RAWDATACOR
# Migrate specific partition (year-based)
python main.py migrate full --table ELABDATADISP --partition 2024
# Dry-run to see what would be migrated
python main.py migrate full --dry-run
```
**When to use:** First time migrating data or need complete fresh migration.
**Characteristics:**
- Fetches ALL rows from MySQL
- No checkpoint tracking
- Cannot resume if interrupted
- Good for initial data load
- Migrates data partition by partition (year-based)
- Uses consolidation groups for efficiency
- Tracks progress in `migration_state` table (PostgreSQL)
- Can resume from last completed partition if interrupted
- Uses `mysql_max_id` optimization for performance
---
## 2. Timestamp-based Incremental Migration
### 2. Incremental Migration
### For Continuous Sync (Recommended for most cases)
Sync only new data since the last migration.
```bash
# After initial full migration, use incremental with timestamps
python main.py migrate --incremental RAWDATACOR
python main.py migrate --incremental ELABDATADISP
# Migrate new data for all tables
python main.py migrate incremental
# Migrate specific table
python main.py migrate incremental --table ELABDATADISP
# Dry-run to check what would be migrated
python main.py migrate incremental --dry-run
```
**When to use:** Continuous sync of new/updated records.
**Characteristics:**
- Tracks `created_at` (RAWDATACOR) or `updated_at` (ELABDATADISP)
- Uses JSON state file (`migration_state.json`)
- Only fetches rows modified since last run
- Perfect for scheduled jobs (cron, airflow, etc.)
- Syncs changes but NOT deletions
- Uses **consolidation keys** to identify new records:
- `(UnitName, ToolNameID, EventDate, EventTime)`
- Tracks last migrated key in `migration_state` table
- Optimized with `min_mysql_id` filter for performance
- Handles duplicates with `ON CONFLICT DO NOTHING`
- Perfect for scheduled jobs (cron, systemd timers)
**How it works:**
1. First run: Returns with message "No previous migration found" - must run full migration first
2. Subsequent runs: Only fetches rows where `created_at` > last_migration_timestamp
3. Updates state file with new timestamp for next run
**Example workflow:**
```bash
# Day 1: Initial full migration
python main.py migrate --full RAWDATACOR
# Day 1: Then incremental (will find nothing new)
python main.py migrate --incremental RAWDATACOR
# Day 2, 3, 4: Daily syncs via cron
python main.py migrate --incremental RAWDATACOR
```
1. Retrieves `last_key` from `migration_state` table
2. Gets `MAX(mysql_max_id)` from PostgreSQL table for optimization
3. Queries MySQL: `WHERE id > max_mysql_id AND (key_tuple) > last_key`
4. Migrates new consolidation groups
5. Updates `migration_state` with new `last_key`
---
## 3. ID-based Incremental Migration (Resumable)
## Consolidation Keys
### For Large Datasets or Unreliable Connections
Both tables use consolidation to group multiple measurements into a single JSONB record.
```bash
# First run
python main.py migrate --incremental RAWDATACOR --use-id
### Consolidation Key Structure
# Can interrupt and resume multiple times
python main.py migrate --incremental RAWDATACOR --use-id
```sql
(UnitName, ToolNameID, EventDate, EventTime)
```
**When to use:**
- Large datasets that may timeout
- Need to resume from exact last position
- Network is unstable
### Why Consolidation?
**Characteristics:**
- Tracks `last_id` instead of timestamp
- Updates state file after EACH BATCH (not just at end)
- Can interrupt and resume dozens of times
- Resumes from exact record ID where it stopped
- Works with `migration_state.json`
Instead of migrating individual sensor readings, we:
1. **Group** all measurements for the same (unit, tool, date, time)
2. **Transform** 16-25 columns into structured JSONB
3. **Migrate** as a single consolidated record
**How it works:**
1. First run: Starts from beginning (ID = 0)
2. Each batch: Updates state file with max ID from batch
3. Interrupt: Can stop at any time
4. Resume: Next run continues from last ID stored
5. Continues until all rows processed
**Example:**
**Example workflow for large dataset:**
```bash
# Start ID-based migration (will migrate in batches)
python main.py migrate --incremental RAWDATACOR --use-id
MySQL has 10 rows for `(Unit1, Tool1, 2024-01-01, 10:00:00)`:
```
id | UnitName | ToolNameID | EventDate | EventTime | Val0 | Val1 | ...
1 | Unit1 | Tool1 | 2024-01-01 | 10:00:00 | 23.5 | 45.2 | ...
2 | Unit1 | Tool1 | 2024-01-01 | 10:00:00 | 23.6 | 45.3 | ...
...
```
# [If interrupted after 1M rows processed]
# Resume from ID 1M (automatically detects last position)
python main.py migrate --incremental RAWDATACOR --use-id
# [Continues until complete]
PostgreSQL gets 1 consolidated record:
```json
{
"unit_name": "Unit1",
"tool_name_id": "Tool1",
"event_timestamp": "2024-01-01 10:00:00",
"measurements": {
"0": {"value": 23.5, "unit": "°C"},
"1": {"value": 45.2, "unit": "bar"},
...
},
"mysql_max_id": 10
}
```
---
## State Management
### State File Location
```
migration_state.json # In project root
### Migration State Table
The `migration_state` table in PostgreSQL tracks migration progress:
```sql
CREATE TABLE migration_state (
table_name VARCHAR(50),
partition_name VARCHAR(50),
last_key JSONB, -- Last migrated consolidation key
started_at TIMESTAMP,
completed_at TIMESTAMP,
total_rows INTEGER,
status VARCHAR(20)
);
```
### State File Content (Timestamp-based)
```json
{
"rawdatacor": {
"last_timestamp": "2024-12-11T19:30:45.123456",
"last_updated": "2024-12-11T19:30:45.123456",
"total_migrated": 50000
}
}
### State Records
- **Per-partition state**: Tracks each partition's progress
- Example: `('rawdatacor', '2024', {...}, '2024-01-15 10:30:00', 'completed', 1000000)`
- **Global state**: Tracks overall incremental migration position
- Example: `('rawdatacor', '_global', {...}, NULL, NULL, 0, 'in_progress')`
### Checking State
```sql
-- View all migration state
SELECT * FROM migration_state ORDER BY table_name, partition_name;
-- View global state (for incremental migration)
SELECT * FROM migration_state WHERE partition_name = '_global';
```
### State File Content (ID-based)
```json
{
"rawdatacor": {
"last_id": 1000000,
"total_migrated": 1000000,
"last_updated": "2024-12-11T19:45:30.123456"
}
}
---
## Performance Optimization
### MySQL ID Filter
The incremental migration uses `MAX(mysql_max_id)` from PostgreSQL to filter MySQL queries:
```sql
SELECT UnitName, ToolNameID, EventDate, EventTime
FROM RAWDATACOR
WHERE id > 267399536 -- max_mysql_id from PostgreSQL
AND (UnitName, ToolNameID, EventDate, EventTime) > (?, ?, ?, ?)
GROUP BY UnitName, ToolNameID, EventDate, EventTime
ORDER BY UnitName, ToolNameID, EventDate, EventTime
LIMIT 10000
```
### Reset Migration State
```python
from src.migrator.state import MigrationState
**Why this is fast:**
- Uses PRIMARY KEY index on `id` to skip millions of already-migrated rows
- Tuple comparison only applied to filtered subset
- Avoids full table scans
state = MigrationState()
### Consolidation Group Batching
# Reset specific table
state.reset("rawdatacor")
# Reset all tables
state.reset()
```
Instead of fetching individual rows, we:
1. Fetch 10,000 consolidation keys at a time
2. For each key, fetch all matching rows from MySQL
3. Transform and insert into PostgreSQL
4. Update state every batch
---
## Recommended Workflow
### For Daily Continuous Sync
```bash
# Week 1: Initial setup
python main.py setup --create-schema
python main.py migrate --full RAWDATACOR
python main.py migrate --full ELABDATADISP
### Initial Setup (One-time)
# Week 2+: Daily incremental syncs (via cron job)
# Schedule: `0 2 * * * cd /path/to/project && python main.py migrate --incremental RAWDATACOR`
python main.py migrate --incremental RAWDATACOR
python main.py migrate --incremental ELABDATADISP
```bash
# 1. Configure .env file
cp .env.example .env
nano .env
# 2. Create PostgreSQL schema
python main.py setup --create-schema
# 3. Run full migration
python main.py migrate full
```
### For Large Initial Migration
```bash
# If dataset > 10 million rows
python main.py setup --create-schema
python main.py migrate --incremental RAWDATACOR --use-id # Can interrupt/resume
### Daily Incremental Sync
# For subsequent syncs, use timestamp
python main.py migrate --incremental RAWDATACOR # Timestamp-based
```bash
# Run incremental migration (via cron or manual)
python main.py migrate incremental
```
**Cron example** (daily at 2 AM):
```cron
0 2 * * * cd /path/to/mysql2postgres && python main.py migrate incremental >> /var/log/migration.log 2>&1
```
---
## Key Differences at a Glance
## Resuming Interrupted Migrations
| Feature | Full | Timestamp | ID-based |
|---------|------|-----------|----------|
| Initial setup | ✅ Required first | ✅ After full | ✅ After full |
| Sync new/updated | ❌ No | ✅ Yes | ✅ Yes |
| Resumable | ❌ No | ⚠️ Partial* | ✅ Full |
| Batched state tracking | ❌ No | ❌ No | ✅ Yes |
| Large datasets | ⚠️ Risky | ✅ Good | ✅ Best |
| Scheduled jobs | ❌ No | ✅ Perfect | ⚠️ Unnecessary |
### Full Migration
*Timestamp mode can resume, but must wait for full batch to complete before continuing
If interrupted, full migration resumes from the last completed partition:
```bash
# First run: migrates partitions 2014, 2015, 2016... (interrupted after 2020)
python main.py migrate full --table RAWDATACOR
# Resume: continues from partition 2021
python main.py migrate full --table RAWDATACOR
```
### Incremental Migration
Incremental migration uses the `last_key` from `migration_state`:
```bash
# Always safe to re-run - uses ON CONFLICT DO NOTHING
python main.py migrate incremental
```
---
## Default Partitions
## Syncing Migration State
Both tables are partitioned by year (2014-2031) plus a DEFAULT partition:
- **rawdatacor_2014** through **rawdatacor_2031** (yearly partitions)
- **rawdatacor_default** (catches data outside 2014-2031)
If `migration_state` becomes out of sync with actual data, use the sync utility:
Same for ELABDATADISP. This ensures data with edge-case timestamps doesn't break migration.
```bash
# Sync migration_state with actual PostgreSQL data
python scripts/sync_migration_state.py
```
This finds the most recent row (by `created_at`) and updates `migration_state._global`.
---
## Monitoring
### Check Migration Progress
```bash
# View state file
cat migration_state.json
### Check Progress
# Check PostgreSQL row counts
psql -U postgres -h localhost -d your_db -c "SELECT COUNT(*) FROM rawdatacor;"
```bash
# View migration state
psql -h localhost -U postgres -d migrated_db -c \
"SELECT table_name, partition_name, status, total_rows, completed_at
FROM migration_state
ORDER BY table_name, partition_name"
```
### Common Issues
### Verify Row Counts
**"No previous migration found"** (Timestamp mode)
- Solution: Run full migration first with `--full` flag
```sql
-- PostgreSQL
SELECT COUNT(*) FROM rawdatacor;
SELECT COUNT(*) FROM elabdatadisp;
**"Duplicate key value violates unique constraint"**
- Cause: Running full migration twice
- Solution: Use timestamp-based incremental sync instead
-- Compare with MySQL
-- mysql> SELECT COUNT(DISTINCT UnitName, ToolNameID, EventDate, EventTime) FROM RAWDATACOR;
```
**"Timeout during migration"** (Large datasets)
- Solution: Switch to ID-based resumable migration with `--use-id`
---
## Common Issues
### "No previous migration found"
**Cause**: Running incremental migration before full migration
**Solution**: Run full migration first
```bash
python main.py migrate full
```
### "Duplicate key value violates unique constraint"
**Cause**: Data already exists (shouldn't happen with ON CONFLICT DO NOTHING)
**Solution**: Migration handles this automatically - check logs for details
### Slow Incremental Migration
**Cause**: `MAX(mysql_max_id)` query is slow (~60 seconds on large tables)
**Solution**: This is expected and only happens once at start. The MySQL queries are instant thanks to the optimization.
**Alternative**: Create an index on `mysql_max_id` in PostgreSQL (uses disk space):
```sql
CREATE INDEX idx_rawdatacor_mysql_max_id ON rawdatacor (mysql_max_id DESC);
CREATE INDEX idx_elabdatadisp_mysql_max_id ON elabdatadisp (mysql_max_id DESC);
```
---
## Key Technical Details
### Tuple Comparison in MySQL
MySQL supports lexicographic tuple comparison:
```sql
WHERE (UnitName, ToolNameID, EventDate, EventTime) > ('Unit1', 'Tool1', '2024-01-01', '10:00:00')
```
This is equivalent to:
```sql
WHERE UnitName > 'Unit1'
OR (UnitName = 'Unit1' AND ToolNameID > 'Tool1')
OR (UnitName = 'Unit1' AND ToolNameID = 'Tool1' AND EventDate > '2024-01-01')
OR (UnitName = 'Unit1' AND ToolNameID = 'Tool1' AND EventDate = '2024-01-01' AND EventTime > '10:00:00')
```
But much more efficient!
### Partitioning in PostgreSQL
Tables are partitioned by year (2014-2031):
```sql
CREATE TABLE rawdatacor_2024 PARTITION OF rawdatacor
FOR VALUES FROM (2024) TO (2025);
```
PostgreSQL automatically routes INSERTs to the correct partition based on `event_year`.
---
## Summary
- **Start with:** Full migration (`--full`) for initial data load
- **Then use:** Timestamp-based incremental (`--incremental`) for daily syncs
- **Switch to:** ID-based resumable (`--incremental --use-id`) if full migration is too large
1. **Full migration**: One-time initial load, partition by partition
2. **Incremental migration**: Daily sync of new data using consolidation keys
3. **State tracking**: PostgreSQL `migration_state` table
4. **Performance**: `mysql_max_id` filter + consolidation batching
5. **Resumable**: Both modes can resume from interruptions
6. **Safe**: ON CONFLICT DO NOTHING prevents duplicates