8.7 KiB
MySQL to PostgreSQL Migration Workflow
Overview
This tool implements consolidation-based incremental migration for two tables:
- RAWDATACOR: Raw sensor measurements
- ELABDATADISP: Elaborated/calculated data
Both tables use consolidation keys to group and migrate data efficiently.
Migration Modes
1. Full Migration
Initial migration of all historical data, migrating one partition at a time.
# Migrate all partitions for all tables
python main.py migrate full
# Migrate specific table
python main.py migrate full --table RAWDATACOR
# Migrate specific partition (year-based)
python main.py migrate full --table ELABDATADISP --partition 2024
# Dry-run to see what would be migrated
python main.py migrate full --dry-run
Characteristics:
- Migrates data partition by partition (year-based)
- Uses consolidation groups for efficiency
- Tracks progress in
migration_statetable (PostgreSQL) - Can resume from last completed partition if interrupted
- Uses
mysql_max_idoptimization for performance
2. Incremental Migration
Sync only new data since the last migration.
# Migrate new data for all tables
python main.py migrate incremental
# Migrate specific table
python main.py migrate incremental --table ELABDATADISP
# Dry-run to check what would be migrated
python main.py migrate incremental --dry-run
Characteristics:
- Uses consolidation keys to identify new records:
(UnitName, ToolNameID, EventDate, EventTime)
- Tracks last migrated key in
migration_statetable - Optimized with
min_mysql_idfilter for performance - Handles duplicates with
ON CONFLICT DO NOTHING - Perfect for scheduled jobs (cron, systemd timers)
How it works:
- Retrieves
last_keyfrommigration_statetable - Gets
MAX(mysql_max_id)from PostgreSQL table for optimization - Queries MySQL:
WHERE id > max_mysql_id AND (key_tuple) > last_key - Migrates new consolidation groups
- Updates
migration_statewith newlast_key
Consolidation Keys
Both tables use consolidation to group multiple measurements into a single JSONB record.
Consolidation Key Structure
(UnitName, ToolNameID, EventDate, EventTime)
Why Consolidation?
Instead of migrating individual sensor readings, we:
- Group all measurements for the same (unit, tool, date, time)
- Transform 16-25 columns into structured JSONB
- Migrate as a single consolidated record
Example:
MySQL has 10 rows for (Unit1, Tool1, 2024-01-01, 10:00:00):
id | UnitName | ToolNameID | EventDate | EventTime | Val0 | Val1 | ...
1 | Unit1 | Tool1 | 2024-01-01 | 10:00:00 | 23.5 | 45.2 | ...
2 | Unit1 | Tool1 | 2024-01-01 | 10:00:00 | 23.6 | 45.3 | ...
...
PostgreSQL gets 1 consolidated record:
{
"unit_name": "Unit1",
"tool_name_id": "Tool1",
"event_timestamp": "2024-01-01 10:00:00",
"measurements": {
"0": {"value": 23.5, "unit": "°C"},
"1": {"value": 45.2, "unit": "bar"},
...
},
"mysql_max_id": 10
}
State Management
Migration State Table
The migration_state table in PostgreSQL tracks migration progress:
CREATE TABLE migration_state (
table_name VARCHAR(50),
partition_name VARCHAR(50),
last_key JSONB, -- Last migrated consolidation key
started_at TIMESTAMP,
completed_at TIMESTAMP,
total_rows INTEGER,
status VARCHAR(20)
);
State Records
-
Per-partition state: Tracks each partition's progress
- Example:
('rawdatacor', '2024', {...}, '2024-01-15 10:30:00', 'completed', 1000000)
- Example:
-
Global state: Tracks overall incremental migration position
- Example:
('rawdatacor', '_global', {...}, NULL, NULL, 0, 'in_progress')
- Example:
Checking State
-- View all migration state
SELECT * FROM migration_state ORDER BY table_name, partition_name;
-- View global state (for incremental migration)
SELECT * FROM migration_state WHERE partition_name = '_global';
Performance Optimization
MySQL ID Filter
The incremental migration uses MAX(mysql_max_id) from PostgreSQL to filter MySQL queries:
SELECT UnitName, ToolNameID, EventDate, EventTime
FROM RAWDATACOR
WHERE id > 267399536 -- max_mysql_id from PostgreSQL
AND (UnitName, ToolNameID, EventDate, EventTime) > (?, ?, ?, ?)
GROUP BY UnitName, ToolNameID, EventDate, EventTime
ORDER BY UnitName, ToolNameID, EventDate, EventTime
LIMIT 10000
Why this is fast:
- Uses PRIMARY KEY index on
idto skip millions of already-migrated rows - Tuple comparison only applied to filtered subset
- Avoids full table scans
Consolidation Group Batching
Instead of fetching individual rows, we:
- Fetch 10,000 consolidation keys at a time
- For each key, fetch all matching rows from MySQL
- Transform and insert into PostgreSQL
- Update state every batch
Recommended Workflow
Initial Setup (One-time)
# 1. Configure .env file
cp .env.example .env
nano .env
# 2. Create PostgreSQL schema
python main.py setup --create-schema
# 3. Run full migration
python main.py migrate full
Daily Incremental Sync
# Run incremental migration (via cron or manual)
python main.py migrate incremental
Cron example (daily at 2 AM):
0 2 * * * cd /path/to/mysql2postgres && python main.py migrate incremental >> /var/log/migration.log 2>&1
Resuming Interrupted Migrations
Full Migration
If interrupted, full migration resumes from the last completed partition:
# First run: migrates partitions 2014, 2015, 2016... (interrupted after 2020)
python main.py migrate full --table RAWDATACOR
# Resume: continues from partition 2021
python main.py migrate full --table RAWDATACOR
Incremental Migration
Incremental migration uses the last_key from migration_state:
# Always safe to re-run - uses ON CONFLICT DO NOTHING
python main.py migrate incremental
Syncing Migration State
If migration_state becomes out of sync with actual data, use the sync utility:
# Sync migration_state with actual PostgreSQL data
python scripts/sync_migration_state.py
This finds the most recent row (by created_at) and updates migration_state._global.
Monitoring
Check Progress
# View migration state
psql -h localhost -U postgres -d migrated_db -c \
"SELECT table_name, partition_name, status, total_rows, completed_at
FROM migration_state
ORDER BY table_name, partition_name"
Verify Row Counts
-- PostgreSQL
SELECT COUNT(*) FROM rawdatacor;
SELECT COUNT(*) FROM elabdatadisp;
-- Compare with MySQL
-- mysql> SELECT COUNT(DISTINCT UnitName, ToolNameID, EventDate, EventTime) FROM RAWDATACOR;
Common Issues
"No previous migration found"
Cause: Running incremental migration before full migration
Solution: Run full migration first
python main.py migrate full
"Duplicate key value violates unique constraint"
Cause: Data already exists (shouldn't happen with ON CONFLICT DO NOTHING)
Solution: Migration handles this automatically - check logs for details
Slow Incremental Migration
Cause: MAX(mysql_max_id) query is slow (~60 seconds on large tables)
Solution: This is expected and only happens once at start. The MySQL queries are instant thanks to the optimization.
Alternative: Create an index on mysql_max_id in PostgreSQL (uses disk space):
CREATE INDEX idx_rawdatacor_mysql_max_id ON rawdatacor (mysql_max_id DESC);
CREATE INDEX idx_elabdatadisp_mysql_max_id ON elabdatadisp (mysql_max_id DESC);
Key Technical Details
Tuple Comparison in MySQL
MySQL supports lexicographic tuple comparison:
WHERE (UnitName, ToolNameID, EventDate, EventTime) > ('Unit1', 'Tool1', '2024-01-01', '10:00:00')
This is equivalent to:
WHERE UnitName > 'Unit1'
OR (UnitName = 'Unit1' AND ToolNameID > 'Tool1')
OR (UnitName = 'Unit1' AND ToolNameID = 'Tool1' AND EventDate > '2024-01-01')
OR (UnitName = 'Unit1' AND ToolNameID = 'Tool1' AND EventDate = '2024-01-01' AND EventTime > '10:00:00')
But much more efficient!
Partitioning in PostgreSQL
Tables are partitioned by year (2014-2031):
CREATE TABLE rawdatacor_2024 PARTITION OF rawdatacor
FOR VALUES FROM (2024) TO (2025);
PostgreSQL automatically routes INSERTs to the correct partition based on event_year.
Summary
- Full migration: One-time initial load, partition by partition
- Incremental migration: Daily sync of new data using consolidation keys
- State tracking: PostgreSQL
migration_statetable - Performance:
mysql_max_idfilter + consolidation batching - Resumable: Both modes can resume from interruptions
- Safe: ON CONFLICT DO NOTHING prevents duplicates