mysql2postgres

Author	SHA1	Message	Date
alex	ca2f7c5756	fix: Ensure last_completed_partition is saved on final migration state update Problem: The final migration_state update (when marking migration as complete) was not passing last_partition parameter, so the last completed partition was being lost in migration_state table. If migration was interrupted at any point, resume would lose the partition tracking. Solution: 1. Track last_processed_partition throughout the migration loop 2. Update it when each partition completes 3. Pass it to final _update_migration_state() call when marking migration as complete Additional fix: - Use correct postgres_pk column when querying MAX() ID for final state update - This ensures we get the correct last ID even for tables with non-standard PK names 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-26 18:24:28 +01:00
alex	1430ef206f	fix: Ensure complete node consolidation by ordering MySQL query by consolidation key Root cause: Nodes 1-11 had IDs in 132M+ range while nodes 12-22 had IDs in 298-308 range, causing them to be fetched in batches thousands apart using keyset pagination by ID. This meant they arrived as separate groups and were never unified into a single consolidated row. Solution: Order MySQL query by (UnitName, ToolNameID, EventDate, EventTime) instead of by ID. This guarantees all rows for the same consolidation key arrive together, ensuring they are grouped and consolidated into a single row with JSONB measurements keyed by node number. Changes: - fetch_consolidation_groups_from_partition(): Changed from keyset pagination by ID to ORDER BY consolidation key. Simplify grouping logic since ORDER BY already ensures consecutive rows have same key. - full_migration.py: Add cleanup of partial partitions on resume. When resuming and a partition was started but not completed, delete its incomplete data before re-processing to avoid duplicates. Also recalculate total_rows_migrated from actual database count. - config.py: Add postgres_pk field to TABLE_CONFIGS to specify correct primary key column names in PostgreSQL (id vs id_elab_data). - Cleanup: Remove temporary test scripts used during debugging Performance note: ORDER BY consolidation key requires index for speed. Index (UnitName, ToolNameID, EventDate, EventTime) created with ALGORITHM=INPLACE LOCK=NONE to avoid blocking reads. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-26 18:22:23 +01:00
alex	6ca97f0ba4	fix: Only update last_completed_partition when partition is fully processed Previously, last_completed_partition was updated during batch flushes while the partition was still being processed. This caused resume to skip partitions that were only partially completed. Now, last_completed_partition is only updated AFTER all consolidation groups in a partition have been processed and the final buffer flush is complete. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-26 00:49:14 +01:00
alex	9ef65995d4	feat: Add granular resume within partitions using last inserted ID Problem: If migration was interrupted in the middle of processing a partition (e.g., at row 100k of 500k), resume would re-process all 100k rows, causing duplicate insertions and wasted time. Solution: 1. Modified fetch_consolidation_groups_from_partition() to accept start_id parameter 2. When resuming within the same partition, query the last inserted ID from migration_state.last_migrated_id 3. Use keyset pagination starting from (id > last_id) to skip already-processed rows 4. Added logic to detect when we're resuming within the same partition vs resuming from a new partition Flow: - If last_completed_partition < current_partition: start from beginning of partition - If last_completed_partition == current_partition: start from last_migrated_id - If last_completed_partition > current_partition: skip to next uncompleted partition This ensures resume is granular: - Won't re-insert already inserted rows within a partition - Continues exactly from where it stopped - Combines with existing partition tracking for complete accuracy 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 23:41:57 +01:00
alex	e5c87d145f	feat: Track last completed partition for accurate resume capability Problem: Resume was re-processing all partitions from the beginning because migration_state didn't track which partition was the last one completed. This caused duplicate data insertion and wasted time. Solution: 1. Added 'last_completed_partition' column to migration_state table 2. Created _get_last_completed_partition() method to retrieve saved state 3. Updated _update_migration_state() to accept and save last_partition parameter 4. Modified migration loop to: - Retrieve last_completed_partition on resume - Skip partitions that were already completed (partition <= last_completed_partition) - Update last_completed_partition after each partition finishes - Log which partitions are being skipped during resume Now when resuming: - Only processes partitions after the last completed one - Avoids re-migrating already completed partitions - Provides clear logging showing which partitions are skipped For example, if migration was at partition d5 when interrupted, resume will: - Skip d0 through d5 (logging each skip) - Continue with d6 onwards 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 23:30:37 +01:00
alex	3532631f3f	fix: Reduce INSERT buffer size and update state after every flush Problems identified: 1. Buffer size of batch_size * 10 (100k rows) was too large, causing migration_state to not update for several minutes on low-consolidation partitions 2. State updates only happened every 10 batches, not reflecting actual progress Changes: - Reduce insert_buffer_size from 10x to 5x batch_size (50k rows) - Update migration_state after EVERY batch flush, not every 10 batches - Add debug logging showing flush operations and total migrated count - This provides better visibility into migration progress and checkpointing For partitions with low consolidation ratio (like d0 with 1.1x), this ensures migration_state is updated more frequently, supporting better resume capability and providing visibility into actual progress. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 23:01:33 +01:00
alex	dfc54cf867	perf: Batch INSERT statements to reduce database round-trips When processing partitions with many small consolidation groups (low consolidation ratio), the previous approach of inserting each group individually caused excessive database round-trips. Example from partition d0: - 572k MySQL rows - 514k unique consolidation keys (1.1x consolidation ratio) - 514k separate INSERT statements = severe performance bottleneck Changes: - Accumulate consolidated rows in a buffer (size = batch_size * 10) - Flush buffer to PostgreSQL when full or when partition is complete - Reduces 514k INSERT statements to ~50 batches for d0 - Significant performance improvement expected (8-10x faster for low-consolidation partitions) The progress tracker still counts MySQL source rows (before consolidation), so the progress bar remains accurate. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 22:53:20 +01:00
alex	c30d77e24b	Fix N+1 query problem - use single ordered query with Python grouping CRITICAL FIX: Previous implementation was doing GROUP BY to get unique keys, then a separate WHERE query for EACH group. With millions of groups, this meant millions of separate MySQL queries = 12 bytes/sec = unusable. New approach (single query): - Fetch all rows from partition ordered by consolidation key - Group them in Python as we iterate - One query per LIMIT batch, not one per group - ~100,000x faster than N+1 approach Query uses index efficiently: ORDER BY (UnitName, ToolNameID, EventDate, EventTime, NodeNum) matches index prefix and keeps groups together for consolidation. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 22:32:41 +01:00
alex	b6886293f6	Add detailed partition progress logging Log shows: - Current partition index and total ([X/Y]) - Partition name being processed - Number of groups consolidated per partition after completion This helps track migration progress when processing 18 partitions, making it easier to identify slow partitions or issues. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 22:10:43 +01:00
alex	255fb1c520	Simplify resume logic for partition-based consolidation With partition-based consolidation, resume is now simpler: - No longer track last_migrated_id (not useful for partition iteration) - Resume capability: if rows exist in target table, migration was interrupted - Use total_rows_migrated count to calculate remaining work - Update state every 10 consolidations instead of maintaining per-batch state This aligns resume mechanism with the new partition-based architecture where we process complete consolidation groups, not sequential ID ranges. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 21:54:40 +01:00
alex	bb27f749a0	Implement partition-based consolidation for ELABDATADISP Changed consolidation strategy to leverage MySQL partitioning: - Added get_table_partitions() to list all partitions - Added fetch_consolidation_groups_from_partition() to read groups by consolidation key - Each group (UnitName, ToolNameID, EventDate, EventTime) is fetched completely - All nodes of same group are consolidated into single row with JSONB measurements - Process partitions sequentially for predictable memory usage Key benefits: - Guaranteed complete consolidation (no fragmentation across batches) - Deterministic behavior - same group always consolidated together - Better memory efficiency with partition limits (100k groups per query) - Clear audit trail of which partition each row came from Tested with partition d3: 6960 input rows → 100 consolidated rows (69.6:1 ratio) with groups containing 24-72 nodes each. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 21:49:30 +01:00
alex	a394de99ef	Fix ELABDATADISP consolidation by consolidating across batches Previously, consolidation happened per-batch, which meant if the same (unit, tool, date, time) group spanned multiple batches, nodes would be split into separate rows. For example, nodes 1-32 would be split into 4 separate rows instead of 1 consolidated row. Now, we buffer rows with the same consolidation key and only consolidate when we see a NEW consolidation key. This ensures all nodes of the same group are consolidated together, regardless of batch boundaries. Results: Proper 25:1 consolidation ratio with all nodes grouped correctly. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 20:23:31 +01:00
alex	d3ada1ded2	fix: Mark migration as completed when migration finishes The _update_migration_state() method was using logic: status = "in_progress" if last_id is not None else "completed" This was incorrect because: 1. last_id is always set during periodic updates (to track resume point) 2. So status would always be "in_progress" even when migration finished 3. migration_completed_at would never be set Solution: Add is_final parameter to explicitly mark when migration is complete. During periodic updates, is_final=False (status="in_progress"). Only when called at the end, is_final=True (status="completed"). This ensures: - migration_state.status = "completed" when done - migration_state.migration_completed_at is set - Proper tracking for knowing if migration is finished 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 20:41:46 +01:00
alex	8d9e63081a	chore: Add detailed logging for migration state update Added logging to trace the final migration state update process: - Log final count from PostgreSQL - Log final last ID from table - Log before and after _update_migration_state() call This helps debug why migration_state might not be getting updated when migration completes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 20:32:33 +01:00
alex	26b3ccb06e	fix: Ensure migration_state updates are committed to database The _update_migration_state() method was using pg_conn.execute() which has its own connection management. This could cause issues with transaction handling when called at end of migration. Changed to use explicit cursor with guaranteed commit: - Use pg_conn.connection.cursor() to get a direct cursor - Execute the INSERT ... ON CONFLICT query - Explicitly call pg_conn.connection.commit() - This matches the pattern used in other parts of the code This ensures that final migration state (completed status, final counts) are properly persisted to the database. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 20:26:10 +01:00
alex	1708969616	fix: Update migration state with final count when migration completes When migration finishes, we need to update migration_state with: 1. The final actual row count from PostgreSQL 2. The final last_migrated_id (MAX(id) from the table) 3. Mark status as 'completed' (handled by _update_migration_state) Previously, the final state update was missing, so migration_state was left with stale data from the periodic updates. Now _update_migration_state is called at the end to record the authoritative final state. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 20:16:28 +01:00
alex	7cb4783385	fix: Reduce expensive COUNT() queries to every 10 batches The previous fix was too aggressive - calling get_row_count() on every batch meant executing COUNT() on a 14M row table for each batch. With a typical batch size of ~10k rows and consolidation ratio of ~10:1, this meant: - ~500-1000 batches total - ~500k COUNT() queries on a huge table = completely destroyed performance New approach: - Keep local accumulator for migrated count (fast) - Update total_rows_migrated to DB only every 10 batches (reduces COUNT() 50x) - Update last_migrated_id on every batch via UPDATE (fast, no COUNT) - Do final COUNT(*) at end of migration for accurate total This maintains accuracy while being performant. The local count is reliable because we're tracking inserts in a single sequential migration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 16:10:40 +01:00
alex	0cb4a0f71e	fix: Update progress tracking to use MySQL row count instead of PostgreSQL count The progress bar was appearing frozen because: - Total was set to MySQL rows to process (111M) - Progress was updated by PostgreSQL rows inserted (11M after consolidation) - This created a 10:1 mismatch, making progress appear to crawl Solution: - Track progress based on MySQL rows processed (matches total) - Use batch_size (MySQL rows) instead of inserted count (PostgreSQL rows) - Change batch_max_id calculation to use original batch instead of transformed This ensures the progress bar advances at a visible rate while still maintaining accurate row count tracking from the database. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 15:40:50 +01:00
alex	0f217379ea	fix: Use actual PostgreSQL row count for total_rows_migrated tracking Replace session-level counting with direct table COUNT queries to ensure total_rows_migrated always reflects actual reality in PostgreSQL. This fixes the discrepancy where the counter was only tracking rows from the current session and didn't account for earlier insertions or duplicates from failed resume attempts. Key improvements: - Use get_row_count() after each batch to get authoritative total - Preserve previous count on resume and accumulate across sessions - Remove dependency on error-prone session-level counters - Ensures migration_state.total_rows_migrated matches actual table row count 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 15:33:27 +01:00
alex	62577d3200	feat: Add MySQL to PostgreSQL migration tool with JSONB transformation Implement comprehensive migration solution with: - Full and incremental migration modes - JSONB schema transformation for RAWDATACOR and ELABDATADISP tables - Native PostgreSQL partitioning (2014-2031) - Optimized GIN indexes for JSONB queries - Rich logging with progress tracking - Complete benchmark system for MySQL vs PostgreSQL comparison - CLI interface with multiple commands (setup, migrate, benchmark) - Configuration management via .env file - Error handling and retry logic - Batch processing for performance (configurable batch size) Database transformations: - RAWDATACOR: 16 Val columns + units → single JSONB measurements - ELABDATADISP: 25+ measurement fields → structured JSONB with categories 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-10 19:57:11 +01:00

20 Commits