mysql2postgres

Author	SHA1	Message	Date
alex	ff0187b74a	debug: Add logging for key changes within batch grouping	2025-12-26 09:46:52 +01:00
alex	c9088d9144	fix: Merge consolidation groups with same key across batch boundaries Fix critical issue where consolidation groups with the same consolidation key (UnitName, ToolNameID, EventDate, EventTime) but arriving in different batches were being yielded separately instead of being merged. Now when a buffered group has the same key as the start of the next batch, they are prepended and consolidated together. If the key changes, the buffered group is yielded before processing the new key's rows. This fixes the issue where nodes 1-11 and 12-22 (with the same consolidation key) were being inserted as two separate rows instead of one consolidated row with all 22 nodes. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-26 09:16:21 +01:00
alex	6ca97f0ba4	fix: Only update last_completed_partition when partition is fully processed Previously, last_completed_partition was updated during batch flushes while the partition was still being processed. This caused resume to skip partitions that were only partially completed. Now, last_completed_partition is only updated AFTER all consolidation groups in a partition have been processed and the final buffer flush is complete. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-26 00:49:14 +01:00
alex	32f90fdd47	refactor: Remove legacy consolidation methods from MySQLConnector Remove unused fetch_all_rows() and fetch_rows_ordered_for_consolidation() methods. These were part of the old migration strategy before partition-based consolidation. The current implementation uses fetch_consolidation_groups_from_partition() which handles keyset pagination and consolidation group buffering more efficiently. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-26 00:20:46 +01:00
alex	d6564b7f9e	refactor: Improve logging for consolidation group tracking Enhanced debug logging to show: - Max ID for each yielded group (important for resume tracking) - Group size and consolidation key for each operation - Clear distinction between buffered and final groups The max ID is tracked because: - PostgreSQL stores MAX(id) per consolidated group for resume - This logging helps verify correct ID tracking - Assists debugging consolidation completeness No functional changes, improved observability. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-26 00:10:16 +01:00
alex	4277dd8d2c	fix: Yield all groups in final batch, not just last group Critical bug fix for missing nodes in consolidated groups. Problem: When a partition batch contained multiple consolidation groups, only the LAST group was being buffered/yielded, causing earlier groups to be lost. This happened when: 1. Batch < limit rows (final batch) 2. Multiple different consolidation keys present 3. First groups were yielded correctly 4. But FINAL group was only yielded if batch == limit 5. If batch < limit, final group was discarded Example from partition d10: - Fetch returns 22 rows with 2 groups: (nodes 1-11) and (nodes 12-22) - Old code: yield nodes 1-11 on key change, then didn't yield nodes 12-22 - Result: inserted row had only nodes 12-22 Fix: Detect final batch with len(rows) < limit, then yield ALL groups including the final one instead of buffering it. Changes: - Detect final batch early: is_final_batch = len(rows) < limit - If final batch: yield current_group even if no key change follows - If NOT final batch: buffer last group for continuity (original logic) Now all nodes from all groups are properly consolidated. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-26 00:03:45 +01:00
alex	3687f77911	debug: Add detailed logging to consolidation group buffering Added logging to track: - When groups are buffered at batch boundaries - Group consolidation keys and row counts - When buffered groups are resumed in next batch - Final batch group yields This will help diagnose why some nodes are being lost during consolidation (observed: nodes 1-11 missing from consolidated group, only nodes 12-22 present). 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-26 00:03:17 +01:00
alex	9ef65995d4	feat: Add granular resume within partitions using last inserted ID Problem: If migration was interrupted in the middle of processing a partition (e.g., at row 100k of 500k), resume would re-process all 100k rows, causing duplicate insertions and wasted time. Solution: 1. Modified fetch_consolidation_groups_from_partition() to accept start_id parameter 2. When resuming within the same partition, query the last inserted ID from migration_state.last_migrated_id 3. Use keyset pagination starting from (id > last_id) to skip already-processed rows 4. Added logic to detect when we're resuming within the same partition vs resuming from a new partition Flow: - If last_completed_partition < current_partition: start from beginning of partition - If last_completed_partition == current_partition: start from last_migrated_id - If last_completed_partition > current_partition: skip to next uncompleted partition This ensures resume is granular: - Won't re-insert already inserted rows within a partition - Continues exactly from where it stopped - Combines with existing partition tracking for complete accuracy 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 23:41:57 +01:00
alex	e5c87d145f	feat: Track last completed partition for accurate resume capability Problem: Resume was re-processing all partitions from the beginning because migration_state didn't track which partition was the last one completed. This caused duplicate data insertion and wasted time. Solution: 1. Added 'last_completed_partition' column to migration_state table 2. Created _get_last_completed_partition() method to retrieve saved state 3. Updated _update_migration_state() to accept and save last_partition parameter 4. Modified migration loop to: - Retrieve last_completed_partition on resume - Skip partitions that were already completed (partition <= last_completed_partition) - Update last_completed_partition after each partition finishes - Log which partitions are being skipped during resume Now when resuming: - Only processes partitions after the last completed one - Avoids re-migrating already completed partitions - Provides clear logging showing which partitions are skipped For example, if migration was at partition d5 when interrupted, resume will: - Skip d0 through d5 (logging each skip) - Continue with d6 onwards 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 23:30:37 +01:00
alex	3532631f3f	fix: Reduce INSERT buffer size and update state after every flush Problems identified: 1. Buffer size of batch_size * 10 (100k rows) was too large, causing migration_state to not update for several minutes on low-consolidation partitions 2. State updates only happened every 10 batches, not reflecting actual progress Changes: - Reduce insert_buffer_size from 10x to 5x batch_size (50k rows) - Update migration_state after EVERY batch flush, not every 10 batches - Add debug logging showing flush operations and total migrated count - This provides better visibility into migration progress and checkpointing For partitions with low consolidation ratio (like d0 with 1.1x), this ensures migration_state is updated more frequently, supporting better resume capability and providing visibility into actual progress. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 23:01:33 +01:00
alex	dfc54cf867	perf: Batch INSERT statements to reduce database round-trips When processing partitions with many small consolidation groups (low consolidation ratio), the previous approach of inserting each group individually caused excessive database round-trips. Example from partition d0: - 572k MySQL rows - 514k unique consolidation keys (1.1x consolidation ratio) - 514k separate INSERT statements = severe performance bottleneck Changes: - Accumulate consolidated rows in a buffer (size = batch_size * 10) - Flush buffer to PostgreSQL when full or when partition is complete - Reduces 514k INSERT statements to ~50 batches for d0 - Significant performance improvement expected (8-10x faster for low-consolidation partitions) The progress tracker still counts MySQL source rows (before consolidation), so the progress bar remains accurate. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 22:53:20 +01:00
alex	d513920788	fix: Buffer incomplete groups at batch boundaries for complete consolidation The consolidation grouping logic now properly handles rows with the same consolidation key (UnitName, ToolNameID, EventDate, EventTime) that span across multiple fetch batches. Key improvements: - Added buffering of incomplete groups at batch boundaries - When a batch is full (has exactly limit rows), the final group is buffered to be prepended to the next batch, ensuring complete group consolidation - When the final batch is reached (fewer than limit rows), all buffered and current groups are yielded This ensures that all nodes with the same consolidation key are grouped together in a single consolidated row, eliminating node fragmentation. Added comprehensive unit tests verifying: - Multi-node consolidation with batch boundaries - RAWDATACOR consolidation with multiple nodes - Groups that span batch boundaries are kept complete 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 22:36:15 +01:00
alex	c30d77e24b	Fix N+1 query problem - use single ordered query with Python grouping CRITICAL FIX: Previous implementation was doing GROUP BY to get unique keys, then a separate WHERE query for EACH group. With millions of groups, this meant millions of separate MySQL queries = 12 bytes/sec = unusable. New approach (single query): - Fetch all rows from partition ordered by consolidation key - Group them in Python as we iterate - One query per LIMIT batch, not one per group - ~100,000x faster than N+1 approach Query uses index efficiently: ORDER BY (UnitName, ToolNameID, EventDate, EventTime, NodeNum) matches index prefix and keeps groups together for consolidation. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 22:32:41 +01:00
alex	fe2d173b0f	Optimize consolidation fetching with GROUP BY and reduced limit Changed consolidation_group_limit from 100k to 10k for faster queries. Reverted to GROUP BY approach for getting consolidation keys: - Uses MySQL index efficiently: (UnitName, ToolNameID, NodeNum, EventDate, EventTime) - GROUP BY with NodeNum ensures we don't lose any combinations - Faster GROUP BY queries than large ORDER BY queries - Smaller LIMIT = faster pagination This matches the original optimization suggestion and should be faster. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 22:22:30 +01:00
alex	b6886293f6	Add detailed partition progress logging Log shows: - Current partition index and total ([X/Y]) - Partition name being processed - Number of groups consolidated per partition after completion This helps track migration progress when processing 18 partitions, making it easier to identify slow partitions or issues. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 22:10:43 +01:00
alex	255fb1c520	Simplify resume logic for partition-based consolidation With partition-based consolidation, resume is now simpler: - No longer track last_migrated_id (not useful for partition iteration) - Resume capability: if rows exist in target table, migration was interrupted - Use total_rows_migrated count to calculate remaining work - Update state every 10 consolidations instead of maintaining per-batch state This aligns resume mechanism with the new partition-based architecture where we process complete consolidation groups, not sequential ID ranges. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 21:54:40 +01:00
alex	bb27f749a0	Implement partition-based consolidation for ELABDATADISP Changed consolidation strategy to leverage MySQL partitioning: - Added get_table_partitions() to list all partitions - Added fetch_consolidation_groups_from_partition() to read groups by consolidation key - Each group (UnitName, ToolNameID, EventDate, EventTime) is fetched completely - All nodes of same group are consolidated into single row with JSONB measurements - Process partitions sequentially for predictable memory usage Key benefits: - Guaranteed complete consolidation (no fragmentation across batches) - Deterministic behavior - same group always consolidated together - Better memory efficiency with partition limits (100k groups per query) - Clear audit trail of which partition each row came from Tested with partition d3: 6960 input rows → 100 consolidated rows (69.6:1 ratio) with groups containing 24-72 nodes each. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 21:49:30 +01:00
alex	a394de99ef	Fix ELABDATADISP consolidation by consolidating across batches Previously, consolidation happened per-batch, which meant if the same (unit, tool, date, time) group spanned multiple batches, nodes would be split into separate rows. For example, nodes 1-32 would be split into 4 separate rows instead of 1 consolidated row. Now, we buffer rows with the same consolidation key and only consolidate when we see a NEW consolidation key. This ensures all nodes of the same group are consolidated together, regardless of batch boundaries. Results: Proper 25:1 consolidation ratio with all nodes grouped correctly. 🤖 Generated with Claude Code Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 20:23:31 +01:00
alex	9cc12abe11	fix: Order rows by consolidation key to keep related nodes together in batches When fetching rows for consolidation, the original keyset pagination only ordered by id, which caused nodes from the same (unit, tool, timestamp) to be split across multiple batches. This resulted in incomplete consolidation, with some nodes being missed. Solution: Order by consolidation columns in addition to id: - Primary: id (for keyset pagination) - Secondary: UnitName, ToolNameID, EventDate, EventTime, NodeNum This ensures all nodes with the same (unit, tool, timestamp) are grouped together in the same batch, allowing proper consolidation within the batch. Fixes: Nodes being lost during ELABDATADISP consolidation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 19:32:52 +01:00
alex	648bd98a09	chore: Add debug logging to ELABDATADISP consolidation Added logging to track which nodes are being consolidated and how many measurement categories each node has. This helps debug cases where data appears to be lost during consolidation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 19:27:34 +01:00
alex	72035bb1b5	fix: Convert MySQL Decimal values to float for JSON serialization in ELABDATADISP MySQL returns numeric values as Decimal objects, which are not JSON serializable. PostgreSQL JSONB requires proper JSON types. Added convert_value() helper in _build_measurement_for_elabdatadisp_node() to: - Convert Decimal → float - Convert str → float - Pass through other types unchanged This ensures all numeric values are JSON-serializable before insertion into the measurements JSONB column. Fixes: "Object of type Decimal is not JSON serializable" error 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 19:06:50 +01:00
alex	3c0a6f72b4	fix: Use correct ID column for ELABDATADISP in fetch_rows_ordered_for_consolidation() ELABDATADISP uses 'idElabData' as the primary key, while RAWDATACOR uses 'id'. Updated the fetch method to detect the correct column based on the table name: - RAWDATACOR: use 'id' column - ELABDATADISP: use 'idElabData' column This allows keyset pagination to work correctly for both tables. Fixes: "Unknown column 'id' in 'order clause'" error when fetching ELABDATADISP 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 19:04:51 +01:00
alex	e75cf4c545	fix: Support ELABDATADISP in fetch_rows_ordered_for_consolidation() The method was restricted to only RAWDATACOR, but the consolidation logic works for both tables. Updated the check to allow both: - RAWDATACOR - ELABDATADISP The keyset pagination (id-based WHERE clause) works identically for both tables, and consolidation happens in Python for both. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 19:02:00 +01:00
alex	5045c8bd86	fix: Add updated_at column back to ELABDATADISP table The updated_at column was removed from the schema but should be kept for consistency with the original table structure and to track when rows are modified. Changes: - Added updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP to table schema - Added updated_at to get_column_order() for elabdatadisp - Added updated_at to transform_elabdatadisp_row() output This maintains backward compatibility while still consolidating node_num, state, and calc_err into the measurements JSONB. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 18:46:03 +01:00
alex	42c0d9cdaf	chore: Update column order for ELABDATADISP to exclude node/state/calc_err Updated get_column_order() for elabdatadisp table to return only the columns that are now stored separately: - id_elab_data - unit_name - tool_name_id - event_timestamp - measurements (includes node_num, state, calc_err keyed by node) - created_at Removed: node_num, state, calc_err, updated_at (not used after consolidation) This matches the schema defined in schema_transformer.py where these fields are noted as being stored in the JSONB measurements column. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 18:42:19 +01:00
alex	693228c0da	feat: Implement node consolidation for ELABDATADISP table Add consolidation logic to ELABDATADISP similar to RAWDATACOR: - Group rows by (unit_name, tool_name_id, event_timestamp) - Consolidate multiple nodes with same timestamp into single row - Store node_num, state, calc_err in JSONB measurements keyed by node Changes: 1. Add _build_measurement_for_elabdatadisp_node() helper - Builds measurement object with state, calc_err, and measurement categories - Filters out empty categories to save space 2. Update transform_elabdatadisp_row() signature - Accept optional measurements parameter for consolidated rows - Build from single row if measurements not provided - Remove node_num, state, calc_err from returned columns (now in JSONB) - Keep only: id_elab_data, unit_name, tool_name_id, event_timestamp, measurements, created_at 3. Add consolidate_elabdatadisp_batch() method - Group rows by consolidation key - Build consolidated measurements with node numbers as keys - Use MAX(idElabData) for checkpoint tracking (resume capability) - Use MIN(idElabData) as template for other fields 4. Update transform_batch() to support ELABDATADISP consolidation - Check consolidate flag for both tables - Call consolidate_elabdatadisp_batch() when needed Result: ELABDATADISP now consolidates ~5-10:1 like RAWDATACOR, with all node data (node_num, state, calc_err, measurements) keyed by node number in JSONB. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-25 18:41:54 +01:00
alex	4d72d2a42e	chore: Add validation queries for default timestamp records Added queries to identify and sample records with default timestamp (1970-01-01 00:00:00) which resulted from invalid MySQL dates during migration. These records need date recovery from the MySQL source. Queries: - Count records with default timestamp in both tables - Sample first 10 records from rawdatacor with default timestamp These will help quantify the scope of date recovery work needed. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 20:47:01 +01:00
alex	d3ada1ded2	fix: Mark migration as completed when migration finishes The _update_migration_state() method was using logic: status = "in_progress" if last_id is not None else "completed" This was incorrect because: 1. last_id is always set during periodic updates (to track resume point) 2. So status would always be "in_progress" even when migration finished 3. migration_completed_at would never be set Solution: Add is_final parameter to explicitly mark when migration is complete. During periodic updates, is_final=False (status="in_progress"). Only when called at the end, is_final=True (status="completed"). This ensures: - migration_state.status = "completed" when done - migration_state.migration_completed_at is set - Proper tracking for knowing if migration is finished 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 20:41:46 +01:00
alex	8d9e63081a	chore: Add detailed logging for migration state update Added logging to trace the final migration state update process: - Log final count from PostgreSQL - Log final last ID from table - Log before and after _update_migration_state() call This helps debug why migration_state might not be getting updated when migration completes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 20:32:33 +01:00
alex	26b3ccb06e	fix: Ensure migration_state updates are committed to database The _update_migration_state() method was using pg_conn.execute() which has its own connection management. This could cause issues with transaction handling when called at end of migration. Changed to use explicit cursor with guaranteed commit: - Use pg_conn.connection.cursor() to get a direct cursor - Execute the INSERT ... ON CONFLICT query - Explicitly call pg_conn.connection.commit() - This matches the pattern used in other parts of the code This ensures that final migration state (completed status, final counts) are properly persisted to the database. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 20:26:10 +01:00
alex	1708969616	fix: Update migration state with final count when migration completes When migration finishes, we need to update migration_state with: 1. The final actual row count from PostgreSQL 2. The final last_migrated_id (MAX(id) from the table) 3. Mark status as 'completed' (handled by _update_migration_state) Previously, the final state update was missing, so migration_state was left with stale data from the periodic updates. Now _update_migration_state is called at the end to record the authoritative final state. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 20:16:28 +01:00
alex	0461bb3b44	fix: Handle invalid MySQL dates (0000-00-00) gracefully MySQL can contain invalid/zero dates like '0000-00-00' which cannot be parsed with strptime. These should be treated as NULL and converted to the default timestamp (1970-01-01 00:00:00). Changes to _convert_date(): - Check for '0000-00-00' and invalid date strings - Wrap strptime in try/except to catch ValueError - Return None for invalid dates instead of crashing - Updated callers to check for None and use default timestamp This allows the migration to continue even when encountering invalid historical dates in the MySQL database. Fixes: "time data '0000-00-00' does not match format '%Y-%m-%d'" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 19:06:38 +01:00
alex	4f4ba6af51	fix: Import date type explicitly to fix isinstance checks When we import datetime from the datetime module, we get the datetime class, not the module. This caused isinstance() checks to fail when checking against datetime.date (which doesn't exist when datetime is a class). Solution: Import date explicitly from datetime module and use it in isinstance checks. Order matters - check datetime before date since datetime is a subclass of date. Fixes: "isinstance() arg 2 must be a type, a tuple of types, or a union" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 18:56:12 +01:00
alex	eb315c90ff	fix: Handle date conversion for string dates in data transformer When resuming migration, EventDate may be a string (from PostgreSQL queries) instead of a datetime.date object (from MySQL). The combine() function expects a datetime.date object, so we now convert strings to dates before combining with time. Added _convert_date() helper similar to _convert_time() that handles: - str: Parse from "YYYY-MM-DD" format - datetime.date: Return as-is - datetime.datetime: Extract date component Fixes error: "combine() argument 1 must be datetime.date, not str" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 18:52:42 +01:00
alex	262edd0ed2	chore: Revert throughput reporting feature from progress tracker The Rich progress bar has complexities with live mode that make it difficult to get visual feedback working correctly. Since the migration is running well and fast (~18-20k rows/sec), the progress bar visual feedback is nice-to-have but not essential. Focus on what matters: the migration completing correctly. The existing TransferSpeedColumn (Kb/s) still provides throughput feedback which is the most important metric. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 16:47:10 +01:00
alex	678cd22c89	fix: Use print_status() for throughput reporting in progress tracker The print_status() method properly handles printing with the live progress bar, whereas direct .print() calls don't work correctly with Progress in live mode. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 16:34:47 +01:00
alex	38b359a72d	feat: Add periodic throughput reporting to progress tracker For very large migrations (111M rows), the progress bar can appear frozen when showing percentage-based progress on 60M+ remaining rows. Even at 20k rows/sec, progress moves slowly on screen. Solution: Print periodic throughput updates every 1M rows processed. Shows: - Actual count processed and total - Current throughput in rows/sec - Elapsed time in hours This gives users visual feedback that migration is actively processing without needing to wait for percentage to visibly change. Example output: Progress: 5,000,000/111,000,000 items (18,500 items/sec, 4.2h elapsed) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 16:31:12 +01:00
alex	7cb4783385	fix: Reduce expensive COUNT() queries to every 10 batches The previous fix was too aggressive - calling get_row_count() on every batch meant executing COUNT() on a 14M row table for each batch. With a typical batch size of ~10k rows and consolidation ratio of ~10:1, this meant: - ~500-1000 batches total - ~500k COUNT() queries on a huge table = completely destroyed performance New approach: - Keep local accumulator for migrated count (fast) - Update total_rows_migrated to DB only every 10 batches (reduces COUNT() 50x) - Update last_migrated_id on every batch via UPDATE (fast, no COUNT) - Do final COUNT(*) at end of migration for accurate total This maintains accuracy while being performant. The local count is reliable because we're tracking inserts in a single sequential migration. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 16:10:40 +01:00
alex	0cb4a0f71e	fix: Update progress tracking to use MySQL row count instead of PostgreSQL count The progress bar was appearing frozen because: - Total was set to MySQL rows to process (111M) - Progress was updated by PostgreSQL rows inserted (11M after consolidation) - This created a 10:1 mismatch, making progress appear to crawl Solution: - Track progress based on MySQL rows processed (matches total) - Use batch_size (MySQL rows) instead of inserted count (PostgreSQL rows) - Change batch_max_id calculation to use original batch instead of transformed This ensures the progress bar advances at a visible rate while still maintaining accurate row count tracking from the database. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 15:40:50 +01:00
alex	0f217379ea	fix: Use actual PostgreSQL row count for total_rows_migrated tracking Replace session-level counting with direct table COUNT queries to ensure total_rows_migrated always reflects actual reality in PostgreSQL. This fixes the discrepancy where the counter was only tracking rows from the current session and didn't account for earlier insertions or duplicates from failed resume attempts. Key improvements: - Use get_row_count() after each batch to get authoritative total - Preserve previous count on resume and accumulate across sessions - Remove dependency on error-prone session-level counters - Ensures migration_state.total_rows_migrated matches actual table row count 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>	2025-12-23 15:33:27 +01:00
alex	b09cfcf9df	fix: Add timeout settings and retry logic to MySQL connector Configuration improvements: - Set read_timeout=300 (5 minutes) to handle long queries - Set write_timeout=300 (5 minutes) for writes - Set max_allowed_packet=64MB to handle larger data transfers Retry logic: - Added retry mechanism with max 3 retries on fetch failure - Auto-reconnect on connection loss before retry - Better error messages showing retry attempts This fixes the 'connection is lost' error that occurs during long-running migrations by: 1. Giving MySQL queries more time to complete 2. Allowing larger packet sizes for bulk data 3. Automatically recovering from connection drops Fixes: 'Connection is lost' error during full migration	2025-12-21 09:53:34 +01:00
alex	821cda850e	fix: Change from COPY to parameterized INSERT for batch inserts Replace cursor.copy() with cursor.executemany() for more reliable batch inserts in PostgreSQL. The COPY method has issues with format and data encoding in psycopg3. Changes: - Use executemany() with parameterized INSERT statements - Let psycopg handle parameter escaping and encoding - Convert JSONB dicts to JSON strings automatically - More compatible with various data types This ensures that data is actually being inserted into PostgreSQL during migration, fixing the issue where data wasn't appearing in the database after migration completed. Fixes: Data not being persisted in PostgreSQL during migration	2025-12-10 20:48:20 +01:00
alex	e2377d4191	fix: Add explicit commit/rollback in PostgreSQL context manager exit - On successful execution (no exception): explicitly commit before closing - On exception: explicitly rollback before closing - Add try-except to handle commit/rollback failures gracefully This ensures that all inserted data is committed to the database when the context manager exits. Previously, commits were only done per-batch in insert_batch(), but the final context exit wasn't ensuring a final commit. Fixes: Data not appearing in PostgreSQL after migration completes	2025-12-10 20:39:04 +01:00
alex	e381618255	fix: Support both uppercase and lowercase table names in TABLE_CONFIGS - TABLE_CONFIGS now accepts both 'RAWDATACOR' and 'rawdatacor' as keys - TABLE_CONFIGS now accepts both 'ELABDATADISP' and 'elabdatadisp' as keys - Reuse same config dict for both cases to avoid duplication This allows FullMigrator to work correctly when initialized with uppercase table names from the CLI while DataTransformer works with lowercase names. Fixes: 'Unknown table: RAWDATACOR' error during migration	2025-12-10 20:28:19 +01:00
alex	de6bde17c9	feat: Add sequences for auto-incrementing IDs - Create rawdatacor_id_seq for auto-increment of id column - Create elabdatadisp_id_seq for auto-increment of id_elab_data column - Both sequences use DEFAULT nextval() to auto-generate IDs on insert This replaces PRIMARY KEY functionality since PostgreSQL doesn't support PRIMARY KEY on partitioned tables with expression-based ranges. IDs are now auto-incremented without primary key constraint. Tested: schema creation works correctly with sequences	2025-12-10 20:20:52 +01:00
alex	2834f8b578	fix: Remove unsupported constraints from partitioned tables PostgreSQL doesn't support PRIMARY KEY or UNIQUE constraints on partitioned tables when using RANGE partitioning on expressions (like EXTRACT(YEAR FROM event_date)). Changed: - RAWDATACOR: removed PRIMARY KEY (id, event_date) and UNIQUE constraint - ELABDATADISP: removed PRIMARY KEY (id_elab_data, event_date) and UNIQUE constraint - Tables now have no constraints except NOT NULL on required columns This is a PostgreSQL limitation with partitioned tables. Constraints can be added per-partition if needed, but for simplicity we rely on application-level validation. Fixes: 'vincolo PRIMARY KEY non supportato con una definizione di chiave di partizione'	2025-12-10 20:18:20 +01:00
alex	410b253808	fix: Update Pydantic v2 configuration for .env loading - Fix ConfigDict model_config for Pydantic v2.12+ compatibility - Add env_file and env_file_encoding to all config classes - Each config class now properly loads from .env with correct prefix Fixes: ValidationError when loading settings from .env file CLI now works correctly with 'uv run python main.py'	2025-12-10 20:11:12 +01:00
alex	9b18db029b	docs: Add quick navigation guide (START_HERE.md)	2025-12-10 20:00:50 +01:00
alex	8e705e33da	docs: Add detailed example workflow	2025-12-10 19:59:22 +01:00
alex	38c6b4c6d8	docs: Add implementation summary	2025-12-10 19:58:49 +01:00

1 2

52 Commits