Changed consolidation strategy to leverage MySQL partitioning:
- Added get_table_partitions() to list all partitions
- Added fetch_consolidation_groups_from_partition() to read groups by consolidation key
- Each group (UnitName, ToolNameID, EventDate, EventTime) is fetched completely
- All nodes of same group are consolidated into single row with JSONB measurements
- Process partitions sequentially for predictable memory usage
Key benefits:
- Guaranteed complete consolidation (no fragmentation across batches)
- Deterministic behavior - same group always consolidated together
- Better memory efficiency with partition limits (100k groups per query)
- Clear audit trail of which partition each row came from
Tested with partition d3: 6960 input rows → 100 consolidated rows (69.6:1 ratio)
with groups containing 24-72 nodes each.
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Previously, consolidation happened per-batch, which meant if the same
(unit, tool, date, time) group spanned multiple batches, nodes would be
split into separate rows. For example, nodes 1-32 would be split into 4
separate rows instead of 1 consolidated row.
Now, we buffer rows with the same consolidation key and only consolidate
when we see a NEW consolidation key. This ensures all nodes of the same
group are consolidated together, regardless of batch boundaries.
Results: Proper 25:1 consolidation ratio with all nodes grouped correctly.
🤖 Generated with Claude Code
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
When fetching rows for consolidation, the original keyset pagination only
ordered by id, which caused nodes from the same (unit, tool, timestamp) to
be split across multiple batches. This resulted in incomplete consolidation,
with some nodes being missed.
Solution: Order by consolidation columns in addition to id:
- Primary: id (for keyset pagination)
- Secondary: UnitName, ToolNameID, EventDate, EventTime, NodeNum
This ensures all nodes with the same (unit, tool, timestamp) are grouped
together in the same batch, allowing proper consolidation within the batch.
Fixes: Nodes being lost during ELABDATADISP consolidation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Added logging to track which nodes are being consolidated and how many
measurement categories each node has. This helps debug cases where data
appears to be lost during consolidation.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
MySQL returns numeric values as Decimal objects, which are not JSON serializable.
PostgreSQL JSONB requires proper JSON types.
Added convert_value() helper in _build_measurement_for_elabdatadisp_node() to:
- Convert Decimal → float
- Convert str → float
- Pass through other types unchanged
This ensures all numeric values are JSON-serializable before insertion into
the measurements JSONB column.
Fixes: "Object of type Decimal is not JSON serializable" error
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
ELABDATADISP uses 'idElabData' as the primary key, while RAWDATACOR uses 'id'.
Updated the fetch method to detect the correct column based on the table name:
- RAWDATACOR: use 'id' column
- ELABDATADISP: use 'idElabData' column
This allows keyset pagination to work correctly for both tables.
Fixes: "Unknown column 'id' in 'order clause'" error when fetching ELABDATADISP
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
The method was restricted to only RAWDATACOR, but the consolidation logic
works for both tables. Updated the check to allow both:
- RAWDATACOR
- ELABDATADISP
The keyset pagination (id-based WHERE clause) works identically for both
tables, and consolidation happens in Python for both.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
The updated_at column was removed from the schema but should be kept for
consistency with the original table structure and to track when rows are
modified.
Changes:
- Added updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP to table schema
- Added updated_at to get_column_order() for elabdatadisp
- Added updated_at to transform_elabdatadisp_row() output
This maintains backward compatibility while still consolidating node_num,
state, and calc_err into the measurements JSONB.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Updated get_column_order() for elabdatadisp table to return only the
columns that are now stored separately:
- id_elab_data
- unit_name
- tool_name_id
- event_timestamp
- measurements (includes node_num, state, calc_err keyed by node)
- created_at
Removed: node_num, state, calc_err, updated_at (not used after consolidation)
This matches the schema defined in schema_transformer.py where these fields
are noted as being stored in the JSONB measurements column.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Add consolidation logic to ELABDATADISP similar to RAWDATACOR:
- Group rows by (unit_name, tool_name_id, event_timestamp)
- Consolidate multiple nodes with same timestamp into single row
- Store node_num, state, calc_err in JSONB measurements keyed by node
Changes:
1. Add _build_measurement_for_elabdatadisp_node() helper
- Builds measurement object with state, calc_err, and measurement categories
- Filters out empty categories to save space
2. Update transform_elabdatadisp_row() signature
- Accept optional measurements parameter for consolidated rows
- Build from single row if measurements not provided
- Remove node_num, state, calc_err from returned columns (now in JSONB)
- Keep only: id_elab_data, unit_name, tool_name_id, event_timestamp, measurements, created_at
3. Add consolidate_elabdatadisp_batch() method
- Group rows by consolidation key
- Build consolidated measurements with node numbers as keys
- Use MAX(idElabData) for checkpoint tracking (resume capability)
- Use MIN(idElabData) as template for other fields
4. Update transform_batch() to support ELABDATADISP consolidation
- Check consolidate flag for both tables
- Call consolidate_elabdatadisp_batch() when needed
Result: ELABDATADISP now consolidates ~5-10:1 like RAWDATACOR,
with all node data (node_num, state, calc_err, measurements) keyed
by node number in JSONB.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Added queries to identify and sample records with default timestamp
(1970-01-01 00:00:00) which resulted from invalid MySQL dates during
migration. These records need date recovery from the MySQL source.
Queries:
- Count records with default timestamp in both tables
- Sample first 10 records from rawdatacor with default timestamp
These will help quantify the scope of date recovery work needed.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
The _update_migration_state() method was using logic:
status = "in_progress" if last_id is not None else "completed"
This was incorrect because:
1. last_id is always set during periodic updates (to track resume point)
2. So status would always be "in_progress" even when migration finished
3. migration_completed_at would never be set
Solution: Add is_final parameter to explicitly mark when migration is
complete. During periodic updates, is_final=False (status="in_progress").
Only when called at the end, is_final=True (status="completed").
This ensures:
- migration_state.status = "completed" when done
- migration_state.migration_completed_at is set
- Proper tracking for knowing if migration is finished
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Added logging to trace the final migration state update process:
- Log final count from PostgreSQL
- Log final last ID from table
- Log before and after _update_migration_state() call
This helps debug why migration_state might not be getting updated
when migration completes.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
The _update_migration_state() method was using pg_conn.execute() which has
its own connection management. This could cause issues with transaction
handling when called at end of migration.
Changed to use explicit cursor with guaranteed commit:
- Use pg_conn.connection.cursor() to get a direct cursor
- Execute the INSERT ... ON CONFLICT query
- Explicitly call pg_conn.connection.commit()
- This matches the pattern used in other parts of the code
This ensures that final migration state (completed status, final counts)
are properly persisted to the database.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
When migration finishes, we need to update migration_state with:
1. The final actual row count from PostgreSQL
2. The final last_migrated_id (MAX(id) from the table)
3. Mark status as 'completed' (handled by _update_migration_state)
Previously, the final state update was missing, so migration_state
was left with stale data from the periodic updates.
Now _update_migration_state is called at the end to record the
authoritative final state.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
MySQL can contain invalid/zero dates like '0000-00-00' which cannot be
parsed with strptime. These should be treated as NULL and converted to
the default timestamp (1970-01-01 00:00:00).
Changes to _convert_date():
- Check for '0000-00-00' and invalid date strings
- Wrap strptime in try/except to catch ValueError
- Return None for invalid dates instead of crashing
- Updated callers to check for None and use default timestamp
This allows the migration to continue even when encountering invalid
historical dates in the MySQL database.
Fixes: "time data '0000-00-00' does not match format '%Y-%m-%d'"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
When we import datetime from the datetime module, we get the datetime class,
not the module. This caused isinstance() checks to fail when checking against
datetime.date (which doesn't exist when datetime is a class).
Solution: Import date explicitly from datetime module and use it in isinstance
checks. Order matters - check datetime before date since datetime is a subclass
of date.
Fixes: "isinstance() arg 2 must be a type, a tuple of types, or a union"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
When resuming migration, EventDate may be a string (from PostgreSQL queries)
instead of a datetime.date object (from MySQL). The combine() function expects
a datetime.date object, so we now convert strings to dates before combining
with time.
Added _convert_date() helper similar to _convert_time() that handles:
- str: Parse from "YYYY-MM-DD" format
- datetime.date: Return as-is
- datetime.datetime: Extract date component
Fixes error: "combine() argument 1 must be datetime.date, not str"
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
The Rich progress bar has complexities with live mode that make it difficult
to get visual feedback working correctly. Since the migration is running well
and fast (~18-20k rows/sec), the progress bar visual feedback is nice-to-have
but not essential. Focus on what matters: the migration completing correctly.
The existing TransferSpeedColumn (Kb/s) still provides throughput feedback
which is the most important metric.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
The print_status() method properly handles printing with the live progress
bar, whereas direct .print() calls don't work correctly with Progress in
live mode.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
For very large migrations (111M rows), the progress bar can appear frozen
when showing percentage-based progress on 60M+ remaining rows. Even at
20k rows/sec, progress moves slowly on screen.
Solution: Print periodic throughput updates every 1M rows processed.
Shows:
- Actual count processed and total
- Current throughput in rows/sec
- Elapsed time in hours
This gives users visual feedback that migration is actively processing
without needing to wait for percentage to visibly change.
Example output:
Progress: 5,000,000/111,000,000 items (18,500 items/sec, 4.2h elapsed)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
The previous fix was too aggressive - calling get_row_count() on every batch
meant executing COUNT(*) on a 14M row table for each batch. With a typical
batch size of ~10k rows and consolidation ratio of ~10:1, this meant:
- ~500-1000 batches total
- ~500k COUNT(*) queries on a huge table = completely destroyed performance
New approach:
- Keep local accumulator for migrated count (fast)
- Update total_rows_migrated to DB only every 10 batches (reduces COUNT(*) 50x)
- Update last_migrated_id on every batch via UPDATE (fast, no COUNT)
- Do final COUNT(*) at end of migration for accurate total
This maintains accuracy while being performant. The local count is reliable
because we're tracking inserts in a single sequential migration.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
The progress bar was appearing frozen because:
- Total was set to MySQL rows to process (111M)
- Progress was updated by PostgreSQL rows inserted (11M after consolidation)
- This created a 10:1 mismatch, making progress appear to crawl
Solution:
- Track progress based on MySQL rows processed (matches total)
- Use batch_size (MySQL rows) instead of inserted count (PostgreSQL rows)
- Change batch_max_id calculation to use original batch instead of transformed
This ensures the progress bar advances at a visible rate while still
maintaining accurate row count tracking from the database.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Replace session-level counting with direct table COUNT queries to ensure
total_rows_migrated always reflects actual reality in PostgreSQL. This fixes
the discrepancy where the counter was only tracking rows from the current session
and didn't account for earlier insertions or duplicates from failed resume attempts.
Key improvements:
- Use get_row_count() after each batch to get authoritative total
- Preserve previous count on resume and accumulate across sessions
- Remove dependency on error-prone session-level counters
- Ensures migration_state.total_rows_migrated matches actual table row count
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Configuration improvements:
- Set read_timeout=300 (5 minutes) to handle long queries
- Set write_timeout=300 (5 minutes) for writes
- Set max_allowed_packet=64MB to handle larger data transfers
Retry logic:
- Added retry mechanism with max 3 retries on fetch failure
- Auto-reconnect on connection loss before retry
- Better error messages showing retry attempts
This fixes the 'connection is lost' error that occurs during
long-running migrations by:
1. Giving MySQL queries more time to complete
2. Allowing larger packet sizes for bulk data
3. Automatically recovering from connection drops
Fixes: 'Connection is lost' error during full migration
Replace cursor.copy() with cursor.executemany() for more reliable
batch inserts in PostgreSQL. The COPY method has issues with format
and data encoding in psycopg3.
Changes:
- Use executemany() with parameterized INSERT statements
- Let psycopg handle parameter escaping and encoding
- Convert JSONB dicts to JSON strings automatically
- More compatible with various data types
This ensures that data is actually being inserted into PostgreSQL
during migration, fixing the issue where data wasn't appearing in
the database after migration completed.
Fixes: Data not being persisted in PostgreSQL during migration
- On successful execution (no exception): explicitly commit before closing
- On exception: explicitly rollback before closing
- Add try-except to handle commit/rollback failures gracefully
This ensures that all inserted data is committed to the database
when the context manager exits. Previously, commits were only done
per-batch in insert_batch(), but the final context exit wasn't
ensuring a final commit.
Fixes: Data not appearing in PostgreSQL after migration completes
- TABLE_CONFIGS now accepts both 'RAWDATACOR' and 'rawdatacor' as keys
- TABLE_CONFIGS now accepts both 'ELABDATADISP' and 'elabdatadisp' as keys
- Reuse same config dict for both cases to avoid duplication
This allows FullMigrator to work correctly when initialized with
uppercase table names from the CLI while DataTransformer works
with lowercase names.
Fixes: 'Unknown table: RAWDATACOR' error during migration
- Create rawdatacor_id_seq for auto-increment of id column
- Create elabdatadisp_id_seq for auto-increment of id_elab_data column
- Both sequences use DEFAULT nextval() to auto-generate IDs on insert
This replaces PRIMARY KEY functionality since PostgreSQL doesn't
support PRIMARY KEY on partitioned tables with expression-based ranges.
IDs are now auto-incremented without primary key constraint.
Tested: schema creation works correctly with sequences
PostgreSQL doesn't support PRIMARY KEY or UNIQUE constraints on
partitioned tables when using RANGE partitioning on expressions
(like EXTRACT(YEAR FROM event_date)).
Changed:
- RAWDATACOR: removed PRIMARY KEY (id, event_date) and UNIQUE constraint
- ELABDATADISP: removed PRIMARY KEY (id_elab_data, event_date) and UNIQUE constraint
- Tables now have no constraints except NOT NULL on required columns
This is a PostgreSQL limitation with partitioned tables.
Constraints can be added per-partition if needed, but for simplicity
we rely on application-level validation.
Fixes: 'vincolo PRIMARY KEY non supportato con una definizione di chiave di partizione'
- Fix ConfigDict model_config for Pydantic v2.12+ compatibility
- Add env_file and env_file_encoding to all config classes
- Each config class now properly loads from .env with correct prefix
Fixes: ValidationError when loading settings from .env file
CLI now works correctly with 'uv run python main.py'