Files

alex 23e9fc9d82 feat: Add error logging and fix incremental migration state tracking

Implement comprehensive error handling and fix state management bug in incremental migration:

Error Logging System:
- Add validation for consolidation keys (NULL dates, empty IDs, corrupted Java strings)
- Log invalid keys to dedicated error files with detailed reasons
- Full migration: migration_errors_<table>_<partition>.log
- Incremental migration: migration_errors_<table>_incremental_<timestamp>.log (timestamped to preserve history)
- Report total count of skipped invalid keys at migration completion
- Auto-delete empty error log files

State Tracking Fix:
- Fix critical bug where last_key wasn't updated after final buffer flush
- Track last_processed_key throughout migration loop
- Update state both during periodic flushes and after final flush
- Ensures incremental migration correctly resumes from last migrated key

Validation Checks:
- EventDate IS NULL or EventDate = '0000-00-00'
- EventTime IS NULL
- ToolNameID IS NULL or empty string
- UnitName IS NULL or empty string
- UnitName starting with '[L' (corrupted Java strings)

Documentation:
- Update README.md with error logging behavior
- Update MIGRATION_WORKFLOW.md with validation details
- Update CHANGELOG.md with new features and fixes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2026-01-01 19:49:44 +01:00

5.3 KiB

Raw Blame History

Changelog

[Current] - 2025-12-30

Added

Consolidation-based incremental migration: Uses consolidation keys (UnitName, ToolNameID, EventDate, EventTime) instead of timestamps
MySQL ID optimization: Uses MAX(mysql_max_id) from PostgreSQL to filter MySQL queries, avoiding full table scans
State management in PostgreSQL: Replaced JSON file with migration_state table for more reliable tracking
Sync utility: Added scripts/sync_migration_state.py to sync state with actual data
Performance optimization: MySQL queries now instant using PRIMARY KEY filter
Data quality validation: Automatically validates and logs invalid consolidation keys to dedicated error files
Error logging: Invalid keys (null dates, empty tool IDs, corrupted Java strings) are logged and skipped during migration
Better documentation: Consolidated and updated all documentation files

Changed

Incremental migration: Now uses consolidation keys instead of timestamp-based approach
Full migration: Improved to save global last_key after completing all partitions
State tracking: Moved from migration_state.json to PostgreSQL table migration_state
Query performance: Added min_mysql_id parameter to fetch_consolidation_keys_after() for optimization
Configuration: Renamed BATCH_SIZE to CONSOLIDATION_GROUP_LIMIT to better reflect what it controls
Configuration: Added PROGRESS_LOG_INTERVAL to control logging frequency
Configuration: Added BENCHMARK_OUTPUT_DIR to specify benchmark results directory
Documentation: Updated README.md, MIGRATION_WORKFLOW.md, QUICKSTART.md, EXAMPLE_WORKFLOW.md with current implementation
Documentation: Corrected index and partitioning documentation to reflect actual PostgreSQL schema:
- Uses event_timestamp (not separate event_date/event_time)
- Primary key includes event_year for partitioning
- Consolidation key is UNIQUE (unit_name, tool_name_id, event_timestamp, event_year)

Removed

migration_state.json: Replaced by PostgreSQL table
Timestamp-based migration: Replaced by consolidation key-based approach
ID-based resumable migration: Consolidated into single consolidation-based approach
Temporary debug scripts: Cleaned up all /tmp/ debug files

Fixed

Incremental migration performance: MySQL queries now ~1000x faster with ID filter
State synchronization: Can now sync migration_state with actual data using utility script
Duplicate handling: Uses ON CONFLICT DO NOTHING to prevent duplicates
Last key tracking: Properly updates global state after full migration
Corrupted data handling: Both full and incremental migrations now validate keys and log errors instead of crashing

Error Logging

Both full and incremental migrations now handle corrupted consolidation keys gracefully:

Error files:

Full migration: migration_errors_<table>_<partition>.log (e.g., migration_errors_rawdatacor_p2024.log)
Incremental migration: migration_errors_<table>_incremental_<timestamp>.log (e.g., migration_errors_rawdatacor_incremental_20260101_194500.log)

Each incremental migration creates a new timestamped file to preserve error history across runs.

File format:

# Migration errors for <table> partition <partition>
# Format: UnitName|ToolNameID|EventDate|EventTime|Reason

ID0350||0000-00-00|0:00:00|EventDate is invalid: 0000-00-00
[Ljava.lang.String;@abc123|TOOL1|2024-01-01|10:00:00|UnitName is corrupted Java string: [Ljava.lang.String;@abc123
UNIT1||2024-01-01|10:00:00|ToolNameID is NULL or empty

Behavior:

Invalid keys are automatically skipped to prevent migration failure
Each skipped key is logged with the reason for rejection
Total count of skipped keys is reported at the end of migration
Empty error files (no errors) are automatically deleted

Migration Guide (from old to new)

If you have an existing installation with migration_state.json:

Backup your data (optional but recommended):

cp migration_state.json migration_state.json.backup

Run full migration to populate migration_state table:
```
python main.py migrate full
```
Sync state (if you have existing data):
```
python scripts/sync_migration_state.py
```
Remove old state file:
```
rm migration_state.json
```

Run incremental migration:

python main.py migrate incremental --dry-run
python main.py migrate incremental

Performance Improvements

MySQL query time: From 60+ seconds to <0.1 seconds (600x faster)
Consolidation efficiency: Multiple MySQL rows → single PostgreSQL record
State reliability: PostgreSQL table instead of JSON file

Breaking Changes

--state-file parameter removed from incremental migration (no longer uses JSON)
--use-id flag removed (consolidation-based approach is now default)
Incremental migration requires full migration to be run first
BATCH_SIZE environment variable renamed to CONSOLIDATION_GROUP_LIMIT (update your .env file)

[Previous] - Before 2025-12-30

Features

Full migration support
Incremental migration with timestamp tracking
JSONB transformation
Partitioning by year
GIN indexes for JSONB queries
Benchmark system
Progress tracking
Rich logging

5.3 KiB Raw Blame History