# Changelog ## [Current] - 2025-12-30 ### Added - **Consolidation-based incremental migration**: Uses consolidation keys `(UnitName, ToolNameID, EventDate, EventTime)` instead of timestamps - **MySQL ID optimization**: Uses `MAX(mysql_max_id)` from PostgreSQL to filter MySQL queries, avoiding full table scans - **State management in PostgreSQL**: Replaced JSON file with `migration_state` table for more reliable tracking - **Sync utility**: Added `scripts/sync_migration_state.py` to sync state with actual data - **Performance optimization**: MySQL queries now instant using PRIMARY KEY filter - **Data quality validation**: Automatically validates and logs invalid consolidation keys to dedicated error files - **Error logging**: Invalid keys (null dates, empty tool IDs, corrupted Java strings) are logged and skipped during migration - **Better documentation**: Consolidated and updated all documentation files ### Changed - **Incremental migration**: Now uses consolidation keys instead of timestamp-based approach - **Full migration**: Improved to save global `last_key` after completing all partitions - **State tracking**: Moved from `migration_state.json` to PostgreSQL table `migration_state` - **Query performance**: Added `min_mysql_id` parameter to `fetch_consolidation_keys_after()` for optimization - **Configuration**: Renamed `BATCH_SIZE` to `CONSOLIDATION_GROUP_LIMIT` to better reflect what it controls - **Configuration**: Added `PROGRESS_LOG_INTERVAL` to control logging frequency - **Configuration**: Added `BENCHMARK_OUTPUT_DIR` to specify benchmark results directory - **Documentation**: Updated README.md, MIGRATION_WORKFLOW.md, QUICKSTART.md, EXAMPLE_WORKFLOW.md with current implementation - **Documentation**: Corrected index and partitioning documentation to reflect actual PostgreSQL schema: - Uses `event_timestamp` (not separate event_date/event_time) - Primary key includes `event_year` for partitioning - Consolidation key is UNIQUE (unit_name, tool_name_id, event_timestamp, event_year) ### Removed - **migration_state.json**: Replaced by PostgreSQL table - **Timestamp-based migration**: Replaced by consolidation key-based approach - **ID-based resumable migration**: Consolidated into single consolidation-based approach - **Temporary debug scripts**: Cleaned up all `/tmp/` debug files ### Fixed - **Incremental migration performance**: MySQL queries now ~1000x faster with ID filter - **State synchronization**: Can now sync `migration_state` with actual data using utility script - **Duplicate handling**: Uses `ON CONFLICT DO NOTHING` to prevent duplicates - **Last key tracking**: Properly updates global state after full migration - **Corrupted data handling**: Both full and incremental migrations now validate keys and log errors instead of crashing ### Error Logging Both full and incremental migrations now handle corrupted consolidation keys gracefully: **Error files:** - Full migration: `migration_errors__.log` (e.g., `migration_errors_rawdatacor_p2024.log`) - Incremental migration: `migration_errors_
_incremental_.log` (e.g., `migration_errors_rawdatacor_incremental_20260101_194500.log`) Each incremental migration creates a new timestamped file to preserve error history across runs. **File format:** ``` # Migration errors for
partition # Format: UnitName|ToolNameID|EventDate|EventTime|Reason ID0350||0000-00-00|0:00:00|EventDate is invalid: 0000-00-00 [Ljava.lang.String;@abc123|TOOL1|2024-01-01|10:00:00|UnitName is corrupted Java string: [Ljava.lang.String;@abc123 UNIT1||2024-01-01|10:00:00|ToolNameID is NULL or empty ``` **Behavior:** - Invalid keys are automatically skipped to prevent migration failure - Each skipped key is logged with the reason for rejection - Total count of skipped keys is reported at the end of migration - Empty error files (no errors) are automatically deleted ### Migration Guide (from old to new) If you have an existing installation with `migration_state.json`: 1. **Backup your data** (optional but recommended): ```bash cp migration_state.json migration_state.json.backup ``` 2. **Run full migration** to populate `migration_state` table: ```bash python main.py migrate full ``` 3. **Sync state** (if you have existing data): ```bash python scripts/sync_migration_state.py ``` 4. **Remove old state file**: ```bash rm migration_state.json ``` 5. **Run incremental migration**: ```bash python main.py migrate incremental --dry-run python main.py migrate incremental ``` ### Performance Improvements - **MySQL query time**: From 60+ seconds to <0.1 seconds (600x faster) - **Consolidation efficiency**: Multiple MySQL rows → single PostgreSQL record - **State reliability**: PostgreSQL table instead of JSON file ### Breaking Changes - `--state-file` parameter removed from incremental migration (no longer uses JSON) - `--use-id` flag removed (consolidation-based approach is now default) - Incremental migration requires full migration to be run first - `BATCH_SIZE` environment variable renamed to `CONSOLIDATION_GROUP_LIMIT` (update your .env file) ## [Previous] - Before 2025-12-30 ### Features - Full migration support - Incremental migration with timestamp tracking - JSONB transformation - Partitioning by year - GIN indexes for JSONB queries - Benchmark system - Progress tracking - Rich logging