Implement comprehensive error handling and fix state management bug in incremental migration: Error Logging System: - Add validation for consolidation keys (NULL dates, empty IDs, corrupted Java strings) - Log invalid keys to dedicated error files with detailed reasons - Full migration: migration_errors_<table>_<partition>.log - Incremental migration: migration_errors_<table>_incremental_<timestamp>.log (timestamped to preserve history) - Report total count of skipped invalid keys at migration completion - Auto-delete empty error log files State Tracking Fix: - Fix critical bug where last_key wasn't updated after final buffer flush - Track last_processed_key throughout migration loop - Update state both during periodic flushes and after final flush - Ensures incremental migration correctly resumes from last migrated key Validation Checks: - EventDate IS NULL or EventDate = '0000-00-00' - EventTime IS NULL - ToolNameID IS NULL or empty string - UnitName IS NULL or empty string - UnitName starting with '[L' (corrupted Java strings) Documentation: - Update README.md with error logging behavior - Update MIGRATION_WORKFLOW.md with validation details - Update CHANGELOG.md with new features and fixes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
122 lines
5.3 KiB
Markdown
122 lines
5.3 KiB
Markdown
# Changelog
|
|
|
|
## [Current] - 2025-12-30
|
|
|
|
### Added
|
|
- **Consolidation-based incremental migration**: Uses consolidation keys `(UnitName, ToolNameID, EventDate, EventTime)` instead of timestamps
|
|
- **MySQL ID optimization**: Uses `MAX(mysql_max_id)` from PostgreSQL to filter MySQL queries, avoiding full table scans
|
|
- **State management in PostgreSQL**: Replaced JSON file with `migration_state` table for more reliable tracking
|
|
- **Sync utility**: Added `scripts/sync_migration_state.py` to sync state with actual data
|
|
- **Performance optimization**: MySQL queries now instant using PRIMARY KEY filter
|
|
- **Data quality validation**: Automatically validates and logs invalid consolidation keys to dedicated error files
|
|
- **Error logging**: Invalid keys (null dates, empty tool IDs, corrupted Java strings) are logged and skipped during migration
|
|
- **Better documentation**: Consolidated and updated all documentation files
|
|
|
|
### Changed
|
|
- **Incremental migration**: Now uses consolidation keys instead of timestamp-based approach
|
|
- **Full migration**: Improved to save global `last_key` after completing all partitions
|
|
- **State tracking**: Moved from `migration_state.json` to PostgreSQL table `migration_state`
|
|
- **Query performance**: Added `min_mysql_id` parameter to `fetch_consolidation_keys_after()` for optimization
|
|
- **Configuration**: Renamed `BATCH_SIZE` to `CONSOLIDATION_GROUP_LIMIT` to better reflect what it controls
|
|
- **Configuration**: Added `PROGRESS_LOG_INTERVAL` to control logging frequency
|
|
- **Configuration**: Added `BENCHMARK_OUTPUT_DIR` to specify benchmark results directory
|
|
- **Documentation**: Updated README.md, MIGRATION_WORKFLOW.md, QUICKSTART.md, EXAMPLE_WORKFLOW.md with current implementation
|
|
- **Documentation**: Corrected index and partitioning documentation to reflect actual PostgreSQL schema:
|
|
- Uses `event_timestamp` (not separate event_date/event_time)
|
|
- Primary key includes `event_year` for partitioning
|
|
- Consolidation key is UNIQUE (unit_name, tool_name_id, event_timestamp, event_year)
|
|
|
|
### Removed
|
|
- **migration_state.json**: Replaced by PostgreSQL table
|
|
- **Timestamp-based migration**: Replaced by consolidation key-based approach
|
|
- **ID-based resumable migration**: Consolidated into single consolidation-based approach
|
|
- **Temporary debug scripts**: Cleaned up all `/tmp/` debug files
|
|
|
|
### Fixed
|
|
- **Incremental migration performance**: MySQL queries now ~1000x faster with ID filter
|
|
- **State synchronization**: Can now sync `migration_state` with actual data using utility script
|
|
- **Duplicate handling**: Uses `ON CONFLICT DO NOTHING` to prevent duplicates
|
|
- **Last key tracking**: Properly updates global state after full migration
|
|
- **Corrupted data handling**: Both full and incremental migrations now validate keys and log errors instead of crashing
|
|
|
|
### Error Logging
|
|
|
|
Both full and incremental migrations now handle corrupted consolidation keys gracefully:
|
|
|
|
**Error files:**
|
|
- Full migration: `migration_errors_<table>_<partition>.log` (e.g., `migration_errors_rawdatacor_p2024.log`)
|
|
- Incremental migration: `migration_errors_<table>_incremental_<timestamp>.log` (e.g., `migration_errors_rawdatacor_incremental_20260101_194500.log`)
|
|
|
|
Each incremental migration creates a new timestamped file to preserve error history across runs.
|
|
|
|
**File format:**
|
|
```
|
|
# Migration errors for <table> partition <partition>
|
|
# Format: UnitName|ToolNameID|EventDate|EventTime|Reason
|
|
|
|
ID0350||0000-00-00|0:00:00|EventDate is invalid: 0000-00-00
|
|
[Ljava.lang.String;@abc123|TOOL1|2024-01-01|10:00:00|UnitName is corrupted Java string: [Ljava.lang.String;@abc123
|
|
UNIT1||2024-01-01|10:00:00|ToolNameID is NULL or empty
|
|
```
|
|
|
|
**Behavior:**
|
|
- Invalid keys are automatically skipped to prevent migration failure
|
|
- Each skipped key is logged with the reason for rejection
|
|
- Total count of skipped keys is reported at the end of migration
|
|
- Empty error files (no errors) are automatically deleted
|
|
|
|
### Migration Guide (from old to new)
|
|
|
|
If you have an existing installation with `migration_state.json`:
|
|
|
|
1. **Backup your data** (optional but recommended):
|
|
```bash
|
|
cp migration_state.json migration_state.json.backup
|
|
```
|
|
|
|
2. **Run full migration** to populate `migration_state` table:
|
|
```bash
|
|
python main.py migrate full
|
|
```
|
|
|
|
3. **Sync state** (if you have existing data):
|
|
```bash
|
|
python scripts/sync_migration_state.py
|
|
```
|
|
|
|
4. **Remove old state file**:
|
|
```bash
|
|
rm migration_state.json
|
|
```
|
|
|
|
5. **Run incremental migration**:
|
|
```bash
|
|
python main.py migrate incremental --dry-run
|
|
python main.py migrate incremental
|
|
```
|
|
|
|
### Performance Improvements
|
|
|
|
- **MySQL query time**: From 60+ seconds to <0.1 seconds (600x faster)
|
|
- **Consolidation efficiency**: Multiple MySQL rows → single PostgreSQL record
|
|
- **State reliability**: PostgreSQL table instead of JSON file
|
|
|
|
### Breaking Changes
|
|
|
|
- `--state-file` parameter removed from incremental migration (no longer uses JSON)
|
|
- `--use-id` flag removed (consolidation-based approach is now default)
|
|
- Incremental migration requires full migration to be run first
|
|
- `BATCH_SIZE` environment variable renamed to `CONSOLIDATION_GROUP_LIMIT` (update your .env file)
|
|
|
|
## [Previous] - Before 2025-12-30
|
|
|
|
### Features
|
|
- Full migration support
|
|
- Incremental migration with timestamp tracking
|
|
- JSONB transformation
|
|
- Partitioning by year
|
|
- GIN indexes for JSONB queries
|
|
- Benchmark system
|
|
- Progress tracking
|
|
- Rich logging
|