Interactive visualization and analysis of the file organization ML dataset
Each file in the dataset has the following features extracted for ML training:
filename - Original filenamefilename_lower - Lowercase filenamefilepath - Full file pathcategory - Assigned categorysubcategory - Assigned subcategoryextension - File extension (.png, .jpg)extension_category - Type (image, video, audio)filename_tokens - Tokenized words arrayfilename_length - Character counthas_numbers - Contains digitshas_underscores - Contains underscoreshas_dashes - Contains dasheshas_spaces - Contains spacesstarts_with_date - Date prefix patternis_screenshot - Screenshot pattern matchis_game_asset - Game asset pattern matchis_document - Document pattern matchhas_extracted_text - OCR text presentextracted_text_length - Text char counthas_company_name - Company detectedhas_people_names - Names detectedpeople_count - Number of peoplehas_datetime - EXIF datetime presenthas_gps - GPS coordinates presenthas_location - Location name presentpath_depth - Directory nesting levelparent_folder - Parent directory namefilename_hash - MD5 hash (dedup)| Filename | Category | Subcategory | Extension | Patterns |
|---|---|---|---|---|
| Loading sample data... | ||||
| Category | Precision | Recall | F1 Score | Support |
|---|---|---|---|---|
| game_assets | 98.79% | 78.22% | 87.31% | 5,111 |
| media | 14.88% | 93.95% | 25.69% | 314 |
| uncategorized | 0.00% | 0.00% | 0.00% | 546 |
| property_management | 0.00% | 0.00% | 0.00% | 49 |
| other (9 categories) | 0.00% | 0.00% | 0.00% | 12 |
| Actual \ Predicted | game_assets | media | technical | legal |
|---|---|---|---|---|
| game_assets | 3,998 | 1,113 | - | - |
| media | 19 | 295 | - | - |
| uncategorized | 28 | 518 | - | - |
| property_management | 2 | 47 | - | - |
| personal | - | - | - | 1 |
| filepath | - | - | 1 | - |
Green = correct predictions | Red = significant misclassifications | Orange = minor misclassifications