ML Data Explorer

Interactive visualization and analysis of the file organization ML dataset

Total Records
30,133
Files in dataset
Training Set
24,101
79.9% of total
Test Set
6,032
20.0% of total
Data Quality
95.47%
Excellent quality
Categories
13
Unique classes
Vocabulary Size
1,430
Unique tokens (min_freq: 5)
Category Distribution
Top Extensions
Top 20 Filename Tokens
Features Extracted Per File

Each file in the dataset has the following features extracted for ML training:

📁 Basic Features

  • filename - Original filename
  • filename_lower - Lowercase filename
  • filepath - Full file path
  • category - Assigned category
  • subcategory - Assigned subcategory

📎 Extension Features

  • extension - File extension (.png, .jpg)
  • extension_category - Type (image, video, audio)

🔤 Filename Pattern Features

  • filename_tokens - Tokenized words array
  • filename_length - Character count
  • has_numbers - Contains digits
  • has_underscores - Contains underscores
  • has_dashes - Contains dashes
  • has_spaces - Contains spaces
  • starts_with_date - Date prefix pattern

🎯 Pattern Detection

  • is_screenshot - Screenshot pattern match
  • is_game_asset - Game asset pattern match
  • is_document - Document pattern match

📊 Metadata Features

  • has_extracted_text - OCR text present
  • extracted_text_length - Text char count
  • has_company_name - Company detected
  • has_people_names - Names detected
  • people_count - Number of people

📷 Image Metadata

  • has_datetime - EXIF datetime present
  • has_gps - GPS coordinates present
  • has_location - Location name present

📂 Path Features

  • path_depth - Directory nesting level
  • parent_folder - Parent directory name

🔧 Derived Features

  • filename_hash - MD5 hash (dedup)
24
Total Features
8
Feature Groups
1,430
Vocabulary Tokens
13
Label Classes
Dataset Split
Training Set 24,101
Test Set 6,032
Test Ratio 20%
Stratified By Category
Train/Test Distribution
Category Distribution (All Categories)
Main Categories (Pie Chart)
Subcategory Distribution
Screenshots Detected
180
0.60% of dataset
Game Assets Detected
6,729
22.33% of dataset
Documents Detected
3
0.01% of dataset
Files with Text
236
0.78% extracted text
Pattern Detection Summary
Metadata Availability
Quality Gauge
95.47%
Quality Score
Uncategorized Files 2,727
9.05% of dataset
Duplicate Files 11
0.04% of dataset
Unknown Extensions 2
0.01% of dataset
Sample Issues

Uncategorized Files (Sample)

ScreenshotResult.js.map
ScreenshotResult.d.ts.map
93419027627913a58f...
20251021_030330_9.webp
20251021_030330_5.webp
20251021_030330_4.webp

Duplicate Files (Sample)

IMG_9352.heic
IMG_9352.mov
IMG_4766.heic
IMG_2395.mov
walking-outside.png
stretching.png
Feature Records
Filename Category Subcategory Extension Patterns
Loading sample data...
Category Accuracy
71.17%
4,293 / 6,032 correct
Subcategory Accuracy
52.90%
3,191 / 6,032 correct
Avg Confidence
75.79%
Mean prediction confidence
Test Samples
6,032
FileCategorizationModel v1.0
Per-Category F1 Scores
Precision vs Recall by Category
Top Misclassifications
Per-Category Metrics
Category Precision Recall F1 Score Support
game_assets 98.79% 78.22% 87.31% 5,111
media 14.88% 93.95% 25.69% 314
uncategorized 0.00% 0.00% 0.00% 546
property_management 0.00% 0.00% 0.00% 49
other (9 categories) 0.00% 0.00% 0.00% 12
Key Insights
Strong Game Asset Detection
98.79% precision for game_assets - model rarely mislabels other files as game assets
High Recall for Media
93.95% recall catches most media files, but low precision (14.88%) means many false positives
Missing Categories
Model doesn't predict uncategorized, property_management, or other minority classes
Top Misclassification Flows
game_assets → media 1,113
21.8% of game assets misclassified as generic media
uncategorized → media 518
94.9% of uncategorized predicted as media
property_management → media 47
95.9% of property files predicted as media
Confusion Matrix (Major Categories)
Actual \ Predicted game_assets media technical legal
game_assets 3,998 1,113 - -
media 19 295 - -
uncategorized 28 518 - -
property_management 2 47 - -
personal - - - 1
filepath - - 1 -

Green = correct predictions | Red = significant misclassifications | Orange = minor misclassifications