ML Data Explorer - File Organization System

Total Records

30,133

Files in dataset

Training Set

24,101

79.9% of total

Test Set

6,032

20.0% of total

Data Quality

95.47%

Excellent quality

📁 Basic Features

filename - Original filename
filename_lower - Lowercase filename
filepath - Full file path
category - Assigned category
subcategory - Assigned subcategory

📎 Extension Features

extension - File extension (.png, .jpg)
extension_category - Type (image, video, audio)

🔤 Filename Pattern Features

filename_tokens - Tokenized words array
filename_length - Character count
has_numbers - Contains digits
has_underscores - Contains underscores
has_dashes - Contains dashes
has_spaces - Contains spaces
starts_with_date - Date prefix pattern

🎯 Pattern Detection

is_screenshot - Screenshot pattern match
is_game_asset - Game asset pattern match
is_document - Document pattern match

📊 Metadata Features

has_extracted_text - OCR text present
extracted_text_length - Text char count
has_company_name - Company detected
has_people_names - Names detected
people_count - Number of people

📷 Image Metadata

has_datetime - EXIF datetime present
has_gps - GPS coordinates present
has_location - Location name present

📂 Path Features

path_depth - Directory nesting level
parent_folder - Parent directory name

🔧 Derived Features

filename_hash - MD5 hash (dedup)

Total Features

Feature Groups

1,430

Vocabulary Tokens

Label Classes

Dataset Split

Training Set 24,101

Test Set 6,032

Test Ratio 20%

Stratified By Category

Train/Test Distribution

Category Distribution (All Categories)

Main Categories (Pie Chart)

Subcategory Distribution

Screenshots Detected

180

0.60% of dataset

Game Assets Detected

6,729

22.33% of dataset

Documents Detected

0.01% of dataset

Files with Text

236

0.78% extracted text

Pattern Detection Summary

Metadata Availability

Quality Gauge

95.47%

Quality Score

Uncategorized Files 2,727

9.05% of dataset

Duplicate Files 11

0.04% of dataset

Unknown Extensions 2

0.01% of dataset

Sample Issues

Uncategorized Files (Sample)

ScreenshotResult.js.map

ScreenshotResult.d.ts.map

93419027627913a58f...

20251021_030330_9.webp

20251021_030330_5.webp

20251021_030330_4.webp

Duplicate Files (Sample)

IMG_9352.heic

IMG_9352.mov

IMG_4766.heic

IMG_2395.mov

walking-outside.png

stretching.png

Feature Records

Filename	Category	Subcategory	Extension	Patterns
Loading sample data...

Category Accuracy

71.17%

4,293 / 6,032 correct

Subcategory Accuracy

52.90%

3,191 / 6,032 correct

Avg Confidence

75.79%

Mean prediction confidence

Test Samples

6,032

FileCategorizationModel v1.0

Per-Category F1 Scores

Precision vs Recall by Category

Top Misclassifications

Per-Category Metrics

Category	Precision	Recall	F1 Score	Support
game_assets	98.79%	78.22%	87.31%	5,111
media	14.88%	93.95%	25.69%	314
uncategorized	0.00%	0.00%	0.00%	546
property_management	0.00%	0.00%	0.00%	49
other (9 categories)	0.00%	0.00%	0.00%	12

Key Insights

Strong Game Asset Detection

98.79% precision for game_assets - model rarely mislabels other files as game assets

High Recall for Media

93.95% recall catches most media files, but low precision (14.88%) means many false positives

Missing Categories

Model doesn't predict uncategorized, property_management, or other minority classes

Top Misclassification Flows

game_assets → media 1,113

21.8% of game assets misclassified as generic media

uncategorized → media 518

94.9% of uncategorized predicted as media

property_management → media 47

95.9% of property files predicted as media

Confusion Matrix (Major Categories)

Actual \ Predicted	game_assets	media	technical	legal
game_assets	3,998	1,113	-	-
media	19	295	-	-
uncategorized	28	518	-	-
property_management	2	47	-	-
personal	-	-	-	1
filepath	-	-	1	-

Green = correct predictions | Red = significant misclassifications | Orange = minor misclassifications