FEU Institute of Technology Undergraduate Thesis

Detecting Internet Brain Rot with Multimodal AI

Visual-Qwen pairs a Q-Former vision encoder with Qwen3-4B and Whisper transcripts to flag “sludge” short-form videos: the stacked, multi-feed clips engineered to bypass single-modality moderators.

96.67%

Video-level accuracy on 300-video test split

97.19%

F1-score (95.58% precision, 98.86% recall)

~6,000

Multimodal samples in the open Kaggle dataset

Parameter Qwen3 backbone, LoRA-tuned

How It Works

A frozen-projector tri-modal classifier

Visual-Qwen reads each clip across three signals at once: an EVA-CLIP frame embedding, a Q-Former cross-modal attention bottleneck, and a Whisper transcript of the audio. Qwen3-4B fuses the streams and emits a sludge / not-sludge verdict.

What is sludge?

Multi-feed short-form clips that stack unrelated content (gameplay over reaction video over text crawl) to defeat algorithmic moderation built for single coherent scenes.

Full architecture breakdown on /thesis

Research

What the paper shows

Three headline contributions, every number traceable to the public test split.

96.67%

Video-level test accuracy

300-video held-out split

97.19%

F1-score

precision 95.58% / recall 98.86%

+0.77 pp

Lift from frozen projector

regularization finding

~6,000

Multimodal samples

open on Kaggle

Cross-modal Q-Former

A 32-token attention bottleneck distills heterogeneous vision and audio signals into a single embedding the LLM can fuse.

See the architecture

Frozen-projector ablation

Freezing the stage-1 Linear projector during LoRA fine-tuning beat training it by 0.77 pp. Less aligned drift, better generalization.

Read the ablation

Open 2K TikTok-sludge dataset

Two thousand short-form clips, human-validated, paired with Whisper-V3-Turbo transcripts. Released on Kaggle under an open license.

Open on Kaggle

Visual-Qwen: Augmenting Multimodal Deep Learning with Attention Mechanisms

FEU Institute of Technology, 2025. Open paper, open code, open dataset, open weights.

Try it Yourself

Upload Your Video

Run our fine-tuned multimodal model on your own clip. It looks for sludge: short-form video that stacks unrelated streams together (think Subway Surfers under Family Guy under soap-cutting) to defeat single-modality moderators.

Hosted on Hugging Face Spaces (free CPU). Inference takes about 1 to 2 minutes per video on the default settings, longer with deep analysis.

Open in a new tab

Our Team

Meet the Researchers

Four dedicated CS students and their extraordinary advisor from FEU Institute of Technology, combining academic excellence with entrepreneurial vision.