Visual-Qwen

Augmenting Multimodal Deep Learning with Attention Mechanisms to Recognize “Sludge” Videos from Short-Form Content

Marc Olata, Alpha Romer Coma, Job Isaac Ong, Kristoffer Ian Sioson

Justine Jude Pura (Project Mentor), Shaneth Ambat (Course Adviser)

Abstract

The proliferation of “sludge” content in short-form videos featuring multiple, unrelated clips playing simultaneously presents a significant challenge to conventional content moderation systems on platforms like TikTok and YouTube Shorts. This format is engineered to manipulate recommendation algorithms and circumvent moderation by creating deliberate audiovisual mismatches, a tactic that unimodal analysis tools fail to reliably detect. This research addresses this gap by developing and evaluating Visual-Qwen, a novel multimodal deep learning architecture augmented with attention mechanisms for the automated recognition of sludge videos.

The proposed model integrates a frozen EVA-CLIP-G/14 vision encoder (via the BLIP-2 bundle from Salesforce/blip2-opt-2.7b) and a Whisper V3 Turbo audio transcription module to extract visual and textual features, respectively. A lightweight Query-Former (Q-Former) acts as a cross-modal attention fusion mechanism, distilling these heterogeneous inputs into a compact set of learned embeddings. These fused features are then projected into a frozen Qwen3-4B large language model, which generates a final classification and a human-readable explanation. To ensure robust and generalizable performance, the model was trained on a custom-built dataset of 2,000 TikTok and YouTube Shorts videos, evenly balanced between sludge and non-sludge content, ethically sourced and annotated through a human-in-the-loop pipeline with external expert validation.

Evaluated on the held-out 300-video test set (Kaggle stratified split), Visual-Qwen achieved 96.67% video-level accuracy (95% CI 94.33–98.67), 95.58% precision, 98.86% recall, and a 97.19% F1-score. Furthermore, evaluations conducted with content creators, content moderators, and machine learning experts confirmed the system's high utility and trustworthiness, scoring favorably on assessments based on the Technology Acceptance Model (TAM) and ISO/IEC TR 24028 guidelines. This study demonstrates that an attention-augmented multimodal approach can effectively identify complex and evasive content formats, offering a significant contribution to developing more sophisticated and resilient automated content moderation systems.

Video Presentation

Model

Visual-Qwen consists of a frozen EVA-CLIP-G/14 vision encoder, a Whisper V3 Turbo audio transcription module, a lightweight Query-Former (Q-Former), and a frozen Qwen3-4B large language model. The vision encoder and Q-Former are inherited from the BLIP-2 bundle (Salesforce/blip2-opt-2.7b) and remain frozen at every stage. Only the linear projection layer (stage 1) and a low-rank Qwen3-4B adapter (stage 2) are trained.

Component	Output shape	Hidden size	Trainable?
EVA-CLIP-G/14 vision encoder	`(B, 257, 1408)`	1408	Frozen
Q-Former (12 layers, 32 query tokens)	`(B, 32, 768)`	768	Frozen
Linear projector	`(B, 32, 2560)`	768 → 2560	Trained stage 1; frozen stage 2
Whisper V3 Turbo transcript	Qwen text tokens	n/a	Frozen
Qwen3-4B decoder	Causal LM	2560	Frozen + LoRA (stage 2)

The 32 query tokens are a sequence length, not a feature dimension. The linear projector maps per-token features (768 to 2560), preserving the 32-token sequence. Visual tokens, Whisper transcript tokens, and instruction tokens are concatenated as the Qwen3-4B input context.

Dataset

A balanced dataset of 2,000 short-form videos (1,000 sludge and 1,000 non-sludge), assembled through ethical scraping from public TikTok and YouTube Shorts feeds in accordance with the YouTube Researcher Program. Each video contributes paired visual, audio, and textual modalities, totaling 6,000 rows of multimodal data. The collection process combined automated platform-API scraping, manual screening, synthetic feature generation with Gemini 2.5 Flash, human verification, and external expert validation. The corpus is split 70% training / 15% validation / 15% test (1,400 / 300 / 300 videos) with stratified sampling, matching the official release on Kaggle.

Training

The model was trained on Google Cloud's TPU v4-64 pods granted by the TPU Research Cloud, ingested via Cloud Storage FUSE. Training proceeded in two stages: a pre-training stage on the LLaVA image-caption dataset (177 minutes for 4 epochs, training only the linear projection layer while EVA-CLIP-G/14, Q-Former, and Qwen3-4B remained frozen), followed by a fine-tuning stage on the 2,000-video sludge dataset (9.6 minutes for 6 epochs, training only LoRA adapters injected into Qwen3-4B). Total training time was approximately 3 hours.

Acknowledgement

This website is adapted from MiniGPT-4, licensed under a BSD-3-Clause License.