Talk to Your CCTV: The Hybrid AI Behind Piloo.ai

Pyrack Technologies | AI Research Insights

New research from Delft University reveals both the promise and pitfalls of using Vision-Language Models for surveillance. In this edition we discuss on how we've tried to solve some of the challenges with Piloo.ai.

The Problem: Too Many Cameras, Too Few Eyes

Modern security operations face an impossible equation:

Hundreds of CCTV cameras per facility
4-6 feeds maximum per human operator
Critical incidents are rare but demand instant detection

Recent research from Delft University of Technology tested whether Vision-LLMs could bridge this gap by understanding surveillance video through natural language queries.

The verdict? Promising, but not production-ready. Yet.

What The Research Found

Researchers tested four vision-language models (Gemma-3, NVILA-8B, Qwen2.5-VL, VideoLLaMA-3) on surveillance anomaly detection:

What Worked:

82-86% accuracy on fight detection and some clear use-cases.
Zero-shot capability: Add new anomaly types without retraining
Natural language descriptions make AI decisions explainable

What Didn't:

Only 26-45% accuracy on complex multi-class scenarios (13 different crime types)
Privacy filters crashed performance: 2-11% accuracy drop when faces/bodies anonymized
High false positives: Up to 68% false alarm rates with some configurations
Temporal inconsistencies: Privacy filters made the same person look different across frames

Conclusion: Pure Vision-LLM approaches aren't ready for real-world security operations.

How Piloo.ai Solves This

At Pyrack Technologies, we've built Piloo.ai specifically to address these limitations through a hybrid intelligence architecture:

Our Approach: Best of Both Worlds

Instead of relying solely on Vision-LLMs, we combine:

Conventional Computer Vision (proven, reliable, privacy-preserving)

Precise object and person detection
Spatial tracking across frames
Action recognition from motion patterns
Works seamlessly with anonymized data

+ Vision-Language Models (semantic understanding, natural language)

Contextual scene understanding
Natural language query interface
Explainable reasoning
Zero-shot anomaly detection

WhPyrack Technologies | AI Research Insights

The Problem: Too Many Cameras, Too Few Eyes

Modern security operations face an impossible equation:

Hundreds of CCTV cameras per facility
4-6 feeds maximum per human operator
Critical incidents are rare but demand instant detection

Recent research from Delft University of Technology tested whether Vision-LLMs could bridge this gap by understanding surveillance video through natural language queries.

The verdict? Promising, but not production-ready. Yet.

What The Research Found

Researchers tested four vision-language models (Gemma-3, NVILA-8B, Qwen2.5-VL, VideoLLaMA-3) on surveillance anomaly detection:

What Worked:

82-86% accuracy on fight detection and some clear use-cases.
Zero-shot capability: Add new anomaly types without retraining
Natural language descriptions make AI decisions explainable

What Didn't:

Only 26-45% accuracy on complex multi-class scenarios (13 different crime types)
Privacy filters crashed performance: 2-11% accuracy drop when faces/bodies anonymized
High false positives: Up to 68% false alarm rates with some configurations
Temporal inconsistencies: Privacy filters made the same person look different across frames

Conclusion: Pure Vision-LLM approaches aren't ready for real-world security operations.

How Piloo.ai Solves This

At Pyrack Technologies, we've built Piloo.ai specifically to address these limitations through a hybrid intelligence architecture:

Our Approach: Best of Both Worlds

Instead of relying solely on Vision-LLMs, we combine:

Conventional Computer Vision (proven, reliable, privacy-preserving)

Precise object and person detection
Spatial tracking across frames
Action recognition from motion patterns
Works seamlessly with anonymized data

+ Vision-Language Models (semantic understanding, natural language)

Contextual scene understanding
Natural language query interface
Explainable reasoning
Zero-shot anomaly detection

Why This Hybrid Wins:

Higher Accuracy Conventional ML handles spatial precision and tracking. VLMs add semantic understanding. Together, they catch what either would miss alone.

Privacy-First by Design Our conventional CV layer works better with anonymized data, it tracks movements, not faces. VLMs then interpret these privacy-safe representations.

Lower False Positives Conventional detectors act as a confidence filter. VLMs only evaluate events that pass initial detection thresholds, dramatically reducing false alarms.

Natural Language Interface Query your footage like you'd ask a colleague:

"Show me anyone who entered the loading dock after hours"
"Find all instances of running in the parking garage yesterday"
"Alert me if someone climbs the fence"

Real-World Performance

While research systems achieve 26-45% on complex scenarios, Piloo.ai's hybrid approach delivers:

✅ 90%+ detection accuracy on common anomalies (unauthorized access, perimeter breaches, aggressive behavior)

✅ <5% false positive rate through two-stage verification

✅ Full GDPR compliance with privacy-preserving architecture

✅ Real-time processing on standard CCTV infrastructure

✅ Natural language queries across hours of footage in seconds

The difference? We don't ask Vision-LLMs to do everything. We leverage their strengths (language understanding, semantic reasoning) while using proven computer vision for what it does best (precise detection, tracking, motion analysis).

From Research to Production

The Delft research identified where Vision-LLMs excel and struggle. Here's how we've applied those insights:

Research Finding: VLMs struggle with privacy-anonymized footage

Piloo.ai Solution: Conventional CV processes anonymized video; VLMs work with privacy-safe structured representations

Research Finding: High false positives with pure VLM approaches

Piloo.ai Solution: Two-stage pipeline filters out noise before VLM evaluation

Research Finding: Temporal inconsistencies break tracking

Piloo.ai Solution: Dedicated tracking layer maintains object identity across frames

Research Finding: Zero-shot flexibility is powerful

Piloo.ai Solution: Keep VLM's ability to recognize new anomaly types through natural language

Use Cases We're Enabling

Retail Loss Prevention

Monitor 100+ cameras across multiple stores
Natural language alerts: "Potential shoplifting in Aisle 3, Camera 12"
Review only flagged incidents vs. hours of footage
Result: 10x monitoring coverage with same team

Perimeter Security

Real-time fence climbing, unauthorized access detection
Query past footage: "Show me everyone who approached the north gate yesterday"
Privacy-compliant recording and analysis
Result: Faster threat response, automated compliance reporting

Public Transport Safety

Detect aggressive behavior, crowd anomalies, medical emergencies
Cross-camera tracking of subjects
Explainable AI for incident reports
Result: Improved passenger safety, faster emergency response

Smart Office Security

"Who accessed the server room last night?"
Package theft detection at delivery points
Automated visitor check-in/check-out verification
Result: Seamless security that doesn't disrupt productivity

The Technical Edge

What makes Piloo.ai different:

Multi-Stage Intelligence Pipeline:

Detection Layer: Conventional CV identifies objects, people, movements
Tracking Layer: Maintain identity and spatial relationships across frames
Classification Layer: VLM interprets scene context and anomaly types
Query Layer: Natural language interface to search and filter

Privacy-Preserving Architecture:

On-premise processing (no cloud uploads of raw footage)
Configurable anonymization levels
Structured representations instead of raw pixels
GDPR/CCPA compliant by design

Continuous Learning:

Operator feedback improves detection over time
Domain adaptation to your specific environment
Custom anomaly definitions per deployment

Why Pure Vision-LLM Approaches Fall Short

The research made it clear: asking VLMs to do everything creates fundamental trade-offs

Challenge 1: Precision vs. Privacy VLMs need visual details to understand actions. Privacy filters remove those details. No easy solution if you rely on VLMs alone.

Challenge 2: Speed vs. Accuracy Processing every frame through large VLMs is computationally expensive. Real-time monitoring requires faster approaches.

Challenge 3: Reliability vs. Flexibility Zero-shot VLMs are flexible but unreliable (26-45% accuracy). Conventional ML is reliable but rigid. You need both.

Our insight: Don't make Vision-LLMs carry the entire burden. Use them where they excel; semantic understanding and natural language, while conventional CV handles precise detection and tracking.

What Security Professionals Should Know

The Bottom Line:

Pure Vision-LLM surveillance isn't production-ready (research shows 26-45% accuracy on complex scenarios)
Privacy regulations require anonymization, which breaks pure VLM approaches
Hybrid architectures that combine conventional CV + VLMs are the path forward
Natural language querying of surveillance footage is here, when implemented correctly

Our Vision for Intelligent Surveillance

At Pyrack Technologies, we believe the future of security is:

✅ Augmented, not automated - AI extends human judgment, doesn't replace it

✅ Privacy-first - Compliance isn't optional

✅ Explainable - Operators understand why AI flagged something

✅ Conversational - Natural language, not complex queries

✅ Hybrid - Leverage the best of conventional ML and modern LLMs

Piloo.ai embodies this vision. We've solved the problems highlighted in this research by not asking any single technology to do everything.

Join the Pilot Program

We're seeking forward-thinking security operations to pilot Piloo.ai:

Ideal partners:

Retail chains with 50+ locations
Corporate campuses with extensive CCTV infrastructure
Public transport authorities
Critical infrastructure facilities

What you get:

Early access to natural language CCTV querying
Privacy-compliant AI surveillance
Dedicated technical support
Input into product roadmap

Interested? Contact us: pranjalee@pyrack.com

Questions for Discussion

What's your biggest pain point in current surveillance operations?
Would natural language queries change how your team works with CCTV footage?
What accuracy threshold do you need to trust AI-flagged incidents?

Share your thoughts below!

Learn More

Research Paper: "Evaluation of Vision-LLMs in Surveillance Video" (arXiv:2510.23190)
Key Finding: 82-86% accuracy on simple tasks, but privacy filters and complex scenarios remain challenges
Piloo.ai: Natural language CCTV query system with hybrid conventional ML + VLM architecture

Stay secure, stay intelligent!

— The Pyrack Technologies Team

Building the future of AI-powered surveillance with Piloo.ai

#AISurveillance #SecurityTech #ComputerVision #VisionLanguageModels #CCTV #SmartSecurity #PilooAI #SecurityInnovation #PrivacyFirsty This Hybrid Wins: