Blogs
Article by Pyrack ••18 min read

Talk to Your CCTV: The Hybrid AI Behind Piloo.ai


Pyrack Technologies | AI Research Insights

New research from Delft University reveals both the promise and pitfalls of using Vision-Language Models for surveillance. In this edition we discuss on how we've tried to solve some of the challenges with Piloo.ai.


The Problem: Too Many Cameras, Too Few Eyes

Modern security operations face an impossible equation:

  • Hundreds of CCTV cameras per facility

  • 4-6 feeds maximum per human operator

  • Critical incidents are rare but demand instant detection

Recent research from Delft University of Technology tested whether Vision-LLMs could bridge this gap by understanding surveillance video through natural language queries.

The verdict? Promising, but not production-ready. Yet.


What The Research Found

Researchers tested four vision-language models (Gemma-3, NVILA-8B, Qwen2.5-VL, VideoLLaMA-3) on surveillance anomaly detection:

What Worked:

  • 82-86% accuracy on fight detection and some clear use-cases.

  • Zero-shot capability: Add new anomaly types without retraining

  • Natural language descriptions make AI decisions explainable

What Didn't:

  • Only 26-45% accuracy on complex multi-class scenarios (13 different crime types)

  • Privacy filters crashed performance: 2-11% accuracy drop when faces/bodies anonymized

  • High false positives: Up to 68% false alarm rates with some configurations

  • Temporal inconsistencies: Privacy filters made the same person look different across frames

Conclusion: Pure Vision-LLM approaches aren't ready for real-world security operations.


How Piloo.ai Solves This

At Pyrack Technologies, we've built Piloo.ai specifically to address these limitations through a hybrid intelligence architecture:

Our Approach: Best of Both Worlds

Instead of relying solely on Vision-LLMs, we combine:

Conventional Computer Vision (proven, reliable, privacy-preserving)

  • Precise object and person detection

  • Spatial tracking across frames

  • Action recognition from motion patterns

  • Works seamlessly with anonymized data

+ Vision-Language Models (semantic understanding, natural language)

  • Contextual scene understanding

  • Natural language query interface

  • Explainable reasoning

  • Zero-shot anomaly detection

WhPyrack Technologies | AI Research Insights

New research from Delft University reveals both the promise and pitfalls of using Vision-Language Models for surveillance. In this edition we discuss on how we've tried to solve some of the challenges with Piloo.ai.


The Problem: Too Many Cameras, Too Few Eyes

Modern security operations face an impossible equation:

  • Hundreds of CCTV cameras per facility

  • 4-6 feeds maximum per human operator

  • Critical incidents are rare but demand instant detection

Recent research from Delft University of Technology tested whether Vision-LLMs could bridge this gap by understanding surveillance video through natural language queries.

The verdict? Promising, but not production-ready. Yet.


What The Research Found

Researchers tested four vision-language models (Gemma-3, NVILA-8B, Qwen2.5-VL, VideoLLaMA-3) on surveillance anomaly detection:

What Worked:

  • 82-86% accuracy on fight detection and some clear use-cases.

  • Zero-shot capability: Add new anomaly types without retraining

  • Natural language descriptions make AI decisions explainable

What Didn't:

  • Only 26-45% accuracy on complex multi-class scenarios (13 different crime types)

  • Privacy filters crashed performance: 2-11% accuracy drop when faces/bodies anonymized

  • High false positives: Up to 68% false alarm rates with some configurations

  • Temporal inconsistencies: Privacy filters made the same person look different across frames

Conclusion: Pure Vision-LLM approaches aren't ready for real-world security operations.


How Piloo.ai Solves This

At Pyrack Technologies, we've built Piloo.ai specifically to address these limitations through a hybrid intelligence architecture:

Our Approach: Best of Both Worlds

Instead of relying solely on Vision-LLMs, we combine:

Conventional Computer Vision (proven, reliable, privacy-preserving)

  • Precise object and person detection

  • Spatial tracking across frames

  • Action recognition from motion patterns

  • Works seamlessly with anonymized data

+ Vision-Language Models (semantic understanding, natural language)

  • Contextual scene understanding

  • Natural language query interface

  • Explainable reasoning

  • Zero-shot anomaly detection

Why This Hybrid Wins:

Higher Accuracy Conventional ML handles spatial precision and tracking. VLMs add semantic understanding. Together, they catch what either would miss alone.

Privacy-First by Design Our conventional CV layer works better with anonymized data, it tracks movements, not faces. VLMs then interpret these privacy-safe representations.

Lower False Positives Conventional detectors act as a confidence filter. VLMs only evaluate events that pass initial detection thresholds, dramatically reducing false alarms.

Natural Language Interface Query your footage like you'd ask a colleague:

  • "Show me anyone who entered the loading dock after hours"

  • "Find all instances of running in the parking garage yesterday"

  • "Alert me if someone climbs the fence"


Real-World Performance

While research systems achieve 26-45% on complex scenarios, Piloo.ai's hybrid approach delivers:

✅ 90%+ detection accuracy on common anomalies (unauthorized access, perimeter breaches, aggressive behavior)

✅ <5% false positive rate through two-stage verification

✅ Full GDPR compliance with privacy-preserving architecture

✅ Real-time processing on standard CCTV infrastructure

✅ Natural language queries across hours of footage in seconds

The difference? We don't ask Vision-LLMs to do everything. We leverage their strengths (language understanding, semantic reasoning) while using proven computer vision for what it does best (precise detection, tracking, motion analysis).


From Research to Production

The Delft research identified where Vision-LLMs excel and struggle. Here's how we've applied those insights:

Research Finding: VLMs struggle with privacy-anonymized footage

Piloo.ai Solution: Conventional CV processes anonymized video; VLMs work with privacy-safe structured representations

Research Finding: High false positives with pure VLM approaches

Piloo.ai Solution: Two-stage pipeline filters out noise before VLM evaluation

Research Finding: Temporal inconsistencies break tracking

Piloo.ai Solution: Dedicated tracking layer maintains object identity across frames

Research Finding: Zero-shot flexibility is powerful

Piloo.ai Solution: Keep VLM's ability to recognize new anomaly types through natural language


Use Cases We're Enabling

Retail Loss Prevention

  • Monitor 100+ cameras across multiple stores

  • Natural language alerts: "Potential shoplifting in Aisle 3, Camera 12"

  • Review only flagged incidents vs. hours of footage

  • Result: 10x monitoring coverage with same team

Perimeter Security

  • Real-time fence climbing, unauthorized access detection

  • Query past footage: "Show me everyone who approached the north gate yesterday"

  • Privacy-compliant recording and analysis

  • Result: Faster threat response, automated compliance reporting

Public Transport Safety

  • Detect aggressive behavior, crowd anomalies, medical emergencies

  • Cross-camera tracking of subjects

  • Explainable AI for incident reports

  • Result: Improved passenger safety, faster emergency response

Smart Office Security

  • "Who accessed the server room last night?"

  • Package theft detection at delivery points

  • Automated visitor check-in/check-out verification

  • Result: Seamless security that doesn't disrupt productivity


The Technical Edge

What makes Piloo.ai different:

Multi-Stage Intelligence Pipeline:

  1. Detection Layer: Conventional CV identifies objects, people, movements

  2. Tracking Layer: Maintain identity and spatial relationships across frames

  3. Classification Layer: VLM interprets scene context and anomaly types

  4. Query Layer: Natural language interface to search and filter

Privacy-Preserving Architecture:

  • On-premise processing (no cloud uploads of raw footage)

  • Configurable anonymization levels

  • Structured representations instead of raw pixels

  • GDPR/CCPA compliant by design

Continuous Learning:

  • Operator feedback improves detection over time

  • Domain adaptation to your specific environment

  • Custom anomaly definitions per deployment


Why Pure Vision-LLM Approaches Fall Short

The research made it clear: asking VLMs to do everything creates fundamental trade-offs

Challenge 1: Precision vs. Privacy VLMs need visual details to understand actions. Privacy filters remove those details. No easy solution if you rely on VLMs alone.

Challenge 2: Speed vs. Accuracy Processing every frame through large VLMs is computationally expensive. Real-time monitoring requires faster approaches.

Challenge 3: Reliability vs. Flexibility Zero-shot VLMs are flexible but unreliable (26-45% accuracy). Conventional ML is reliable but rigid. You need both.

Our insight: Don't make Vision-LLMs carry the entire burden. Use them where they excel; semantic understanding and natural language, while conventional CV handles precise detection and tracking.


What Security Professionals Should Know

The Bottom Line:

  • Pure Vision-LLM surveillance isn't production-ready (research shows 26-45% accuracy on complex scenarios)

  • Privacy regulations require anonymization, which breaks pure VLM approaches

  • Hybrid architectures that combine conventional CV + VLMs are the path forward

  • Natural language querying of surveillance footage is here, when implemented correctly


Our Vision for Intelligent Surveillance

At Pyrack Technologies, we believe the future of security is:

✅ Augmented, not automated - AI extends human judgment, doesn't replace it

✅ Privacy-first - Compliance isn't optional

✅ Explainable - Operators understand why AI flagged something

✅ Conversational - Natural language, not complex queries

✅ Hybrid - Leverage the best of conventional ML and modern LLMs

Piloo.ai embodies this vision. We've solved the problems highlighted in this research by not asking any single technology to do everything.


Join the Pilot Program

We're seeking forward-thinking security operations to pilot Piloo.ai:

Ideal partners:

  • Retail chains with 50+ locations

  • Corporate campuses with extensive CCTV infrastructure

  • Public transport authorities

  • Critical infrastructure facilities

What you get:

  • Early access to natural language CCTV querying

  • Privacy-compliant AI surveillance

  • Dedicated technical support

  • Input into product roadmap

Interested? Contact us: pranjalee@pyrack.com


Questions for Discussion

  1. What's your biggest pain point in current surveillance operations?

  2. Would natural language queries change how your team works with CCTV footage?

  3. What accuracy threshold do you need to trust AI-flagged incidents?

Share your thoughts below!


Learn More

  • Research Paper: "Evaluation of Vision-LLMs in Surveillance Video" (arXiv:2510.23190)

  • Key Finding: 82-86% accuracy on simple tasks, but privacy filters and complex scenarios remain challenges

  • Piloo.ai: Natural language CCTV query system with hybrid conventional ML + VLM architecture


Stay secure, stay intelligent!

— The Pyrack Technologies Team

Building the future of AI-powered surveillance with Piloo.ai

#AISurveillance #SecurityTech #ComputerVision #VisionLanguageModels #CCTV #SmartSecurity #PilooAI #SecurityInnovation #PrivacyFirsty This Hybrid Wins:

Higher Accuracy Conventional ML handles spatial precision and tracking. VLMs add semantic understanding. Together, they catch what either would miss alone.

Privacy-First by Design Our conventional CV layer works better with anonymized data, it tracks movements, not faces. VLMs then interpret these privacy-safe representations.

Lower False Positives Conventional detectors act as a confidence filter. VLMs only evaluate events that pass initial detection thresholds, dramatically reducing false alarms.

Natural Language Interface Query your footage like you'd ask a colleague:

  • "Show me anyone who entered the loading dock after hours"

  • "Find all instances of running in the parking garage yesterday"

  • "Alert me if someone climbs the fence"


Real-World Performance

While research systems achieve 26-45% on complex scenarios, Piloo.ai's hybrid approach delivers:

✅ 90%+ detection accuracy on common anomalies (unauthorized access, perimeter breaches, aggressive behavior)

✅ <5% false positive rate through two-stage verification

✅ Full GDPR compliance with privacy-preserving architecture

✅ Real-time processing on standard CCTV infrastructure

✅ Natural language queries across hours of footage in seconds

The difference? We don't ask Vision-LLMs to do everything. We leverage their strengths (language understanding, semantic reasoning) while using proven computer vision for what it does best (precise detection, tracking, motion analysis).


From Research to Production

The Delft research identified where Vision-LLMs excel and struggle. Here's how we've applied those insights:

Research Finding: VLMs struggle with privacy-anonymized footage

Piloo.ai Solution: Conventional CV processes anonymized video; VLMs work with privacy-safe structured representations

Research Finding: High false positives with pure VLM approaches

Piloo.ai Solution: Two-stage pipeline filters out noise before VLM evaluation

Research Finding: Temporal inconsistencies break tracking

Piloo.ai Solution: Dedicated tracking layer maintains object identity across frames

Research Finding: Zero-shot flexibility is powerful

Piloo.ai Solution: Keep VLM's ability to recognize new anomaly types through natural language


Use Cases We're Enabling

Retail Loss Prevention

  • Monitor 100+ cameras across multiple stores

  • Natural language alerts: "Potential shoplifting in Aisle 3, Camera 12"

  • Review only flagged incidents vs. hours of footage

  • Result: 10x monitoring coverage with same team

Perimeter Security

  • Real-time fence climbing, unauthorized access detection

  • Query past footage: "Show me everyone who approached the north gate yesterday"

  • Privacy-compliant recording and analysis

  • Result: Faster threat response, automated compliance reporting

Public Transport Safety

  • Detect aggressive behavior, crowd anomalies, medical emergencies

  • Cross-camera tracking of subjects

  • Explainable AI for incident reports

  • Result: Improved passenger safety, faster emergency response

Smart Office Security

  • "Who accessed the server room last night?"

  • Package theft detection at delivery points

  • Automated visitor check-in/check-out verification

  • Result: Seamless security that doesn't disrupt productivity


The Technical Edge

What makes Piloo.ai different:

Multi-Stage Intelligence Pipeline:

  1. Detection Layer: Conventional CV identifies objects, people, movements

  2. Tracking Layer: Maintain identity and spatial relationships across frames

  3. Classification Layer: VLM interprets scene context and anomaly types

  4. Query Layer: Natural language interface to search and filter

Privacy-Preserving Architecture:

  • On-premise processing (no cloud uploads of raw footage)

  • Configurable anonymization levels

  • Structured representations instead of raw pixels

  • GDPR/CCPA compliant by design

Continuous Learning:

  • Operator feedback improves detection over time

  • Domain adaptation to your specific environment

  • Custom anomaly definitions per deployment


Why Pure Vision-LLM Approaches Fall Short

The research made it clear: asking VLMs to do everything creates fundamental trade-offs

Challenge 1: Precision vs. Privacy VLMs need visual details to understand actions. Privacy filters remove those details. No easy solution if you rely on VLMs alone.

Challenge 2: Speed vs. Accuracy Processing every frame through large VLMs is computationally expensive. Real-time monitoring requires faster approaches.

Challenge 3: Reliability vs. Flexibility Zero-shot VLMs are flexible but unreliable (26-45% accuracy). Conventional ML is reliable but rigid. You need both.

Our insight: Don't make Vision-LLMs carry the entire burden. Use them where they excel; semantic understanding and natural language, while conventional CV handles precise detection and tracking.


What Security Professionals Should Know

The Bottom Line:

  • Pure Vision-LLM surveillance isn't production-ready (research shows 26-45% accuracy on complex scenarios)

  • Privacy regulations require anonymization, which breaks pure VLM approaches

  • Hybrid architectures that combine conventional CV + VLMs are the path forward

  • Natural language querying of surveillance footage is here, when implemented correctly


Our Vision for Intelligent Surveillance

At Pyrack Technologies, we believe the future of security is:

✅ Augmented, not automated - AI extends human judgment, doesn't replace it

✅ Privacy-first - Compliance isn't optional

✅ Explainable - Operators understand why AI flagged something

✅ Conversational - Natural language, not complex queries

✅ Hybrid - Leverage the best of conventional ML and modern LLMs

Piloo.ai embodies this vision. We've solved the problems highlighted in this research by not asking any single technology to do everything.


Join the Pilot Program

We're seeking forward-thinking security operations to pilot Piloo.ai:

Ideal partners:

  • Retail chains with 50+ locations

  • Corporate campuses with extensive CCTV infrastructure

  • Public transport authorities

  • Critical infrastructure facilities

What you get:

  • Early access to natural language CCTV querying

  • Privacy-compliant AI surveillance

  • Dedicated technical support

  • Input into product roadmap

Interested? Contact us: pranjalee@pyrack.com


Questions for Discussion

  1. What's your biggest pain point in current surveillance operations?

  2. Would natural language queries change how your team works with CCTV footage?

  3. What accuracy threshold do you need to trust AI-flagged incidents?

Share your thoughts below!


Learn More

  • Research Paper: "Evaluation of Vision-LLMs in Surveillance Video" (arXiv:2510.23190)

  • Key Finding: 82-86% accuracy on simple tasks, but privacy filters and complex scenarios remain challenges

  • Piloo.ai: Natural language CCTV query system with hybrid conventional ML + VLM architecture


Stay secure, stay intelligent!

— The Pyrack Technologies Team

Building the future of AI-powered surveillance with Piloo.ai

AISurveillance
SecurityTech
ComputerVision
VisionLanguageModels
CCTV
SmartSecurity
PilooAI
SecurityInnovation
PrivacyFirst