How We Index AI Model Performance 24/7
The Challenge of Real-Time AI Benchmarking
AI models evolve rapidly—sometimes multiple times per day. Traditional benchmarks become outdated within weeks. We built AIStupidLevel.info to solve this: a continuous indexing system that tracks 20+ AI models hourly, providing real-time insights into model performance, failures, and changes.
This article provides a technical deep dive into our architecture, methodology, and the challenges of building a search engine indexer for AI model performance.
System Architecture Overview
Test Orchestration
Distributed queue system executing thousands of tests per hour
Real-Time Analysis
Stream processing pipeline for instant result classification
Data Storage
Time-series database optimized for performance tracking
Key Components
- Test Suite Manager: Production-style coding tasks across difficulty levels
- Model API Clients: Adapters for 20+ different AI provider APIs
- Execution Environment: Isolated sandboxes for running generated code
- Result Analyzer: Multi-dimensional scoring and classification
- Version Tracker: Fingerprints model versions from API metadata
- Alert System: Detects performance regressions and anomalies
The 7-Axis Scoring System
Unlike simple pass/fail benchmarks, our 7-axis system captures nuanced performance:
1. Correctness (35%)
Does the output solve the actual problem?
- • Automated test execution
- • Edge case validation
- • Error-free compilation/execution
2. Spec Compliance (15%)
Does it follow requirements precisely?
- • Required features implemented
- • Prohibited features avoided
- • API contracts honored
3. Code Quality (15%)
Is it maintainable and well-structured?
- • Proper error handling
- • Readable structure
- • Best practices followed
4. Efficiency (10%)
Does it use appropriate algorithms?
- • Time complexity
- • Space complexity
- • Resource usage
5. Stability (10%)
Does it handle unexpected inputs gracefully?
- • Input validation
- • Error recovery
- • Edge case resilience
6. Refusal Rate (10%)
Does it attempt tasks within scope?
- • Appropriate task engagement
- • No false refusals
- • Safety boundaries respected
7. Recovery (5%)
Can it fix its own mistakes?
- • Self-correction ability
- • Error diagnosis
- • Iterative improvement
Continuous Testing Pipeline
Hourly Execution Cycle
- Test Selection: Choose representative subset from full test suite (rotating coverage)
- Model Querying: Send prompts to all configured models simultaneously
- Response Capture: Store complete responses with full metadata
- Code Extraction: Parse model output to extract generated code
- Execution: Run code in isolated sandbox environments
- Scoring: Apply 7-axis evaluation criteria
- Storage: Persist results with time-series indexing
- Analysis: Detect regressions, trends, and anomalies
Scale & Coverage
- 20+ models tested every hour
- Hundreds of unique test cases
- Tens of thousands of API calls per day
- Terabytes of historical data collected
- 24/7/365 operation with redundancy
Version Tracking & Change Detection
One of our most valuable features is tracking silent model updates. Many AI providers update models without public announcement, sometimes causing performance regressions.
How We Detect Version Changes
- API Metadata: Extract version identifiers from response headers
- Behavior Fingerprinting: Statistical analysis of response patterns
- Performance Baseline Shifts: Sudden changes in aggregate scores
- Output Format Changes: Structural differences in responses
What We Track
- Version identifier (when available)
- First detection timestamp
- Performance delta vs previous version
- Specific regressions or improvements
- Rollout pattern (gradual vs instant)
Failure Pattern Analysis
Not all failures are equal. We classify failures into categories for deeper insights:
Syntax Errors
Code that doesn't compile/parse. Often indicates prompt understanding issues.
Runtime Errors
Code runs but crashes. Shows logic flaws or missing edge case handling.
Logic Errors
Executes successfully but produces wrong results. Hardest to detect.
Incomplete Output
Model stops mid-generation or omits required components.
Cost & Latency Tracking
Performance means nothing without cost context. We track:
Cost Metrics
- Token usage per task (input + output)
- Cost per successful completion
- Cost per attempt (including failures)
- Cost efficiency (success rate / cost)
Latency Metrics
- Time to first token
- Total completion time
- P50, P95, P99 latencies
- Throughput under load
Safety & Adversarial Testing
Beyond functional testing, we probe model safety boundaries:
- Jailbreak Attempts: Known techniques for bypassing safety filters
- Prompt Injection: Attempts to override system instructions
- Edge Case Stress Tests: Unusual or adversarial inputs
- Refusal Pattern Analysis: When and why models refuse tasks
This data is particularly valuable for AI safety research and red teaming efforts.
Data Quality & Validation
Quality Assurance Process
- Automated Checks: Validate data completeness and consistency
- Statistical Analysis: Detect anomalies and outliers
- Manual Sampling: Human review of edge cases and failures
- Reproducibility Testing: Verify results are consistent
What Makes Our Data Reliable
- Production-style tasks, not synthetic examples
- Consistent methodology over time
- Full raw data preserved (not just aggregates)
- Transparent scoring criteria
- Auditable execution logs
Challenges & Lessons Learned
Technical Challenges
- Rate Limiting: Managing API quotas across 20+ providers
- Cost Management: Balancing coverage with budget constraints
- Sandbox Security: Safely executing untrusted code
- Data Volume: Storing and querying terabytes efficiently
- API Changes: Adapting to provider API updates
Key Insights
- Models change more frequently than publicly announced
- Performance varies significantly by task type
- Cost/performance tradeoffs are non-linear
- Failure patterns are highly informative
- Real-world testing reveals issues synthetic benchmarks miss
The Future of AI Indexing
We're continuously improving our indexing capabilities:
- Multimodal Testing: Expanding beyond text to image/audio/video
- Domain-Specific Suites: Specialized tests for medical, legal, financial domains
- Adversarial Robustness: More sophisticated safety testing
- Fine-Tuning Analysis: Tracking custom model variants
- Real-Time Alerts: Instant notifications of performance changes
Conclusion
Building a 24/7 AI performance indexer is complex, but essential in today's rapidly evolving AI landscape. Our system provides real-time visibility into model behavior that no static benchmark can match.
By continuously tracking performance, costs, failures, and changes across 20+ models, we help enterprises and researchers make informed decisions about AI model selection and usage.
The data we collect powers not just our public benchmark site, but also provides training data for the next generation of AI systems—systems that can learn from the successes and failures of today's models.
Access Our AI Performance Data
License our comprehensive benchmark data for your AI research or development.
