π Monitoring & Observability Guide
Author: Anderson Henrique da Silva
Last Updated: 2025-09-20 07:28:07 -03 (SΓ£o Paulo, Brazil)
Overview
CidadΓ£o.AI implements a comprehensive observability stack providing real-time insights into system health, performance, and business metrics.
π― Observability Pillars
1. Metrics (Prometheus)
- System performance indicators
- Business KPIs
- Custom application metrics
2. Logs (Structured JSON)
- Centralized logging
- Correlation IDs
- Contextual information
3. Traces (OpenTelemetry)
- Distributed request tracking
- Service dependency mapping
- Performance bottleneck identification
ποΈ Architecture
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Application ββββββΆβ Prometheus ββββββΆβ Grafana β
β β β β β β
β - Metrics β β - Storage β β - Dashboards β
β - Health β β - Alerting β β - Alerts β
β - SLO/SLA β β - Rules β β - Reports β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
π Metrics Implementation
Business Metrics
Location: src/infrastructure/observability/metrics.py
# Agent task execution
agent_tasks_total = Counter(
'cidadao_ai_agent_tasks_total',
'Total agent tasks executed',
['agent_name', 'task_type', 'status']
)
# Investigation lifecycle
investigations_total = Counter(
'cidadao_ai_investigations_total',
'Total investigations',
['status', 'investigation_type']
)
# Anomaly detection
anomalies_detected_total = Counter(
'cidadao_ai_anomalies_detected_total',
'Total anomalies detected',
['anomaly_type', 'severity', 'agent']
)
System Metrics
# API performance
@observe_request(
histogram=request_duration_histogram,
counter=request_count_counter
)
async def api_endpoint():
# Automatic metric collection
Metric Endpoints
/health/metrics- Prometheus format/health/metrics/json- JSON format/api/v1/observability/metrics/custom- Custom metrics
π Health Monitoring
Dependency Health Checks
Location: src/infrastructure/health/dependency_checker.py
Monitored Dependencies:
- Database - Connection pool, query performance
- Redis - Cache availability, latency
- External APIs - Portal da TransparΓͺncia, LLM services
- File System - Disk space, write permissions
Health Check Features:
- Parallel execution
- Configurable timeouts
- Retry logic
- Trend analysis
- Degradation detection
Health Endpoints
GET /health # Basic health (for load balancers)
GET /health/detailed # Comprehensive health report
GET /health/dependencies/{name} # Specific dependency health
POST /health/check # Trigger manual health check
π SLA/SLO Monitoring
SLO Configuration
Location: src/infrastructure/monitoring/slo_monitor.py
Default SLOs:
# API Availability
- Target: 99.9% uptime
- Time Window: 24 hours
- Warning: 98%
- Critical: 95%
# API Response Time
- Target: P95 < 2 seconds
- Time Window: 1 hour
- Warning: 90% compliance
- Critical: 80% compliance
# Investigation Success Rate
- Target: 95% success
- Time Window: 4 hours
- Warning: 92%
- Critical: 88%
# Agent Error Rate
- Target: < 1% errors
- Time Window: 1 hour
- Warning: 0.8%
- Critical: 1.5%
Error Budget Tracking
# Automatic error budget calculation
error_budget_remaining = 100 - ((100 - current_compliance) / (100 - target))
# Alerts on budget consumption
if error_budget_consumed > 80%:
alert("High error budget consumption")
SLO Endpoints
GET /api/v1/monitoring/slo # All SLO status
GET /api/v1/monitoring/slo/{name} # Specific SLO
POST /api/v1/monitoring/slo # Create SLO
GET /api/v1/monitoring/error-budget # Error budget report
GET /api/v1/monitoring/alerts/violations # SLO violations
π Structured Logging
Implementation
Location: src/infrastructure/observability/structured_logging.py
Log Format:
{
"timestamp": "2025-09-20T10:28:07.123Z",
"level": "INFO",
"correlation_id": "uuid-1234-5678",
"service": "cidadao-ai",
"component": "agent.zumbi",
"message": "Anomaly detected",
"context": {
"investigation_id": "inv-123",
"anomaly_type": "price_spike",
"confidence": 0.95
}
}
Features:
- JSON structured format
- Correlation ID propagation
- Contextual enrichment
- Performance metrics inclusion
- Sensitive data masking
π Distributed Tracing
OpenTelemetry Integration
Location: src/infrastructure/observability/tracing.py
Trace Context:
@trace_operation("investigation.analyze")
async def analyze_contracts(contracts):
with tracer.start_span("data_validation"):
# Automatic span creation
Trace Propagation:
- B3 headers support
- W3C Trace Context
- Baggage propagation
- Custom attributes
Trace Visualization
- Jaeger UI integration
- Service dependency graphs
- Latency analysis
- Error tracking
π¨ Alerting System
Prometheus Alert Rules
Location: monitoring/prometheus/rules/cidadao-ai-alerts.yml
Alert Categories:
1. System Health
- alert: SystemDown
expr: up{job="cidadao-ai-backend"} == 0
for: 30s
severity: critical
- alert: HighErrorRate
expr: error_rate > 5
for: 2m
severity: warning
2. Infrastructure
- alert: DatabaseConnectionsCritical
expr: db_connections_used / db_connections_total > 0.95
for: 30s
severity: critical
- alert: CacheHitRateLow
expr: cache_hit_rate < 70
for: 5m
severity: warning
3. Agent Performance
- alert: AgentTaskFailureHigh
expr: agent_error_rate > 10
for: 3m
severity: warning
- alert: AgentQualityScoreLow
expr: agent_quality_score < 0.8
for: 5m
severity: warning
4. Business Metrics
- alert: InvestigationSuccessRateLow
expr: investigation_success_rate < 90
for: 10m
severity: warning
- alert: AnomalyDetectionAccuracyLow
expr: anomaly_accuracy < 0.85
for: 15m
severity: warning
π Grafana Dashboards
System Overview Dashboard
Location: monitoring/grafana/dashboards/cidadao-ai-overview.json
Panels:
- System health status
- Active investigations count
- API response time P95
- Anomalies detected (24h)
- Request rate graph
- Agent tasks performance
- SLO compliance table
- Error budget consumption
- Database connection pool
- Cache hit rate
- External API health
- Investigation success rate
- Top anomaly types
- Memory/CPU usage
- Alert status
Agent Performance Dashboard
Location: monitoring/grafana/dashboards/cidadao-ai-agents.json
Panels:
- Agent task success rate
- Active agents count
- Average task duration
- Reflection iterations
- Performance by agent type
- Task duration percentiles
- Agent status distribution
- Top performing agents
- Error distribution
- Agent-specific metrics
- Memory usage by agent
- Communication matrix
- Quality score trends
π§ Monitoring Configuration
Prometheus Configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'cidadao-ai-backend'
static_configs:
- targets: ['localhost:8000']
metrics_path: '/health/metrics'
Grafana Data Sources
{
"name": "Prometheus",
"type": "prometheus",
"url": "http://prometheus:9090",
"access": "proxy"
}
π― Key Performance Indicators
Technical KPIs
- Uptime: Target 99.95%
- API Latency P99: < 500ms
- Error Rate: < 0.1%
- Cache Hit Rate: > 90%
- Agent Success Rate: > 95%
Business KPIs
- Investigations/Day: Track growth
- Anomalies Detected: Measure effectiveness
- Report Generation Time: < 30s
- User Satisfaction: Via feedback metrics
π APM Integration
Supported Platforms
Location: src/infrastructure/apm/
New Relic
apm_integrations.setup_newrelic( license_key="your-key", app_name="cidadao-ai" )Datadog
apm_integrations.setup_datadog( api_key="your-api-key", app_key="your-app-key" )Elastic APM
apm_integrations.setup_elastic_apm( server_url="http://apm-server:8200", secret_token="your-token" )
APM Features
- Performance tracking decorators
- Error reporting with context
- Custom business metrics
- Distributed trace correlation
π§ͺ Chaos Engineering
Chaos Experiments
Location: src/api/routes/chaos.py
Available Experiments:
Latency Injection
- Configurable delays
- Probability-based
- Auto-expiration
Error Injection
- HTTP error codes
- Configurable rate
- Multiple error types
Resource Pressure
- Memory consumption
- CPU load
- Controlled intensity
Chaos Endpoints
POST /api/v1/chaos/inject/latency
POST /api/v1/chaos/inject/errors
POST /api/v1/chaos/experiments/memory-pressure
POST /api/v1/chaos/experiments/cpu-pressure
POST /api/v1/chaos/stop/{experiment}
GET /api/v1/chaos/status
π Best Practices
- Set Meaningful SLOs: Based on user expectations
- Monitor Business Metrics: Not just technical ones
- Use Correlation IDs: For request tracing
- Alert on Symptoms: Not causes
- Document Runbooks: For each alert
- Regular Reviews: Of metrics and thresholds
- Capacity Planning: Based on trends
π Troubleshooting
Missing Metrics
- Check Prometheus scrape configuration
- Verify metrics endpoint accessibility
- Review metric registration code
Alert Fatigue
- Tune alert thresholds
- Implement alert grouping
- Use inhibition rules
Dashboard Performance
- Optimize query time ranges
- Use recording rules
- Implement caching
π Additional Resources
For monitoring questions or improvements, contact: Anderson Henrique da Silva