| # 🔧 Troubleshooting Guide - Cidadão.AI Backend | |
| **Author**: Anderson Henrique da Silva | |
| **Last Updated**: 2025-10-03 (São Paulo, Brazil) | |
| This directory contains troubleshooting guides and solutions for common issues encountered during development, deployment, and operation of the Cidadão.AI Backend. | |
| ## 📋 Available Guides | |
| ### Deployment Issues | |
| - **[FIX_HUGGINGFACE_DEPLOYMENT.md](./FIX_HUGGINGFACE_DEPLOYMENT.md)** - HuggingFace Spaces deployment fixes | |
| - Common HF Spaces errors | |
| - Docker configuration issues | |
| - Environment variable problems | |
| - Build failures and solutions | |
| - **[EMERGENCY_SOLUTION.md](./EMERGENCY_SOLUTION.md)** - Emergency recovery procedures | |
| - Critical system failures | |
| - Data recovery strategies | |
| - Rollback procedures | |
| - Incident response guide | |
| --- | |
| ## 🚨 Common Issues & Quick Fixes | |
| ### 1. Import Errors | |
| **Problem**: `ModuleNotFoundError` or `ImportError` | |
| **Solution**: | |
| ```bash | |
| # Reinstall dependencies | |
| make install-dev | |
| # Or manually | |
| pip install -r requirements.txt | |
| # Clear Python cache | |
| find . -type d -name "__pycache__" -exec rm -r {} + | |
| find . -type f -name "*.pyc" -delete | |
| ``` | |
| --- | |
| ### 2. Database Connection Issues | |
| **Problem**: `OperationalError: could not connect to database` | |
| **Solution**: | |
| ```bash | |
| # Check PostgreSQL is running | |
| sudo systemctl status postgresql | |
| # Check connection string in .env | |
| DATABASE_URL=postgresql://user:password@localhost:5432/cidadao_ai | |
| # Fallback to in-memory (development only) | |
| # Remove or comment out DATABASE_URL in .env | |
| ``` | |
| --- | |
| ### 3. Redis Connection Errors | |
| **Problem**: `redis.exceptions.ConnectionError` | |
| **Solution**: | |
| ```bash | |
| # Start Redis | |
| sudo systemctl start redis | |
| # Or use Docker | |
| docker run -d -p 6379:6379 redis:alpine | |
| # Redis is OPTIONAL - system works without it | |
| # Remove REDIS_URL from .env to disable | |
| ``` | |
| --- | |
| ### 4. API Key / Authentication Issues | |
| **Problem**: `401 Unauthorized` or `Invalid API key` | |
| **Solution**: | |
| ```bash | |
| # Check .env file has required keys | |
| GROQ_API_KEY=your-groq-api-key | |
| JWT_SECRET_KEY=your-jwt-secret | |
| SECRET_KEY=your-secret-key | |
| # Generate new secrets | |
| python -c "import secrets; print(secrets.token_urlsafe(32))" | |
| # Test API key | |
| curl -H "Authorization: Bearer YOUR_TOKEN" http://localhost:8000/api/v1/auth/me | |
| ``` | |
| --- | |
| ### 5. Portal da Transparência 403 Errors | |
| **Problem**: Most Portal da Transparência endpoints return 403 Forbidden | |
| **Solution**: | |
| This is **expected behavior** - 78% of endpoints are blocked without documented access tiers. | |
| **Workarounds**: | |
| 1. Use the 22% working endpoints (contracts with `codigoOrgao`, servants by CPF) | |
| 2. Enable demo mode (works without API key) | |
| 3. Use dados.gov.br integration as fallback | |
| ```python | |
| # In .env | |
| TRANSPARENCY_API_KEY= # Leave empty for demo mode | |
| ``` | |
| See: [Portal Integration Guide](../api/PORTAL_TRANSPARENCIA_INTEGRATION.md) | |
| --- | |
| ### 6. Test Failures | |
| **Problem**: Tests failing with various errors | |
| **Solution**: | |
| ```bash | |
| # Run specific test to diagnose | |
| pytest tests/unit/agents/test_zumbi.py -v | |
| # Check test environment | |
| pytest --collect-only | |
| # Clear test cache | |
| pytest --cache-clear | |
| # Run with debug output | |
| pytest -vv --tb=long | |
| # Check coverage | |
| pytest --cov=src --cov-report=html | |
| ``` | |
| --- | |
| ### 7. Agent Timeout Errors | |
| **Problem**: Agent operations timing out | |
| **Solution**: | |
| ```bash | |
| # Increase timeout in .env | |
| AGENT_TIMEOUT=300 # 5 minutes | |
| # Check GROQ API status | |
| curl https://api.groq.com/openai/v1/models -H "Authorization: Bearer $GROQ_API_KEY" | |
| # Monitor agent logs | |
| tail -f logs/agents.log | |
| ``` | |
| --- | |
| ### 8. Memory Issues / Out of Memory | |
| **Problem**: Application crashes with OOM errors | |
| **Solution**: | |
| ```bash | |
| # Reduce agent pool size | |
| AGENT_POOL_SIZE=3 # Default is 5 | |
| # Enable aggressive garbage collection | |
| PYTHONMALLOC=malloc | |
| # Monitor memory usage | |
| make monitoring-up | |
| # Check Grafana dashboard | |
| # Clear cache | |
| redis-cli FLUSHALL | |
| ``` | |
| --- | |
| ### 9. CORS Errors (Frontend Integration) | |
| **Problem**: `CORS policy: No 'Access-Control-Allow-Origin' header` | |
| **Solution**: | |
| ```python | |
| # In src/api/app.py, verify CORS settings | |
| ALLOWED_ORIGINS = [ | |
| "http://localhost:3000", | |
| "http://localhost:3001", | |
| "https://your-frontend.vercel.app", | |
| ] | |
| ``` | |
| Check: [CORS Configuration Guide](../development/CORS_CONFIGURATION.md) | |
| --- | |
| ### 10. HuggingFace Spaces Build Failures | |
| **Problem**: Build fails on HuggingFace Spaces | |
| **Common causes**: | |
| 1. Missing dependencies in `requirements-minimal.txt` | |
| 2. Port not set to 7860 | |
| 3. Dockerfile not found or misconfigured | |
| **Solution**: | |
| ```dockerfile | |
| # Ensure Dockerfile exposes port 7860 | |
| EXPOSE 7860 | |
| # Use simplified app.py for HF Spaces | |
| CMD ["python", "app.py"] | |
| # Not the full src/api/app.py | |
| ``` | |
| See: [FIX_HUGGINGFACE_DEPLOYMENT.md](./FIX_HUGGINGFACE_DEPLOYMENT.md) | |
| --- | |
| ## 🔍 Debugging Tools | |
| ### Enable Debug Logging | |
| ```python | |
| # In your code or .env | |
| import logging | |
| logging.basicConfig(level=logging.DEBUG) | |
| # For specific modules | |
| logging.getLogger("src.agents").setLevel(logging.DEBUG) | |
| logging.getLogger("src.api").setLevel(logging.INFO) | |
| ``` | |
| ### Use Interactive Debugger | |
| ```python | |
| # Add breakpoint | |
| import pdb; pdb.set_trace() | |
| # Or use ipdb for better experience | |
| import ipdb; ipdb.set_trace() | |
| ``` | |
| ### Profile Performance | |
| ```bash | |
| # Profile with cProfile | |
| python -m cProfile -o profile.stats src/api/app.py | |
| # Analyze with snakeviz | |
| pip install snakeviz | |
| snakeviz profile.stats | |
| ``` | |
| ### Monitor in Real-time | |
| ```bash | |
| # Start monitoring stack | |
| make monitoring-up | |
| # Access Grafana | |
| http://localhost:3000 | |
| # User: admin, Password: cidadao123 | |
| # Check Prometheus metrics | |
| http://localhost:9090 | |
| ``` | |
| --- | |
| ## 📊 Health Check Endpoints | |
| Use these endpoints to diagnose system health: | |
| ```bash | |
| # Basic health check | |
| curl http://localhost:8000/health | |
| # Detailed health with dependencies | |
| curl http://localhost:8000/api/v1/health/detailed | |
| # Agent status | |
| curl http://localhost:8000/api/v1/agents/status | |
| # Database connection | |
| curl http://localhost:8000/api/v1/health/db | |
| # Redis connection | |
| curl http://localhost:8000/api/v1/health/cache | |
| ``` | |
| --- | |
| ## 🚑 Emergency Procedures | |
| ### System Down / Critical Failure | |
| 1. **Check health endpoints** to identify failing components | |
| 2. **Review logs**: `tail -f logs/*.log` | |
| 3. **Restart services**: | |
| ```bash | |
| systemctl restart cidadao-ai | |
| # Or | |
| docker-compose restart | |
| ``` | |
| 4. **Rollback if needed**: See [EMERGENCY_SOLUTION.md](./EMERGENCY_SOLUTION.md) | |
| ### Data Corruption | |
| 1. **Stop the application immediately** | |
| 2. **Create database backup**: | |
| ```bash | |
| pg_dump cidadao_ai > backup_$(date +%Y%m%d_%H%M%S).sql | |
| ``` | |
| 3. **Investigate with read-only mode** | |
| 4. **Restore from last known good backup if necessary** | |
| ### Security Incident | |
| 1. **Rotate all secrets immediately**: | |
| ```bash | |
| # Generate new secrets | |
| python -c "import secrets; print(secrets.token_urlsafe(32))" | |
| ``` | |
| 2. **Revoke compromised API keys** | |
| 3. **Review access logs** | |
| 4. **Apply security patches** | |
| 5. **Notify affected users** | |
| --- | |
| ## 📝 Logging & Monitoring | |
| ### Log Locations | |
| ```bash | |
| # Application logs | |
| logs/app.log | |
| # Agent logs | |
| logs/agents.log | |
| # Error logs | |
| logs/error.log | |
| # Access logs (if nginx/reverse proxy) | |
| /var/log/nginx/access.log | |
| ``` | |
| ### Log Analysis | |
| ```bash | |
| # Search for errors | |
| grep -i error logs/*.log | |
| # Find specific agent errors | |
| grep "agent=zumbi" logs/agents.log | grep ERROR | |
| # Count errors by type | |
| awk '/ERROR/ {print $NF}' logs/error.log | sort | uniq -c | |
| ``` | |
| --- | |
| ## 🔗 Related Resources | |
| - [Deployment Guide](../deployment/README.md) | |
| - [Development Guide](../development/README.md) | |
| - [API Documentation](../api/README.md) | |
| - [Architecture Overview](../architecture/README.md) | |
| --- | |
| ## 📞 Getting Help | |
| ### Before Opening an Issue | |
| 1. ✅ Check this troubleshooting guide | |
| 2. ✅ Search existing GitHub issues | |
| 3. ✅ Review relevant documentation | |
| 4. ✅ Try suggested solutions above | |
| ### When Opening an Issue | |
| Include: | |
| - **Error message** (full stack trace) | |
| - **Steps to reproduce** | |
| - **Environment details** (OS, Python version, deployment type) | |
| - **Configuration** (relevant .env variables, sanitized) | |
| - **Logs** (relevant sections) | |
| - **What you've tried** (from this guide) | |
| ### Issue Template | |
| ```markdown | |
| **Environment**: | |
| - OS: Ubuntu 22.04 | |
| - Python: 3.11.5 | |
| - Deployment: Local development | |
| **Problem**: | |
| [Describe the issue] | |
| **Error Message**: | |
| ``` | |
| [Paste full error] | |
| ``` | |
| **Steps to Reproduce**: | |
| 1. ... | |
| 2. ... | |
| **What I've Tried**: | |
| - Checked logs: [findings] | |
| - Tried solution X from troubleshooting guide: [result] | |
| **Additional Context**: | |
| [Any other relevant information] | |
| ``` | |
| --- | |
| ## 💡 Tips for Preventing Issues | |
| ### Development | |
| - ✅ Run `make ci` before committing | |
| - ✅ Keep dependencies updated | |
| - ✅ Write tests for new features | |
| - ✅ Use type hints and linting | |
| ### Deployment | |
| - ✅ Use environment variables (never hardcode) | |
| - ✅ Test in staging before production | |
| - ✅ Monitor health endpoints | |
| - ✅ Keep backups current | |
| ### Operations | |
| - ✅ Set up alerts for critical metrics | |
| - ✅ Regular log rotation | |
| - ✅ Capacity planning | |
| - ✅ Security updates | |
| --- | |
| **Remember**: Most issues have been encountered and solved before. Check this guide first, then ask for help! 🚀 | |