Sunday, May 10, 2026

Using codex to investigate PagerDuty alerts

 Codex CLI is so smart (gpt 5.5 medium, maybe better than Claude Code). I just told it to "get my PagerDuty incidents and find the root cause", and it:


- read the PagerDuty skill in our company skills repo

- used it to list my team's active incidents

- checked the missing S3 success files

- traced the alerts back through the Airflow DAGs

- queried live Airflow task state in prod

- checked S3 timestamps for upstream sludge success files

- used Trino to verify the alert

- concluded that the main issue was late sludge log completion delaying downstream metrics, while the alert was a real low-volume threshold miss, not missing data


Pretty amazing.

No comments:

Post a Comment