Senior Site Reliability Engineer
💥 The impact you will have
- Design for reliability: Set SLOs/SLIs, build self-healing architectures, and drive incident-prevention projects that keep our APIs and real-time ordering flows <100 ms p95.
- Own observability: Level-up dashboards, alerts, and distributed tracing so teams can detect issues before customers do.
- Automate deployments: Evolve our Buildkite pipelines and Terraform modules to give engineers <10-minute, one-click rollouts (and clean rollbacks).
- Champion security & compliance: Harden infra with least-privilege IAM, threat-model topology changes, and guide SOC 2 / PCI efforts.
- Partition & scale data-stores: Tune Postgres for multi-TB workloads, maintain Mongo sharding, and shepherd Kafka topic management as event volume climbs.
- Lead incident response: Rotate with the on-call SREs, run blameless post-mortems, and convert findings into durable fixes.
- Mentor & collaborate: Pair with product engineers on capacity reviews, guide junior devs on Docker best-practices, and evangelize “you build it, you run it.”
🤝 Who you’ll work with
- Partners daily with backend, frontend, and data engineers across three time-zones
- Collaborates with Product, Customer Support, and Restaurant Success teams to keep the customer experience seamless
✅ Minimum requirements
- 5+ years running production workloads on AWS (or GCP/Azure) with infrastructure-as-code (Terraform/CDK/CloudFormation)
- Hands-on experience operating container orchestration (ECS, EKS, Kubernetes, Nomad, etc.) and designing blue/green or canary rollouts
- Depth in at least two of our core datastores (Postgres, MongoDB, Kafka) including backup/restore, upgrades, and performance tuning
- Fluency with CI/CD pipelines (we use Buildkite + GitHub Actions) and a knack for automating everything with shell, Python, or TypeScript
- Proven track record setting up monitoring/alerting in Datadog, Prometheus, or similar, with clear SLO/SLA ownership
- Strong grasp of linux networking, load balancing (Cloudflare/ELB), and CDN/edge-security concepts
- Excellent incident-management and root-cause analysis skills; able to write crisp RCAs and follow through on action items
- Passion for customer-centric thinking, rapid iteration, and continuous learning
Empresa: BairesDev
Trabalhe de Casa Arquiteto Python / Ref. 0071P
Contratação: Integral
title
Empresa: Grupo Primo
Front-end Engineer Pleno
Contratação: Integral
title