Imagem exibindo o logotipo Trampe de Casa

Senior Site Reliability Engineer

💥 The impact you will have

  • Design for reliability: Set SLOs/SLIs, build self-healing architectures, and drive incident-prevention projects that keep our APIs and real-time ordering flows <100 ms p95.
  • Own observability: Level-up dashboards, alerts, and distributed tracing so teams can detect issues before customers do.
  • Automate deployments: Evolve our Buildkite pipelines and Terraform modules to give engineers <10-minute, one-click rollouts (and clean rollbacks).
  • Champion security & compliance: Harden infra with least-privilege IAM, threat-model topology changes, and guide SOC 2 / PCI efforts.
  • Partition & scale data-stores: Tune Postgres for multi-TB workloads, maintain Mongo sharding, and shepherd Kafka topic management as event volume climbs.
  • Lead incident response: Rotate with the on-call SREs, run blameless post-mortems, and convert findings into durable fixes.
  • Mentor & collaborate: Pair with product engineers on capacity reviews, guide junior devs on Docker best-practices, and evangelize “you build it, you run it.”

🤝 Who you’ll work with

  • Partners daily with backend, frontend, and data engineers across three time-zones
  • Collaborates with Product, Customer Support, and Restaurant Success teams to keep the customer experience seamless

✅ Minimum requirements

  • 5+ years running production workloads on AWS (or GCP/Azure) with infrastructure-as-code (Terraform/CDK/CloudFormation)
  • Hands-on experience operating container orchestration (ECS, EKS, Kubernetes, Nomad, etc.) and designing blue/green or canary rollouts
  • Depth in at least two of our core datastores (Postgres, MongoDB, Kafka) including backup/restore, upgrades, and performance tuning
  • Fluency with CI/CD pipelines (we use Buildkite + GitHub Actions) and a knack for automating everything with shell, Python, or TypeScript
  • Proven track record setting up monitoring/alerting in Datadog, Prometheus, or similar, with clear SLO/SLA ownership
  • Strong grasp of linux networking, load balancing (Cloudflare/ELB), and CDN/edge-security concepts
  • Excellent incident-management and root-cause analysis skills; able to write crisp RCAs and follow through on action items
  • Passion for customer-centric thinking, rapid iteration, and continuous learning

Empresa: BairesDev

Trabalhe de Casa Arquiteto Python / Ref. 0071P

Contratação: Integral
title

Empresa: Grupo Primo

Front-end Engineer Pleno

Contratação: Integral
title

Owner.com

Owner.com

Compartilhar