Here at Sysdig, we’re what you might call container-obsessed. It starts with our unique technology, which listens to the heart of the operating system to surface the deepest data with the least overhead. From there, we’ve created the first-ever Container Intelligence Platform, which proactively uncovers issues before they manifest, and allows for deep digging to solve the most complex problems.
We’re looking for a Senior Site Reliability Engineering (SRE) to help us build large scale distributed solutions. You will apply software engineering techniques and discipline to production operations to attack major problems and fix them for good. You will design, build, and own the end-to-end availability and reliability of both SaaS and On-Prem Sysdig products.
- Build reliable, maintainable systems and be responsible for setting the standards for our production environment
- Work on code and automation to create new systems for scaling deployment and operations of Sysdig products
- Build and manage various components of the internal and production environments with a focus on configuration management, continuous integration and platform automation
- Manage software delivery, systems integration, and developer support tools
- Build custom tools and instrumentation that ensure maximum system uptime and health
- Enhance developer CI/CD pipelines using Jenkins and Github
- Support services before they go live through activities such as developing software platforms and frameworks, capacity planning and launch reviews
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health
- Respond to pings, pages, and alerts to investigate issues in our products that you can really sink your teeth into
- 5-7+ years of experience as a DevOps or Site reliability engineer
- Strong knowledge of the container ecosystem with experience in Docker / Rocket environment
- Experience in managing and troubleshooting CI/CD pipelines using Jenkins, Bamboo or TeamCity
- Experience in managing AWS resources including EC2, RDS, Auto Scaling groups, ALB/NLB, IAM
- Experience in diagnosing and troubleshooting customer facing production service outages
- Aptitude for troubleshooting complex problems in high-throughput web applications and network service
- Proficient in data structures, algorithms, and software design
- Command of at least one of the following : Java, Python, Bash, and Golang
- Deep understanding of Linux systems and networking
- Working knowledge of Git
We'd be super excited if you have:
- Built, automated, and maintained infrastructure in Amazon Web Services using CloudFormation or Terraform (or at least Puppet, Chef, or SaltStack)
- Experience in monitoring cloud services using tools like Sysdig, Datadog, Prometheus, Grafana, Graphite, Nagios, or Zabbix
- Deployed Kubernetes or OpenStack clusters
- Managed any of these clusters - Cassandra, HBase, HDFS, Elasticsearch
- Set up Kafka or Redis clusters
- Used log aggregation services like Elasticsearch or Splunk
- Knowledge of ITIL terminology for incident and problem management
Why work at Sysdig?
- We’re a well funded startup that already has a large enterprise customer base.
- We have a pragmatic, approachable engineering culture, from the CEO down.
- We have an organizational focus on delivering value to customers.
- Our open source tools (https://www.sysdig.org) are widely used and loved by technologists & developers.
- We have fun team and company events, beer outings, and lots of espresso (if you’re in to that).
Along with top notch health insurance coverage, we offer a variety of benefits and perks, such as:
- Desk and tech setup of your choice (for wherever you work)
- IRA with company matching up to 3% of salary
- Unlimited vacation policy
- Monthly self-improvement grant – spend on yourself however you see fit
- Free weekly team lunches and delicious snacks every day of the week
- Free monthly house cleaning service