Senior Systems Engineer
Sauce Labs Inc
Sauce Labs provides the world's largest automation cloud for testing web and native/hybrid mobile applications. Founded by the original creator of Selenium, Sauce Labs helps companies accelerate software development cycles, improve application quality and deploy with confidence across 500+ browser / OS platforms. Join us in making the world a better place for continuous integration and software development. We’re building a next generation infrastructure as a service platform.
We’re looking for a Senior Systems Engineer to join our Ops Team. This role will be responsible for the successful operations and scaling of the infrastructure and software that powers Sauce Labs and launches over 10 million VMs a month.
- Provide Strong Leadership on team under the guidance of the Systems Manager/Architect
- Write tools and scripts to provide automation and self service solutions for ourselves and other teams
- Design new systems to support production services
- Install, configure and debug hardware and systems in our data center
- Creatively solve scale challenges regarding a rapidly expanding cloud environment
- Work with real hardware - Cisco UCS B & C series servers, SuperMicro Twin-Pro, storage (NAS and SAN), Mac-in-a-datacenter, custom appliances for mobile devices, load balancers, and beyond
- Help improve monitoring and identify key performance metrics
- Proactive R&D - discovering and implementing new tools, emerging technology, etc.
- Disaster recovery design, implementation, and maintenance
- Create NOC runbooks, procedures, documentation, and diagrams of the environments you manage
- Troubleshooting and resolution of server/network issues
- Help maintain hardware in Sauce’s colocation facilities
- Help build out new data centers around the globe
- Participation in 24x7 on-call rotation
Here are a couple of examples of the kind of projects you might work on:
- Optimize hardware and configuration for improving hypervisor performance
- Automating Deployment of operating systems to bare metal servers
- Building and optimizing a ELK cluster for our development team to monitor and analyze production system usage
We have a lot of big projects and decisions that need to be made, and in this role you would be a key part of that process. Sound like fun? Here’s what we’re looking for:
Our Ideal Candidate:
- Able execute on high level goals independently and with cross functional teams
- 8+ years recent experience working as a Linux administrator/engineer at scale (hundreds of systems) and designing/deploying ‘highly available’ solutions
- 2+ years of recent professional experience designing, developing, and operating Configuration Management solutions such as Chef, Puppet, Salt (preferred), or Ansible (preferred) at scale
- Solid experience in Linux tuning, profiling, and monitoring
- Strong skills in at least one language: Python (preferred), Ruby, Bash:
- Experience deploying/managing KVM-Qemu and LXC
- Experience with Kubernetes, Docker and their ecosystems.
- Experience managing day-to-day operations with Redis, Memcached
- Solid understanding of cloud/networking/distributed computing environment concepts; including TCP/ IP connections, firewalls, VLANs, etc.
- Familiar with ZFS on Linux and storage appliances (iSCSI and NFS)
- Experience and understanding of contemporary metrics, monitors, and logging solutions especially statsD, Graphite, ELK, Splunk, Nagios, etc.
- Highly organized, able to multi-task, able to work individually, as well as within a team, and across teams
- Excellent communication skills, both verbal and written across all user levels
- Deployment automation in physical and virtual environments (PXE, MAAS (preferred))
- Experience with InSpec or a similar tool for testing configuration management.
- Working knowledge of load balancing technologies (hard/soft)
- Proven experience collaborating in a cross functional team environment
- Familiarity with software engineering practices, including n-tier architecture, configuration management, development methodologies (e.g. agile, waterfall, spiral, prototyping), etc.
- This role can be located remote from SF in the Continental US. Some travel to South Bay or SF is required