Lead Site Reliability Engineer in Payments and Wallet Space
What this job entails
As the Lead SRE at Tutuka you'll be working closely with entire technical team ensuring the reliability of enterprise-level, highly scalable, highly secure financial processing systems that power tens of millions of transactions and tying them to web, mobile and API interfaces that make it easy for people to issue, redeem and reconcile prepaid cards all over the world.
We already have a team of amazing developers that work out of our local offices in Johannesburg, South Africa as well as remotely across Europe and Southeast Asia, and now we need you to drive improvements in our reliability, scalability and efficiency.
What you will be doing
You'll find every day an exciting challenge, helping our technical team transform a monolithic enterprise processing environment with bank-level security and 99.95% uptime, into a sleek, nimble, micro-service serverless processing environment with better than bank-level security and 99.99% uptime.
If it was easy, we would already have done it! This role may or may not involve the following:
- Work closely with software engineering teams to improve availability, latency, performance, efficiency, monitoring, emergency response, and capacity planning of services
- Across hybrid cloud environment of hosted data centre and AWS
- Handle upgrades of infrastructure and services through automation
- Identify, gathering, documenting and automating responses to key performance metrics, logs, and alerts
- Find optimizations and other efficiencies to scale the application
- Develop playbooks and tools to streamline processes and shorten problem resolution time
- Maintain infrastructure as a code management process
- Perform periodic on call duties
Skills & requirements
We love taking on team members with a variety of skill levels, from intern to PhD. But there's no getting around the fact that we need this person to know what they're doing, and hit the ground running.
You should already be an SRE guru with:
- Solid understanding of operational principles, such as capacity planning, monitoring and incident handling
- Experience automating manual processes, leveraging cloud (preferably AWS) platforms
- Telemetry, tracing, logging, and alerting best-practices
- Experience implementing monitored and seamless deployment pipelines
- Internet fundamentals. HTTP/s, DNS, TCP/IP, security-by-design, caching
Extra kudos are awarded for:
- JVM performance tuning
- Experience in monitoring of cloud based systems
- Knowledge of automated testing frameworks and methodologies
If you have no site reliability engineering experience, your application cannot be considered.