Software Reliability and Scalability Expert - 8 Months Contract, Full-Time

Raising the Floor - US

Type of Position: 8-months Contract, Full-Time

Anytime, Anywhere, Any Computer Access. We’re an international coalition of individuals and organizations dedicated to ensuring that the Internet, and everything available through it, is accessible to people with accessibility barriers due to disability, literacy, digital literacy, or aging, and regardless of their economic resources. Our vision is to revolutionize the landscape of assistive technology by creating an infrastructure to facilitate the development, distribution, and support of a wide range of affordable accessibility solutions around the world. That is, the Global Public Inclusive Infrastructure (GPII).
You will help a team of bright and talented developers located across continents who are passionate about our vision, that of radically improving the access to technology. How? By helping to develop associated system that supports the “portability” of user preferences across any platform or device -- that makes it easier for anyone to be able to have the technology they encounter automatically change into a form they can understand and use.
  • Work with the Global Public Inclusive Infrastructure (GPII) architects and subject-matter experts (SME) to understand the infrastructure components and define the reliability and performance/ scalability metrics that need to be implemented and monitored.
  • Plan large scale stress testing, where the stability of GPII is tested by developing automated test cases that mimic very heavy loads, of different user profiles, that are created when many users simultaneously access the cloud component of the GPII, for a defined but shorter period of time.
  • Design and document a reliability plan and a performance/ scalability plan. Each plan should include a test approach, strategy and scenarios.
  • Implement the instrumentation required to collect data for analysis.
  • Recommend and document best practices.
  • Perform data analysis to detect performance bottlenecks and reliability issues.
  • Provide recommendations to the GPII developers on how to optimize the system to improve reliability and performance.
  • Integrate the reliability and performance/ scalability test cases into release processes, automate them in the GPII’s Continuous Integration environment, store results using technologies such as Elasticsearch, and provide dashboards to team members.
  • Work with Infrastructure developers to plan application deployments on Kubernetes clusters for reliability testing.
  • Debug and resolve issues relating to the automated test scripts.
  • 10+ years hands-on experience designing and writing reliability and performance test plans.
  • Experience with modern, containerized cloud infrastructure and load balancing techniques (in particular, Docker and Kubernetes), and the reliability techniques best suited to this style of architecture.
  • Knowledge of and hands-on experience with open source testing tools.
  • An Agile mindset and team player, with experience contributing to open source communities using collaborative environments such as Github.
  • Development background with ability to review code and write automation scripts and instrumentation for data gathering.
  • In-depth experience with profiling and debugging tools for Node.js (e.g. node-inspector, Chrome dev tools, heapdump, NSolid), and experience using these tools to identify the source of failures.
  • In-depth knowledge of profiling performance of services deployed on Unix-like operating systems using technologies such as dtrace, perf, systemtap, tcpdump, etc. to determine reliability issues in networks and down to the kernel.
  • Ability to understand deployment topologies, identify problem areas, simulate failures, and recommend improvements.
  • Experience with load testing tools such as Gatling, JMeter, Tsung, etc., and ability to simulate dynamic user traffic.
  • Experience with networking protocols and one or more programming languages (JavaScript, Go, Python, Ruby).  
  • Experience using JIRA to report issues.
  • Experience working in a distributed environment.