Job Summary

This is an opportunity to be one of the founding members of Canva's Chaos Engineering team.  The Chaos Engineering team is responsible for ensuring that all of the resiliency measures that have been developed and implemented are working as expected.  When they don't work, we're responsible for working with other engineering teams across the business to investigate and remediate the issues.


  • As an individual contributor, design and implement tools and libraries that service teams can use to improve the reliability of their services. For example - adding a new long-awaited feature in our circuit breaker library
  • Conduct and automate chaos experiments to identify possible scenarios where cascading failures may occur and to verify the reliability measures we introduce to prevent this work as expected. For example: discovering what will happen when this newly introduced service goes down, or, does the fallback for a rare failure actually work?
  • Work with product engineering teams to ensure that reliability best practices and tools are rolled out in every service across the whole organization. It’s not enough to create a new throttling library, we want to make sure that it’s successfully used in every service.
  • Deep investigation into production incidents - followed up by applying the learnings to the code base
  • Researching, developing, and justifying the best choices in the form of design docs for tools and processes that will shape the future of reliability at Canva
  • Promote creative and conceptual problem-solving approaches; as opposed to framework- or library-heavy patchwork
  • Propose new approaches and solutions to ensure we future-proof Canva’s distributed cloud infrastructure as we scale 
  • Participating in design meetings, hiring interviews, and code reviews


  • At least five (5) years of commercial experience of working as a reliability/chaos engineer in a large, distributed, cloud-based environment - any of the usual suspects (AWS, Google Cloud, Azure) is fine!
  • Be happy to work in Java, since our services and libraries are primarily written in Java 11
  • Disciplined coding practices and experience with code reviews and pull requests
  • Strong communication and team collaboration skills, both written and verbal. As a reliability engineer, you will need to share the knowledge, communicate and coordinate changes across multiple service teams.
  • Solid understanding of resiliency techniques and patterns – load balancing, throttling, back pressure, circuit breaking, etc - the good stuff


  • Competitive salary, plus equity options
  • Flexible daily working hours, we value work-life balance
  • In-house chefs that cook delicious breakfast and lunch for us each day
  • Onsite Gym; Yoga Benefits
  • Generous parental (including secondary) leave policy
  • Pet-friendly offices
  • Sponsored social clubs and team events
  • Relocation budget for interstate or overseas individuals that legally qualify for visa sponsorship


Canva is a graphic-design tool website, founded in 2012. It uses a drag-and-drop format and provides access to over a million photographs, graphics, and fonts. It is used by non-designers as well as professionals. The tools can be used for both web and print media design and graphics.



Company Type


Official website

Got a question? Get in touch now

We're here to help! Check out our FAQs or send us an email.