Building Ultra-Resilient Systems in Go

In the past 6 months, my team encountered two critical incidents stemming from the same root cause - complete power loss in one of our primary data centers. The incidents occurred 5 months apart, and although the root cause was the same, our services responded very differently the second time around. Drawing from lessons learnt from this, this talk dives deep into the strategies and techniques implemented to build ultra-resilient systems in Go.

LEVEL: Introductory and Overview

Place
GoLab Discovery
Length
25 min
When
November 13th, 2024
10:30

Abstract

DESCRIPTION:

In the past 6 months, my team encountered two critical incidents stemming from the same root cause - complete power loss in one of our primary data centers. The incidents occurred 5 months apart, and although the root cause was the same, our services responded very differently the second time around. These incidents underscored the importance of resilience and prompted us to enhance our system's robustness significantly.

In November 2023, our primary data center lost power for around 36 hours, whilst we did already have a disaster-recovery facility in another location, it took some time to restore a lot of our services. Because of this, we spent the following weeks/months focusing on ensuring high reliability of our control plane. This process was called Code Orange. In April 2024, the same thing happened again, however this time our applications faced little to no downtime with minimal manual intervention. Drawing from lessons learnt and managing applications at massive scale, this talk dives deep into the strategies and techniques for building ultra-resilient systems in Go.

We'll explore the intricacies of ensuring high availability with Go, from building software that can horizontally scale across different data centers, to ensuring we have a reliable and well tested disaster recovery plan in place. We'll look at how some distributed design patterns can be implemented in Go, discussing the Circuit Breaker pattern to mitigate failures and prevent cascading system breakdowns. The leader election pattern to avoid conflicts and duplication across distributed systems.

We'll also walk through to how make remote procedure calls reliably in Go, with sensible retry strategies and timeouts. As well as implementing caching to reduce load on external systems and improve performance while maintaining data integrity.

Measuring resilience is crucial for maintaining system reliability. We'll look Service Level Objectives (SLOs) and discuss practical methods for defining and monitoring them effectively. As well as chaos testing to ensure our disaster recovery plan is effective.

With valuable insights drawn from real-world incidents, this talk aims to equip engineers with a deeper understanding of building ultra-resilient systems in Go, with practical techniques for ensuring high availability and handling failures gracefully.

GoLab is a conference made by Develer.
Develer is a company based in Campi Bisenzio, near Florence. Our motto is : "Technology to give life to your products". We produce hardware and software to create exceptional products and to improve industrial processes and people's well being.
In Develer we have passion for the new technologies and we offer our clients effective solutions that are also efficient, simple and safe for the end users. We also believe in a friendly and welcoming environment where anybody can give their contribution. This passion and this vision are what we've been driven to organize our conference "made by developers for developers".


Subscribe to our newsletter

We hate spam just as much as you do, which is why we promise to only send you relevant communications. We respect your privacy and will never share your information with third parties.
©2024 GoLab | The international conference on Go in Florence-Design & devCantiere Creativo-Made withDatoCMS