/dev/reading
Category

Site Reliability

3 books
Order by
View
First Steps Toward Reliability for You and Your Organization
by David N. Blank-Edelman

Do you wish the existing books on site reliability engineering started at the beginning? Do you wish someone would walk you through how to become an SRE, how to think like an SRE, or how to build and grow a successful SRE function in your organization?

Becoming SRE addresses all of these needs and more with three interconnected sections: the essential groundwork for understanding SRE and SRE culture, advice for individuals on becoming an SRE, and guidance for organizations on creating and developing a thriving SRE practice.

Acting as your personal and personable guide, author David Blank-Edelman takes you through subjects like:

  • SRE mindset, SRE culture, and SRE advocacy
  • What you need to get started and hired in SRE and what the job will be like when you get there
  • What you need to bring SRE into an organization and what is required for a good organizational fit so it can thrive there
  • How to work with your business folks and management around SRE
  • How SRE can grow and mature in an organization over time

Ready to become an SRE or introduce SRE into your organization? This book is here to help.

Site reliability through controlled disruption
by Mikolaj Pawlikowski

Auto engineers test the safety of a car by intentionally crashing it and carefully observing the results. Chaos engineering applies the same principles to software systems. In

Chaos Engineering: Site reliability through controlled disruption, you’ll learn to run your applications and infrastructure through a series of tests that simulate real-life failures. You'll maximize the benefits of chaos engineering by learning to think like a chaos engineer, and how to design the proper experiments to ensure the reliability of your software. With examples that cover a whole spectrum of software, you'll be ready to run an intensive testing regime on anything from a simple WordPress site to a massive distributed system running on Kubernetes.

How Google Runs Production Systems
by Betsy Beyer, Chris Jones, Niall Richard Murphy and Jennifer Petoff

The overwhelming majority of a software system's lifespan is spent in use, not in design or implementation. So, why does conventional wisdom insist that software engineers focus primarily on the design and development of large-scale computing systems?

In this collection of essays and articles, key members of Google's Site Reliability Team explain how and why their commitment to the entire lifecycle has enabled the company to successfully build, deploy, monitor, and maintain some of the largest software systems in the world. You'll learn the principles and practices that enable Google engineers to make systems more scalable, reliable, and efficient—lessons directly applicable to your organization.

This book is divided into four sections:

  • Introduction—Learn what site reliability engineering is and why it differs from conventional IT industry practices
  • Principles—Examine the patterns, behaviors, and areas of concern that influence the work of a site reliability engineer (SRE)
  • Practices—Understand the theory and practice of an SRE's day-to-day work: building and operating large distributed computing systems
  • Management—Explore Google's best practices for training, communication, and meetings that your organization can use