Improve Your Service Scalability and Reliability with SRE Pioneered by Google to create more scalable and reliable large-scale systems, Site Reliability Engineering (SRE) has become one of today's most valuable software innovation opportunities. Establishing SRE Foundations is a concise, practical guide that shows how to drive successful SRE adoption in your own organization. Dr. Vladyslav Ukis presents a step-by-step approach to establishing the right cultural, organizational, and technical process foundations, quickly achieving a "minimum viable SRE" and continually improving from there. Dr.…mehr
Improve Your Service Scalability and Reliability with SRE Pioneered by Google to create more scalable and reliable large-scale systems, Site Reliability Engineering (SRE) has become one of today's most valuable software innovation opportunities. Establishing SRE Foundations is a concise, practical guide that shows how to drive successful SRE adoption in your own organization. Dr. Vladyslav Ukis presents a step-by-step approach to establishing the right cultural, organizational, and technical process foundations, quickly achieving a "minimum viable SRE" and continually improving from there. Dr. Ukis draws extensively on his own experiences leading an SRE transformation journey at a major healthcare company. Throughout, he answers specific questions that organizations ask about SRE, identifies pitfalls, and shows how to avoid or overcome them. Whatever your role in software development, engineering, or operations, this guide will help you apply SRE to improve what matters most: user and customer experience. Understand how SRE works, its role in software operations, and the challenges of SRE transformationAssess your organization's current operations and readiness for SRE transformationAchieve organizational buy-in and initiate foundational activities, including SLO definitions, alerting, on-call rotations, incident response, and error budget-based decision-makingAlign organizational structures to support a full SRE transformationMeasure the progress and success of your SRE initiativeSustain and advance your SRE transformation beyond the foundations "The techniques and principles of SRE are not only clearly defined here, but also the rationale behind them is explained in a way that will stick. This is not some dry definition, this is practical, usable understanding. . . . I can whole-heartedly recommend this book without any reservation. This is a very good book on an important topic that helps to move the game forward for our discipline!"--From the Foreword by David Farley, Founder and CEO of Continuous Delivery Ltd. Register your book for convenient access to downloads, updates, and/or corrections as they become available. See inside book for details.
Dr. Vladyslav Ukis is head of R&D for the Siemens Healthineers teamplay digital health platform and reliability lead for all Siemens Healthineers Digital Health products. Previously, as software development lead, he drove Continuous Delivery, SRE, and DevRel transformation, helping this large distributed development organization evolve architecture, deployment, testing, operations, and culture to implement these new processes at scale.
Inhaltsangabe
Foreword xxi Preface xxv Acknowledgments xxix About the Author xxxiii Part I: Foundations 1 Chapter 1: Introduction to SRE 3 1.1 Why SRE? 3 1.2 Alignment Using SRE 13 1.3 Why Does SRE Work? 17 1.4 Summary 19 Chapter 2: The Challenge 21 2.1 Misalignment 22 2.2 Collective Ownership 23 2.3 Ownership Using SRE 25 2.4 The Challenge Statement 38 2.5 Coaching 39 2.6 Summary 41 Chapter 3: SRE Basic Concepts 43 3.1 Service Level Indicators 43 3.2 Service Level Objectives 45 3.3 Error Budgets 47 3.4 Error Budget Policies 53 3.5 SRE Concept Pyramid 55 3.6 Alignment Using the SRE Concept Pyramid 59 3.7 Summary 63 Chapter 4: Assessing the Status Quo 65 4.1 Where Is the Organization? 65 4.2 Where Are the People? 69 4.3 Where Is the Tech? 71 4.4 Where Is the Culture? 74 4.5 Where Is the Process? 79 4.6 SRE Maturity Model 81 4.7 Posing Hypotheses 81 4.8 Summary 86 Part II: Running the Transformation 87 Chapter 5: Achieving Organizational Buy-In 89 5.1 Getting People Behind SRE 89 5.2 SRE Marketing Funnel 92 5.3 SRE Coaches 96 5.4 Top-Down Buy-In 99 5.5 Bottom-Up Buy-In 117 5.6 Lateral Buy-In 122 5.7 Buy-In Staggering 123 5.8 Team Coaching 124 5.9 Traversing the Organization 126 5.10 Organizational Coaching 131 5.11 Summary 133 Chapter 6: Laying Down the Foundations 135 6.1 Introductory Talks by Team 135 6.2 Conveying the Basics 136 6.3 SLI Standardization 147 6.4 Enabling Logging 154 6.5 Teaching the Log Query Language 156 6.6 Defining Initial SLOs 157 6.7 Default SLOs 163 6.8 Providing Basic Infrastructure 164 6.9 Engaging Champions 167 6.10 Dealing with Detractors 168 6.11 Creating Documentation 171 6.12 Broadcast Success 172 6.13 Summary 174 Chapter 7: Reacting to Alerts on SLO Breaches 175 7.1 Environment Selection 175 7.2 Responsibilities 177 7.3 Ways of Working 180 7.4 Setting Up On-Call Rotations 185 7.5 On-Call Management Tools 188 7.6 Out-of-Hours On-Call 193 7.7 Systematic Knowledge Sharing 196 7.8 Broadcast Success 208 7.9 Summary 209 Chapter 8: Implementing Alert Dispatching 211 8.1 Alert Escalation 212 8.2 Defining an Alert Escalation Policy 214 8.3 Defining Stakeholder Groups 216 8.4 Triggering Stakeholder Notifications 218 8.5 Defining Stakeholder Rings 219 8.6 Defining Effective Stakeholder Notifications 222 8.7 Getting the Stakeholders Subscribed 225 8.8 Broadcast Success 226 8.9 Summary 227 Chapter 9: Implementing Incident Response 229 9.1 Incident Response Foundations 229 9.2 Incident Priorities 230 9.3 Complex Incident Coordination 248 9.4 Incident Postmortems 268 9.5 Effective Postmortem Criteria 269 9.6 Mashing Up the Tools 294 9.7 Service Status Broadcast 298 9.8 Documenting the Incident Response Process 301 9.9 Broadcast Success 302 9.10 Summary 303 Chapter 10: Setting Up an Error Budget Policy 305 10.1 Motivation 305 10.2 Terminology 307 10.3 Error Budget Policy Structure 308 10.4 Error Budget Policy Conditions 309 10.5 Error Budget Policy Consequences 311 10.6 Error Budget Policy Governance 312 10.7 Extending the Error Budget Policy 314 10.8 Agreeing to the Error Budget Policy 318 10.9 Storing the Error Budget Policy 319 10.10 Enacting the Error Budget Policy 320 10.11 Reviewing the Error Budget Policy 321 10.12 Related Concepts 322 10.13 Summary 324 Chapter 11: Enabling Error BudgetBased Decision-Making 325 11.1 Reliability Decision-Making Taxonomy 325 11.2 Implementing SRE Indicators 330 11.3 Process Indicators, Not People KPIs 359 11.4 Decisions Versus Indicators 359 11.5 Decision-Making Workflows 362 11.6 Summary 388 Chapter 12: Implementing Organizational Structure 391 12.1 SRE Principles Versus Organizational Structure 393 12.2 Who Builds It, Who Runs It? 394 12.3 You Build It, You Run It 403 12.4 You Build It, You and SRE Run It 406 12.5 You Build It, SRE Run It 421 12.6 Cost Optimization 424 12.7 Team Topologies 426 12.8 Choosing a Model 432 12.9 A New Role: SRE 440 12.10 SRE Career Path 450 12.11 Communicating the Chosen Model 456 12.12 Introducing the Chosen Model 457 12.13 Summary 462 Part III: Measuring and Sustaining the Transformation 465 Chapter 13: Measuring the SRE Transformation 467 13.1 Testing Transformation Hypotheses 467 13.2 Outages Not Detected Internally 469 13.3 Services Exhausting Error Budgets Prematurely 470 13.4 Executives' Perceptions 471 13.5 Reliability Perception by Users and Partners 472 13.6 Summary 473 Chapter 14: Sustaining the SRE Movement 475 14.1 Maturing the SRE CoP 475 14.2 SRE Minutes 475 14.3 Availability Newsletter 476 14.4 SRE Column in the Engineering Blog 477 14.5 Promote Long-Form SRE Wiki Articles 477 14.6 SRE Broadcasting 478 14.7 Combining SRE and CD Indicators 479 14.8 SRE Feedback Loops 483 14.9 New Hypotheses 484 14.10 Providing Learning Opportunities 486 14.11 Supporting SRE Coaches 487 14.12 Summary 489 Chapter 15: The Road Ahead 491 15.1 Service Catalog 492 15.2 SLAs 494 15.3 Regulatory Compliance 494 15.4 SRE Infrastructure 495 15.5 Game Days 496 Appendix: Topics for Quick Reference 499 Index 507