Site reliability engineering: how Google runs production systems
Gespeichert in:
Weitere Verfasser: | |
---|---|
Format: | Buch |
Sprache: | English |
Veröffentlicht: |
Beijing ; Boston ; Farnham ; Sebastopol ; Tokyo
O'Reilly
April 2016
|
Ausgabe: | First edition |
Schlagworte: | |
Online-Zugang: | Inhaltsverzeichnis |
Beschreibung: | xxiv, 524 Seiten Diagramme |
ISBN: | 9781491929124 |
Internformat
MARC
LEADER | 00000nam a2200000 c 4500 | ||
---|---|---|---|
001 | BV043538727 | ||
003 | DE-604 | ||
005 | 20240201 | ||
007 | t | ||
008 | 160503s2016 |||| |||| 00||| eng d | ||
020 | |a 9781491929124 |9 978-1-4919-2912-4 | ||
024 | 3 | |a 9781491929124 | |
035 | |a (OCoLC)950549250 | ||
035 | |a (DE-599)BSZ467974403 | ||
040 | |a DE-604 |b ger |e rda | ||
041 | 0 | |a eng | |
049 | |a DE-1049 |a DE-739 |a DE-83 |a DE-706 |a DE-91G |a DE-1043 |a DE-M347 | ||
084 | |a ST 233 |0 (DE-625)143620: |2 rvk | ||
084 | |a DAT 675 |2 stub | ||
084 | |a DAT 345 |2 stub | ||
245 | 1 | 0 | |a Site reliability engineering |b how Google runs production systems |c edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy |
250 | |a First edition | ||
264 | 1 | |a Beijing ; Boston ; Farnham ; Sebastopol ; Tokyo |b O'Reilly |c April 2016 | |
300 | |a xxiv, 524 Seiten |b Diagramme | ||
336 | |b txt |2 rdacontent | ||
337 | |b n |2 rdamedia | ||
338 | |b nc |2 rdacarrier | ||
505 | 8 | |a System Administration | |
650 | 0 | 7 | |a Verteiltes System |0 (DE-588)4238872-7 |2 gnd |9 rswk-swf |
650 | 0 | 7 | |a Zuverlässigkeit |0 (DE-588)4059245-5 |2 gnd |9 rswk-swf |
689 | 0 | 0 | |a Verteiltes System |0 (DE-588)4238872-7 |D s |
689 | 0 | 1 | |a Zuverlässigkeit |0 (DE-588)4059245-5 |D s |
689 | 0 | |5 DE-604 | |
700 | 1 | |a Beyer, Betsy |4 edt | |
776 | 0 | 8 | |i Erscheint auch als |n Online-Ausgabe |z 978-1-4919-5118-7 |
856 | 4 | 2 | |m Digitalisierung UB Passau - ADAM Catalogue Enrichment |q application/pdf |u http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=028954230&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |3 Inhaltsverzeichnis |
999 | |a oai:aleph.bib-bvb.de:BVB01-028954230 |
Datensatz im Suchindex
_version_ | 1804176194932834304 |
---|---|
adam_text | Table of Contents
Foreword.
Preface..
XV
Part I. Introduction
1. Introduction.......................................................... 3
The Sysadmin Approach to Service Management 3
Googles Approach to Service Management: Site Reliability Engineering 5
Tenets of SRE 7
The End of the Beginning 12
2. The Production Environment at Google, from the Viewpoint of an SRE... 13
Hardware ^ 13
System Software That “Organizes” the Hardware 15
Other System Software 18
Our Software Infrastructure 19
Our Development Environment 19
Shakespeare: A Sample Service 20
Part II. Principles
3, Embracing Risk,,,............ IK
OOUOOOO*OOOOOOOé dOÔOOOOOOOOO*OOOO$OOO0QOOOO CJii
Managing Risk 25
Measuring Service Risk 26
Risk Tolerance of Services 28
V
Motivation for Error Budgets
33
4. Service Level Objectives,.............................................. 37
Service Level Terminology 37
Indicators in Practice 40
Objectives in Practice 43
Agreements in Practice 47
5. Eliminating Toil....................................................... 49
Toil Defined 49
Why Less Toil Is Better 51
What Qualifies as Engineering? 52
Is Toil Always Bad? 52
Conclusion 54
6. Monitoring Distributed Systems......................................... 55
Definitions 55
Why Monitor? 56
Setting Reasonable Expectations for Monitoring 57
Symptoms Versus Causes 58
Black-Box Versus White-Box 59
The Four Golden Signals 60
Worrying About Your Tail (or, Instrumentation and Performance) 61
Choosing an Appropriate Resolution for Measurements 62
As Simple as Possible, No Simpler 62
Tying These Principles Together 63
Monitoring for the Long Term 64
Conclusion 66
7. The Evolution of Automation at Google...................................67
The Value of Automation 67
The Value for Google SRE 70
The Use Cases for Automation 70
Automate Yourself Out of a Job: Automate ALL the Things! 73
Soothing the Pain: Applying Automation to Cluster Turnups 75
Borg: Birth of the Warehouse-Scale Computer 81
Reliability Is the Fundamental Feature 83
Recommendations 84
8. Release Engineering.......
The Role of a Release Engineer
Philosophy
87
88
vi I Table of Contents
Continuous Build and Deployment
Configuration Management
Conclusions
90
93
95
9. Simplicity........................................................*...... 97
System Stability Versus Agility 97
The Virtue of Boring 98
I Wont Give Up My Code! 98
The “Negative Lines of Code” Metric 99
Minimal APIs 99
Modularity 100
Release Simplicity 100
A Simple Conclusion 101
Part III. Practices
Practical Alerting from Time-Series Data............. 107
The Rise of Borgmon 108
Instrumentation of Applications 109
Collection of Exported Data 110
Storage in the Time-Series Arena 111
Rule Evaluation 114
Alerting 118
Sharding the Monitoring Topology 119
Black-Box Monitoring 120
Maintaining the Configuration 121
Ten Years On... 122
Being On-Call.................................... ........................ 125
Introduction 125
Life of an On-Call Engineer 126
Balanced On-Call 127
Feeling Safe 128
Avoiding Inappropriate Operational Load 130
Conclusions 132
1T:
Effective Troubleshooting*........................ 00030000040 » 3f 00 »000000 lit.:? *-*
Theory 134
In Practice 136
Negative Results Are Magic 144
Case Study 146
Table of Contents I vii
Making Troubleshooting Easier
Conclusion
150
150
13. Emergency Response.................................................. 151
What to Do When Systems Break 151
Test-Induced Emergency 152
Change-Induced Emergency 153
Process-Induced Emergency 155
All Problems Have Solutions 158
Learn from the Past. Don’t Repeat It. 158
Conclusion 159
14. Managing Incidents.................................................. 161
Unmanaged Incidents 161
The Anatomy of an Unmanaged Incident 162
Elements of Incident Management Process 163
A Managed Incident 165
When to Declare an Incident 166
In Summary 166
15. Postmortem Culture: Learning from Failure......................... 169
Google’s Postmortem Philosophy 169
Collaborate and Share Knowledge 171
Introducing a Postmortem Culture 172
Conclusion and Ongoing Improvements 175
16. Tracking Outages.....................................................177
Escalator 178
Outalator 178
17. Testing for Reliability............................................. 183
Types of Software Testing 185
Creating a Test and Build Environment 190
Testing at Scale 192
Conclusion 204
18. Software Engineering in SRE......................................... 205
Why Is Software Engineering Within SRE Important? 205
Auxon Case Study: Project Background and Problem Space 207
Intent-Based Capacity Planning 209
Fostering Software Engineering in SRE 218
Conclusions 222
viii | Table of Contents
19. Load Balancing at the Frontend.......................................... 223
Power Isn’t the Answer 223
Load Balancing Using DNS 224
Load Balancing at the Virtual IP Address 227
20. Load Balancing in the Datacenter.........................................231
The Ideal Case 232
Identifying Bad Tasks: Flow Control and Lame Ducks 233
Limiting the Connections Pool with Subsetting 235
Load Balancing Policies 240
21. Handling Overload...................................................— 247
The Pitfalls of “Queries per Second” 248
Per-Customer Limits 248
Client-Side Throttling 249
Criticality 251
Utilization Signals 253
Handling Overload Errors 253
Load from Connections 257
Conclusions 258
22. Addressing Cascading Failures.................... —............. 259
Causes of Cascading Failures and Designing to Avoid Them 260
Preventing Server Overload 265
Slow Startup and Cold Caching 274
Triggering Conditions for Cascading Failures 276
Testing for Cascading Failures 278
Immediate Steps to Address Cascading Failures 280
Closing Remarks 283
23. Managing Critical State: Distributed Consensus for Reliability........ 285
Motivating the Use of Consensus: Distributed Systems Coordination Failure 288
How Distributed Consensus Works 289
System Architecture Patterns for Distributed Consensus 291
Distributed Consensus Performance 296
Deploying Distributed Consensus-Based Systems 304
Monitoring Distributed Consensus Systems 312
Conclusion 313
24. Distributed Periodic Scheduling with Cron...................................
Cron
Cron Jobs and Idempotency
315
316
ix
Table of Contents
Cron at Large Scale
Building Cron at Google
Summary
317
319
326
25. Data Processing Pipelines.............................................. 327
Origin of the Pipeline Design Pattern 327
Initial Effect of Big Data on the Simple Pipeline Pattern 328
Challenges with the Periodic Pipeline Pattern 328
Trouble Caused By Uneven Work Distribution 328
Drawbacks of Periodic Pipelines in Distributed Environments 329
Introduction to Google Workflow 333
Stages of Execution in Workflow 335
Ensuring Business Continuity 337
Summary and Concluding Remarks 338
26. Data Integrity: What You Read Is What You Wrote......................... 339
Data Integrity’s Strict Requirements 340
Google SRE Objectives in Maintaining Data Integrity and Availability 344
How Google SRE Faces the Challenges of Data Integrity 349
Case Studies 360
General Principles of SRE as Applied to Data Integrity 367
Conclusion 368
27. Reliable Product Launches at Scale.................................... 369
Launch Coordination Engineering 370
Setting Up a Launch Process 372
Developing a Launch Checklist 375
Selected Techniques for Reliable Launches 380
Development of LCE 384
Conclusion 387
Part IV. Management
28. Accelerating SREs to On-Call and Beyond................................... 391
You’ve Hired Your Next SRE(s), Now What? 391
Initial Learning Experiences: The Case for Structure Over Chaos 394
Creating Stellar Reverse Engineers and Improvisational Thinkers 397
Five Practices for Aspiring On-Callers 400
On-Call and Beyond: Rites of Passage, and Practicing Continuing Education 406
Closing Thoughts 406
x I Table of Contents
29. Dealing with Interrupts............................................. 407
Managing Operational Load 408
Factors in Determining How Interrupts Are Handled 408
Imperfect Machines 409
30. Embedding an SRE to Recover from Operational Overload................ 417
Phase 1: Learn the Service and Get Context 418
Phase 2: Sharing Context 420
Phase 3: Driving Change 421
Conclusion 423
31. Communication and Collaboration in SRE........................... 425
Communications: Production Meetings 426
Collaboration within SRE 430
Case Study of Collaboration in SRE: Viceroy 432
Collaboration Outside SRE 437
Case Study: Migrating DFP to FI 437
Conclusion 440
32. The Evolving SRE Engagement Model................................ 441
SRE Engagement: What, How, and Why 441
The PRR Model 442
The SRE Engagement Model 443
Production Readiness Reviews: Simple PRR Model 444
Evolving the Simple PRR Model: Early Engagement 448
Evolving Services Development: Frameworks and SRE Platform 451
Conclusion 456
Part V. Conclusions
33. Lessons Learned from Other Industries......................
Meet Our Industry Veterans
Preparedness and Disaster Testing
Postmortem Culture
Automating Away Repetitive Work and Operational Overhead
Structured and Rational Decision Making
Conclusions
34.
Conclusion,.,,
OOOOOOGOO
O O Q O O 0
OOO0OOO0 OOOO0OOÛOOOOOOOOOOOOOOOOOO
459
460
462
465
467
469
470
4/5
Table of Contents | xi
A. Availability Table.............................................................. 477
B. A Collection of Best Practices for Production Services...........................479
C. Example Incident State Document..................................................485
D. Example Postmortem.............................................................. 487
E. Launch Coordination Checklist................................................... 493
F. Example Production Meeting Minutes.............................................. 497
Bibliography........................................................................501
Index.............................................................................. 513
xii j Table of Contents
|
any_adam_object | 1 |
author2 | Beyer, Betsy |
author2_role | edt |
author2_variant | b b bb |
author_facet | Beyer, Betsy |
building | Verbundindex |
bvnumber | BV043538727 |
classification_rvk | ST 233 |
classification_tum | DAT 675 DAT 345 |
contents | System Administration |
ctrlnum | (OCoLC)950549250 (DE-599)BSZ467974403 |
discipline | Informatik |
edition | First edition |
format | Book |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>01700nam a2200409 c 4500</leader><controlfield tag="001">BV043538727</controlfield><controlfield tag="003">DE-604</controlfield><controlfield tag="005">20240201 </controlfield><controlfield tag="007">t</controlfield><controlfield tag="008">160503s2016 |||| |||| 00||| eng d</controlfield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781491929124</subfield><subfield code="9">978-1-4919-2912-4</subfield></datafield><datafield tag="024" ind1="3" ind2=" "><subfield code="a">9781491929124</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)950549250</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(DE-599)BSZ467974403</subfield></datafield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">DE-604</subfield><subfield code="b">ger</subfield><subfield code="e">rda</subfield></datafield><datafield tag="041" ind1="0" ind2=" "><subfield code="a">eng</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-1049</subfield><subfield code="a">DE-739</subfield><subfield code="a">DE-83</subfield><subfield code="a">DE-706</subfield><subfield code="a">DE-91G</subfield><subfield code="a">DE-1043</subfield><subfield code="a">DE-M347</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">ST 233</subfield><subfield code="0">(DE-625)143620:</subfield><subfield code="2">rvk</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">DAT 675</subfield><subfield code="2">stub</subfield></datafield><datafield tag="084" ind1=" " ind2=" "><subfield code="a">DAT 345</subfield><subfield code="2">stub</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Site reliability engineering</subfield><subfield code="b">how Google runs production systems</subfield><subfield code="c">edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy</subfield></datafield><datafield tag="250" ind1=" " ind2=" "><subfield code="a">First edition</subfield></datafield><datafield tag="264" ind1=" " ind2="1"><subfield code="a">Beijing ; Boston ; Farnham ; Sebastopol ; Tokyo</subfield><subfield code="b">O'Reilly</subfield><subfield code="c">April 2016</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">xxiv, 524 Seiten</subfield><subfield code="b">Diagramme</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="b">n</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="b">nc</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">System Administration</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Verteiltes System</subfield><subfield code="0">(DE-588)4238872-7</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="650" ind1="0" ind2="7"><subfield code="a">Zuverlässigkeit</subfield><subfield code="0">(DE-588)4059245-5</subfield><subfield code="2">gnd</subfield><subfield code="9">rswk-swf</subfield></datafield><datafield tag="689" ind1="0" ind2="0"><subfield code="a">Verteiltes System</subfield><subfield code="0">(DE-588)4238872-7</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2="1"><subfield code="a">Zuverlässigkeit</subfield><subfield code="0">(DE-588)4059245-5</subfield><subfield code="D">s</subfield></datafield><datafield tag="689" ind1="0" ind2=" "><subfield code="5">DE-604</subfield></datafield><datafield tag="700" ind1="1" ind2=" "><subfield code="a">Beyer, Betsy</subfield><subfield code="4">edt</subfield></datafield><datafield tag="776" ind1="0" ind2="8"><subfield code="i">Erscheint auch als</subfield><subfield code="n">Online-Ausgabe</subfield><subfield code="z">978-1-4919-5118-7</subfield></datafield><datafield tag="856" ind1="4" ind2="2"><subfield code="m">Digitalisierung UB Passau - ADAM Catalogue Enrichment</subfield><subfield code="q">application/pdf</subfield><subfield code="u">http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=028954230&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA</subfield><subfield code="3">Inhaltsverzeichnis</subfield></datafield><datafield tag="999" ind1=" " ind2=" "><subfield code="a">oai:aleph.bib-bvb.de:BVB01-028954230</subfield></datafield></record></collection> |
id | DE-604.BV043538727 |
illustrated | Not Illustrated |
indexdate | 2024-07-10T07:28:19Z |
institution | BVB |
isbn | 9781491929124 |
language | English |
oai_aleph_id | oai:aleph.bib-bvb.de:BVB01-028954230 |
oclc_num | 950549250 |
open_access_boolean | |
owner | DE-1049 DE-739 DE-83 DE-706 DE-91G DE-BY-TUM DE-1043 DE-M347 |
owner_facet | DE-1049 DE-739 DE-83 DE-706 DE-91G DE-BY-TUM DE-1043 DE-M347 |
physical | xxiv, 524 Seiten Diagramme |
publishDate | 2016 |
publishDateSearch | 2016 |
publishDateSort | 2016 |
publisher | O'Reilly |
record_format | marc |
spelling | Site reliability engineering how Google runs production systems edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy First edition Beijing ; Boston ; Farnham ; Sebastopol ; Tokyo O'Reilly April 2016 xxiv, 524 Seiten Diagramme txt rdacontent n rdamedia nc rdacarrier System Administration Verteiltes System (DE-588)4238872-7 gnd rswk-swf Zuverlässigkeit (DE-588)4059245-5 gnd rswk-swf Verteiltes System (DE-588)4238872-7 s Zuverlässigkeit (DE-588)4059245-5 s DE-604 Beyer, Betsy edt Erscheint auch als Online-Ausgabe 978-1-4919-5118-7 Digitalisierung UB Passau - ADAM Catalogue Enrichment application/pdf http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=028954230&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA Inhaltsverzeichnis |
spellingShingle | Site reliability engineering how Google runs production systems System Administration Verteiltes System (DE-588)4238872-7 gnd Zuverlässigkeit (DE-588)4059245-5 gnd |
subject_GND | (DE-588)4238872-7 (DE-588)4059245-5 |
title | Site reliability engineering how Google runs production systems |
title_auth | Site reliability engineering how Google runs production systems |
title_exact_search | Site reliability engineering how Google runs production systems |
title_full | Site reliability engineering how Google runs production systems edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy |
title_fullStr | Site reliability engineering how Google runs production systems edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy |
title_full_unstemmed | Site reliability engineering how Google runs production systems edited by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy |
title_short | Site reliability engineering |
title_sort | site reliability engineering how google runs production systems |
title_sub | how Google runs production systems |
topic | Verteiltes System (DE-588)4238872-7 gnd Zuverlässigkeit (DE-588)4059245-5 gnd |
topic_facet | Verteiltes System Zuverlässigkeit |
url | http://bvbr.bib-bvb.de:8991/F?func=service&doc_library=BVB01&local_base=BVB01&doc_number=028954230&sequence=000002&line_number=0001&func_code=DB_RECORDS&service_type=MEDIA |
work_keys_str_mv | AT beyerbetsy sitereliabilityengineeringhowgooglerunsproductionsystems |