Distributed machine learning with Python :: accelerating model training and serving with distributed systems /
Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter...
Gespeichert in:
1. Verfasser: | |
---|---|
Format: | Elektronisch E-Book |
Sprache: | English |
Veröffentlicht: |
Birmingham :
Packt Publishing, Limited,
2022.
|
Schlagworte: | |
Online-Zugang: | Volltext |
Zusammenfassung: | Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter server -- Defining the worker -- Passing data between the parameter server and worker -- Issues with the parameter server -- The parameter server architecture introduces a high coding complexity for practitioners -- All-Reduce architecture -- Reduce -- All-Reduce -- Ring All-Reduce. |
Beschreibung: | Pros and cons of pipeline parallelism. |
Beschreibung: | 1 online resource (284 pages) : color illustrations |
ISBN: | 1801817219 9781801817219 |
Internformat
MARC
LEADER | 00000cam a22000007a 4500 | ||
---|---|---|---|
001 | ZDB-4-EBA-on1312162521 | ||
003 | OCoLC | ||
005 | 20241004212047.0 | ||
006 | m o d | ||
007 | cr cnu---unuuu | ||
008 | 220423s2022 enka o 000 0 eng d | ||
040 | |a EBLCP |b eng |e pn |c EBLCP |d ORMDA |d OCLCO |d UKMGB |d OCLCF |d OCLCQ |d N$T |d UKAHL |d OCLCQ |d IEEEE |d OCLCO | ||
015 | |a GBC274179 |2 bnb | ||
016 | 7 | |a 020566484 |2 Uk | |
020 | |a 1801817219 | ||
020 | |a 9781801817219 |q (electronic bk.) | ||
020 | |z 9781801815697 |q (pbk.) | ||
035 | |a (OCoLC)1312162521 | ||
037 | |a 9781801815697 |b O'Reilly Media | ||
037 | |a 10163213 |b IEEE | ||
050 | 4 | |a Q325.5 | |
082 | 7 | |a 006.3/1 |2 23/eng/20220503 | |
049 | |a MAIN | ||
100 | 1 | |a Wang, Guanhua. | |
245 | 1 | 0 | |a Distributed machine learning with Python : |b accelerating model training and serving with distributed systems / |c Guanhua Wang. |
260 | |a Birmingham : |b Packt Publishing, Limited, |c 2022. | ||
300 | |a 1 online resource (284 pages) : |b color illustrations | ||
336 | |a text |b txt |2 rdacontent | ||
337 | |a computer |b c |2 rdamedia | ||
338 | |a online resource |b cr |2 rdacarrier | ||
588 | 0 | |a Print version record. | |
505 | 0 | |a Intro -- Title page -- Copyright and Credits -- Dedication -- Contributors -- Table of Contents -- Preface -- Section 1 -- Data Parallelism -- Chapter 1: Splitting Input Data -- Single-node training is too slow -- The mismatch between data loading bandwidth and model training bandwidth -- Single-node training time on popular datasets -- Accelerating the training process with data parallelism -- Data parallelism -- the high-level bits -- Stochastic gradient descent -- Model synchronization -- Hyperparameter tuning -- Global batch size -- Learning rate adjustment -- Model synchronization schemes | |
520 | |a Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter server -- Defining the worker -- Passing data between the parameter server and worker -- Issues with the parameter server -- The parameter server architecture introduces a high coding complexity for practitioners -- All-Reduce architecture -- Reduce -- All-Reduce -- Ring All-Reduce. | ||
505 | 8 | |a Collective communication -- Broadcast -- Gather -- All-Gather -- Summary -- Chapter 3: Building a Data Parallel Training and Serving Pipeline -- Technical requirements -- The data parallel training pipeline in a nutshell -- Input pre-processing -- Input data partition -- Data loading -- Training -- Model synchronization -- Model update -- Single-machine multi-GPUs and multi-machine multi-GPUs -- Single-machine multi-GPU -- Multi-machine multi-GPU -- Checkpointing and fault tolerance -- Model checkpointing -- Load model checkpoints -- Model evaluation and hyperparameter tuning | |
505 | 8 | |a Model serving in data parallelism -- Summary -- Chapter 4: Bottlenecks and Solutions -- Communication bottlenecks in data parallel training -- Analyzing the communication workloads -- Parameter server architecture -- The All-Reduce architecture -- The inefficiency of state-of-the-art communication schemes -- Leveraging idle links and host resources -- Tree All-Reduce -- Hybrid data transfer over PCIe and NVLink -- On-device memory bottlenecks -- Recomputation and quantization -- Recomputation -- Quantization -- Summary -- Section 2 -- Model Parallelism -- Chapter 5: Splitting the Model | |
505 | 8 | |a Technical requirements -- Single-node training error -- out of memory -- Fine-tuning BERT on a single GPU -- Trying to pack a giant model inside one state-of-the-art GPU -- ELMo, BERT, and GPT -- Basic concepts -- RNN -- ELMo -- BERT -- GPT -- Pre-training and fine-tuning -- State-of-the-art hardware -- P100, V100, and DGX-1 -- NVLink -- A100 and DGX-2 -- NVSwitch -- Summary -- Chapter 6: Pipeline Input and Layer Split -- Vanilla model parallelism is inefficient -- Forward propagation -- Backward propagation -- GPU idle time between forward and backward propagation -- Pipeline input | |
500 | |a Pros and cons of pipeline parallelism. | ||
650 | 0 | |a Machine learning. |0 http://id.loc.gov/authorities/subjects/sh85079324 | |
650 | 0 | |a Python (Computer program language) |0 http://id.loc.gov/authorities/subjects/sh96008834 | |
650 | 6 | |a Apprentissage automatique. | |
650 | 6 | |a Python (Langage de programmation) | |
650 | 7 | |a Machine learning |2 fast | |
650 | 7 | |a Python (Computer program language) |2 fast | |
776 | 0 | 8 | |i Print version: |a Wang, Guanhua. |t Distributed Machine Learning with Python. |d Birmingham : Packt Publishing, Limited, ©2022 |
856 | 4 | 0 | |l FWS01 |p ZDB-4-EBA |q FWS_PDA_EBA |u https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=3242106 |3 Volltext |
938 | |a Askews and Holts Library Services |b ASKH |n AH39813577 | ||
938 | |a ProQuest Ebook Central |b EBLB |n EBL6956758 | ||
938 | |a EBSCOhost |b EBSC |n 3242106 | ||
994 | |a 92 |b GEBAY | ||
912 | |a ZDB-4-EBA | ||
049 | |a DE-863 |
Datensatz im Suchindex
DE-BY-FWS_katkey | ZDB-4-EBA-on1312162521 |
---|---|
_version_ | 1816882559676579840 |
adam_text | |
any_adam_object | |
author | Wang, Guanhua |
author_facet | Wang, Guanhua |
author_role | |
author_sort | Wang, Guanhua |
author_variant | g w gw |
building | Verbundindex |
bvnumber | localFWS |
callnumber-first | Q - Science |
callnumber-label | Q325 |
callnumber-raw | Q325.5 |
callnumber-search | Q325.5 |
callnumber-sort | Q 3325.5 |
callnumber-subject | Q - General Science |
collection | ZDB-4-EBA |
contents | Intro -- Title page -- Copyright and Credits -- Dedication -- Contributors -- Table of Contents -- Preface -- Section 1 -- Data Parallelism -- Chapter 1: Splitting Input Data -- Single-node training is too slow -- The mismatch between data loading bandwidth and model training bandwidth -- Single-node training time on popular datasets -- Accelerating the training process with data parallelism -- Data parallelism -- the high-level bits -- Stochastic gradient descent -- Model synchronization -- Hyperparameter tuning -- Global batch size -- Learning rate adjustment -- Model synchronization schemes Collective communication -- Broadcast -- Gather -- All-Gather -- Summary -- Chapter 3: Building a Data Parallel Training and Serving Pipeline -- Technical requirements -- The data parallel training pipeline in a nutshell -- Input pre-processing -- Input data partition -- Data loading -- Training -- Model synchronization -- Model update -- Single-machine multi-GPUs and multi-machine multi-GPUs -- Single-machine multi-GPU -- Multi-machine multi-GPU -- Checkpointing and fault tolerance -- Model checkpointing -- Load model checkpoints -- Model evaluation and hyperparameter tuning Model serving in data parallelism -- Summary -- Chapter 4: Bottlenecks and Solutions -- Communication bottlenecks in data parallel training -- Analyzing the communication workloads -- Parameter server architecture -- The All-Reduce architecture -- The inefficiency of state-of-the-art communication schemes -- Leveraging idle links and host resources -- Tree All-Reduce -- Hybrid data transfer over PCIe and NVLink -- On-device memory bottlenecks -- Recomputation and quantization -- Recomputation -- Quantization -- Summary -- Section 2 -- Model Parallelism -- Chapter 5: Splitting the Model Technical requirements -- Single-node training error -- out of memory -- Fine-tuning BERT on a single GPU -- Trying to pack a giant model inside one state-of-the-art GPU -- ELMo, BERT, and GPT -- Basic concepts -- RNN -- ELMo -- BERT -- GPT -- Pre-training and fine-tuning -- State-of-the-art hardware -- P100, V100, and DGX-1 -- NVLink -- A100 and DGX-2 -- NVSwitch -- Summary -- Chapter 6: Pipeline Input and Layer Split -- Vanilla model parallelism is inefficient -- Forward propagation -- Backward propagation -- GPU idle time between forward and backward propagation -- Pipeline input |
ctrlnum | (OCoLC)1312162521 |
dewey-full | 006.3/1 |
dewey-hundreds | 000 - Computer science, information, general works |
dewey-ones | 006 - Special computer methods |
dewey-raw | 006.3/1 |
dewey-search | 006.3/1 |
dewey-sort | 16.3 11 |
dewey-tens | 000 - Computer science, information, general works |
discipline | Informatik |
format | Electronic eBook |
fullrecord | <?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>05045cam a22005537a 4500</leader><controlfield tag="001">ZDB-4-EBA-on1312162521</controlfield><controlfield tag="003">OCoLC</controlfield><controlfield tag="005">20241004212047.0</controlfield><controlfield tag="006">m o d </controlfield><controlfield tag="007">cr cnu---unuuu</controlfield><controlfield tag="008">220423s2022 enka o 000 0 eng d</controlfield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">EBLCP</subfield><subfield code="b">eng</subfield><subfield code="e">pn</subfield><subfield code="c">EBLCP</subfield><subfield code="d">ORMDA</subfield><subfield code="d">OCLCO</subfield><subfield code="d">UKMGB</subfield><subfield code="d">OCLCF</subfield><subfield code="d">OCLCQ</subfield><subfield code="d">N$T</subfield><subfield code="d">UKAHL</subfield><subfield code="d">OCLCQ</subfield><subfield code="d">IEEEE</subfield><subfield code="d">OCLCO</subfield></datafield><datafield tag="015" ind1=" " ind2=" "><subfield code="a">GBC274179</subfield><subfield code="2">bnb</subfield></datafield><datafield tag="016" ind1="7" ind2=" "><subfield code="a">020566484</subfield><subfield code="2">Uk</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">1801817219</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781801817219</subfield><subfield code="q">(electronic bk.)</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="z">9781801815697</subfield><subfield code="q">(pbk.)</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)1312162521</subfield></datafield><datafield tag="037" ind1=" " ind2=" "><subfield code="a">9781801815697</subfield><subfield code="b">O'Reilly Media</subfield></datafield><datafield tag="037" ind1=" " ind2=" "><subfield code="a">10163213</subfield><subfield code="b">IEEE</subfield></datafield><datafield tag="050" ind1=" " ind2="4"><subfield code="a">Q325.5</subfield></datafield><datafield tag="082" ind1="7" ind2=" "><subfield code="a">006.3/1</subfield><subfield code="2">23/eng/20220503</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">MAIN</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Wang, Guanhua.</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Distributed machine learning with Python :</subfield><subfield code="b">accelerating model training and serving with distributed systems /</subfield><subfield code="c">Guanhua Wang.</subfield></datafield><datafield tag="260" ind1=" " ind2=" "><subfield code="a">Birmingham :</subfield><subfield code="b">Packt Publishing, Limited,</subfield><subfield code="c">2022.</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">1 online resource (284 pages) :</subfield><subfield code="b">color illustrations</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">computer</subfield><subfield code="b">c</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">online resource</subfield><subfield code="b">cr</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="588" ind1="0" ind2=" "><subfield code="a">Print version record.</subfield></datafield><datafield tag="505" ind1="0" ind2=" "><subfield code="a">Intro -- Title page -- Copyright and Credits -- Dedication -- Contributors -- Table of Contents -- Preface -- Section 1 -- Data Parallelism -- Chapter 1: Splitting Input Data -- Single-node training is too slow -- The mismatch between data loading bandwidth and model training bandwidth -- Single-node training time on popular datasets -- Accelerating the training process with data parallelism -- Data parallelism -- the high-level bits -- Stochastic gradient descent -- Model synchronization -- Hyperparameter tuning -- Global batch size -- Learning rate adjustment -- Model synchronization schemes</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter server -- Defining the worker -- Passing data between the parameter server and worker -- Issues with the parameter server -- The parameter server architecture introduces a high coding complexity for practitioners -- All-Reduce architecture -- Reduce -- All-Reduce -- Ring All-Reduce.</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Collective communication -- Broadcast -- Gather -- All-Gather -- Summary -- Chapter 3: Building a Data Parallel Training and Serving Pipeline -- Technical requirements -- The data parallel training pipeline in a nutshell -- Input pre-processing -- Input data partition -- Data loading -- Training -- Model synchronization -- Model update -- Single-machine multi-GPUs and multi-machine multi-GPUs -- Single-machine multi-GPU -- Multi-machine multi-GPU -- Checkpointing and fault tolerance -- Model checkpointing -- Load model checkpoints -- Model evaluation and hyperparameter tuning</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Model serving in data parallelism -- Summary -- Chapter 4: Bottlenecks and Solutions -- Communication bottlenecks in data parallel training -- Analyzing the communication workloads -- Parameter server architecture -- The All-Reduce architecture -- The inefficiency of state-of-the-art communication schemes -- Leveraging idle links and host resources -- Tree All-Reduce -- Hybrid data transfer over PCIe and NVLink -- On-device memory bottlenecks -- Recomputation and quantization -- Recomputation -- Quantization -- Summary -- Section 2 -- Model Parallelism -- Chapter 5: Splitting the Model</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Technical requirements -- Single-node training error -- out of memory -- Fine-tuning BERT on a single GPU -- Trying to pack a giant model inside one state-of-the-art GPU -- ELMo, BERT, and GPT -- Basic concepts -- RNN -- ELMo -- BERT -- GPT -- Pre-training and fine-tuning -- State-of-the-art hardware -- P100, V100, and DGX-1 -- NVLink -- A100 and DGX-2 -- NVSwitch -- Summary -- Chapter 6: Pipeline Input and Layer Split -- Vanilla model parallelism is inefficient -- Forward propagation -- Backward propagation -- GPU idle time between forward and backward propagation -- Pipeline input</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Pros and cons of pipeline parallelism.</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">Machine learning.</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh85079324</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">Python (Computer program language)</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh96008834</subfield></datafield><datafield tag="650" ind1=" " ind2="6"><subfield code="a">Apprentissage automatique.</subfield></datafield><datafield tag="650" ind1=" " ind2="6"><subfield code="a">Python (Langage de programmation)</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Machine learning</subfield><subfield code="2">fast</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Python (Computer program language)</subfield><subfield code="2">fast</subfield></datafield><datafield tag="776" ind1="0" ind2="8"><subfield code="i">Print version:</subfield><subfield code="a">Wang, Guanhua.</subfield><subfield code="t">Distributed Machine Learning with Python.</subfield><subfield code="d">Birmingham : Packt Publishing, Limited, ©2022</subfield></datafield><datafield tag="856" ind1="4" ind2="0"><subfield code="l">FWS01</subfield><subfield code="p">ZDB-4-EBA</subfield><subfield code="q">FWS_PDA_EBA</subfield><subfield code="u">https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=3242106</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">Askews and Holts Library Services</subfield><subfield code="b">ASKH</subfield><subfield code="n">AH39813577</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">ProQuest Ebook Central</subfield><subfield code="b">EBLB</subfield><subfield code="n">EBL6956758</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">EBSCOhost</subfield><subfield code="b">EBSC</subfield><subfield code="n">3242106</subfield></datafield><datafield tag="994" ind1=" " ind2=" "><subfield code="a">92</subfield><subfield code="b">GEBAY</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">ZDB-4-EBA</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-863</subfield></datafield></record></collection> |
id | ZDB-4-EBA-on1312162521 |
illustrated | Illustrated |
indexdate | 2024-11-27T13:30:33Z |
institution | BVB |
isbn | 1801817219 9781801817219 |
language | English |
oclc_num | 1312162521 |
open_access_boolean | |
owner | MAIN DE-863 DE-BY-FWS |
owner_facet | MAIN DE-863 DE-BY-FWS |
physical | 1 online resource (284 pages) : color illustrations |
psigel | ZDB-4-EBA |
publishDate | 2022 |
publishDateSearch | 2022 |
publishDateSort | 2022 |
publisher | Packt Publishing, Limited, |
record_format | marc |
spelling | Wang, Guanhua. Distributed machine learning with Python : accelerating model training and serving with distributed systems / Guanhua Wang. Birmingham : Packt Publishing, Limited, 2022. 1 online resource (284 pages) : color illustrations text txt rdacontent computer c rdamedia online resource cr rdacarrier Print version record. Intro -- Title page -- Copyright and Credits -- Dedication -- Contributors -- Table of Contents -- Preface -- Section 1 -- Data Parallelism -- Chapter 1: Splitting Input Data -- Single-node training is too slow -- The mismatch between data loading bandwidth and model training bandwidth -- Single-node training time on popular datasets -- Accelerating the training process with data parallelism -- Data parallelism -- the high-level bits -- Stochastic gradient descent -- Model synchronization -- Hyperparameter tuning -- Global batch size -- Learning rate adjustment -- Model synchronization schemes Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter server -- Defining the worker -- Passing data between the parameter server and worker -- Issues with the parameter server -- The parameter server architecture introduces a high coding complexity for practitioners -- All-Reduce architecture -- Reduce -- All-Reduce -- Ring All-Reduce. Collective communication -- Broadcast -- Gather -- All-Gather -- Summary -- Chapter 3: Building a Data Parallel Training and Serving Pipeline -- Technical requirements -- The data parallel training pipeline in a nutshell -- Input pre-processing -- Input data partition -- Data loading -- Training -- Model synchronization -- Model update -- Single-machine multi-GPUs and multi-machine multi-GPUs -- Single-machine multi-GPU -- Multi-machine multi-GPU -- Checkpointing and fault tolerance -- Model checkpointing -- Load model checkpoints -- Model evaluation and hyperparameter tuning Model serving in data parallelism -- Summary -- Chapter 4: Bottlenecks and Solutions -- Communication bottlenecks in data parallel training -- Analyzing the communication workloads -- Parameter server architecture -- The All-Reduce architecture -- The inefficiency of state-of-the-art communication schemes -- Leveraging idle links and host resources -- Tree All-Reduce -- Hybrid data transfer over PCIe and NVLink -- On-device memory bottlenecks -- Recomputation and quantization -- Recomputation -- Quantization -- Summary -- Section 2 -- Model Parallelism -- Chapter 5: Splitting the Model Technical requirements -- Single-node training error -- out of memory -- Fine-tuning BERT on a single GPU -- Trying to pack a giant model inside one state-of-the-art GPU -- ELMo, BERT, and GPT -- Basic concepts -- RNN -- ELMo -- BERT -- GPT -- Pre-training and fine-tuning -- State-of-the-art hardware -- P100, V100, and DGX-1 -- NVLink -- A100 and DGX-2 -- NVSwitch -- Summary -- Chapter 6: Pipeline Input and Layer Split -- Vanilla model parallelism is inefficient -- Forward propagation -- Backward propagation -- GPU idle time between forward and backward propagation -- Pipeline input Pros and cons of pipeline parallelism. Machine learning. http://id.loc.gov/authorities/subjects/sh85079324 Python (Computer program language) http://id.loc.gov/authorities/subjects/sh96008834 Apprentissage automatique. Python (Langage de programmation) Machine learning fast Python (Computer program language) fast Print version: Wang, Guanhua. Distributed Machine Learning with Python. Birmingham : Packt Publishing, Limited, ©2022 FWS01 ZDB-4-EBA FWS_PDA_EBA https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=3242106 Volltext |
spellingShingle | Wang, Guanhua Distributed machine learning with Python : accelerating model training and serving with distributed systems / Intro -- Title page -- Copyright and Credits -- Dedication -- Contributors -- Table of Contents -- Preface -- Section 1 -- Data Parallelism -- Chapter 1: Splitting Input Data -- Single-node training is too slow -- The mismatch between data loading bandwidth and model training bandwidth -- Single-node training time on popular datasets -- Accelerating the training process with data parallelism -- Data parallelism -- the high-level bits -- Stochastic gradient descent -- Model synchronization -- Hyperparameter tuning -- Global batch size -- Learning rate adjustment -- Model synchronization schemes Collective communication -- Broadcast -- Gather -- All-Gather -- Summary -- Chapter 3: Building a Data Parallel Training and Serving Pipeline -- Technical requirements -- The data parallel training pipeline in a nutshell -- Input pre-processing -- Input data partition -- Data loading -- Training -- Model synchronization -- Model update -- Single-machine multi-GPUs and multi-machine multi-GPUs -- Single-machine multi-GPU -- Multi-machine multi-GPU -- Checkpointing and fault tolerance -- Model checkpointing -- Load model checkpoints -- Model evaluation and hyperparameter tuning Model serving in data parallelism -- Summary -- Chapter 4: Bottlenecks and Solutions -- Communication bottlenecks in data parallel training -- Analyzing the communication workloads -- Parameter server architecture -- The All-Reduce architecture -- The inefficiency of state-of-the-art communication schemes -- Leveraging idle links and host resources -- Tree All-Reduce -- Hybrid data transfer over PCIe and NVLink -- On-device memory bottlenecks -- Recomputation and quantization -- Recomputation -- Quantization -- Summary -- Section 2 -- Model Parallelism -- Chapter 5: Splitting the Model Technical requirements -- Single-node training error -- out of memory -- Fine-tuning BERT on a single GPU -- Trying to pack a giant model inside one state-of-the-art GPU -- ELMo, BERT, and GPT -- Basic concepts -- RNN -- ELMo -- BERT -- GPT -- Pre-training and fine-tuning -- State-of-the-art hardware -- P100, V100, and DGX-1 -- NVLink -- A100 and DGX-2 -- NVSwitch -- Summary -- Chapter 6: Pipeline Input and Layer Split -- Vanilla model parallelism is inefficient -- Forward propagation -- Backward propagation -- GPU idle time between forward and backward propagation -- Pipeline input Machine learning. http://id.loc.gov/authorities/subjects/sh85079324 Python (Computer program language) http://id.loc.gov/authorities/subjects/sh96008834 Apprentissage automatique. Python (Langage de programmation) Machine learning fast Python (Computer program language) fast |
subject_GND | http://id.loc.gov/authorities/subjects/sh85079324 http://id.loc.gov/authorities/subjects/sh96008834 |
title | Distributed machine learning with Python : accelerating model training and serving with distributed systems / |
title_auth | Distributed machine learning with Python : accelerating model training and serving with distributed systems / |
title_exact_search | Distributed machine learning with Python : accelerating model training and serving with distributed systems / |
title_full | Distributed machine learning with Python : accelerating model training and serving with distributed systems / Guanhua Wang. |
title_fullStr | Distributed machine learning with Python : accelerating model training and serving with distributed systems / Guanhua Wang. |
title_full_unstemmed | Distributed machine learning with Python : accelerating model training and serving with distributed systems / Guanhua Wang. |
title_short | Distributed machine learning with Python : |
title_sort | distributed machine learning with python accelerating model training and serving with distributed systems |
title_sub | accelerating model training and serving with distributed systems / |
topic | Machine learning. http://id.loc.gov/authorities/subjects/sh85079324 Python (Computer program language) http://id.loc.gov/authorities/subjects/sh96008834 Apprentissage automatique. Python (Langage de programmation) Machine learning fast Python (Computer program language) fast |
topic_facet | Machine learning. Python (Computer program language) Apprentissage automatique. Python (Langage de programmation) Machine learning |
url | https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=3242106 |
work_keys_str_mv | AT wangguanhua distributedmachinelearningwithpythonacceleratingmodeltrainingandservingwithdistributedsystems |