Verfügbarkeit: Distributed machine learning with Python :

Distributed machine learning with Python :: accelerating model training and serving with distributed systems /

Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
1. Verfasser:	Wang, Guanhua
Format:	Elektronisch E-Book
Sprache:	English
Veröffentlicht:	Birmingham : Packt Publishing, Limited, 2022.
Schlagworte:	Machine learning. Python (Computer program language) Apprentissage automatique. Python (Langage de programmation) Machine learning
Online-Zugang:	DE-862 DE-863
Zusammenfassung:	Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter server -- Defining the worker -- Passing data between the parameter server and worker -- Issues with the parameter server -- The parameter server architecture introduces a high coding complexity for practitioners -- All-Reduce architecture -- Reduce -- All-Reduce -- Ring All-Reduce.
Beschreibung:	Pros and cons of pipeline parallelism.
Beschreibung:	1 online resource (284 pages) : color illustrations
ISBN:	1801817219 9781801817219

Internformat

MARC


LEADER	00000cam a22000007a 4500
001	ZDB-4-EBA-on1312162521
003	OCoLC
005	20241004212047.0
006	m o d
007	cr cnu---unuuu
008	220423s2022 enka o 000 0 eng d
040			\|a EBLCP \|b eng \|e pn \|c EBLCP \|d ORMDA \|d OCLCO \|d UKMGB \|d OCLCF \|d OCLCQ \|d N$T \|d UKAHL \|d OCLCQ \|d IEEEE \|d OCLCO
015			\|a GBC274179 \|2 bnb
016	7		\|a 020566484 \|2 Uk
020			\|a 1801817219
020			\|a 9781801817219 \|q (electronic bk.)
020			\|z 9781801815697 \|q (pbk.)
035			\|a (OCoLC)1312162521
037			\|a 9781801815697 \|b O'Reilly Media
037			\|a 10163213 \|b IEEE
050		4	\|a Q325.5
082	7		\|a 006.3/1 \|2 23/eng/20220503
049			\|a MAIN
100	1		\|a Wang, Guanhua.
245	1	0	\|a Distributed machine learning with Python : \|b accelerating model training and serving with distributed systems / \|c Guanhua Wang.
260			\|a Birmingham : \|b Packt Publishing, Limited, \|c 2022.
300			\|a 1 online resource (284 pages) : \|b color illustrations
336			\|a text \|b txt \|2 rdacontent
337			\|a computer \|b c \|2 rdamedia
338			\|a online resource \|b cr \|2 rdacarrier
588	0		\|a Print version record.
505	0		\|a Intro -- Title page -- Copyright and Credits -- Dedication -- Contributors -- Table of Contents -- Preface -- Section 1 -- Data Parallelism -- Chapter 1: Splitting Input Data -- Single-node training is too slow -- The mismatch between data loading bandwidth and model training bandwidth -- Single-node training time on popular datasets -- Accelerating the training process with data parallelism -- Data parallelism -- the high-level bits -- Stochastic gradient descent -- Model synchronization -- Hyperparameter tuning -- Global batch size -- Learning rate adjustment -- Model synchronization schemes
520			\|a Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter server -- Defining the worker -- Passing data between the parameter server and worker -- Issues with the parameter server -- The parameter server architecture introduces a high coding complexity for practitioners -- All-Reduce architecture -- Reduce -- All-Reduce -- Ring All-Reduce.
505	8		\|a Collective communication -- Broadcast -- Gather -- All-Gather -- Summary -- Chapter 3: Building a Data Parallel Training and Serving Pipeline -- Technical requirements -- The data parallel training pipeline in a nutshell -- Input pre-processing -- Input data partition -- Data loading -- Training -- Model synchronization -- Model update -- Single-machine multi-GPUs and multi-machine multi-GPUs -- Single-machine multi-GPU -- Multi-machine multi-GPU -- Checkpointing and fault tolerance -- Model checkpointing -- Load model checkpoints -- Model evaluation and hyperparameter tuning
505	8		\|a Model serving in data parallelism -- Summary -- Chapter 4: Bottlenecks and Solutions -- Communication bottlenecks in data parallel training -- Analyzing the communication workloads -- Parameter server architecture -- The All-Reduce architecture -- The inefficiency of state-of-the-art communication schemes -- Leveraging idle links and host resources -- Tree All-Reduce -- Hybrid data transfer over PCIe and NVLink -- On-device memory bottlenecks -- Recomputation and quantization -- Recomputation -- Quantization -- Summary -- Section 2 -- Model Parallelism -- Chapter 5: Splitting the Model
505	8		\|a Technical requirements -- Single-node training error -- out of memory -- Fine-tuning BERT on a single GPU -- Trying to pack a giant model inside one state-of-the-art GPU -- ELMo, BERT, and GPT -- Basic concepts -- RNN -- ELMo -- BERT -- GPT -- Pre-training and fine-tuning -- State-of-the-art hardware -- P100, V100, and DGX-1 -- NVLink -- A100 and DGX-2 -- NVSwitch -- Summary -- Chapter 6: Pipeline Input and Layer Split -- Vanilla model parallelism is inefficient -- Forward propagation -- Backward propagation -- GPU idle time between forward and backward propagation -- Pipeline input
500			\|a Pros and cons of pipeline parallelism.
650		0	\|a Machine learning. \|0 http://id.loc.gov/authorities/subjects/sh85079324
650		0	\|a Python (Computer program language) \|0 http://id.loc.gov/authorities/subjects/sh96008834
650		6	\|a Apprentissage automatique.
650		6	\|a Python (Langage de programmation)
650		7	\|a Machine learning \|2 fast
650		7	\|a Python (Computer program language) \|2 fast
776	0	8	\|i Print version: \|a Wang, Guanhua. \|t Distributed Machine Learning with Python. \|d Birmingham : Packt Publishing, Limited, ©2022
966	4	0	\|l DE-862 \|p ZDB-4-EBA \|q FWS_PDA_EBA \|u https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=3242106 \|3 Volltext
966	4	0	\|l DE-863 \|p ZDB-4-EBA \|q FWS_PDA_EBA \|u https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=3242106 \|3 Volltext
938			\|a Askews and Holts Library Services \|b ASKH \|n AH39813577
938			\|a ProQuest Ebook Central \|b EBLB \|n EBL6956758
938			\|a EBSCOhost \|b EBSC \|n 3242106
994			\|a 92 \|b GEBAY
912			\|a ZDB-4-EBA
049			\|a DE-862
049			\|a DE-863

Datensatz im Suchindex

DE-BY-FWS_katkey	ZDB-4-EBA-on1312162521
_version_	1826942349699186688
adam_text
any_adam_object
author	Wang, Guanhua
author_facet	Wang, Guanhua
author_role
author_sort	Wang, Guanhua
author_variant	g w gw
building	Verbundindex
bvnumber	localFWS
callnumber-first	Q - Science
callnumber-label	Q325
callnumber-raw	Q325.5
callnumber-search	Q325.5
callnumber-sort	Q 3325.5
callnumber-subject	Q - General Science
collection	ZDB-4-EBA
contents	Intro -- Title page -- Copyright and Credits -- Dedication -- Contributors -- Table of Contents -- Preface -- Section 1 -- Data Parallelism -- Chapter 1: Splitting Input Data -- Single-node training is too slow -- The mismatch between data loading bandwidth and model training bandwidth -- Single-node training time on popular datasets -- Accelerating the training process with data parallelism -- Data parallelism -- the high-level bits -- Stochastic gradient descent -- Model synchronization -- Hyperparameter tuning -- Global batch size -- Learning rate adjustment -- Model synchronization schemes Collective communication -- Broadcast -- Gather -- All-Gather -- Summary -- Chapter 3: Building a Data Parallel Training and Serving Pipeline -- Technical requirements -- The data parallel training pipeline in a nutshell -- Input pre-processing -- Input data partition -- Data loading -- Training -- Model synchronization -- Model update -- Single-machine multi-GPUs and multi-machine multi-GPUs -- Single-machine multi-GPU -- Multi-machine multi-GPU -- Checkpointing and fault tolerance -- Model checkpointing -- Load model checkpoints -- Model evaluation and hyperparameter tuning Model serving in data parallelism -- Summary -- Chapter 4: Bottlenecks and Solutions -- Communication bottlenecks in data parallel training -- Analyzing the communication workloads -- Parameter server architecture -- The All-Reduce architecture -- The inefficiency of state-of-the-art communication schemes -- Leveraging idle links and host resources -- Tree All-Reduce -- Hybrid data transfer over PCIe and NVLink -- On-device memory bottlenecks -- Recomputation and quantization -- Recomputation -- Quantization -- Summary -- Section 2 -- Model Parallelism -- Chapter 5: Splitting the Model Technical requirements -- Single-node training error -- out of memory -- Fine-tuning BERT on a single GPU -- Trying to pack a giant model inside one state-of-the-art GPU -- ELMo, BERT, and GPT -- Basic concepts -- RNN -- ELMo -- BERT -- GPT -- Pre-training and fine-tuning -- State-of-the-art hardware -- P100, V100, and DGX-1 -- NVLink -- A100 and DGX-2 -- NVSwitch -- Summary -- Chapter 6: Pipeline Input and Layer Split -- Vanilla model parallelism is inefficient -- Forward propagation -- Backward propagation -- GPU idle time between forward and backward propagation -- Pipeline input
ctrlnum	(OCoLC)1312162521
dewey-full	006.3/1
dewey-hundreds	000 - Computer science, information, general works
dewey-ones	006 - Special computer methods
dewey-raw	006.3/1
dewey-search	006.3/1
dewey-sort	16.3 11
dewey-tens	000 - Computer science, information, general works
discipline	Informatik
format	Electronic eBook
fullrecord	<?xml version="1.0" encoding="UTF-8"?><collection xmlns="http://www.loc.gov/MARC21/slim"><record><leader>05045cam a22005537a 4500</leader><controlfield tag="001">ZDB-4-EBA-on1312162521</controlfield><controlfield tag="003">OCoLC</controlfield><controlfield tag="005">20241004212047.0</controlfield><controlfield tag="006">m o d </controlfield><controlfield tag="007">cr cnu---unuuu</controlfield><controlfield tag="008">220423s2022 enka o 000 0 eng d</controlfield><datafield tag="040" ind1=" " ind2=" "><subfield code="a">EBLCP</subfield><subfield code="b">eng</subfield><subfield code="e">pn</subfield><subfield code="c">EBLCP</subfield><subfield code="d">ORMDA</subfield><subfield code="d">OCLCO</subfield><subfield code="d">UKMGB</subfield><subfield code="d">OCLCF</subfield><subfield code="d">OCLCQ</subfield><subfield code="d">N$T</subfield><subfield code="d">UKAHL</subfield><subfield code="d">OCLCQ</subfield><subfield code="d">IEEEE</subfield><subfield code="d">OCLCO</subfield></datafield><datafield tag="015" ind1=" " ind2=" "><subfield code="a">GBC274179</subfield><subfield code="2">bnb</subfield></datafield><datafield tag="016" ind1="7" ind2=" "><subfield code="a">020566484</subfield><subfield code="2">Uk</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">1801817219</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="a">9781801817219</subfield><subfield code="q">(electronic bk.)</subfield></datafield><datafield tag="020" ind1=" " ind2=" "><subfield code="z">9781801815697</subfield><subfield code="q">(pbk.)</subfield></datafield><datafield tag="035" ind1=" " ind2=" "><subfield code="a">(OCoLC)1312162521</subfield></datafield><datafield tag="037" ind1=" " ind2=" "><subfield code="a">9781801815697</subfield><subfield code="b">O'Reilly Media</subfield></datafield><datafield tag="037" ind1=" " ind2=" "><subfield code="a">10163213</subfield><subfield code="b">IEEE</subfield></datafield><datafield tag="050" ind1=" " ind2="4"><subfield code="a">Q325.5</subfield></datafield><datafield tag="082" ind1="7" ind2=" "><subfield code="a">006.3/1</subfield><subfield code="2">23/eng/20220503</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">MAIN</subfield></datafield><datafield tag="100" ind1="1" ind2=" "><subfield code="a">Wang, Guanhua.</subfield></datafield><datafield tag="245" ind1="1" ind2="0"><subfield code="a">Distributed machine learning with Python :</subfield><subfield code="b">accelerating model training and serving with distributed systems /</subfield><subfield code="c">Guanhua Wang.</subfield></datafield><datafield tag="260" ind1=" " ind2=" "><subfield code="a">Birmingham :</subfield><subfield code="b">Packt Publishing, Limited,</subfield><subfield code="c">2022.</subfield></datafield><datafield tag="300" ind1=" " ind2=" "><subfield code="a">1 online resource (284 pages) :</subfield><subfield code="b">color illustrations</subfield></datafield><datafield tag="336" ind1=" " ind2=" "><subfield code="a">text</subfield><subfield code="b">txt</subfield><subfield code="2">rdacontent</subfield></datafield><datafield tag="337" ind1=" " ind2=" "><subfield code="a">computer</subfield><subfield code="b">c</subfield><subfield code="2">rdamedia</subfield></datafield><datafield tag="338" ind1=" " ind2=" "><subfield code="a">online resource</subfield><subfield code="b">cr</subfield><subfield code="2">rdacarrier</subfield></datafield><datafield tag="588" ind1="0" ind2=" "><subfield code="a">Print version record.</subfield></datafield><datafield tag="505" ind1="0" ind2=" "><subfield code="a">Intro -- Title page -- Copyright and Credits -- Dedication -- Contributors -- Table of Contents -- Preface -- Section 1 -- Data Parallelism -- Chapter 1: Splitting Input Data -- Single-node training is too slow -- The mismatch between data loading bandwidth and model training bandwidth -- Single-node training time on popular datasets -- Accelerating the training process with data parallelism -- Data parallelism -- the high-level bits -- Stochastic gradient descent -- Model synchronization -- Hyperparameter tuning -- Global batch size -- Learning rate adjustment -- Model synchronization schemes</subfield></datafield><datafield tag="520" ind1=" " ind2=" "><subfield code="a">Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter server -- Defining the worker -- Passing data between the parameter server and worker -- Issues with the parameter server -- The parameter server architecture introduces a high coding complexity for practitioners -- All-Reduce architecture -- Reduce -- All-Reduce -- Ring All-Reduce.</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Collective communication -- Broadcast -- Gather -- All-Gather -- Summary -- Chapter 3: Building a Data Parallel Training and Serving Pipeline -- Technical requirements -- The data parallel training pipeline in a nutshell -- Input pre-processing -- Input data partition -- Data loading -- Training -- Model synchronization -- Model update -- Single-machine multi-GPUs and multi-machine multi-GPUs -- Single-machine multi-GPU -- Multi-machine multi-GPU -- Checkpointing and fault tolerance -- Model checkpointing -- Load model checkpoints -- Model evaluation and hyperparameter tuning</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Model serving in data parallelism -- Summary -- Chapter 4: Bottlenecks and Solutions -- Communication bottlenecks in data parallel training -- Analyzing the communication workloads -- Parameter server architecture -- The All-Reduce architecture -- The inefficiency of state-of-the-art communication schemes -- Leveraging idle links and host resources -- Tree All-Reduce -- Hybrid data transfer over PCIe and NVLink -- On-device memory bottlenecks -- Recomputation and quantization -- Recomputation -- Quantization -- Summary -- Section 2 -- Model Parallelism -- Chapter 5: Splitting the Model</subfield></datafield><datafield tag="505" ind1="8" ind2=" "><subfield code="a">Technical requirements -- Single-node training error -- out of memory -- Fine-tuning BERT on a single GPU -- Trying to pack a giant model inside one state-of-the-art GPU -- ELMo, BERT, and GPT -- Basic concepts -- RNN -- ELMo -- BERT -- GPT -- Pre-training and fine-tuning -- State-of-the-art hardware -- P100, V100, and DGX-1 -- NVLink -- A100 and DGX-2 -- NVSwitch -- Summary -- Chapter 6: Pipeline Input and Layer Split -- Vanilla model parallelism is inefficient -- Forward propagation -- Backward propagation -- GPU idle time between forward and backward propagation -- Pipeline input</subfield></datafield><datafield tag="500" ind1=" " ind2=" "><subfield code="a">Pros and cons of pipeline parallelism.</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">Machine learning.</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh85079324</subfield></datafield><datafield tag="650" ind1=" " ind2="0"><subfield code="a">Python (Computer program language)</subfield><subfield code="0">http://id.loc.gov/authorities/subjects/sh96008834</subfield></datafield><datafield tag="650" ind1=" " ind2="6"><subfield code="a">Apprentissage automatique.</subfield></datafield><datafield tag="650" ind1=" " ind2="6"><subfield code="a">Python (Langage de programmation)</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Machine learning</subfield><subfield code="2">fast</subfield></datafield><datafield tag="650" ind1=" " ind2="7"><subfield code="a">Python (Computer program language)</subfield><subfield code="2">fast</subfield></datafield><datafield tag="776" ind1="0" ind2="8"><subfield code="i">Print version:</subfield><subfield code="a">Wang, Guanhua.</subfield><subfield code="t">Distributed Machine Learning with Python.</subfield><subfield code="d">Birmingham : Packt Publishing, Limited, ©2022</subfield></datafield><datafield tag="966" ind1="4" ind2="0"><subfield code="l">DE-862</subfield><subfield code="p">ZDB-4-EBA</subfield><subfield code="q">FWS_PDA_EBA</subfield><subfield code="u">https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=3242106</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="966" ind1="4" ind2="0"><subfield code="l">DE-863</subfield><subfield code="p">ZDB-4-EBA</subfield><subfield code="q">FWS_PDA_EBA</subfield><subfield code="u">https://search.ebscohost.com/login.aspx?direct=true&scope=site&db=nlebk&AN=3242106</subfield><subfield code="3">Volltext</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">Askews and Holts Library Services</subfield><subfield code="b">ASKH</subfield><subfield code="n">AH39813577</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">ProQuest Ebook Central</subfield><subfield code="b">EBLB</subfield><subfield code="n">EBL6956758</subfield></datafield><datafield tag="938" ind1=" " ind2=" "><subfield code="a">EBSCOhost</subfield><subfield code="b">EBSC</subfield><subfield code="n">3242106</subfield></datafield><datafield tag="994" ind1=" " ind2=" "><subfield code="a">92</subfield><subfield code="b">GEBAY</subfield></datafield><datafield tag="912" ind1=" " ind2=" "><subfield code="a">ZDB-4-EBA</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-862</subfield></datafield><datafield tag="049" ind1=" " ind2=" "><subfield code="a">DE-863</subfield></datafield></record></collection>
id	ZDB-4-EBA-on1312162521
illustrated	Illustrated
indexdate	2025-03-18T14:26:36Z
institution	BVB
isbn	1801817219 9781801817219
language	English
oclc_num	1312162521
open_access_boolean
owner	MAIN DE-862 DE-BY-FWS DE-863 DE-BY-FWS
owner_facet	MAIN DE-862 DE-BY-FWS DE-863 DE-BY-FWS
physical	1 online resource (284 pages) : color illustrations
psigel	ZDB-4-EBA FWS_PDA_EBA ZDB-4-EBA
publishDate	2022
publishDateSearch	2022
publishDateSort	2022
publisher	Packt Publishing, Limited,
record_format	marc
spelling	Wang, Guanhua. Distributed machine learning with Python : accelerating model training and serving with distributed systems / Guanhua Wang. Birmingham : Packt Publishing, Limited, 2022. 1 online resource (284 pages) : color illustrations text txt rdacontent computer c rdamedia online resource cr rdacarrier Print version record. Intro -- Title page -- Copyright and Credits -- Dedication -- Contributors -- Table of Contents -- Preface -- Section 1 -- Data Parallelism -- Chapter 1: Splitting Input Data -- Single-node training is too slow -- The mismatch between data loading bandwidth and model training bandwidth -- Single-node training time on popular datasets -- Accelerating the training process with data parallelism -- Data parallelism -- the high-level bits -- Stochastic gradient descent -- Model synchronization -- Hyperparameter tuning -- Global batch size -- Learning rate adjustment -- Model synchronization schemes Chapter 2: Parameter Server and All-Reduce -- Technical requirements -- Parameter server architecture -- Communication bottleneck in the parameter server architecture -- Sharding the model among parameter servers -- Implementing the parameter server -- Defining model layers -- Defining the parameter server -- Defining the worker -- Passing data between the parameter server and worker -- Issues with the parameter server -- The parameter server architecture introduces a high coding complexity for practitioners -- All-Reduce architecture -- Reduce -- All-Reduce -- Ring All-Reduce. Collective communication -- Broadcast -- Gather -- All-Gather -- Summary -- Chapter 3: Building a Data Parallel Training and Serving Pipeline -- Technical requirements -- The data parallel training pipeline in a nutshell -- Input pre-processing -- Input data partition -- Data loading -- Training -- Model synchronization -- Model update -- Single-machine multi-GPUs and multi-machine multi-GPUs -- Single-machine multi-GPU -- Multi-machine multi-GPU -- Checkpointing and fault tolerance -- Model checkpointing -- Load model checkpoints -- Model evaluation and hyperparameter tuning Model serving in data parallelism -- Summary -- Chapter 4: Bottlenecks and Solutions -- Communication bottlenecks in data parallel training -- Analyzing the communication workloads -- Parameter server architecture -- The All-Reduce architecture -- The inefficiency of state-of-the-art communication schemes -- Leveraging idle links and host resources -- Tree All-Reduce -- Hybrid data transfer over PCIe and NVLink -- On-device memory bottlenecks -- Recomputation and quantization -- Recomputation -- Quantization -- Summary -- Section 2 -- Model Parallelism -- Chapter 5: Splitting the Model Technical requirements -- Single-node training error -- out of memory -- Fine-tuning BERT on a single GPU -- Trying to pack a giant model inside one state-of-the-art GPU -- ELMo, BERT, and GPT -- Basic concepts -- RNN -- ELMo -- BERT -- GPT -- Pre-training and fine-tuning -- State-of-the-art hardware -- P100, V100, and DGX-1 -- NVLink -- A100 and DGX-2 -- NVSwitch -- Summary -- Chapter 6: Pipeline Input and Layer Split -- Vanilla model parallelism is inefficient -- Forward propagation -- Backward propagation -- GPU idle time between forward and backward propagation -- Pipeline input Pros and cons of pipeline parallelism. Machine learning. http://id.loc.gov/authorities/subjects/sh85079324 Python (Computer program language) http://id.loc.gov/authorities/subjects/sh96008834 Apprentissage automatique. Python (Langage de programmation) Machine learning fast Python (Computer program language) fast Print version: Wang, Guanhua. Distributed Machine Learning with Python. Birmingham : Packt Publishing, Limited, ©2022
spellingShingle	Wang, Guanhua Distributed machine learning with Python : accelerating model training and serving with distributed systems / Intro -- Title page -- Copyright and Credits -- Dedication -- Contributors -- Table of Contents -- Preface -- Section 1 -- Data Parallelism -- Chapter 1: Splitting Input Data -- Single-node training is too slow -- The mismatch between data loading bandwidth and model training bandwidth -- Single-node training time on popular datasets -- Accelerating the training process with data parallelism -- Data parallelism -- the high-level bits -- Stochastic gradient descent -- Model synchronization -- Hyperparameter tuning -- Global batch size -- Learning rate adjustment -- Model synchronization schemes Collective communication -- Broadcast -- Gather -- All-Gather -- Summary -- Chapter 3: Building a Data Parallel Training and Serving Pipeline -- Technical requirements -- The data parallel training pipeline in a nutshell -- Input pre-processing -- Input data partition -- Data loading -- Training -- Model synchronization -- Model update -- Single-machine multi-GPUs and multi-machine multi-GPUs -- Single-machine multi-GPU -- Multi-machine multi-GPU -- Checkpointing and fault tolerance -- Model checkpointing -- Load model checkpoints -- Model evaluation and hyperparameter tuning Model serving in data parallelism -- Summary -- Chapter 4: Bottlenecks and Solutions -- Communication bottlenecks in data parallel training -- Analyzing the communication workloads -- Parameter server architecture -- The All-Reduce architecture -- The inefficiency of state-of-the-art communication schemes -- Leveraging idle links and host resources -- Tree All-Reduce -- Hybrid data transfer over PCIe and NVLink -- On-device memory bottlenecks -- Recomputation and quantization -- Recomputation -- Quantization -- Summary -- Section 2 -- Model Parallelism -- Chapter 5: Splitting the Model Technical requirements -- Single-node training error -- out of memory -- Fine-tuning BERT on a single GPU -- Trying to pack a giant model inside one state-of-the-art GPU -- ELMo, BERT, and GPT -- Basic concepts -- RNN -- ELMo -- BERT -- GPT -- Pre-training and fine-tuning -- State-of-the-art hardware -- P100, V100, and DGX-1 -- NVLink -- A100 and DGX-2 -- NVSwitch -- Summary -- Chapter 6: Pipeline Input and Layer Split -- Vanilla model parallelism is inefficient -- Forward propagation -- Backward propagation -- GPU idle time between forward and backward propagation -- Pipeline input Machine learning. http://id.loc.gov/authorities/subjects/sh85079324 Python (Computer program language) http://id.loc.gov/authorities/subjects/sh96008834 Apprentissage automatique. Python (Langage de programmation) Machine learning fast Python (Computer program language) fast
subject_GND	http://id.loc.gov/authorities/subjects/sh85079324 http://id.loc.gov/authorities/subjects/sh96008834
title	Distributed machine learning with Python : accelerating model training and serving with distributed systems /
title_auth	Distributed machine learning with Python : accelerating model training and serving with distributed systems /
title_exact_search	Distributed machine learning with Python : accelerating model training and serving with distributed systems /
title_full	Distributed machine learning with Python : accelerating model training and serving with distributed systems / Guanhua Wang.
title_fullStr	Distributed machine learning with Python : accelerating model training and serving with distributed systems / Guanhua Wang.
title_full_unstemmed	Distributed machine learning with Python : accelerating model training and serving with distributed systems / Guanhua Wang.
title_short	Distributed machine learning with Python :
title_sort	distributed machine learning with python accelerating model training and serving with distributed systems
title_sub	accelerating model training and serving with distributed systems /
topic	Machine learning. http://id.loc.gov/authorities/subjects/sh85079324 Python (Computer program language) http://id.loc.gov/authorities/subjects/sh96008834 Apprentissage automatique. Python (Langage de programmation) Machine learning fast Python (Computer program language) fast
topic_facet	Machine learning. Python (Computer program language) Apprentissage automatique. Python (Langage de programmation) Machine learning
work_keys_str_mv	AT wangguanhua distributedmachinelearningwithpythonacceleratingmodeltrainingandservingwithdistributedsystems

Verfügbarkeit

Es ist kein Print-Exemplar vorhanden.

Volltext öffnen

MARC

Datensatz im Suchindex

Es ist kein Print-Exemplar vorhanden.

Ähnliche Einträge