When training large neural networks on large datasets, there are several ways of breaking down the problem across machines and cores within a machine for parallel computation.
To my knowledge, one could have:
- Different cores or machines operate on different parts of the graph ("graph splitting"). E.g. backpropagation through the graph itself can be parallelized e.g. by having different layers hosted on different machines since the autodiff graph is always a DAG.
- Different cores or machines operate on different samples of data ("data splitting"). In SGD, the computation of gradients across batches or samples can also be parallelized (e.g. the gradients can be combined after computing them independently on different batches). I believe this is also called gradient accumulation (?).
On top of that, I have read about:
- Asynchronous training
- Synchronous training
but I don’t know what’s typically synchronous or asynchronous in this case e.g. Is it the computation of gradients on different data batches or the computation of gradients on different subgraphs? (synchronous vs asynchronous data or graph parallellism?). Or perhaps it refers to something else altogether?
More broadly speaking, what forms of parallelization are used in practice (and why) in modern architectures for computer vision, text and speech, or make more sense when training a neural network:
- inside a machine across cores (e.g. GPU)
- across machines on a network or a rack?