What strategies and forms of parallelization are feasible and available for training and serving a neural network?:
- inside a machine across cores (e.g. GPU / TPU / CPU)
- across machines on a network or a rack
I’m also looking for evidence for how they may also be used in e.g. TensorFlow, PyTorch or MXNet.
To my knowledge, when training large neural networks on large datasets, one could at least have:
- Different cores or machines operate on different parts of the graph ("graph splitting"). E.g. backpropagation through the graph itself can be parallelized e.g. by having different layers hosted on different machines since (I think?) the autodiff graph is always a DAG.
- Different cores or machines operate on different samples of data ("data splitting"). In SGD, the computation of gradients across batches or samples can also be parallelized (e.g. the gradients can be combined after computing them independently on different batches). I believe this is also called gradient accumulation (?).
When is each strategy better for what type of problem or neural network? Which modes are supported by modern libraries? and can one combine all four (2×2) strategies?
On top of that, I have read about:
- Asynchronous training
- Synchronous training
but I don’t know what exactly that refers to, e.g. is it the computation of gradients on different data batches or the computation of gradients on different subgraphs? Or perhaps it refers to something else altogether?
If the network is huge, prediction / inference may also be slow, and the model may not fit on a single machine in memory at serving time, so I’m interested in multi-core and multi-node prediction solutions for such models.