To keep a GPU fully utilized during training I need to be able to feed about 250 MB/s of raw data to the GPU (the data is uncompressible). I am accessing the data over a fast network which can feed well over 2GB/sec without a problem. Python’s GIL makes it rather hard to get those speeds into the same process that runs Tensorflow without negatively impacting the training loop. Python 3.8’s shared memory may alleviate this, but that’s not supported by Tensorflow just yet.
So I’m using
tf.io.gfile.GFile to read data over the network (data is stored on a high bandwidth S3 compliant interface). The value of
GFile is that it doesn’t engage the GIL, and thus plays nicely with the training loop. In order to achieve high throughput there needs to be significant parallelization of the network IO.
I only seem to be able to get about 75-100 MB/sec out this approach though.
I’ve timed two approaches:
- Create a
tf.data.Dataset.map(mymapfunc, num_parallel_calls=50)(I’ve tried many values of num_parallel_calls including AUTOTUNE).
- Create a function that reads data using
tf.io.gfile.GFileand simply run it using multiple threads in a
concurrent.futures.ThreadPoolExecutor, attempting thread counts up to about 100 (there’s no improvement above about 20, and eventually more threads slow it down).
In both cases I’m topping out at 75-100 MB/sec.
I’m wondering if there’s a reason for
GFileto hit an upper limit that is perhaps more
obvious to someone else.
I’m also making an assumption I should validate:
runs in numpy land, in both cases above I’m running
from python land (in the case of
tf.py_function). If GFile is meant to run as part of the graph
operations more efficiently I’m unaware of this and need to be