#StackBounty: #python #tensorflow How to utilize 100% of GPU memory with Tensorflow?

Bounty: 50

I have a 32Gb graphics card and upon start of my script I see:

2019-07-11 01:26:19.985367: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 95.16G (102174818304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.988090: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 85.64G (91957338112 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.990806: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 77.08G (82761605120 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.993527: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 69.37G (74485440512 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.996219: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 62.43G (67036893184 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:19.998911: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 56.19G (60333203456 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.001601: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 50.57G (54299881472 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.004296: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 45.51G (48869892096 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.006981: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 40.96G (43982901248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.009660: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 36.87G (39584608256 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2019-07-11 01:26:20.012341: E tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 33.18G (35626147840 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

After which TF settles with using 96% of my memory. And later, when it runs out of memory it tryies to allocate 65G

tensorflow/stream_executor/cuda/cuda_driver.cc:936] failed to allocate 65.30G (70111285248 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY

My question is, what about remaining 1300MB (0.04*32480)? I would not mind using those before running OOM.

How can I make TF utilize 99.9% of memory instead of 96%?

Update: nvidia-smi output

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:16.0 Off |                    0 |
| N/A   66C    P0   293W / 300W |  31274MiB / 32480MiB |    100%      Default |

I am asking about these 1205MB (31274MiB – 32480MiB) remaining unused. Maybe they are there for a reason, maybe they are used just before OOM.


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.