#StackBounty: #20.04 #nvidia #virtualization #cuda #gpu How does one make a GPU in a brand new ubuntu 20.04 VM usable?

Bounty: 50

I’ve been trying all day to have this (v100) GPU working on a new ubuntu VM. I tried installing the drivers and rebooting and also purging/uninstalling everything to do with nvidia but none of these things seem to work.

In particular I ran this specifically:

apt update;
apt install build-essential;

sudo add-apt-repository ppa:graphics-drivers
sudo apt install ubuntu-drivers-common
ubuntu-drivers devices
sudo apt-get install nvidia-driver-460
sudo reboot now

Then sometimes it seems that nvidia-smi is working (as of the writing of this question it wasn’t so I wasn’t able to copy paste what is said when it works) but when it doesn’t work it says this:

(synthesis) miranda9@miranda9:~$ nvidia-smi
Unable to determine the device handle for GPU 0000:00:06.0: Unknown Error

any help is appreciated.

Note I also do not have access to the VMs vmx file so this question and answers are useless/meaningless to me: https://forums.developer.nvidia.com/t/nvidia-smi-reports-unable-to-determine-the-device-handle-for-gpu/46835

In addition I have tried to uninstall everything from nivida and re-install it with:

sudo apt-get --purge remove "*nvidia*"
sudo /usr/bin/nvidia-uninstall

then

apt update;
apt install build-essential;

sudo add-apt-repository ppa:graphics-drivers
sudo apt install ubuntu-drivers-common
ubuntu-drivers devices
sudo apt-get install nvidia-driver-460
sudo reboot now

but that doesnt seem to work


More info in case it helps:

(synthesis) miranda9@miranda9:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal

also:

(synthesis) miranda9@miranda9:~$ python
Python 3.9.5 (default, Jun  4 2021, 12:28:51) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/home/miranda9/miniconda3/envs/synthesis/lib/python3.9/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at  /opt/conda/conda-bld/pytorch_1623448238472/work/c10/cuda/CUDAFunctions.cpp:115.)
  return torch._C._cuda_getDeviceCount() > 0
False

As requested by comment:

# lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 01)
00:05.0 System peripheral: XenSource, Inc. Citrix XenServer PCI Device for Windows Update (rev 01)
00:06.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)

another vm:

$ lspci
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
00:03.0 SCSI storage controller: XenSource, Inc. Xen Platform Device (rev 01)
00:05.0 System peripheral: XenSource, Inc. Citrix XenServer PCI Device for Windows Update (rev 01)
00:06.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)

Resources I’ve search for help:


Get this bounty!!!

#StackBounty: #gpu Computer lags with GPU0 at 100% and GPU1 at 0%

Bounty: 50

I have a laptop that often lags/stutters when doing "normal" tasks that are not intensive:

  • Firefox with Youtube
  • Microsoft Teams make my computer lag with the Desktop App and even when using the App interface with Google Chrome)

Here is my GPU usagemy GPU usage

We can see that 3 apps take roughly 80% of one GPU ("System", Firefox, and "Windows Driver Foundation – Host process…" each take between 20% and 30% of my GPU 0)

Another weird thing is that my second GPU is not used, it stays at 0% even when launching games.

Here are my specs:

  • CPU: Intel Core i5-8265U @1.6GHz (4 cores / 8 virtual processors?)
  • RAM: 24 GB
  • GPU 0: Intel UHD Graphics 620
  • GPU 1: NVIDIA GeForce MX250

What can I do to further troubleshoot my issue? How can I use both my GPU?


Get this bounty!!!

#StackBounty: #cuda #gpu #nvidia #ubuntu-20.04 Nvidia-smi ; No devices were found

Bounty: 50

I am facing this issue since 3 days, i have uninstalled nvidia drivers and reinstalled again and tried searching for many answers, i could not find one satisfactory response

When i check my GPU using nvidia-smi i am getting a response No devices were found

Ubuntu drivers in my system

sudo ubuntu-drivers devices
[sudo] password for dev: 
== /sys/devices/pci0000:00/0000:00:01.1/0000:01:00.0 ==
modalias : pci:v000010DEd00001F99sv000017AAsd00003A43bc03sc00i00
vendor   : NVIDIA Corporation
driver   : nvidia-driver-460-server - distro non-free
driver   : nvidia-driver-460 - third-party non-free recommended
driver   : nvidia-driver-465 - third-party non-free
driver   : nvidia-driver-450-server - distro non-free
driver   : xserver-xorg-video-nouveau - distro free builtin

output of LSB modules :

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:    20.04
Codename:   focal

GPU information :

cat /proc/driver/nvidia/gpus/0000:01:00.0/information
Model:       GeForce GTX 1650
IRQ:         91
GPU UUID:    GPU-e40be3a1-7830-6e15-7330-30fd6a28ae8f
Video BIOS:      ??.??.??.??.??
Bus Type:    PCIe
DMA Size:    47 bits
DMA Mask:    0x7fffffffffff
Bus Location:    0000:01:00.0
Device Minor:    0
Blacklisted:     No

Drivers installed :

lspci -k | grep -EA3 'VGA|3D|Display'
01:00.0 VGA compatible controller: NVIDIA Corporation Device 1f99 (rev a1)
    Subsystem: Lenovo Device 3a3f
    Kernel driver in use: nvidia
    Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
--
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Renoir (rev c7)
    Subsystem: Lenovo Renoir
    Kernel driver in use: amdgpu
    Kernel modules: amdgpu

Nvidia installations :

nvidia_uvm           1011712  0
nvidia_drm             53248  0
nvidia_modeset       1228800  1 nvidia_drm
nvidia              34168832  2 nvidia_uvm,nvidia_modeset
drm_kms_helper        217088  2 amdgpu,nvidia_drm
nvidiafb               53248  0
vgastate               20480  1 nvidiafb
fb_ddc                 16384  1 nvidiafb
i2c_algo_bit           16384  2 nvidiafb,amdgpu
drm                   552960  22 gpu_sched,drm_kms_helper,amdgpu,nvidia_drm,ttm


Get this bounty!!!

#StackBounty: #fedora #gpu #amd-graphics Check which GPU a process is using

Bounty: 50

I have a notebook with an Intel iGPU and a dedicated AMD GPU and I was told that by default the iGPU would be used, but that I can explicitly tell a program use the AMD GPU by running it like this

DRI_PRIME=1 example_program

I know that this works for glmark2 because it tells me in the terminal, but how can I verify this for any other process?

For nvidia GPUs there apparently is a utility called nvidia-smi, but I need something that works for AMD GPUs.

I’m using Fedora 34 in case it matters…


Get this bounty!!!

#StackBounty: #server #nvidia #power-management #gpu #headless How to turn off Nvidia GPU on a headless server?

Bounty: 200

I am running a headless server with an Nvidia GPU.
Even when the GPU is not doing any work, it is consuming about 25 Watts of power:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.26       Driver Version: 430.26       CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 950     Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   61C    P0    26W / 110W |      0MiB /  2001MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Is there a way to completely turn off power delivery to the GPU when it is not in use?

I tried
sudo prime-select intel
Which does cause nvidia-smi to stop working, but a power meter connect to the wall shows exactly the same power consumption with either intel or nvidia selected.

Completely removing the GPU reduces the power consumption by about 30 Watts as expected.

The main purpose is to save power and costs during idle operations, with an option to spin up the GPU when it is needed (remotely via ssh).


Get this bounty!!!

#StackBounty: #20.04 #gpu Cannot use PC while crypto-mining on Ubuntu: works great on windows though

Bounty: 100

I have an RTX 2080, and I can mine Ethereum with 37-40 MH/s on windows 10, while using the PC for mundane tasks, browsing, even able to play video games too (in this case, the hashrate drops to 10MH/s but still works) and my PC runs smoothly.

I have an Ubuntu Linux 20.04. on my PC as well on a different partition on which I work for the most part of the day. No matter what mining software I use, mining will make the system lag so much I can barely interact with anything. 5-10 seconds of lags, even struggle to stop the mining command line.

My drivers are ok, I can see on the resource monitor that sometimes CPU cores jump to 100% (but not all the time) and my nvidia-smi --loop=1 outputs normal values. Any idea what can be the cause of this?

Thanks in advance.

CPU Usage while mining:

enter image description here

nvidia-smi --loop=1 output timestamp:

Wed Feb 24 14:55:30 2021        
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     | 
|-------------------------------+----------------------+----------------------+ 
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | 
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | 
|                               |                      |               MIG M. |
|===============================+======================+======================| 
|   0  GeForce RTX 2080    Off  | 00000000:01:00.0  On |                  N/A | 
| 47%   61C    P0   166W / 245W |   5243MiB /  7959MiB |    100%      Default | 
|                               |                      |                  N/A | 
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ 
| Processes:                                                                  | 
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory | 
|        ID   ID                                                   Usage      | 
|=============================================================================| 
|    0   N/A  N/A      1267      G   /usr/lib/xorg/Xorg                120MiB | 
|    0   N/A  N/A      2419      G   /usr/lib/xorg/Xorg                510MiB | 
|    0   N/A  N/A      2556      G   /usr/bin/gnome-shell              101MiB | 
|    0   N/A  N/A      2755      G   livewallpaper                      55MiB | 
|    0   N/A  N/A      3029      G   ...AAAAAAAA== --shared-files       42MiB | 
|    0   N/A  N/A     20326      G   gnome-control-center                2MiB | 
|    0   N/A  N/A     20561      C   ./bminer                         4377MiB | 
+-----------------------------------------------------------------------------+ 
Wed Feb 24 14:55:31 2021         
+-----------------------------------------------------------------------------+ 
| NVIDIA-SMI 460.39       Driver Version: 460.39       CUDA Version: 11.2     | 
|-------------------------------+----------------------+----------------------+ 
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC | 
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. | 
|                               |                      |               MIG M. | 
|===============================+======================+======================| 
|   0  GeForce RTX 2080    Off  | 00000000:01:00.0  On |                  N/A | 
| 47%   61C    P0   163W / 245W |   5243MiB /  7959MiB |    100%      Default | 
|                               |                      |                  N/A | 
+-------------------------------+----------------------+----------------------+


Get this bounty!!!

#StackBounty: #nvidia #20.04 #gpu #fancontrol mouse problems after nvidia-xconfig –cool-bits=4

Bounty: 100

nvidia-xconfig --enable-all-gpus
nvidia-xconfig --cool-bits=4

from forums.developer.nvidia.com worked with my dual titan rtx on Ubuntu 18. This allows me to manually set the fan speed using NVIDIA X Server Setting. Otherwise, the GPU runs real hot during training (86C). When I set fans to 100%, the temperature is at 66C during training.

Upgraded to ubuntu 20.04, everything very stable except after setting the coolbits=4. When I do, I get manual fan control, but after the screen times out and darkens, I log in, the mouse does not appear. reboot using the <CTRL><ALT>T to get a terminal(no mouse), sudo reboot, log in and all is fine until the screen darkens after a timeout.

Delete the /etc/X11/xorg.config, reboot, then all is good again, even after screen times out and darkens, but no GPU fan speed control.

Very repeatable.

Any other way to enable manual fan speed control on NVIDIA?

I tried sudo nvidia-smi --gpu-target-temp=70 but this had no effect on fan speed.

The /etc/X11/xorg.config created by nvidia-xconfig

# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig:  version 460.32.03


Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen0" 0 0
    Screen      1  "Screen1" RightOf "Screen0"
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
EndSection

Section "Files"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Mouse0"
    Driver         "mouse"
    Option         "Protocol" "auto"
    Option         "Device" "/dev/psaux"
    Option         "Emulate3Buttons" "no"
    Option         "ZAxisMapping" "4 5"
EndSection

Section "InputDevice"

    # generated from default
    Identifier     "Keyboard0"
    Driver         "kbd"
EndSection

Section "Monitor"
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Unknown"
    Option         "DPMS"
EndSection

Section "Monitor"
    Identifier     "Monitor1"
    VendorName     "Unknown"
    ModelName      "Unknown"
    Option         "DPMS"
EndSection

Section "Device"
    Identifier     "Device0"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "TITAN RTX"
    BusID          "PCI:33:0:0"
EndSection

Section "Device"
    Identifier     "Device1"
    Driver         "nvidia"
    VendorName     "NVIDIA Corporation"
    BoardName      "TITAN RTX"
    BusID          "PCI:74:0:0"
EndSection

Section "Screen"
    Identifier     "Screen0"
    Device         "Device0"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "Coolbits" "4"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Section "Screen"
    Identifier     "Screen1"
    Device         "Device1"
    Monitor        "Monitor1"
    DefaultDepth    24
    Option         "Coolbits" "4"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection


Get this bounty!!!

#StackBounty: #nvidia #power-management #gpu #ethereum unable to set power-limit via nvidia-smi

Bounty: 50

I have gotten a used mining rig and am in the process of trying to tune to clock/memory/power of the GPUs to increase the hash rate. However I am unable to control nor monitor the power for 1 of 4 gpus that are of the same kind.

More detail below:

While tuning the power limit, I’ve noticed via nvidia-smi, the Power Readings while supported, have "N/A" displayed under Power Draw and Power Limit (same thing when doing nvidia-smi -q) All 4 cards are EVGA GeForce GTX 1080 Ti 11GB (11g-p4-6696-kr), but only 1 is behaving as described.

enter image description here

Hardware: 500GB SSD, 16GB RAM, Motherboard B250-MINING-EXPERT, 1600 Watt PSU, pcie risers are used to connect the GPUs to the motherboard.

OS is Ubuntu 20.04, installed nvidia driver is 450.80.02.

I’ve downgraded the driver from 455, hoping its a software issue, but same problem), I’ve even switched up the PCIe slots to match the recommendation from the manual.

When setting the power limit, you get the below warning message:
Changing power management limit is not supported for GPU: 00000000:09:00.0.

enter image description here

From this link, a poster suggests it could be power related. Since the PC has a 1600W Power Supply that is for now connected to 120v outlet (but supports 240v), to elimiate the power being a suspect, I have plugged the GPU individually to the motherboard with the remaining GPUs unpowered. and only 1 particular gpu is showing "N/A" for power values.

More info. all 4 GPU seems to be working and is able to crank up to hashrate of >= 37Mh/s via Ethminer. However, the other 3 gpus is able to reach >40Mh/s with watt capped at 225 watts (GPUGraphicsClockOffset=0,GPUMemoryTransferRateOffset=1530, with the "pill").

Any suggestions or thoughts?

Edits:


Found a related link, simliar issue 1, with no resolution.


Pulled the problematic gpu out and plugged it into another pc, same issue.


Attempted to downgrade to nvidia’s other availalble drivers 390 (same issue) and 340 (errored out…can’t install) per this post (though that post stated 367 works). Purged drivers and went back to 430, which actually installed ver 450.102.04. issue remains.


let me know if this needs to be moved to unix/serverfault/ethereum/bitcoin SE. Thanks. Personally I think askubuntu is the most related, unless there is a forum that is more specific to gpu+ubuntu. Most miners are using Windows, so I feel my chance here at AskUbuntu is higher, considering there are ML users here also that helps troubleshoot GPU+driver+Ubuntu issues.


Get this bounty!!!

#StackBounty: #20.04 #amd-graphics #gpu #testing I fear my GPU might be broken. How can I test my dedicated GPU in Ubuntu? DRI_PRIME=1 …

Bounty: 50

I would like to test my dedicated graphics card on my ubuntu partition, but I can’t run anything with it. It just defaults to the Intel Graphics Card, instead of my AMD Radeon M360. The graphics card shows up when I use the command

$ lspci -nn | grep -E ‘VGA|Display’

This is the output:

00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 5500 [8086:1616] (rev 09)
04:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Topaz XT [Radeon R7 M260/M265 / M340/M360 / M440/M445 / 530/535 / 620/625 Mobile] [10... (rev 81)

And I’ve seen elsewhere on the web that using DRI_PRIME=1 before the program you wish to run should switch to the dedicated graphics card, but in my case it doesn’t work.

For example when I try the benchmark glmark2 by running the following command:

DRI_PRIME=1 glmark2

This happens:

=======================================================
    glmark2 2014.03+git20150611.fa71af2d
=======================================================
    OpenGL Information
    GL_VENDOR:     Intel
    GL_RENDERER:   Mesa Intel(R) HD Graphics 5500 (BDW GT2)
    GL_VERSION:    4.6 (Compatibility Profile) Mesa 20.0.8

Followed by the benchmark results. Same thing if I try with DRI_PRIME=0, or DRI_PRIME=2. Anyone have any other way to switch GPU’s in order to test it?

For context, I’ve been having issues with my laptop on my windows partition, and I feel like the issue lays on my graphics card or it’s drivers. Windows just freezes (no blue screen or anything. Just freezes everything, including all input) after a few minutes of being on. After trying to fix it by resetting windows now the Windows partition is broken, so before fixing it, I’d like to test the GPU on ubuntu to make sure that the hardware is not the issue, but it lays on software.

Thank you all in advance.


Get this bounty!!!

#StackBounty: #ubuntu #gpu #amd Ubuntu 20.04 screen flicking and glithing

Bounty: 100

I’ve installed ubuntu 20.04 in my computer, and I’m having troubles with some flitchig and flicking.
My specs are those:
Amd Ryzen 3 3200G With Vega 8 GPU
16GB of RAM DDR4 3200MHz
Mother Board Gigabyte B450 DS3H
SSD Samsung EVO860
I’ve been trying some config in grunt but didn’t worked, this are the solutions I’ve been trying up:

https://askubuntu.com/questions/1239149/graphics-glitches-and-artifacts-with-ryzen-5-3400g-apu

https://www.muylinux.com/2020/07/30/forzar-driver-amdgpu-vieja-grafica-amd/

The last thing I’ve done is log in in "ubuntu on wayland" and the flicks and glitchs reduced a lot, but it still happening sometimes

This is what was happeing
enter image description here

After "ubuntu on wayland" it’s working better, but still glitching sometimes:

enter image description here

I though that maybe the problem was the OS, so I’ve tried to install Debian, but when I was chosing graphical installation, this was happening (tried 3 times, and allways the same)

https://i.imgur.com/0zqHw8U.jpg

So I think that maybe the vega 8 isn’t at all compatible with ubuntu, but This computer MUST be used on linux.

[EDIT]

Maybe if I try another desktop environment that is not GNOME it could work better?


Get this bounty!!!