NVIDIA-SMI a échoué parce qu'il ne pouvait pas communiquer avec le pilote NVIDIA

Je suis à court d'une AWS EC2 g2.2xlarge exemple avec Ubuntu 14.04 LTS.
J'aimerais observer le GPU utilisation tandis que la formation de mon TensorFlow modèles.
J'obtiens une erreur en essayant de lancer "nvidia-smi'.

ubuntu@ip-10-0-1-213:/etc/alternatives$ cd /usr/lib/nvidia-375/bin
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ls
nvidia-bug-report.sh     nvidia-debugdump     nvidia-xconfig
nvidia-cuda-mps-control  nvidia-persistenced
nvidia-cuda-mps-server   nvidia-smi
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ ./nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.


ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ dpkg -l | grep nvidia 
ii  nvidia-346                                            352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-346
ii  nvidia-346-dev                                        346.46-0ubuntu1                                     amd64        NVIDIA binary Xorg driver development files
ii  nvidia-346-uvm                                        346.96-0ubuntu0.0.1                                 amd64        Transitional package for nvidia-346
ii  nvidia-352                                            375.26-0ubuntu1                                     amd64        Transitional package for nvidia-375
ii  nvidia-375                                            375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary driver - version 375.39
ii  nvidia-375-dev                                        375.39-0ubuntu0.14.04.1                             amd64        NVIDIA binary Xorg driver development files
ii  nvidia-modprobe                                       375.26-0ubuntu1                                     amd64        Load the NVIDIA kernel driver and create device files
ii  nvidia-opencl-icd-346                                 352.63-0ubuntu0.14.04.1                             amd64        Transitional package for nvidia-opencl-icd-352
ii  nvidia-opencl-icd-352                                 375.26-0ubuntu1                                     amd64        Transitional package for nvidia-opencl-icd-375
ii  nvidia-opencl-icd-375                                 375.39-0ubuntu0.14.04.1                             amd64        NVIDIA OpenCL ICD
ii  nvidia-prime                                          0.6.2.1                                             amd64        Tools to enable NVIDIA's Prime
ii  nvidia-settings                                       375.26-0ubuntu1                                     amd64        Tool for configuring the NVIDIA graphics driver
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ lspci | grep -i nvidia
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
ubuntu@ip-10-0-1-213:/usr/lib/nvidia-375/bin$ 

$ inxi -G
Graphics:  Card-1: Cirrus Logic GD 5446 
           Card-2: NVIDIA GK104GL [GRID K520] 
           X.org: 1.15.1 driver: N/A tty size: 80x24 Advanced Data: N/A out of X

$  lspci -k | grep -A 2 -E "(VGA|3D)"
00:02.0 VGA compatible controller: Cirrus Logic GD 5446
    Subsystem: XenSource, Inc. Device 0001
    Kernel driver in use: cirrus
00:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)
    Subsystem: NVIDIA Corporation Device 1014
00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

J'ai suivi ces instructions pour installer CUDA 7 et cuDNN:

$sudo apt-get -q2 update
$sudo apt-get upgrade
$sudo reboot

=======================================================================

Post redémarrage, mise à jour de l'initramfs en exécutant '$sudo update-initramfs -u'

Maintenant, s'il vous plaît modifier le /etc/modprobe.d/blacklist.conf fichier à la liste noire de nouveau. Ouvrez le fichier dans un éditeur de texte et insérez les lignes suivantes à la fin du fichier.

blacklist nouveau
liste noire lbm-nouveau
options nouveau modeset=0
alias nouveau off
alias lbm-nouveau off

Enregistrez et fermez le fichier.

Maintenant installer le build essentiel des outils et mise à jour de l'initramfs et redémarrez à nouveau comme ci-dessous:

$sudo apt-get install linux-{headers,image,image-extra}-$(uname -r) build-essential
$sudo update-initramfs -u
$sudo reboot

========================================================================

Post redémarrage, exécutez les commandes suivantes pour installer Nvidia.

$sudo wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/cuda_7.0.28_linux.run
$sudo chmod 700 ./cuda_7.0.28_linux.run
$sudo ./cuda_7.0.28_linux.run
$sudo update-initramfs -u
$sudo reboot

========================================================================

Maintenant que le système, de façon à vérifier l'installation en exécutant la commande suivante.

$sudo modprobe nvidia
$sudo nvidia-smi -q | head`enter code here`

Vous devriez voir la sortie comme " nvidia.png'.

Maintenant, exécutez les commandes suivantes.
$

cd ~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery
$make
$./deviceQuery

Cependant, "nvidia-smi' ne pas afficher l'activité du GPU tout en Tensorflow est des modèles de formation:

ubuntu@ip-10-0-1-48:~$ ipython
Python 2.7.11 |Anaconda custom (64-bit)| (default, Dec  6 2015, 18:08:32) 
Type "copyright", "credits" or "license" for more information.

IPython 4.1.2 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: import tensorflow as tf 
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.7.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.7.5 locally



ubuntu@ip-10-0-1-48:~$ nvidia-smi
Thu Mar 30 05:45:26 2017       
+------------------------------------------------------+                       
| NVIDIA-SMI 346.46     Driver Version: 346.46         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   35C    P0    38W /125W |     10MiB / 4095MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

OriginalL'auteur dbl001 | 2017-03-23

gpu

18

J'ai résolu "NVIDIA-SMI a échoué parce qu'il ne pouvait pas communiquer avec le pilote NVIDIA" sur mon portable ASUS, avec GTX 950m et Ubuntu 18.04 par la désactivation du Secure Boot Control à partir du BIOS.

A fait le tour de mon ASUS Zenbook avec un 940MX. Merci!!!!
Merci beaucoup! Cela a fonctionné pour mon Alienware Aurora
Fonctionne pour moi. Maintenant, je peux utiliser CUDA de nouveau.

OriginalL'auteur nuicca

J'avais la même erreur sur mon Ubuntu 16.04 (Linux 4.14 noyau) dans Google Compute Engine avec K80 GPU. J'ai mis à jour le noyau, 4.14 et boom, le problème a été résolu. Voici comment j'ai mis à jour mon noyau Linux à partir de 4,13 à 4.14:

Step 1:
Check the existing kernel of your Ubuntu Linux:

uname -a

Step 2:

Ubuntu maintains a website for all the versions of kernel that have 
been released. At the time of this writing, the latest stable release 
of Ubuntu kernel is 4.15. If you go to this 
link: http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/, you will 
see several links for download.

Step 3:

Download the appropriate files based on the type of OS you have. For 64 
bit, I would download the following deb files:

wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-
4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb

Step 4:

Install all the downloaded deb files:

sudo dpkg -i *.deb

Step 5:
Reboot your machine and check if the kernel has been updated by:
uname -a

Vous devriez voir que votre noyau a été mis à jour et nous espérons que nvidia-smi devrait fonctionner.

Cela a fonctionné pour moi et le nvidia-390 pilote, il modprobes maintenant, j'ai mis à jour du noyau 4.4.0 à 4.15.0.
Après perdre 4 heures, celui-ci a résolu mon problème. Je vous remercie.

OriginalL'auteur Heapify

4

Exécutez la procédure suivante pour obtenir le droit de pilote NVIDIA :

sudo ubuntu-pilotes de périphériques

Ensuite choisir le droit et exécuter:

sudo apt install

OriginalL'auteur gowin
3

Je travaille avec AWS DeepAMI P2 exemple et tout à coup j'ai trouvé que Nvidia pilote de commande ne fonctionne pas et le GPU n'est pas trouvé torche ou tensorflow de la bibliothèque. Alors j'ai résolu le problème de la manière suivante,

Exécuter nvcc --version si cela ne fonctionne pas

Puis exécutez ce qui suit

apt install nvidia-cuda-toolkit

J'espère que ça va résoudre le problème.

Cela fonctionne pour moi. Dans mon cas, un redémarrage est nécessaire pour que nvidia-smi fonctionne à nouveau.

OriginalL'auteur Rabindra Nath Nandi
1

Je tiens juste à remercier @Heapify pour fournir une réponse pratique et mise à jour de sa réponse, car les liens ne sont pas à jour.

Étape 1:
Vérifiez le noyau existant de votre Ubuntu Linux:
```
uname -a
```
Étape 2:

Ubuntu gère un site internet pour toutes les versions de noyau qui ont
été libérés. Au moment d'écrire ces lignes, la dernière version stable
de noyau Ubuntu est 4.15. Si vous allez à ce
lien: http://kernel.ubuntu.com/~noyau-ppa/principale/v4.15/, vous
voir plusieurs liens pour le téléchargement.

Étape 3:

Télécharger les fichiers appropriés en fonction du type de système d'exploitation que vous avez. Pour 64
bits, je voudrais télécharger les éléments suivants fichiers deb:
```
//UP-TO-DATE 2019-03-18
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500_4.15.0-041500.201802011154_all.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-headers-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v4.15/linux-image-4.15.0-041500-generic_4.15.0-041500.201802011154_amd64.deb
```
Étape 4:

Installer tous les fichiers deb téléchargés:
```
sudo dpkg -i *.deb
```
Étape 5:

Redémarrez votre ordinateur et vérifiez si le noyau a été mis à jour par:
```
uname -aenter code here
```
OriginalL'auteur Weixing

J'ai dû installer la carte NVIDIA 367.57 pilote et CUDA 7.5 avec Tensorflow sur le g2.2xlarge Ubuntu 14.04 LTS instance.
par exemple
graphiques nvidia-drivers-367_367.57.orig.tar

Maintenant, la GRILLE K520 GPU est de travailler alors que j'train tensorflow modèles:

ubuntu@ip-10-0-1-70:~$ nvidia-smi
Sat Apr  1 18:03:32 2017       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 367.57                 Driver Version: 367.57                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   39C    P8    43W /125W |   3800MiB / 4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      2254    C   python                                        3798MiB |
+-----------------------------------------------------------------------------+

ubuntu@ip-10-0-1-70:~/NVIDIA_CUDA-7.0_Samples/1_Utilities/deviceQuery$ ./deviceQuery 
./deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GRID K520"
  CUDA Driver Version /Runtime Version          8.0 /7.0
  CUDA Capability Major/Minor version number:    3.0
  Total amount of global memory:                 4036 MBytes (4232052736 bytes)
  ( 8) Multiprocessors, (192) CUDA Cores/MP:     1536 CUDA Cores
  GPU Max Clock rate:                            797 MHz (0.80 GHz)
  Memory Clock rate:                             2500 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 524288 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID /Bus ID /location ID:   0 /0 /3
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.0, NumDevs = 1, Device0 = GRID K520
Result = PASS

OriginalL'auteur dbl001

Vous devez vous connecter pour publier un commentaire.