PCI assignement with nvidia K1: RmInitAdapter failed

Discussion:

Guillaume Thouvenin

2013-07-09 09:18:10 UTC

Hello,

I'd like to test a nvidia K1 with pci assignement by using the pci-stub
module or the vfio module. I'm running an ubuntu raring as the host and
a version 1.4.0 of qemu-system-x86 1.4.0. The nvidia K1 has a grid
architecture and I see four K1 on one card.

I have unbind the device from the old driver and bind it to the
vfio-pci. Then started the VM with the following command:

$ sudo sh -c "qemu-system-x86_64 -M q35 -m 4096 --enable-kvm \
-net nic,model=virtio,macaddr=52:54:00:82:69:75 -net tap,ifname=tap0 \
-drive file=/home/thouveng/ubu13.04_amd64_base.qcow2,if=virtio \
-device vfio-pci,host=87:00.0"

Then I logged into the VM and I can see the nvidia device:

$ lspci | grep -i nvi
00:03.0 VGA compatible controller: NVIDIA Corporation GK107GL [GRID
K1] (rev a1)

I installed the nvidia driver that I downloaded from their website.
Every things seems ok. Now to validate that everyhting is working I
tried to run the command "nvidia-smi -q" and I got the following error:

$ nvidia-smi -q
NVIDIA: could not open the device file /dev/nvidia0 (Input/output error).
Unable to determine the device handle for GPU 0000:00:03.0: Unknown Error

and in the syslog of the guest I can see:
[ 1502.459123] NVRM: RmInitAdapter failed! (0x26:0x38:1170)
[ 1502.459150] NVRM: rm_init_adapter(0) failed

I checked the BAR registers in the guest but as far as I understand
them, memory regions seem correct:

$ lspci -v
00:03.0 VGA compatible controller: NVIDIA Corporation GK107GL [GRID
K1] (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 099d
Flags: bus master, fast devsel, latency 0, IRQ 23
Memory at fd000000 (32-bit, non-prefetchable) [size=16M]
Memory at f0000000 (64-bit, prefetchable) [size=128M]
Memory at fa000000 (64-bit, prefetchable) [size=32M]
I/O ports at c000 [size=128]
Expansion ROM at fe000000 [disabled] [size=512K]
Capabilities: <access denied>
Kernel driver in use: nvidia

I also tried to do the assignement with pci-stub but I have the same
problem.
The onlu clue that I could find on the web was on the page
http://us.download.nvidia.com/XFree86/Linux-x86/304.60/README/commonproblems.html where they said that the same problem occured when the "VBIOS fail to load on my Optimus system". They said about this error "Such problems are typically beyond the control of the NVIDIA driver, which relies on proper cooperation of ACPI and the System BIOS to retrieve important information about the GPU, including the Video
BIOS."

I don't really understand the problem. I will really appreciate any help to
troubleshoot the problem and any links that can help are more than welcome :).

I will also appreciate any comment about a succesfull story with PCI
assignement with a K1. I don't have all of them but I maybe will have
the opportunity to test them so any comments are welcome :).

Best regards,
Guillaume

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Guillaume Thouvenin

2013-07-09 10:15:45 UTC

Permalink

I'd like to test a nvidia K1 with pci assignement by using the=20
pci-stub module or the vfio module. I'm running an ubuntu raring as=20
the host and a version 1.4.0 of qemu-system-x86 1.4.0.

I just tested with QEMU emulator version 1.5.50 that I compiled from=20
git but I have the same issue. The kernel running on the host is a=20
3.8.0-23-generic that comes with ubuntu.

I also tried to reset the card before binding it but I got the same iss=
ue:

NVIDIA: could not open the device file /dev/nvidia0 (Input/output err=
or).
Unable to determine the device handle for GPU 0000:00:03.0: Unknown E=
rror

even if in /dev I have:

$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 juil. 9 12:09 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 juil. 9 12:09 /dev/nvidiactl

Regards,
Guillaume

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Alex Williamson

2013-07-09 17:52:52 UTC

Permalink

Post by Guillaume Thouvenin
=20

I'd like to test a nvidia K1 with pci assignement by using the=20
pci-stub module or the vfio module. I'm running an ubuntu raring as=

=20

Post by Guillaume Thouvenin

the host and a version 1.4.0 of qemu-system-x86 1.4.0.

=20
I just tested with QEMU emulator version 1.5.50 that I compiled from=20
git but I have the same issue. The kernel running on the host is a=20
3.8.0-23-generic that comes with ubuntu.
=20
I also tried to reset the card before binding it but I got the same i=
=20
NVIDIA: could not open the device file /dev/nvidia0 (Input/output e=

rror).

Post by Guillaume Thouvenin
Unable to determine the device handle for GPU 0000:00:03.0: Unknown=

Error

Post by Guillaume Thouvenin
=20
=20
$ ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 juil. 9 12:09 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 juil. 9 12:09 /dev/nvidiactl

Are you sure that nvidia-smi is relevant to the K1/K2 devices? Does it
work on the host? I wouldn't be surprised if a system management
interface tries to use backdoors that are not available in a VM. Are
there other tests you can do to check whether the device is otherwise
available? Also, I'm curious to see what these cards look like, could
you provide an 'sudo lspci -vvv' of the host system? Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Guillaume Thouvenin

2013-07-10 12:54:31 UTC

Permalink

Are you sure that nvidia-smi is relevant to the K1/K2 devices? Does =

work on the host?

Yes it works on the host and I have some information like power=20
consumption, temperature of the GPU, etc...

Are
there other tests you can do to check whether the device is otherwise
available?

I also tried to start an X server but I got the same error reported in=20
the syslog: [ 435.745673] NVRM: RmInitAdapter failed! (0x26:0x38:1170=
)
[ 435.745695] NVRM: rm_init_adapter(0) failed

And in the xorg.log I have:

[ 423.624] (=3D=3D) NVIDIA(0): Depth 24, (=3D=3D) framebuffer bpp 3=
2
[ 423.624] (=3D=3D) NVIDIA(0): RGB weight 888
[ 423.624] (=3D=3D) NVIDIA(0): Default visual is TrueColor
[ 423.624] (=3D=3D) NVIDIA(0): Using gamma correction (1.0, 1.0, 1.=
0)
[ 423.624] (**) NVIDIA(0): Option "NoLogo" "true"
[ 423.624] (**) NVIDIA(0): Option "UseDisplayDevice" "none"
[ 423.624] (**) NVIDIA(0): Enabling 2D acceleration
[ 423.624] (**) NVIDIA(0): Option "UseDisplayDevice" set to "none";=
=20
enabling NoScanout
[ 423.624] (**) NVIDIA(0): mode
[ 435.746] (EE) NVIDIA(0): Failed to initialize the NVIDIA GPU at=20
PCI:0:3:0. Please
[ 435.746] (EE) NVIDIA(0): check your system's kernel log for=20
additional error
[ 435.746] (EE) NVIDIA(0): messages and refer to Chapter 8:=20
Common Problems in the
[ 435.746] (EE) NVIDIA(0): README for additional information.
[ 435.746] (EE) NVIDIA(0): Failed to initialize the NVIDIA graphics=
device!
[ 435.746] (EE) NVIDIA(0): Failing initialization of X screen 0

Regards,
Guillaume

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Alex Williamson

2013-07-10 15:25:32 UTC

Permalink

Post by Guillaume Thouvenin
=20

Are you sure that nvidia-smi is relevant to the K1/K2 devices? Doe=

s it

Post by Guillaume Thouvenin

work on the host?

=20
Yes it works on the host and I have some information like power=20
consumption, temperature of the GPU, etc...
=20

Are
there other tests you can do to check whether the device is otherwi=

Post by Guillaume Thouvenin

available?

=20
I also tried to start an X server but I got the same error reported i=

n=20

Post by Guillaume Thouvenin
the syslog: [ 435.745673] NVRM: RmInitAdapter failed! (0x26:0x38:11=

70)

Post by Guillaume Thouvenin
[ 435.745695] NVRM: rm_init_adapter(0) failed
=20
=20
=20
[ 423.624] (=3D=3D) NVIDIA(0): Depth 24, (=3D=3D) framebuffer bpp=

Post by Guillaume Thouvenin
[ 423.624] (=3D=3D) NVIDIA(0): RGB weight 888
[ 423.624] (=3D=3D) NVIDIA(0): Default visual is TrueColor
[ 423.624] (=3D=3D) NVIDIA(0): Using gamma correction (1.0, 1.0, =

1.0)

Post by Guillaume Thouvenin
[ 423.624] (**) NVIDIA(0): Option "NoLogo" "true"
[ 423.624] (**) NVIDIA(0): Option "UseDisplayDevice" "none"
[ 423.624] (**) NVIDIA(0): Enabling 2D acceleration
[ 423.624] (**) NVIDIA(0): Option "UseDisplayDevice" set to "none=

";=20

Post by Guillaume Thouvenin
enabling NoScanout
[ 423.624] (**) NVIDIA(0): mode
[ 435.746] (EE) NVIDIA(0): Failed to initialize the NVIDIA GPU at=

=20

Post by Guillaume Thouvenin
PCI:0:3:0. Please
[ 435.746] (EE) NVIDIA(0): check your system's kernel log for=

=20

Post by Guillaume Thouvenin
additional error
[ 435.746] (EE) NVIDIA(0): messages and refer to Chapter 8:=20
Common Problems in the
[ 435.746] (EE) NVIDIA(0): README for additional information.
[ 435.746] (EE) NVIDIA(0): Failed to initialize the NVIDIA graphi=

cs device!

Post by Guillaume Thouvenin
[ 435.746] (EE) NVIDIA(0): Failing initialization of X screen 0

AFAICT, rm_init_adapter is in the binary blob part of the nvidia driver=
,
so we can't simply check the code to see what's wrong. Google finds
this:

http://forums.gentoo.org/viewtopic-t-961110.html?sid=3D0a60d40ba001f7bb=
e799bb16b2921cac

That was of course on bare metal, but note that the IOMMU prevented som=
e
accesses and the solution was to disable the IOMMU. Of course we can't
disable the IOMMU in this case. Do you see any IOMMU faults in dmesg o=
n
the host around this error? Does nouveau work? Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Guillaume Thouvenin

2013-07-11 10:05:30 UTC

Permalink

Do you see any IOMMU faults in dmesg on
the host around this error?

No, the only messages produces by IOMMU occurred at boot time.

Does nouveau work?

Not really. When I load the nouveau module in the guest I have the=20
following messages:

Jul 11 11:56:29 raringvm1 kernel: [ 555.659921] [drm] Initialized drm=20
1.1.0 20060810
Jul 11 11:56:29 raringvm1 kernel: [ 555.668919] wmi: Mapper loaded
Jul 11 11:56:29 raringvm1 kernel: [ 555.685793] checking generic=20
(fa000000 160000) vs hw (f0000000 8000000)
Jul 11 11:56:29 raringvm1 kernel: [ 555.685795] checking generic=20
(fa000000 160000) vs hw (f8000000 2000000)
Jul 11 11:56:29 raringvm1 kernel: [ 555.689434] nouveau [ =20
DEVICE][0000:00:03.0] BOOT0 : 0x0e7320a2
Jul 11 11:56:29 raringvm1 kernel: [ 555.689438] nouveau [ =20
DEVICE][0000:00:03.0] Chipset: GK107 (NVE7)
Jul 11 11:56:29 raringvm1 kernel: [ 555.689439] nouveau [ =20
DEVICE][0000:00:03.0] Family : NVE0
Jul 11 11:56:29 raringvm1 kernel: [ 555.690528] nouveau [ =20
VBIOS][0000:00:03.0] checking PRAMIN for image...
Jul 11 11:56:29 raringvm1 kernel: [ 555.690554] nouveau [ =20
VBIOS][0000:00:03.0] ... signature not found
Jul 11 11:56:29 raringvm1 kernel: [ 555.690556] nouveau [ =20
VBIOS][0000:00:03.0] checking PROM for image...
Jul 11 11:56:29 raringvm1 kernel: [ 555.899715] nouveau [ =20
VBIOS][0000:00:03.0] ... appears to be valid
Jul 11 11:56:29 raringvm1 kernel: [ 555.899717] nouveau [ =20
VBIOS][0000:00:03.0] using image from PROM
Jul 11 11:56:29 raringvm1 kernel: [ 555.899797] nouveau [ =20
VBIOS][0000:00:03.0] BIT signature found
Jul 11 11:56:29 raringvm1 kernel: [ 555.899798] nouveau [ =20
VBIOS][0000:00:03.0] version 80.07.4e.00.06
Jul 11 11:56:29 raringvm1 kernel: [ 555.906258] nouveau [ =20
PFB][0000:00:03.0] RAM type: DDR3
Jul 11 11:56:29 raringvm1 kernel: [ 555.906261] nouveau [ =20
PFB][0000:00:03.0] RAM size: 4096 MiB
Jul 11 11:56:29 raringvm1 kernel: [ 555.906262] nouveau [ =20
PFB][0000:00:03.0] ZCOMP: 0 tags
Jul 11 11:56:29 raringvm1 kernel: [ 555.955664] nouveau [ =20
THERM][0000:00:03.0] Found an max1617 at address 0x4c (controlled by=20
lm_sensors)
Jul 11 11:56:31 raringvm1 kernel: [ 555.955666] nouveau [ =20
I2C][0000:00:03.0] detected monitoring device: max1617
Jul 11 11:56:31 raringvm1 kernel: [ 555.999204] [TTM] Zone kernel:=20
Available graphics memory: 2024782 kiB
Jul 11 11:56:31 raringvm1 kernel: [ 555.999205] [TTM] Initializing=20
pool allocator
Jul 11 11:56:31 raringvm1 kernel: [ 555.999208] [TTM] Initializing DMA=
=20
pool allocator
Jul 11 11:56:31 raringvm1 kernel: [ 555.999267] nouveau [ DRM]=20
VRAM: 4096 MiB
Jul 11 11:56:31 raringvm1 kernel: [ 555.999268] nouveau [ DRM]=20
GART: 512 MiB
Jul 11 11:56:31 raringvm1 kernel: [ 555.999270] nouveau [ DRM]=20
BIT BIOS found
Jul 11 11:56:31 raringvm1 kernel: [ 555.999271] nouveau [ DRM]=20
Bios version 80.07.4e.00
Jul 11 11:56:31 raringvm1 kernel: [ 555.999273] nouveau [ DRM]=20
TMDS table version 2.0
Jul 11 11:56:31 raringvm1 kernel: [ 555.999274] nouveau [ DRM]=20
DCB version 4.0
Jul 11 11:56:31 raringvm1 kernel: [ 555.999275] nouveau [ DRM]=20
DCB outp 00: 02000f00 00020030
Jul 11 11:56:31 raringvm1 kernel: [ 555.999276] nouveau [ DRM]=20
DCB conn 00: 00000000
Jul 11 11:56:31 raringvm1 kernel: [ 556.002406] [drm] Supports vblank=20
timestamp caching Rev 1 (10.10.2010).
Jul 11 11:56:31 raringvm1 kernel: [ 556.002408] [drm] No driver=20
support for vblank timestamp query.
Jul 11 11:56:31 raringvm1 kernel: [ 556.012527] nouveau E[ =20
PDISP][0000:00:03.0] chid 0 mthd 0x0000 data 0x00000000 0x10001000=20
0x00000001
Jul 11 11:56:31 raringvm1 kernel: [ 558.005281] [TTM] Finalizing pool=20
allocator
Jul 11 11:56:31 raringvm1 kernel: [ 558.005285] [TTM] Finalizing DMA=20
pool allocator
Jul 11 11:56:31 raringvm1 kernel: [ 558.005300] [TTM] Zone kernel:=20
Used memory at exit: 0 kiB
Jul 11 11:56:31 raringvm1 kernel: [ 558.013388] nouveau: probe of=20
0000:00:03.0 failed with error -16

So I don't know what it the error -16 but when I try to start an X=20
server it starts well and finally complains about KMS that is not=20
enabled. I checked in the config file and CONFIG_FRAMEBUFFER_CONSOLE is=
=20
set to y.

As KMS seems required by the nouveau driver, X failed to use nouveau.=20
Here is the logs in Xorg.log:

[ 22.234] (II) LoadModule: "nouveau"
[ 22.235] (II) Loading /usr/lib/xorg/modules/drivers/nouveau_drv.so
[ 22.236] (II) Module nouveau: vendor=3D"X.Org Foundation"
[ 22.236] compiled for 1.13.3, module version =3D 1.0.7
[ 22.236] Module class: X.Org Video Driver
[ 22.236] ABI class: X.Org Video Driver, version 13.1
[ 22.236] (II) NOUVEAU driver Date: Wed Mar 27 09:50:03 2013 +0100
[ 22.236] (II) NOUVEAU driver for NVIDIA chipset families :
[ 22.236] RIVA TNT (NV04)
[ 22.236] RIVA TNT2 (NV05)
[ 22.236] GeForce 256 (NV10)
[ 22.236] GeForce 2 (NV11, NV15)
[ 22.236] GeForce 4MX (NV17, NV18)
[ 22.236] GeForce 3 (NV20)
[ 22.236] GeForce 4Ti (NV25, NV28)
[ 22.236] GeForce FX (NV3x)
[ 22.236] GeForce 6 (NV4x)
[ 22.236] GeForce 7 (G7x)
[ 22.236] GeForce 8 (G8x)
[ 22.236] GeForce GTX 200 (NVA0)
[ 22.236] GeForce GTX 400 (NVC0)
[ 22.236] (--) using VT number 7

[ 22.239] (EE) [drm] KMS not enabled
[ 22.239] (EE) No devices detected.

Regards,
Guillaume

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Guillaume Thouvenin

2013-07-10 15:30:17 UTC

Permalink

NVIDIA: could not open the device file /dev/nvidia0 (Input/output er=

ror).

Unable to determine the device handle for GPU 0000:00:03.0: Unknown =

Error

And I look with strace and I can see that:

stat("/dev/nvidiactl", {st_mode=3DS_IFCHR|0666, st_rdev=3Dmakedev(195,=20
255), ...}) =3D 0
open("/dev/nvidiactl", O_RDWR) =3D 3
fcntl(3, F_SETFD, FD_CLOEXEC) =3D 0
ioctl(3, 0xc04846d2, 0x7fff95a7f380) =3D 0
ioctl(3, 0xc00446ca, 0x7f7692482000) =3D 0
ioctl(3, 0xc70046c8, 0x7f7692482060) =3D 0
ioctl(3, 0xc020462b, 0x7fff95a7f3d0) =3D 0
ioctl(3, 0xc020462a, 0x7fff95a7f3b0) =3D 0
ioctl(3, 0xc020462a, 0x7fff95a7f3b0) =3D 0
ioctl(3, 0xc020462a, 0x7fff95a7f410) =3D 0
open("/proc/driver/nvidia/params", O_RDONLY) =3D 4
fstat(4, {st_mode=3DS_IFREG|0444, st_size=3D0, ...}) =3D 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1,=20
0) =3D 0x7f7692e8b000
read(4, "Mobile: 4294967295\nResmanDebugLe"..., 1024) =3D 417
close(4) =3D 0
munmap(0x7f7692e8b000, 4096) =3D 0
stat("/dev/nvidia0", {st_mode=3DS_IFCHR|0666, st_rdev=3Dmakedev(195, 0)=
, ...}) =3D 0
open("/dev/nvidia0", O_RDWR) =3D -1 EIO (Input/output error)

I dig... I dig...

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html