Discussion:
[PATCH/RFC 3/9] s390 guest detection
(too old to reply)
Carsten Otte
2007-05-11 17:35:55 UTC
Permalink
From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>

This patch adds functionality to detect if the kernel runs under an s390host
hypervisor. A macro MACHINE_IS_GUEST is exported for device drivers. This
allows drivers to skip device detection if the systems runs non-virtualized.

Signed-off-by: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>
Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>

---
arch/s390/kernel/early.c | 4 ++++
arch/s390/kernel/setup.c | 9 ++++++---
include/asm-s390/setup.h | 1 +
3 files changed, 11 insertions(+), 3 deletions(-)

Index: linux-2.6.21/arch/s390/kernel/setup.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/setup.c
+++ linux-2.6.21/arch/s390/kernel/setup.c
@@ -744,9 +744,12 @@ setup_arch(char **cmdline_p)
"This machine has an IEEE fpu\n" :
"This machine has no IEEE fpu\n");
#else /* CONFIG_64BIT */
- printk((MACHINE_IS_VM) ?
- "We are running under VM (64 bit mode)\n" :
- "We are running native (64 bit mode)\n");
+ if (MACHINE_IS_VM)
+ printk("We are running under VM (64 bit mode)\n");
+ else if (MACHINE_IS_GUEST)
+ printk("We are running on a non z/VM host\n");
+ else
+ printk("We are running native (64 bit mode)\n");
#endif /* CONFIG_64BIT */

/* Save unparsed command line copy for /proc/cmdline */
Index: linux-2.6.21/include/asm-s390/setup.h
===================================================================
--- linux-2.6.21.orig/include/asm-s390/setup.h
+++ linux-2.6.21/include/asm-s390/setup.h
@@ -61,6 +61,7 @@ extern unsigned long machine_flags;
#define MACHINE_IS_VM (machine_flags & 1)
#define MACHINE_IS_P390 (machine_flags & 4)
#define MACHINE_HAS_MVPG (machine_flags & 16)
+#define MACHINE_IS_GUEST (machine_flags & 64)
#define MACHINE_HAS_IDTE (machine_flags & 128)
#define MACHINE_HAS_DIAG9C (machine_flags & 256)

Index: linux-2.6.21/arch/s390/kernel/early.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/early.c
+++ linux-2.6.21/arch/s390/kernel/early.c
@@ -139,6 +139,10 @@ static noinline __init void detect_machi
/* Running on a P/390 ? */
if (cpuinfo->cpu_id.machine == 0x7490)
machine_flags |= 4;
+
+ /* Running under a host ? */
+ if (cpuinfo->cpu_id.version == 0xfe)
+ machine_flags |= 64;
}

#ifdef CONFIG_64BIT



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-11 17:35:53 UTC
Permalink
From: Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+***@public.gmane.org>

Add interface which allows a process to start a virtual machine.

To keep things easy each thread group is allowed to have only one
virtual machine and each thread of the thread group can only control
one virtual cpu of the virtual machine. All the information about
the virtual machines/cpus can be found via the thread_info structures
of the participating threads.

This patch adds three new s390 specific system calls:

long sys_s390host_add_cpu(unsigned long addr, unsigned long flags,
struct sie_block __user *sie_template)

Adds a new cpu to a the virtual machine that belongs to the current
thread group. If no virtual machine exists it will be created. In
addition two pages will be allocated and mapped at <addr> into the
address space of the process. These two pages are used so user space
and kernel space can easily exchange/modify the state of the
corresponding virtual cpu without a ton of copy_from/to_user calls.
The sie_template is a pointer to a data structure that contains
initial information how the virtual cpu should be setup. The
resulting block will be used as a parameter to issue the sie (start
interpretive execution) instruction which starts a virtual cpu.

int sys_s390host_remove_cpu(void)

Removes a virtual cpu from a virtual machine.

int sys_s390host_sie(unsigned long action)

Starts / re-enters the virtual cpu of the virtual machine that the
thread belongs to, if any.

Please note that this patch is nothing more than a proof-of-concept
and may contain quite a few bugs.
Since we want to convert to use kvm instead, most of this will be
dropped anyway. But maybe this is of interest for others as well.

Signed-off-by: Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+***@public.gmane.org>
Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>

---
arch/s390/Kconfig | 7
arch/s390/Makefile | 2
arch/s390/host/Makefile | 5
arch/s390/host/s390_intercept.c | 42 ++++
arch/s390/host/s390host.c | 418 ++++++++++++++++++++++++++++++++++++++++
arch/s390/host/s390host.h | 16 +
arch/s390/host/sie64a.S | 38 +++
arch/s390/kernel/asm-offsets.c | 2
arch/s390/kernel/process.c | 15 +
arch/s390/kernel/setup.c | 4
arch/s390/kernel/syscalls.S | 3
include/asm-s390/sie64.h | 279 ++++++++++++++++++++++++++
include/asm-s390/thread_info.h | 8
include/asm-s390/unistd.h | 5
kernel/sys_ni.c | 3
15 files changed, 842 insertions(+), 5 deletions(-)

Index: linux-2.6.21/arch/s390/kernel/asm-offsets.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/asm-offsets.c
+++ linux-2.6.21/arch/s390/kernel/asm-offsets.c
@@ -44,5 +44,7 @@ int main(void)
DEFINE(__SF_BACKCHAIN, offsetof(struct stack_frame, back_chain),);
DEFINE(__SF_GPRS, offsetof(struct stack_frame, gprs),);
DEFINE(__SF_EMPTY, offsetof(struct stack_frame, empty1),);
+ BLANK();
+ DEFINE(__SIE_USER_gprs, offsetof(struct sie_user, gprs),);
return 0;
}
Index: linux-2.6.21/arch/s390/kernel/syscalls.S
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/syscalls.S
+++ linux-2.6.21/arch/s390/kernel/syscalls.S
@@ -322,3 +322,6 @@ NI_SYSCALL /* 310 sys_move_pages *
SYSCALL(sys_getcpu,sys_getcpu,sys_getcpu_wrapper)
SYSCALL(sys_epoll_pwait,sys_epoll_pwait,compat_sys_epoll_pwait_wrapper)
SYSCALL(sys_utimes,sys_utimes,compat_sys_utimes_wrapper)
+SYSCALL(sys_ni_syscall,sys_s390host_add_cpu,sys_ni_syscall)
+SYSCALL(sys_ni_syscall,sys_s390host_remove_cpu,sys_ni_syscall)
+SYSCALL(sys_ni_syscall,sys_s390host_sie,sys_ni_syscall)
Index: linux-2.6.21/arch/s390/host/Makefile
===================================================================
--- /dev/null
+++ linux-2.6.21/arch/s390/host/Makefile
@@ -0,0 +1,5 @@
+#
+# Makefile for the s390host components.
+#
+
+obj-$(CONFIG_S390_HOST) += s390host.o sie64a.o s390_intercept.o
Index: linux-2.6.21/arch/s390/host/sie64a.S
===================================================================
--- /dev/null
+++ linux-2.6.21/arch/s390/host/sie64a.S
@@ -0,0 +1,38 @@
+/*
+ * arch/s390/host/sie64a.S
+ * low level sie call
+ *
+ * Copyright IBM Corp. 2007
+ * Author(s): Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ * License : GPL
+ */
+
+#include <linux/errno.h>
+#include <asm/asm-offsets.h>
+
+SP_R6 = 6 * 8 # offset into stackframe
+
+ .globl sie64a
+sie64a:
+ stmg %r6,%r15,SP_R6(%r15) # save register on entry
+ lgr %r14,%r2 # pointer to program parms
+ aghi %r2,4096
+ lmg %r0,%r13,__SIE_USER_gprs(%r2) # load guest gprs 0-13
+sie_inst:
+ sie 0(%r14)
+ aghi %r14,4096
+ stmg %r0,%r13,__SIE_USER_gprs(%r14) # save guest gprs 0-13
+ lghi %r2,0
+ lmg %r6,%r15,SP_R6(%r15)
+ br %r14
+
+sie_err:
+ aghi %r14,4096
+ stmg %r0,%r13,__SIE_USER_gprs(%r14) # save guest gprs 0-13
+ lghi %r2,-EFAULT
+ lmg %r6,%r15,SP_R6(%r15)
+ br %r14
+
+ .section __ex_table,"a"
+ .quad sie_inst,sie_err
+ .previous
Index: linux-2.6.21/arch/s390/host/s390host.c
===================================================================
--- /dev/null
+++ linux-2.6.21/arch/s390/host/s390host.c
@@ -0,0 +1,418 @@
+/*
+ * s390host.c -- hosting zSeries Linux virtual engines
+ *
+ * Copyright IBM Corp. 2007
+ * Author(s): Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>,
+ * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>,
+ * Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ * License : GPL
+ */
+
+#include <linux/pagemap.h>
+#include <linux/module.h>
+#include <linux/fs.h>
+#include <linux/mm.h>
+#include <linux/init.h>
+#include <linux/file.h>
+#include <linux/mman.h>
+#include <linux/mutex.h>
+#include <asm/uaccess.h>
+#include <asm/processor.h>
+#include <asm/tlbflush.h>
+#include <asm/semaphore.h>
+#include <asm/sie64.h>
+#include "s390host.h"
+
+static int s390host_do_action(unsigned long, struct sie_io *);
+
+static DEFINE_MUTEX(s390host_init_mutex);
+
+static void s390host_get_data(struct s390host_data *data)
+{
+ atomic_inc(&data->count);
+}
+
+void s390host_put_data(struct s390host_data *data)
+{
+ int cpu;
+
+ if (atomic_dec_return(&data->count))
+ return;
+
+ for (cpu = 0; cpu < S390HOST_MAX_CPUS; cpu++)
+ if (data->sie_io[cpu])
+ free_page((unsigned long)data->sie_io[cpu]);
+
+ if (data->sca_block)
+ free_page((unsigned long)data->sca_block);
+
+ kfree(data);
+}
+
+static void s390host_vma_close(struct vm_area_struct *vma)
+{
+ s390host_put_data(vma->vm_private_data);
+}
+
+static struct page *s390host_vma_nopage(struct vm_area_struct *vma,
+ unsigned long address, int *type)
+{
+ return NOPAGE_SIGBUS;
+}
+
+static struct vm_operations_struct s390host_vmops = {
+ .close = s390host_vma_close,
+ .nopage = s390host_vma_nopage,
+};
+
+static struct s390host_data *get_s390host_context(void)
+{
+ struct thread_info *tif;
+ struct sca_block *sca_block = NULL;
+ struct s390host_data *data = NULL;
+ struct task_struct *tsk;
+
+ /* zlh context for current thread already created? */
+ tif = current_thread_info();
+ if (tif->s390host_data)
+ return tif->s390host_data;
+
+ /* zlh context in thread group available? */
+ write_lock_irq(&tasklist_lock);
+ tsk = next_thread(current);
+ for (; tsk != current; tsk = next_thread(tsk)) {
+ data = tsk->thread_info->s390host_data;
+ if (data) {
+ s390host_get_data(data);
+ tif->s390host_data = data;
+ break;
+ }
+ }
+ write_unlock_irq(&tasklist_lock);
+
+ if (data)
+ return data;
+
+ /* create new context */
+ data = kzalloc(sizeof(*data), GFP_KERNEL);
+
+ if (!data)
+ return NULL;
+
+ sca_block = (struct sca_block *)get_zeroed_page(GFP_KERNEL);
+
+ if (!sca_block) {
+ kfree(data);
+ return NULL;
+ }
+
+ data->sca_block = sca_block;
+ tif->s390host_data = data;
+ s390host_get_data(data);
+
+ return data;
+}
+
+static unsigned long
+s390host_create_io_area(unsigned long addr, unsigned long flags,
+ unsigned long io_addr, struct s390host_data *data)
+{
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+ unsigned long ret;
+
+ flags &= MAP_FIXED;
+ addr = get_unmapped_area(NULL, addr, 2 * PAGE_SIZE, 0, flags);
+
+ if (addr & ~PAGE_MASK)
+ return addr;
+
+ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL);
+
+ if (!vma)
+ return -ENOMEM;
+
+ vma->vm_mm = mm;
+ vma->vm_start = addr;
+ vma->vm_end = addr + 2 * PAGE_SIZE;
+ vma->vm_flags = VM_READ | VM_MAYREAD | VM_IO | VM_RESERVED;
+ vma->vm_flags |= VM_SHARED | VM_MAYSHARE | VM_DONTCOPY;
+
+#if 1 /* FIXME: write access until sys_s390host_sie interface is final */
+ vma->vm_flags |= VM_WRITE | VM_MAYWRITE;
+#endif
+
+ vma->vm_page_prot = protection_map[vma->vm_flags & 0xf];
+ vma->vm_private_data = data;
+ vma->vm_ops = &s390host_vmops;
+
+ down_write(&mm->mmap_sem);
+ ret = insert_vm_struct(mm, vma);
+ if (ret) {
+ kmem_cache_free(vm_area_cachep, vma);
+ goto out;
+ }
+ s390host_get_data(data);
+ mm->total_vm += 2;
+ vm_insert_page(vma, addr, virt_to_page(io_addr));
+
+ ret = split_vma(mm, vma, addr + PAGE_SIZE, 0);
+ if (ret)
+ goto out;
+ s390host_get_data(data);
+
+ vma = find_vma(mm, addr + PAGE_SIZE);
+ vma->vm_flags |= VM_WRITE | VM_MAYWRITE;
+ vma->vm_page_prot = protection_map[vma->vm_flags & 0xf];
+ vm_insert_page(vma, addr + PAGE_SIZE,
+ virt_to_page(io_addr + PAGE_SIZE));
+ ret = addr;
+out:
+ up_write(&mm->mmap_sem);
+ return ret;
+}
+
+long sys_s390host_add_cpu(unsigned long addr, unsigned long flags,
+ struct sie_block __user *sie_template)
+{
+ struct sie_block *sie_block;
+ struct sie_io *sie_io;
+ struct sca_block *sca_block;
+ struct s390host_data *data = NULL;
+ unsigned long ret;
+ __u16 cpu;
+
+ if (current_thread_info()->sie_cpu != -1)
+ return -EINVAL;
+
+ if (copy_from_user(&cpu, &sie_template->icpua, sizeof(u16)))
+ return -EFAULT;
+
+ if (cpu >= S390HOST_MAX_CPUS)
+ return -EINVAL;
+
+ mutex_lock(&s390host_init_mutex);
+
+ data = get_s390host_context();
+ if (!data) {
+ ret = -ENOMEM;
+ goto out_err;
+ }
+
+ sca_block = data->sca_block;
+ if (sca_block->mcn & (1UL << (S390HOST_MAX_CPUS - 1 - cpu))) {
+ ret = -EINVAL;
+ goto out_err;
+ }
+
+ if (!data->sie_io[cpu]) {
+ unsigned long tmp;
+
+ /* allocate two pages: 1st is r/o 2nd r/w area */
+ tmp = __get_free_pages(GFP_KERNEL, 1);
+ if (!tmp) {
+ ret = -ENOMEM;
+ goto out_err;
+ }
+ split_page(virt_to_page(tmp), 1);
+ data->sie_io[cpu] = (struct sie_io *)tmp;
+ }
+
+ sie_io = data->sie_io[cpu];
+ memset(sie_io, 0, 2 * PAGE_SIZE);
+
+ sie_block = &sie_io->sie_kernel.sie_block;
+ sca_block->cpu[cpu].sda = (__u64)sie_block;
+
+ if (copy_from_user(sie_block, sie_template, sizeof(struct sie_block))) {
+ ret = -EFAULT;
+ goto out_err;
+ }
+ sie_block->icpua = cpu;
+
+ ret = s390host_create_io_area(addr, flags, (unsigned long)sie_io, data);
+
+ if (ret & ~PAGE_MASK)
+ goto out_err;
+
+ sca_block->mcn |= 1UL << (S390HOST_MAX_CPUS - 1 - cpu);
+ sie_block->scaoh = (__u32)(((__u64)sca_block) >> 32);
+ sie_block->scaol = (__u32)(__u64)sca_block;
+ current_thread_info()->sie_cpu = cpu;
+ goto out;
+out_err:
+ if (data)
+ s390host_put_data(data);
+out:
+ mutex_unlock(&s390host_init_mutex);
+ return ret;
+}
+
+int sys_s390host_remove_cpu(void)
+{
+ struct sca_block *sca_block;
+ int cpu;
+
+ cpu = current_thread_info()->sie_cpu;
+ if (cpu == -1)
+ return -EINVAL;
+
+ mutex_lock(&s390host_init_mutex);
+ sca_block = current_thread_info()->s390host_data->sca_block;
+ sca_block->mcn &= ~(1UL << (S390HOST_MAX_CPUS - 1 - cpu));
+ current_thread_info()->sie_cpu = -1;
+ mutex_unlock(&s390host_init_mutex);
+ return 0;
+}
+
+int sys_s390host_sie(unsigned long action)
+{
+ struct sie_kernel *sie_kernel;
+ struct sie_user *sie_user;
+ struct sie_io *sie_io;
+ int cpu;
+ int ret = 0;
+
+ cpu = current_thread_info()->sie_cpu;
+ if (cpu == -1)
+ return -EINVAL;
+
+ sie_io = current_thread_info()->s390host_data->sie_io[cpu];
+
+ if (action)
+ ret = s390host_do_action(action, sie_io);
+ if (ret)
+ goto out_err;
+ sie_kernel = &sie_io->sie_kernel;
+ sie_user = &sie_io->sie_user;
+
+ save_fp_regs(&sie_kernel->host_fpregs);
+ save_access_regs(sie_kernel->host_acrs);
+ sie_user->guest_fpregs.fpc &= FPC_VALID_MASK;
+ restore_fp_regs(&sie_user->guest_fpregs);
+ restore_access_regs(sie_user->guest_acrs);
+ memcpy(&sie_kernel->sie_block.gg14, &sie_user->gprs[14], 16);
+again:
+ if (need_resched())
+ schedule();
+
+ sie_kernel->sie_block.icptcode = 0;
+ ret = sie64a(sie_kernel);
+ if (ret)
+ goto out;
+
+ if (signal_pending(current)) {
+ ret = -EINTR;
+ goto out;
+ }
+
+ ret = s390host_handle_intercept(sie_kernel);
+
+ /* intercept reason was handled, enter SIE again */
+ if (!ret)
+ goto again;
+
+ /* if kernel cannot hanle intercept, pass to the user */
+ if (ret == -ENOTSUPP)
+ ret = 0;
+
+out:
+ memcpy(&sie_user->gprs[14], &sie_kernel->sie_block.gg14, 16);
+ save_fp_regs(&sie_user->guest_fpregs);
+ save_access_regs(sie_user->guest_acrs);
+ restore_fp_regs(&sie_kernel->host_fpregs);
+ restore_access_regs(sie_kernel->host_acrs);
+out_err:
+ return ret;
+}
+
+static void s390host_vsmxm_local_update(struct sie_io *sie_io)
+{
+ struct sie_kernel *local_sie_kernel;
+ struct sie_user *sie_user;
+ atomic_t *cpuflags;
+ int old, new;
+
+ mutex_lock(&s390host_init_mutex);
+
+ sie_user = &sie_io->sie_user;
+ local_sie_kernel = &sie_io->sie_kernel;
+
+ cpuflags = &local_sie_kernel->sie_block.cpuflags;
+ do {
+ old = atomic_read(cpuflags);
+ new = old | sie_user->vsmxm_or_local;
+ new &= sie_user->vsmxm_and_local;
+ } while (atomic_cmpxchg(cpuflags, old, new) != old);
+
+ mutex_unlock(&s390host_init_mutex);
+ return;
+}
+
+static int s390host_vsmxm_dist_update(struct sie_io *sie_io)
+{
+ struct sie_kernel *dist_sie_kernel;
+ struct sie_user *sie_user;
+ struct sca_block *sca_block;
+ struct thread_info *tif;
+ atomic_t *cpuflags;
+ int cpu;
+ int old, new;
+ int rc = -EINVAL;
+
+ mutex_lock(&s390host_init_mutex);
+
+ sie_user = &sie_io->sie_user;
+ cpu = sie_user->vsmxm_cpuid;
+
+ if (cpu >= S390HOST_MAX_CPUS)
+ goto out;
+
+ tif = current_thread_info();
+ sca_block = tif->s390host_data->sca_block;
+ if (!(sca_block->mcn & (1UL << (S390HOST_MAX_CPUS - 1 - cpu))))
+ goto out;
+
+ dist_sie_kernel = &((tif->s390host_data->sie_io[cpu])->sie_kernel);
+
+ cpuflags = &dist_sie_kernel->sie_block.cpuflags;
+ do {
+ old = atomic_read(cpuflags);
+ new = old | sie_user->vsmxm_or;
+ new &= sie_user->vsmxm_and;
+ } while (atomic_cmpxchg(cpuflags, old, new) != old);
+
+ rc = 0;
+out:
+ mutex_unlock(&s390host_init_mutex);
+ return rc;
+}
+
+static int s390host_do_action(unsigned long action, struct sie_io *sie_io)
+{
+ void *src;
+ void *dest;
+ int rc = 0;
+
+ if (action & SIE_BLOCK_UPDATE) {
+ src = &(sie_io->sie_user.sie_block);
+ dest = &(sie_io->sie_kernel.sie_block);
+
+ memcpy(dest + 4, src + 4, 88);
+ memcpy(dest + 96, src + 96, 4);
+ memcpy(dest + 104, src + 104, 408);
+ }
+
+ if (action & SIE_UPDATE_PSW)
+ sie_io->sie_kernel.sie_block.psw.gpsw = sie_io->sie_user.psw;
+
+ if (action & SIE_FLUSH_TLB)
+ flush_tlb_mm(current->mm);
+
+ if (action & SIE_VSMXM_LOCAL_UPDATE)
+ s390host_vsmxm_local_update(sie_io);
+
+ if (action & SIE_VSMXM_DIST_UPDATE)
+ rc = s390host_vsmxm_dist_update(sie_io);
+ return rc;
+}
Index: linux-2.6.21/include/asm-s390/unistd.h
===================================================================
--- linux-2.6.21.orig/include/asm-s390/unistd.h
+++ linux-2.6.21/include/asm-s390/unistd.h
@@ -251,8 +251,11 @@
#define __NR_getcpu 311
#define __NR_epoll_pwait 312
#define __NR_utimes 313
+#define __NR_s390host_add_cpu 314
+#define __NR_s390host_remove_cpu 315
+#define __NR_s390host_sie 316

-#define NR_syscalls 314
+#define NR_syscalls 317

/*
* There are some system calls that are not present on 64 bit, some
Index: linux-2.6.21/include/asm-s390/sie64.h
===================================================================
--- /dev/null
+++ linux-2.6.21/include/asm-s390/sie64.h
@@ -0,0 +1,279 @@
+/*
+ * include/asm-s390/sie64.h
+ *
+ * Copyright IBM Corp. 2007
+ * Author(s): Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>,
+ * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>,
+ * Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ *
+ */
+
+#ifndef _ASM_S390_SIE64_H
+#define _ASM_S390_SIE64_H
+
+#include <asm/atomic.h>
+#include <asm/ptrace.h> //FIXME: psw_t definition needs relocation
+
+struct sie_block {
+ atomic_t cpuflags; /* 0x0000 */
+ __u32 prefix; /* 0x0004 */
+ __u32 :32; /* 0x0008 */
+ __u32 :32; /* 0x000c */
+ __u64 :64; /* 0x0010 */
+ __u64 :64; /* 0x0018 */
+ __u64 :64; /* 0x0020 */
+ __u64 cputm; /* 0x0028 */
+ __u64 ckc; /* 0x0030 */
+ __u64 epoch; /* 0x0038 */
+ __u8 svcnn :1, /* 0x0040 */
+ svc1c :1,
+ svc2c :1,
+ svc3c :1,
+ :4;
+ __u8 svc1n; /* 0x0041 */
+ __u8 svc2n; /* 0x0042 */
+ __u8 svc3n; /* 0x0043 */
+ __u16 lctl0 :1, /* 0x0044 */
+ lctl1 :1,
+ lctl2 :1,
+ lctl3 :1,
+ lctl4 :1,
+ lctl5 :1,
+ lctl6 :1,
+ lctl7 :1,
+ lctl8 :1,
+ lctl9 :1,
+ lctla :1,
+ lctlb :1,
+ lctlc :1,
+ lctld :1,
+ lctle :1,
+ lctlf :1;
+ __s16 icpua; /* 0x0046 */
+ __u32 icpop :1, /* 0x0048 */
+ icpro :1,
+ icprg :1,
+ :4,
+ icipte :1,
+ :1, /* 0x0049 */
+ iclpsw :1,
+ icptlb :1,
+ icssm :1,
+ icbsa :1,
+ icstctl :1,
+ icstnsm :1,
+ icstosm :1,
+ icstck :1, /* 0x004a */
+ iciske :1,
+ icsske :1,
+ icrrbe :1,
+ icpc :1,
+ icpt :1,
+ ictprot :1,
+ iclasp :1,
+ :1, /* 0x004b */
+ icstpt :1,
+ icsckc :1,
+ :1,
+ icpr :1,
+ icbakr :1,
+ icpg :1,
+ :1;
+ __u32 ecext :1, /* 0x004c */
+ ecint :1,
+ ecwait :1,
+ ecsigp :1,
+ ecalt :1,
+ ecio2 :1,
+ :1,
+ ecmvp :1;
+ __u8 eca1; /* 0x004d */
+ __u8 eca2; /* 0x004e */
+ __u8 eca3; /* 0x004f */
+ __u8 icptcode; /* 0x0050 */
+ __u8 :6, /* 0x0051 */
+ icif :1,
+ icex :1;
+ __u16 ihcpu; /* 0x0052 */
+ __u16 :16; /* 0x0054 */
+ struct {
+ union {
+ __u16 ipa; /* 0x0056 */
+ __u16 inst; /* 0x0056 */
+ struct {
+ union {
+ __u8 ipa0; /* 0x0056 */
+ __u8 viwho; /* 0x0056 */
+ };
+ union {
+ __u8 ipa1; /* 0x0057 */
+ __u8 viwhen; /* 0x0057 */
+ };
+ };
+ };
+ union {
+ __u32 ipb; /* 0x0058 */
+ struct {
+ union {
+ __u16 ipbh0; /* 0x0058 */
+ __u16 viwhy; /* 0x0058 */
+ struct {
+ __u8 ipb0; /* 0x0058 */
+ __u8 ipb1; /* 0x0059 */
+ };
+ };
+ union {
+ __u16 ipbh1; /* 0x005a */
+ struct {
+ __u8 ipb2; /* 0x005a */
+ __u8 ipb3; /* 0x005b */
+ };
+ };
+ };
+ };
+ } __attribute__((packed));
+ __u32 scaoh; /* 0x005c */
+ union {
+ __u32 rcp; /* 0x0060 */
+ struct {
+ __u8 ska :1, /* 0x0060 */
+ skaip :1,
+ :6;
+ __u8 ecb :8; /* 0x0061 */
+ __u8 :3, /* 0x0062 */
+ cpby :1,
+ :4;
+ __u8 :8; /* 0x0063 */
+ };
+ };
+ __u32 scaol; /* 0x0064 */
+ __u32 :32; /* 0x0068 */
+ union {
+ __u32 todpr; /* 0x006c */
+ struct {
+ __u16 :16; /* 0x006c */
+ __u16 todpf; /* 0x006e */
+ };
+ } __attribute__((packed));
+ __u32 gisa; /* 0x0070 */
+ __u32 iopct; /* 0x0074 */
+ __u32 rsvd3; /* 0x0078 */
+ __u32 :32; /* 0x007c */
+ __u64 gmsor; /* 0x0080 */
+ __u64 gmslm; /* 0x0088 */
+ union {
+ psw_t gpsw; /* 0x0090 */
+ struct {
+ __u64 pswh; /* 0x0090 */
+
+ __u64 pswl; /* 0x0098 */
+ };
+ } psw;
+ __u64 gg14; /* 0x00a0 */
+ __u64 gg15; /* 0x00a8 */
+ __u64 :64; /* 0x00b0 */
+ __u64 :16, /* 0x00b8 */
+ xso :24,
+ xsl :24;
+ union {
+ __u8 uzp0[56]; /* 0x00c0 */
+ struct {
+ __u32 exmsf; /* 0x00c0 */
+ union {
+ __u32 iexcf; /* 0x00c4 */
+ struct {
+ __u16 iexca; /* 0x00c4 */
+ __u16 iexcd; /* 0x00c6 */
+ };
+ };
+ __u16 svcil; /* 0x00c8 */
+ __u16 svcnt; /* 0x00ca */
+ __u16 iprcl; /* 0x00cc */
+ __u16 iprcc; /* 0x00ce */
+ __u32 itrad; /* 0x00d0 */
+ __u32 imncl; /* 0x00d4 */
+ __u64 gpera; /* 0x00d8 */
+ __u8 excpar; /* 0x00e0 */
+ __u8 perar; /* 0x00e1 */
+ __u8 oprid; /* 0x00e2 */
+ __u8 :8; /* 0x00e3 */
+ __u32 :32; /* 0x00e4 */
+ __u64 gtrad; /* 0x00e8 */
+ __u32 :32; /* 0x00f0 */
+ __u32 :32; /* 0x00f4 */
+ };
+ };
+ __u16 :16; /* 0x00f8 */
+ __u16 ief; /* 0x00fa */
+ __u32 apcbk; /* 0x00fc */
+ __u64 gcr[16]; /* 0x0100 */
+ __u8 reserved[128]; /* 0x0180 */
+} __attribute__((packed));
+
+struct sie_kernel {
+ struct sie_block sie_block;
+ s390_fp_regs host_fpregs;
+ int host_acrs[NUM_ACRS];
+} __attribute__((packed,aligned(4096)));
+
+#define SIE_UPDATE_PSW (1UL << 0)
+#define SIE_FLUSH_TLB (1UL << 1)
+#define SIE_ISKE (1UL << 2)
+#define SIE_SSKE (1UL << 3)
+#define SIE_BLOCK_UPDATE (1UL << 4)
+#define SIE_VSMXM_LOCAL_UPDATE (1UL << 5)
+#define SIE_VSMXM_DIST_UPDATE (1UL << 6)
+
+struct sie_skey_parm {
+ unsigned long sk_reg;
+ unsigned long sk_addr;
+};
+
+struct sie_user {
+ struct sie_block sie_block;
+ psw_t psw;
+ unsigned long gprs[NUM_GPRS];
+ s390_fp_regs guest_fpregs;
+ int guest_acrs[NUM_ACRS];
+ struct sie_skey_parm iske_parm;
+ struct sie_skey_parm sske_parm;
+ int vsmxm_or_local;
+ int vsmxm_and_local;
+ int vsmxm_or;
+ int vsmxm_and;
+ int vsmxm_cpuid;
+} __attribute__((packed,aligned(4096)));
+
+struct sie_io {
+ struct sie_kernel sie_kernel;
+ struct sie_user sie_user;
+};
+
+struct sca_entry {
+ atomic_t scn;
+ __u64 reserved;
+ __u64 sda;
+ __u64 reserved2[2];
+}__attribute__((packed));
+
+struct sca_block {
+ __u64 ipte_control;
+ __u64 reserved[5];
+ __u64 mcn;
+ __u64 reserved2;
+ struct sca_entry cpu[64];
+}__attribute__((packed));
+
+#define S390HOST_MAX_CPUS 64
+
+struct s390host_data {
+ atomic_t count;
+ struct sie_io *sie_io[S390HOST_MAX_CPUS];
+ struct sca_block *sca_block;
+};
+
+/* function definitions */
+extern int sie64a(struct sie_kernel *);
+extern void s390host_put_data(struct s390host_data *);
+
+#endif /* _ASM_S390_SIE64_H */
Index: linux-2.6.21/arch/s390/Makefile
===================================================================
--- linux-2.6.21.orig/arch/s390/Makefile
+++ linux-2.6.21/arch/s390/Makefile
@@ -85,7 +85,7 @@ LDFLAGS_vmlinux := -e start
head-y := arch/s390/kernel/head.o arch/s390/kernel/init_task.o

core-y += arch/s390/mm/ arch/s390/kernel/ arch/s390/crypto/ \
- arch/s390/appldata/ arch/s390/hypfs/
+ arch/s390/appldata/ arch/s390/hypfs/ arch/s390/host/
libs-y += arch/s390/lib/
drivers-y += drivers/s390/
drivers-$(CONFIG_MATHEMU) += arch/s390/math-emu/
Index: linux-2.6.21/kernel/sys_ni.c
===================================================================
--- linux-2.6.21.orig/kernel/sys_ni.c
+++ linux-2.6.21/kernel/sys_ni.c
@@ -122,6 +122,9 @@ cond_syscall(sys32_sysctl);
cond_syscall(ppc_rtas);
cond_syscall(sys_spu_run);
cond_syscall(sys_spu_create);
+cond_syscall(sys_s390host_add_cpu);
+cond_syscall(sys_s390host_remove_cpu);
+cond_syscall(sys_s390host_sie);

/* mmu depending weak syscall entries */
cond_syscall(sys_mprotect);
Index: linux-2.6.21/arch/s390/kernel/process.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/process.c
+++ linux-2.6.21/arch/s390/kernel/process.c
@@ -274,12 +274,23 @@ int copy_thread(int nr, unsigned long cl
#endif /* CONFIG_64BIT */
/* start new process with ar4 pointing to the correct address space */
p->thread.mm_segment = get_fs();
- /* Don't copy debug registers */
- memset(&p->thread.per_info,0,sizeof(p->thread.per_info));
+ /* Don't copy debug registers */
+ memset(&p->thread.per_info,0,sizeof(p->thread.per_info));
+ p->thread_info->s390host_data = NULL;
+ p->thread_info->sie_cpu = -1;

return 0;
}

+void free_thread_info(struct thread_info *ti)
+{
+#ifdef CONFIG_S390_HOST
+ if (ti->s390host_data)
+ s390host_put_data(ti->s390host_data);
+#endif
+ free_pages((unsigned long) (ti),THREAD_ORDER);
+}
+
asmlinkage long sys_fork(struct pt_regs regs)
{
return do_fork(SIGCHLD, regs.gprs[15], &regs, 0, NULL, NULL);
Index: linux-2.6.21/include/asm-s390/thread_info.h
===================================================================
--- linux-2.6.21.orig/include/asm-s390/thread_info.h
+++ linux-2.6.21/include/asm-s390/thread_info.h
@@ -38,6 +38,7 @@
#ifndef __ASSEMBLY__
#include <asm/processor.h>
#include <asm/lowcore.h>
+#include <asm/sie64.h>

/*
* low level task data that entry.S needs immediate access to
@@ -52,6 +53,8 @@ struct thread_info {
unsigned int cpu; /* current CPU */
int preempt_count; /* 0 => preemptable, <0 => BUG */
struct restart_block restart_block;
+ struct s390host_data *s390host_data; /* s390host data */
+ int sie_cpu; /* sie cpu number */
};

/*
@@ -67,6 +70,8 @@ struct thread_info {
.restart_block = { \
.fn = do_no_restart_syscall, \
}, \
+ .s390host_data = NULL, \
+ .sie_cpu = 0, \
}

#define init_thread_info (init_thread_union.thread_info)
@@ -81,7 +86,8 @@ static inline struct thread_info *curren
/* thread information allocation */
#define alloc_thread_info(tsk) ((struct thread_info *) \
__get_free_pages(GFP_KERNEL,THREAD_ORDER))
-#define free_thread_info(ti) free_pages((unsigned long) (ti),THREAD_ORDER)
+
+extern void free_thread_info(struct thread_info *);

#endif

Index: linux-2.6.21/arch/s390/Kconfig
===================================================================
--- linux-2.6.21.orig/arch/s390/Kconfig
+++ linux-2.6.21/arch/s390/Kconfig
@@ -153,6 +153,7 @@ config S390_SWITCH_AMODE

config S390_EXEC_PROTECT
bool "Data execute protection"
+ depends on !S390_HOST
select S390_SWITCH_AMODE
help
This option allows to enable a buffer overflow protection for user
@@ -514,6 +515,12 @@ config KEXEC
current kernel, and to start another kernel. It is like a reboot
but is independent of hardware/microcode support.

+config S390_HOST
+ bool "s390 host support (EXPERIMENTAL)"
+ depends on 64BIT && EXPERIMENTAL
+ select S390_SWITCH_AMODE
+ help
+ Select this option if you want to host guest Linux images
endmenu

source "net/Kconfig"
Index: linux-2.6.21/arch/s390/kernel/setup.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/setup.c
+++ linux-2.6.21/arch/s390/kernel/setup.c
@@ -394,7 +394,11 @@ static int __init early_parse_ipldelay(c
early_param("ipldelay", early_parse_ipldelay);

#ifdef CONFIG_S390_SWITCH_AMODE
+#ifdef CONFIG_S390_HOST
+unsigned int switch_amode = 1;
+#else
unsigned int switch_amode = 0;
+#endif
EXPORT_SYMBOL_GPL(switch_amode);

static void set_amode_and_uaccess(unsigned long user_amode,
Index: linux-2.6.21/arch/s390/host/s390_intercept.c
===================================================================
--- /dev/null
+++ linux-2.6.21/arch/s390/host/s390_intercept.c
@@ -0,0 +1,42 @@
+/*
+ * s390_intercept.c -- handle SIE intercept codes
+ *
+ * Copyright IBM Corp. 2007
+ * Author(s): Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ */
+
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <asm/sie64.h>
+#include <linux/pagemap.h>
+#include "s390host.h"
+
+static int s390host_handle_validity (struct sie_kernel *sie_kernel)
+{
+ if (sie_kernel->sie_block.viwhy == 0x37) {
+ //debug message here
+ fault_in_pages_writeable((void*)0 + S390HOST_ORIGIN,
+ PAGE_SIZE);
+ fault_in_pages_writeable((void*)(unsigned long)
+ sie_kernel->sie_block.prefix+
+ S390HOST_ORIGIN, 2*PAGE_SIZE);
+ return 0;
+ }
+ // debug message here
+ return -ENOTSUPP;
+}
+
+int s390host_handle_intercept(struct sie_kernel *sie_kernel)
+{
+ switch (sie_kernel->sie_block.icptcode) {
+ case 0x00:
+ case 0x24:
+ return 0;
+ case 0x20:
+ return s390host_handle_validity(sie_kernel);
+ default:
+ // debug message here
+ return -ENOTSUPP;
+ }
+}
+
Index: linux-2.6.21/arch/s390/host/s390host.h
===================================================================
--- /dev/null
+++ linux-2.6.21/arch/s390/host/s390host.h
@@ -0,0 +1,16 @@
+/*
+ * s390host.h -- hosting zSeries Linux virtual engines
+ *
+ * Copyright IBM Corp. 2007
+ * Author(s): Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>,
+ * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>,
+ * Heiko Carstens <heiko.carstens-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ */
+
+#ifndef __S390HOST_H
+#define __S390HOST_H
+#include <asm/sie64.h>
+#define S390HOST_ORIGIN 0
+
+int s390host_handle_intercept(struct sie_kernel *sie_kernel);
+#endif // defined __S390HOST_H



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-11 17:35:58 UTC
Permalink
From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>

This patch adds support for a new bus type that manages paravirtualized
devices. The bus uses the s390 diagnose instruction to query devices, and
match them with the corresponding drivers.
Future enhancements should include hotplug and hotremoval of virtual devices
triggered by the host, and supend/resume of virtual devices for migration.

This code is s390 architecture specific, please review.

Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>

---
arch/s390/Kconfig | 6 +
drivers/s390/Makefile | 2
drivers/s390/guest/Makefile | 6 +
drivers/s390/guest/vdev.c | 158 +++++++++++++++++++++++++++++++++++++++
drivers/s390/guest/vdev_device.c | 50 ++++++++++++
include/asm-s390/vdev.h | 53 +++++++++++++
6 files changed, 274 insertions(+), 1 deletion(-)

Index: linux-2.6.21/arch/s390/Kconfig
===================================================================
--- linux-2.6.21.orig/arch/s390/Kconfig
+++ linux-2.6.21/arch/s390/Kconfig
@@ -521,6 +521,12 @@ config S390_HOST
select S390_SWITCH_AMODE
help
Select this option if you want to host guest Linux images
+
+config S390_GUEST
+ bool "s390 guest support (EXPERIMENTAL)"
+ depends on 64BIT && EXPERIMENTAL
+ help
+ Select this option if you want to run the kernel under s390 linux
endmenu

source "net/Kconfig"
Index: linux-2.6.21/drivers/s390/Makefile
===================================================================
--- linux-2.6.21.orig/drivers/s390/Makefile
+++ linux-2.6.21/drivers/s390/Makefile
@@ -5,7 +5,7 @@
CFLAGS_sysinfo.o += -Iinclude/math-emu -Iarch/s390/math-emu -w

obj-y += s390mach.o sysinfo.o s390_rdev.o
-obj-y += cio/ block/ char/ crypto/ net/ scsi/
+obj-y += cio/ block/ char/ crypto/ net/ scsi/ guest/

drivers-y += drivers/s390/built-in.o

Index: linux-2.6.21/drivers/s390/guest/Makefile
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/Makefile
@@ -0,0 +1,6 @@
+#
+# s390 Linux virtual environment
+#
+
+obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o
+
Index: linux-2.6.21/drivers/s390/guest/vdev.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vdev.c
@@ -0,0 +1,158 @@
+/*
+ * vdev - guest os layer for device virtualization
+ *
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ *
+ */
+
+#include <asm/vdev.h>
+
+static void vdev_bus_release(struct device *);
+
+struct bus_type vdev_bus_type = {
+ .name = "vdev",
+ .match = vdev_match,
+ .probe = vdev_probe,
+};
+
+struct device vdev_bus = {
+ .bus_id = "vdev0",
+ .release = vdev_bus_release
+};
+
+int vdev_match(struct device * dev, struct device_driver *drv)
+{
+ struct vdev *vdev = to_vdev(dev);
+ struct vdev_driver *vdrv = to_vdrv(drv);
+
+ if (vdev->vdev_type == vdrv->vdev_type)
+ return 1;
+
+ return 0;
+}
+
+int vdev_probe(struct device * dev)
+{
+ struct vdev *vdev = to_vdev(dev);
+ struct vdev_driver *vdrv = to_vdrv(dev->driver);
+
+ return vdrv->probe(vdev);
+}
+
+static void vdev_bus_release (struct device *device)
+{
+ /* noop, static bus object */
+}
+
+static inline int vdev_diag_hotplug(char symname[128], char hostid[128])
+{
+ register char * __arg1 asm("2") = symname;
+ register char * __arg2 asm("3") = hostid;
+ register int __svcres asm("2");
+ int __res;
+
+ __asm__ __volatile__ (
+ "diag 0,0,0x1e"
+ : "=d" (__svcres)
+ : "0" (__arg1),
+ "d" (__arg2)
+ : "cc", "memory");
+ __res = __svcres;
+ return __res;
+}
+
+
+static int vdev_scan_coldplug(void)
+{
+ int rc;
+ struct vdev *device;
+
+ do {
+ device = kzalloc(sizeof(struct vdev), GFP_ATOMIC);
+ if (!device) {
+ rc = -ENOMEM;
+ goto out;
+ }
+ rc = vdev_diag_hotplug(device->symname, device->hostid);
+ if (rc == -ENODEV)
+ break;
+ if (rc < 0) {
+ printk (KERN_WARNING "vdev: error %d detecting" \
+ " initial devices\n", rc);
+ break;
+ }
+ device->vdev_type = rc;
+
+ //sanity: are strings terminated?
+ if ((strnlen(device->symname, 128) == 128) ||
+ (strnlen(device->hostid, 128) == 128)) {
+ // warn and discard device
+ printk ("vdev: illegal device entry received\n");
+ break;
+ }
+
+ rc = vdevice_register(device);
+ if (rc) {
+ kfree(device);
+ } else
+ switch (device->vdev_type) {
+ case VDEV_TYPE_DISK:
+ printk (KERN_INFO "vdev: storage device " \
+ "detected: %s\n", device->symname);
+ break;
+ case VDEV_TYPE_NET:
+ printk (KERN_INFO "vdev: network device " \
+ "detected: %s\n", device->symname);
+ break;
+ default:
+ printk (KERN_INFO "vdev: unknown device " \
+ "detected: %s\n", device->symname);
+ }
+ } while(1);
+ kfree (device);
+ out:
+ return 0;
+}
+
+
+int __init vdev_init(void)
+{
+ int rc;
+
+ rc = bus_register(&vdev_bus_type);
+ if (rc) {
+ printk (KERN_WARNING "vdev: failed to register bus type\n");
+ goto out;
+ }
+ rc = device_register(&vdev_bus);
+ if (rc) {
+ printk (KERN_WARNING "vdev: failed to register bus device\n");
+ goto bunregister;
+ }
+ printk (KERN_INFO "vdev: initialization complete\n");
+ rc = vdev_scan_coldplug();
+ if (rc) {
+ printk (KERN_WARNING "vdev: failed to scan devices\n");
+ goto dunregister;
+ }
+ goto out;
+ dunregister:
+ device_unregister(&vdev_bus);
+
+ bunregister:
+ bus_unregister(&vdev_bus_type);
+ out:
+ return rc;
+}
+
+void vdev_exit(void)
+{
+ bus_unregister(&vdev_bus_type);
+}
+
+module_init(vdev_init);
+module_exit(vdev_exit);
+MODULE_DESCRIPTION("Guest layer for device virtualization");
+MODULE_AUTHOR("Copyright IBM Corp. 2007");
+MODULE_LICENSE("GPL");
Index: linux-2.6.21/drivers/s390/guest/vdev_device.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vdev_device.c
@@ -0,0 +1,50 @@
+/*
+ * vdev - guest layer for device virtualization
+ *
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ *
+ */
+
+#include <asm/vdev.h>
+
+int vdev_driver_register (struct vdev_driver *vdrv)
+{
+ struct device_driver *drv = &vdrv->driver;
+
+ drv->bus = &vdev_bus_type;
+ drv->name = vdrv->name;
+
+ return driver_register(drv);
+}
+
+int vdevice_register(struct vdev *vdev)
+{
+ struct device *dev = &vdev->dev;
+ int ret,typesize;
+
+ dev->bus = &vdev_bus_type;
+ dev->parent = &vdev_bus;
+ memset(dev->bus_id, 0, BUS_ID_SIZE);
+ switch (vdev->vdev_type) {
+ case VDEV_TYPE_DISK:
+ strncpy (dev->bus_id, "block:", 6);
+ typesize=6;
+ break;
+ case VDEV_TYPE_NET:
+ strncpy (dev->bus_id, "net:", 4);
+ typesize=4;
+ break;
+ default:
+ strncpy (dev->bus_id, "unknown:", 8);
+ typesize=8;
+ break;
+ }
+ strncpy (dev->bus_id+typesize, vdev->symname, BUS_ID_SIZE-typesize-1);
+
+ ret = device_register(dev);
+
+ //FIXME: add device attribs here
+
+ return ret;
+}
Index: linux-2.6.21/include/asm-s390/vdev.h
===================================================================
--- /dev/null
+++ linux-2.6.21/include/asm-s390/vdev.h
@@ -0,0 +1,53 @@
+/*
+ * vdev - guest layer for device virtualization
+ *
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ *
+ */
+
+#ifndef __VDEV_H
+#define __VDEV_H
+#include <linux/device.h>
+
+/* in vdev.c */
+extern int vdev_match(struct device *, struct device_driver *);
+extern int vdev_probe (struct device *);
+
+extern struct device vdev_bus;
+extern struct bus_type vdev_bus_type;
+
+#define VDEV_TYPE_DISK 0
+#define VDEV_TYPE_NET 1
+
+struct vdev {
+ unsigned int vdev_type;
+ char symname[128];
+ char hostid[128];
+ struct vdev_driver *driver;
+ struct device dev;
+ void *drv_private;
+};
+
+struct vdev_driver {
+ struct module *owner;
+ int vdev_type;
+ int (*probe) (struct vdev *);
+ int (*set_online) (struct vdev *);
+ int (*set_offline) (struct vdev *);
+ int (*suspend) (struct vdev *);
+ int (*resume) (struct vdev *);
+ struct device_driver driver; /* higher level structure, don't init
+ this from your driver */
+ char *name;
+ void *drv_private;
+};
+
+#define to_vdev(n) container_of(n, struct vdev, dev)
+#define to_vdrv(n) container_of(n, struct vdev_driver, driver)
+
+
+/* in vdevice.c */
+extern int vdevice_register(struct vdev *);
+extern int vdev_driver_register(struct vdev_driver *);
+#endif



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Arnd Bergmann
2007-05-11 20:06:01 UTC
Permalink
On Friday 11 May 2007, Carsten Otte wrote:

> This patch adds support for a new bus type that manages paravirtualized
> devices. The bus uses the s390 diagnose instruction to query devices, and
> match them with the corresponding drivers.

It seems that the diagnose instruction is really the only s390 specific
thing in here, right? I guess this part of your series is the first one
that we should have in an architecture independent way.

There may also be the chance of merging this with existing virtual
buses like the one for the ps3, which also just exists using
hypercalls.

> +int vdev_match(struct device * dev, struct device_driver *drv)
> +{
> + struct vdev *vdev = to_vdev(dev);
> + struct vdev_driver *vdrv = to_vdrv(drv);
> +
> + if (vdev->vdev_type == vdrv->vdev_type)
> + return 1;
> +
> + return 0;
> +}

Why invent device type numbers? On open firmware, we just do a string compare,
which more intuitive, and means you don't need any further

> +int vdev_probe(struct device * dev)
> +{
> + struct vdev *vdev = to_vdev(dev);
> + struct vdev_driver *vdrv = to_vdrv(dev->driver);
> +
> + return vdrv->probe(vdev);
> +}

This abstraction is unnecessary, just do the do_vdev() conversion inside
of the individual drivers.

> +
> +struct device vdev_bus = {
> + .bus_id = "vdev0",
> + .release = vdev_bus_release
> +};
>
> +static void vdev_bus_release (struct device *device)
> +{
> + /* noop, static bus object */
> +}

Just make the root of your devices a platform_device, then you don't need
to do dirty tricks like this.

> +static int vdev_scan_coldplug(void)
> +{
> + int rc;
> + struct vdev *device;
> +
> + do {
> + device = kzalloc(sizeof(struct vdev), GFP_ATOMIC);
> + if (!device) {
> + rc = -ENOMEM;
> + goto out;
> + }
> + rc = vdev_diag_hotplug(device->symname, device->hostid);
> + if (rc == -ENODEV)
> + break;
> + if (rc < 0) {
> + printk (KERN_WARNING "vdev: error %d detecting" \
> + " initial devices\n", rc);
> + break;
> + }
> + device->vdev_type = rc;
> +
> + //sanity: are strings terminated?
> + if ((strnlen(device->symname, 128) == 128) ||
> + (strnlen(device->hostid, 128) == 128)) {
> + // warn and discard device
> + printk ("vdev: illegal device entry received\n");
> + break;
> + }
> +
> + rc = vdevice_register(device);
> + if (rc) {
> + kfree(device);
> + } else
> + switch (device->vdev_type) {
> + case VDEV_TYPE_DISK:
> + printk (KERN_INFO "vdev: storage device " \
> + "detected: %s\n", device->symname);
> + break;
> + case VDEV_TYPE_NET:
> + printk (KERN_INFO "vdev: network device " \
> + "detected: %s\n", device->symname);
> + break;
> + default:
> + printk (KERN_INFO "vdev: unknown device " \
> + "detected: %s\n", device->symname);
> + }
> + } while(1);
> + kfree (device);
> + out:
> + return 0;
> +}

Interesting concept of probing the bus -- so you just ask if there are
any new devices, right?

> +#define VDEV_TYPE_DISK 0
> +#define VDEV_TYPE_NET 1
> +
> +struct vdev {
> + unsigned int vdev_type;
> + char symname[128];
> + char hostid[128];
> + struct vdev_driver *driver;
> + struct device dev;
> + void *drv_private;
> +};

You shouldn't need the driver and drv_private fields -- they are already
present in struct device.

Arnd <><

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-14 11:26:11 UTC
Permalink
Carsten Otte wrote:
> From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
>
> This patch adds support for a new bus type that manages paravirtualized
> devices. The bus uses the s390 diagnose instruction to query devices, and
> match them with the corresponding drivers.
> Future enhancements should include hotplug and hotremoval of virtual devices
> triggered by the host, and supend/resume of virtual devices for migration.
>
>

Interesting. We could use a variation this for x86 as well, but I'm not
sure how easy it is to integrate it into closed source OSes (Windows).
The diag instruction could be replaced by a hypercall which would make
the code generic.

--
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-14 11:43:25 UTC
Permalink
Avi Kivity wrote:
> Interesting. We could use a variation this for x86 as well, but I'm not
> sure how easy it is to integrate it into closed source OSes (Windows).
> The diag instruction could be replaced by a hypercall which would make
> the code generic.
I think we need to freeze the hypercall API at some time, and consider
it a stable kernel external API. We do then need to document these
calls, and non-GPL hypervisors can implement it. We could eventually
have a similar situation with one of the other non-GPL hypervisors on
s390 that run Linux.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Dor Laor
2007-05-14 12:00:16 UTC
Permalink
>Avi Kivity wrote:
>> Interesting. We could use a variation this for x86 as well, but I'm
not
>> sure how easy it is to integrate it into closed source OSes
(Windows).
>> The diag instruction could be replaced by a hypercall which would
make
>> the code generic.
>I think we need to freeze the hypercall API at some time, and consider
>it a stable kernel external API. We do then need to document these
>calls, and non-GPL hypervisors can implement it. We could eventually
>have a similar situation with one of the other non-GPL hypervisors on
>s390 that run Linux.

I think Avi meant using a virtual bus as an option for HVMs too (windows
especially). Currently we're using the cpi bus. Using a new virtualized
bus might be a good idea, it's easy & clean for open source. The
question is it make life easier for HVMs. For instance, on windows we'll
need Pnp support for these devices.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-14 13:32:14 UTC
Permalink
Dor Laor wrote:
> I think Avi meant using a virtual bus as an option for HVMs too (windows
> especially). Currently we're using the cpi bus. Using a new virtualized
> bus might be a good idea, it's easy & clean for open source. The
> question is it make life easier for HVMs. For instance, on windows we'll
> need Pnp support for these devices.
Oh that way around. Thanks for clarification.
As far as I see, a stable hypercall API would also be good for
maintaining non-GPL HVMs. Probably we should forge the API with
respect to other HVMs needs then.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-11 17:36:03 UTC
Permalink
From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>

This driver provides access to virtual block devices. It does use its own
make_request function which passes the bio to a workqueue thread. The workqueue
thread does use the diagnose hypervisor call to call the hosting Linux.
The hypervisor code in host userspace does use aio_submit to initiate the IO.
Once the IO is done, the host will use io_getevents and then generate an
interrupt to the guest. The interrupt handler calls bio_endio.
This device driver is currently architecture dependent. We intend to move the
host API to hypercall instead of the diagnose instuction. Please review.

Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>

---
drivers/s390/block/Kconfig | 7
drivers/s390/guest/Makefile | 1
drivers/s390/guest/vdisk.c | 153 +++++++++++++++++
drivers/s390/guest/vdisk.h | 230 ++++++++++++++++++++++++++
drivers/s390/guest/vdisk_blk.c | 355 +++++++++++++++++++++++++++++++++++++++++
5 files changed, 746 insertions(+)

Index: linux-2.6.21/drivers/s390/guest/vdisk.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vdisk.c
@@ -0,0 +1,153 @@
+/*
+ * guest virtual block device driver
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ */
+
+#include <linux/blkdev.h>
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/spinlock.h>
+#include <linux/types.h>
+#include <asm/ptrace.h>
+#include <asm/s390_ext.h>
+#include <asm/vdev.h>
+#include "vdisk.h"
+
+MODULE_DESCRIPTION("Guest virtual block device driver");
+MODULE_AUTHOR("Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>");
+MODULE_LICENSE("GPL");
+
+static struct vdev_driver vdisk_driver;
+
+static int __find_fd(struct device *dev, void* fdptr) {
+ int fd = (long)fdptr;
+
+ struct vdev *vdev = to_vdev(dev);
+ struct vdisk_device *vdisk = (struct vdisk_device *)vdev->drv_private;
+
+ if (vdisk->vfd == fd)
+ return 1;
+ else
+ return 0;
+}
+
+vdisk_irq_t vdisk_get_irqpage(int fd)
+{
+ struct device *dev;
+ struct vdev *vdev;
+ struct vdisk_device *vdisk;
+
+ dev = driver_find_device(&vdisk_driver.driver, NULL, (void*)(long)fd, __find_fd);
+ if (!dev)
+ return NULL;
+ vdev = to_vdev(dev);
+ vdisk = (struct vdisk_device *)vdev->drv_private;
+ return vdisk->irq_page;
+}
+
+struct vdisk_device * vdisk_get_device_by_fd(int fd)
+{
+ struct device *dev;
+ struct vdev *vdev;
+ struct vdisk_device *vdisk;
+
+ dev = driver_find_device(&vdisk_driver.driver, NULL, (void*)(long)fd, __find_fd);
+ if (!dev)
+ return NULL;
+ vdev = to_vdev(dev);
+ vdisk = (struct vdisk_device *)vdev->drv_private;
+ return vdisk;
+}
+
+
+static int vdisk_probe(struct vdev *vdev)
+{
+ struct vdisk_device *vdisk;
+ int rc;
+
+ vdisk = kzalloc(sizeof(struct vdisk_device), GFP_ATOMIC);
+ if (!vdisk)
+ return -ENOMEM;
+
+ vdisk->vdev = vdev;
+ vdev->drv_private = vdisk;
+ vdisk->submit_page = (void*)get_zeroed_page(GFP_KERNEL);
+
+ if (!vdisk->submit_page) {
+ rc = -ENOMEM;
+ goto out_free;
+ }
+
+ vdisk->irq_page = (void*)get_zeroed_page(GFP_KERNEL);
+
+ if (!vdisk->irq_page) {
+ rc = -ENOMEM;
+ goto out_free;
+ }
+
+ rc = diag_vdisk_disk_info(vdisk->vdev->hostid,
+ &vdisk->blocksize, &vdisk->size,
+ &vdisk->read_only);
+ if (rc)
+ goto out_free;
+ spin_lock_init(&vdisk->lock);
+ init_rwsem(&vdisk->pump_sem);
+ init_waitqueue_head(&vdisk->wait);
+
+ vdisk_init_blockdev(vdisk);
+ goto out;
+
+out_free:
+ if (vdisk->irq_page)
+ free_page((unsigned long)(vdisk->irq_page));
+ if (vdisk->submit_page)
+ free_page((unsigned long)(vdisk->submit_page));
+ kfree(vdisk);
+out:
+ return rc;
+}
+
+static struct vdev_driver vdisk_driver = {
+ .name = "vdisk",
+ .owner = THIS_MODULE,
+ .vdev_type = VDEV_TYPE_DISK,
+ .probe = vdisk_probe,
+};
+
+static int __init vdisk_init(void)
+{
+ int rc;
+ if (!MACHINE_IS_GUEST)
+ return -ENODEV;
+
+ rc = register_blkdev(VDISK_MAJOR, "vdisk");
+ if (rc) {
+ printk(KERN_WARNING "vdisk: cannot register block device\n");
+ return rc;
+ }
+ rc = register_external_interrupt(0x1235, vdisk_ext_handler);
+ if (rc)
+ goto unregister_blk;
+ rc = vdev_driver_register(&vdisk_driver);
+ if (rc)
+ goto unregister_irq;
+ goto out;
+unregister_irq:
+ unregister_external_interrupt(0x1235, vdisk_ext_handler);
+unregister_blk:
+ unregister_blkdev(VDISK_MAJOR, "vdisk");
+ printk (KERN_WARNING "vdisk: initialization failed\n");
+out:
+ return rc;
+}
+
+static void __exit vdisk_exit(void)
+{
+ unregister_external_interrupt(0x1235, vdisk_ext_handler);
+ unregister_blkdev(VDISK_MAJOR, "vdisk");
+ return;
+}
+
+module_init(vdisk_init);
+module_exit(vdisk_exit);
Index: linux-2.6.21/drivers/s390/guest/vdisk.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vdisk.h
@@ -0,0 +1,230 @@
+/*
+ * guest virtual block device driver header file
+ * Copyright IBM Corp.
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ */
+
+#include <linux/list.h>
+#include <linux/genhd.h>
+#include <linux/types.h>
+#include <linux/aio_abi.h>
+#include <linux/wait.h>
+#include <asm/vdev.h>
+
+#define VDISK_MAJOR 95
+#define SECTOR_SHIFT 9
+#define VDISK_NR_REQ 256
+#define VDISK_NR_RES 170
+
+#define VDISK_WRITE 1
+#define VDISK_READ 0
+
+struct vdisk_request {
+ unsigned long buf;
+ unsigned long count;
+};
+
+typedef struct vdisk_request (*vdisk_req_t)[VDISK_NR_REQ];
+
+struct vdisk_response {
+ unsigned long intparm;
+ unsigned long count;
+ unsigned long failed;
+};
+
+typedef struct vdisk_response (*vdisk_irq_t)[VDISK_NR_RES];
+
+struct vdisk_device {
+ struct list_head head;
+ int blocksize;
+ long size;
+ int read_only;
+ struct gendisk *gd;
+ struct vdev *vdev;
+ spinlock_t lock;
+ struct rw_semaphore pump_sem;
+ int open_count;
+ int vfd;
+ struct vdisk_request (*submit_page)[VDISK_NR_REQ];
+ struct workqueue_struct *wq;
+ vdisk_irq_t irq_page;
+ wait_queue_head_t wait;
+};
+
+struct vdisk_work {
+ struct work_struct work;
+ struct bio* bio;
+};
+
+struct vdisk_elem {
+ unsigned int fd;
+ unsigned int command;
+ unsigned long offset;
+ unsigned long buffer;
+ unsigned long nbytes;
+};
+
+struct vdisk_iocb_container {
+ struct iocb iocb;
+ struct bio *bio;
+ struct vdisk_device *dev;
+ int ctx_index;
+ unsigned long context;
+ struct list_head list;
+};
+
+// from aio_abi.h
+typedef enum io_iocb_cmd {
+ IO_CMD_PREAD = 0,
+ IO_CMD_PWRITE = 1,
+
+ IO_CMD_FSYNC = 2,
+ IO_CMD_FDSYNC = 3,
+
+ IO_CMD_POLL = 5,
+ IO_CMD_NOOP = 6,
+} io_iocb_cmd_t;
+
+static inline int
+diag_vdisk_disk_info(char name[256], int* blocksize,
+ long* size, int* read_only)
+{
+ register char* __arg1 asm("2") = name;
+ register int* __arg2 asm("3") = blocksize;
+ register long* __arg3 asm("4") = size;
+ register int* __arg4 asm("5") = read_only;
+ register int __svcres asm("2");
+ int __res;
+
+ __asm__ __volatile__ (
+ "diag 0,0,5"
+ : "=d" (__svcres)
+ : "0" (__arg1),
+ "d" (__arg2),
+ "d" (__arg3),
+ "d" (__arg4)
+ : "cc", "memory");
+ __res = __svcres;
+ return __res;
+}
+
+static inline int
+diag_vdisk_open(const char* name, int read_only, void* irq_page)
+{
+ register const char* __arg1 asm("2") = name;
+ register long __arg2 asm("3") = read_only;
+ register unsigned long __arg3 asm("4") = (unsigned long) irq_page;
+ register int __svcres asm("2");
+ int __res;
+
+ __asm__ __volatile__ (
+ "diag 0,0,7"
+ : "=d" (__svcres)
+ : "0" (__arg1),
+ "d" (__arg2),
+ "d" (__arg3)
+ : "cc", "memory");
+ __res = __svcres;
+ return __res;
+}
+
+static inline int
+diag_vdisk_close(int fd)
+{
+ register long __arg1 asm("2") = fd;
+ register int __svcres asm("2");
+ int __res;
+
+ __asm__ __volatile__ (
+ "diag 0,0,9"
+ : "=d" (__svcres)
+ : "0" (__arg1)
+ : "cc", "memory");
+ __res = __svcres;
+ return __res;
+}
+
+static inline int
+diag_vdisk_aio_setup(unsigned int index, unsigned int num_events,
+ unsigned long *context, void *int_page)
+{
+ register unsigned long __arg1 asm("2") = index;
+ register unsigned long __arg2 asm("3") = num_events;
+ register unsigned long* __arg3 asm("4") = context;
+ register void* __arg4 asm("5") = int_page;
+ register int __svcres asm("2");
+ int __res;
+
+ __asm__ __volatile__ (
+ "diag 0,0,0x0a"
+ : "=d" (__svcres)
+ : "0" (__arg1),
+ "d" (__arg2),
+ "d" (__arg3),
+ "d" (__arg4)
+ : "cc", "memory");
+ __res = __svcres;
+ return __res;
+}
+
+static inline void
+diag_vdisk_aio_destroy(unsigned int index)
+{
+ register unsigned long __arg1 asm("2") = index;
+ __asm__ __volatile__ (
+ "diag 0,0,0x12"
+ :: "d" (__arg1)
+ : "cc", "memory");
+}
+
+static inline int
+diag_vdisk_submit_request(int fd, void *submit_page, int op,
+ loff_t start_offset, int nrreq, void* parm)
+{
+ register long __arg1 asm("2") = fd;
+ register unsigned long __arg2 asm("3") = (unsigned long)submit_page;
+ register long __arg3 asm("4") = op;
+ register unsigned long __arg4 asm("5") = start_offset;
+ register long __arg5 asm("6") = nrreq;
+ register unsigned long __arg6 asm("7") = (unsigned long)parm;
+ register int __svcres asm("2");
+ int __res;
+
+ __asm__ __volatile__ (
+ "diag 0,0,0x0b"
+ : "=d" (__svcres)
+ : "0" (__arg1),
+ "d" (__arg2),
+ "d" (__arg3),
+ "d" (__arg4),
+ "d" (__arg5),
+ "d" (__arg6)
+ : "cc", "memory");
+ __res = __svcres;
+ return __res;
+}
+
+static inline int
+diag_vdisk_getevents(int index) {
+ register long __arg1 asm("2") = index;
+ register int __svcres asm("2");
+ int __res;
+
+ __asm__ __volatile__ (
+ "diag 0,0,0x0d"
+ : "=d" (__svcres)
+ : "0" (__arg1)
+ : "cc", "memory");
+ __res = __svcres;
+ return __res;
+}
+
+// in vdisk.c
+extern struct device *vdisk_sysfs_root;
+int vdisk_disk_info(struct vdisk_device *dev);
+vdisk_irq_t vdisk_get_irqpage(int fd);
+struct vdisk_device * vdisk_get_device_by_fd(int fd);
+
+// in vdisk_blk.c
+void vdisk_init_blockdev(struct vdisk_device *dev);
+void vdisk_ext_handler(__u16 code);
Index: linux-2.6.21/drivers/s390/guest/vdisk_blk.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vdisk_blk.c
@@ -0,0 +1,355 @@
+/*
+ * guest virtual block device driver
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ */
+
+#include <linux/blkdev.h>
+#include "vdisk.h"
+
+static int vdisk_open(struct inode *inode, struct file *filp);
+static int vdisk_release(struct inode *inode, struct file *filp);
+static int vdisk_make_request(request_queue_t *q, struct bio *bio);
+
+static struct block_device_operations vdisk_fops = {
+ .owner = THIS_MODULE,
+ .open = vdisk_open,
+ .release = vdisk_release,
+};
+
+void vdisk_init_blockdev(struct vdisk_device *dev)
+{
+ static int lastminor = 0;
+
+ dev->gd = alloc_disk(1);
+ if (!dev->gd) {
+ printk (KERN_WARNING "vdisk: out of memory while " \
+ "initializing %s\n", dev->vdev->symname);
+ return;
+ }
+ dev->open_count = 0;
+ dev->gd->first_minor = lastminor++;
+ dev->gd->queue = blk_alloc_queue(GFP_KERNEL);
+ if (!dev->gd->queue) {
+ printk (KERN_WARNING "vdisk: out of memory while " \
+ "initializing %s\n", dev->vdev->symname);
+ goto free_gd;
+ }
+ blk_queue_max_segment_size(dev->gd->queue, 15 * dev->blocksize);
+ strlcpy(dev->gd->disk_name, dev->vdev->symname, 32);
+ dev->gd->disk_name[32] = '\0';
+ dev->gd->major = VDISK_MAJOR;
+ dev->gd->fops = &vdisk_fops;
+ dev->gd->private_data = dev;
+ dev->gd->driverfs_dev = &dev->vdev->dev;
+ set_capacity(dev->gd, dev->size);
+ get_device(&dev->vdev->dev);
+ add_disk(dev->gd);
+ blk_queue_make_request(dev->gd->queue, vdisk_make_request);
+ blk_queue_hardsect_size(dev->gd->queue, dev->blocksize);
+ set_disk_ro(dev->gd, dev->read_only);
+ if (dev->blocksize)
+ printk(KERN_INFO "vdisk: device %s(%d:%d) active with " \
+ "block size %d and %ld sectors\n", dev->vdev->symname,
+ VDISK_MAJOR, dev->gd->first_minor, dev->blocksize,
+ dev->size);
+ else
+ printk(KERN_INFO "vdisk: device %s(%d:%d) inactive\n",
+ dev->vdev->symname, VDISK_MAJOR, dev->gd->first_minor);
+ return;
+ free_gd:
+ kfree(dev->gd);
+ dev->gd = NULL;
+ return;
+}
+
+static int vdisk_open(struct inode *inode, struct file *filp)
+{
+ struct vdisk_device *dev = inode->i_bdev->bd_disk->private_data;
+ unsigned long flags;
+ char* wq_name;
+ struct workqueue_struct *new_wq;
+ int rc;
+
+ if (!dev) {
+ rc = -ENODEV;
+ goto out;
+ }
+
+ down_write(&dev->pump_sem);
+ spin_lock_irqsave(&dev->lock, flags);
+ if (dev->open_count) {
+ dev->open_count++;
+ rc = 0;
+ goto unlock;
+ }
+
+
+ wq_name = kmalloc(strlen(dev->gd->disk_name)+9, GFP_ATOMIC);
+ if (!wq_name) {
+ rc = -ENOMEM;
+ goto unlock;
+ }
+ memcpy(wq_name, "IO_pump_", 8);
+ strcpy(wq_name+8, dev->gd->disk_name);
+
+ spin_unlock_irqrestore(&dev->lock, flags);
+ new_wq = create_singlethread_workqueue(wq_name);
+ spin_lock_irqsave(&dev->lock, flags);
+
+ dev->wq = new_wq;
+ kfree (wq_name);
+
+ if (!dev->wq) {
+ rc = -EIO;
+ goto unlock;
+ }
+
+ rc = diag_vdisk_disk_info(dev->vdev->hostid, &dev->blocksize,
+ &dev->size, &dev->read_only);
+ if (rc) {
+ printk(KERN_ERR "vdisk: error querying %s\n", dev->vdev->hostid);
+ goto cleanup;
+ }
+ inode->i_bdev->bd_block_size = dev->blocksize;
+ dev->vfd = diag_vdisk_open(dev->vdev->hostid, dev->read_only,
+ dev->irq_page);
+
+ if (dev->vfd < 0) {
+ rc = dev->vfd;
+ printk(KERN_ERR "vdisk: error opening %s\n", dev->vdev->hostid);
+ goto cleanup;
+ } else {
+ dev->open_count++;
+ rc = 0;
+ }
+ goto unlock;
+
+ cleanup:
+ spin_unlock_irqrestore(&dev->lock, flags);
+ destroy_workqueue(new_wq);
+ spin_lock_irqsave(&dev->lock, flags);
+ unlock:
+ spin_unlock_irqrestore(&dev->lock, flags);
+ up_write(&dev->pump_sem);
+ out:
+ return rc;
+}
+
+static int
+vdisk_release(struct inode *inode, struct file *filp)
+{
+ int rc;
+ unsigned long flags;
+ struct vdisk_device *dev = inode->i_bdev->bd_disk->private_data;
+ struct workqueue_struct *old_wq;
+
+ if (!dev) {
+ rc = -ENODEV;
+ goto out;
+ }
+
+ down_write(&dev->pump_sem);
+ spin_lock_irqsave(&dev->lock, flags);
+ dev->open_count--;
+
+ if (dev->open_count) {
+ rc = 0;
+ spin_unlock_irqrestore(&dev->lock, flags);
+ goto up;
+ }
+ rc = diag_vdisk_close(dev->vfd);
+
+ old_wq = dev->wq;
+ dev->wq = NULL;
+
+ spin_unlock_irqrestore(&dev->lock, flags);
+
+ destroy_workqueue(old_wq);
+
+ up:
+ up_write(&dev->pump_sem);
+ out:
+ return rc;
+}
+
+static void vdisk_pump_bvecs(struct vdisk_device *dev, int op,
+ loff_t start_offset, int requestno,
+ struct bio* bio, struct bio_vec *(vectors[256]))
+{
+ int i, rc;
+ loff_t offset = start_offset;
+ int nr_done = 0;
+ long size;
+ long flags=0;
+ DEFINE_WAIT(wait);
+
+ spin_lock_irqsave(&dev->lock, flags);
+ prepare_to_wait_exclusive(&dev->wait, &wait,
+ TASK_UNINTERRUPTIBLE);
+
+ while (nr_done < requestno) {
+ memset(dev->submit_page, 0, PAGE_SIZE);
+ for (i=nr_done; i<requestno; i++) {
+ (*dev->submit_page)[i-nr_done].buf =
+ (unsigned long)page_address(vectors[i]->bv_page) +
+ vectors[i]->bv_offset;
+ (*dev->submit_page)[i-nr_done].count = vectors[i]->bv_len;
+ }
+
+ rc = diag_vdisk_submit_request(dev->vfd,
+ dev->submit_page,
+ op, offset,
+ requestno-nr_done, bio);
+
+ if (rc < 0) {
+ // error case
+ size = 0;
+ for (i=0; i<(requestno-nr_done); i++)
+ size += (*dev->submit_page)[i].count;
+ bio_io_error(bio, size);
+ break;
+ }
+
+ if (rc == requestno - nr_done)
+ // everything was submitted propper
+ break;
+
+ if (rc) {
+ //request was partly submitted
+ for (i=0; i<rc; i++)
+ offset += (*dev->submit_page)[i].count;
+ nr_done += rc;
+ }
+ // we need to throttle IO, and retry submission later
+ spin_unlock_irqrestore(&dev->lock, flags);
+ io_schedule();
+ spin_lock_irqsave(&dev->lock, flags);
+ }
+ finish_wait(&dev->wait, &wait);
+ spin_unlock_irqrestore(&dev->lock, flags);
+ return;
+}
+
+static void vdisk_pump_bio(struct work_struct *zw)
+{
+ struct vdisk_work *work =
+ container_of(zw, struct vdisk_work, work);
+
+ struct bio *bio = work->bio;
+ struct bio_vec *bvec;
+ struct bio_vec *(vectors[256]);
+ struct vdisk_device *dev = bio->bi_bdev->bd_disk->private_data;
+ int i, op, requestno=0;
+ loff_t start_offset, offset;
+
+ BUG_ON(!dev);
+
+ kfree (zw);
+
+ if (bio_data_dir(bio))
+ op = VDISK_WRITE;
+ else
+ op = VDISK_READ;
+
+ offset = start_offset = ((loff_t)bio->bi_sector)<<SECTOR_SHIFT;
+
+ bio_for_each_segment(bvec, bio, i) {
+ if (bvec->bv_len & (dev->blocksize - 1)) //FIXME: Zugriff auf dev ohne lock
+ goto out;
+
+ vectors[requestno] = bvec;
+ offset += bvec->bv_len;
+ requestno++;
+ if (requestno == 255) {
+ vdisk_pump_bvecs(dev, op, start_offset, requestno,
+ bio, vectors);
+ start_offset = offset;
+ requestno = 0;
+ }
+ }
+
+ if (requestno)
+ vdisk_pump_bvecs(dev, op, start_offset, requestno, bio, vectors);
+
+out:
+ return;
+}
+
+static int vdisk_make_request(request_queue_t *q, struct bio *bio)
+{
+ struct vdisk_device *dev = bio->bi_bdev->bd_disk->private_data;
+ struct vdisk_work *work;
+ int rc;
+
+ if (!dev) {
+ rc = -ENODEV;
+ goto out;
+ }
+
+ if (bio_barrier(bio)) {
+ rc = -EOPNOTSUPP;
+ goto out;
+ }
+
+ work = kmalloc(sizeof(struct vdisk_work), GFP_KERNEL);
+ if (!work) {
+ rc = -ENOMEM;
+ goto out;
+ }
+
+ work->bio = bio;
+
+ INIT_WORK(&work->work, vdisk_pump_bio);
+
+ if (!queue_work(dev->wq, &work->work)) {
+ rc = -EIO;
+ kfree(work);
+ } else
+ rc = 0;
+
+out:
+ return rc;
+}
+
+static void __post_response(vdisk_irq_t irq_page)
+{
+ int i;
+ struct bio *bio;
+
+ for (i=0; i<VDISK_NR_RES; i++) {
+ if (!(*irq_page)[i].intparm)
+ break;
+ bio = (struct bio *)((*irq_page)[i].intparm);
+ if ((*irq_page)[i].count)
+ bio_endio(bio, (*irq_page)[i].count, 0);
+ if ((*irq_page)[i].failed)
+ bio_endio(bio, (*irq_page)[i].failed, 1);
+ }
+}
+
+void vdisk_ext_handler(__u16 code)
+{
+ int rc=0; //FIXME: no initialization here
+ int fd = S390_lowcore.ext_params;
+ vdisk_irq_t irq_page;
+ struct vdisk_device *vdev;
+
+ irq_page = vdisk_get_irqpage(fd);
+
+ if (irq_page) {
+ do {
+ __post_response(irq_page);
+ rc = diag_vdisk_getevents(fd); // get more interrupts
+ } while(rc > 0);
+ vdev = vdisk_get_device_by_fd(fd);
+ if (!vdev)
+ panic("cannot find vdisk device while in interrupt");
+ spin_lock(&vdev->lock);
+ if (waitqueue_active(&vdev->wait))
+ wake_up(&vdev->wait);
+ spin_unlock(&vdev->lock);
+ } else
+ printk (KERN_WARNING "vdisk got interrupt for non-existing" \
+ " aio context id %d\n", fd);
+}
Index: linux-2.6.21/drivers/s390/block/Kconfig
===================================================================
--- linux-2.6.21.orig/drivers/s390/block/Kconfig
+++ linux-2.6.21/drivers/s390/block/Kconfig
@@ -63,4 +63,11 @@ config DASD_EER
DASD extended error reporting. This is only needed if you want to
use applications written for the EER facility.

+
+config VDISK
+ tristate "guest disk device support"
+ depends on S390_GUEST
+ help
+ This driver provides access to block devices for Linux systems running
+ under non z/VM hosts. If you are running on LPAR or z/VM only, say N.
endif
Index: linux-2.6.21/drivers/s390/guest/Makefile
===================================================================
--- linux-2.6.21.orig/drivers/s390/guest/Makefile
+++ linux-2.6.21/drivers/s390/guest/Makefile
@@ -4,4 +4,5 @@

obj-$(CONFIG_GUEST_CONSOLE) += guest_console.o guest_tty.o
obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o
+obj-$(CONFIG_VDISK) += vdisk.o vdisk_blk.o




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-14 11:49:51 UTC
Permalink
Carsten Otte wrote:
> From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
>
> This driver provides access to virtual block devices. It does use its own
> make_request function which passes the bio to a workqueue thread. The workqueue
> thread does use the diagnose hypervisor call to call the hosting Linux.
> The hypervisor code in host userspace does use aio_submit to initiate the IO.
> Once the IO is done, the host will use io_getevents and then generate an
> interrupt to the guest. The interrupt handler calls bio_endio.
> This device driver is currently architecture dependent. We intend to move the
> host API to hypercall instead of the diagnose instuction. Please review.
>
> Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
>

> +struct vdisk_device * vdisk_get_device_by_fd(int fd)
> +{
> + struct device *dev;
> + struct vdev *vdev;
> + struct vdisk_device *vdisk;
> +
> + dev = driver_find_device(&vdisk_driver.driver, NULL, (void*)(long)fd, __find_fd);
> + if (!dev)
> + return NULL;
> + vdev = to_vdev(dev);
> + vdisk = (struct vdisk_device *)vdev->drv_private;
> + return vdisk;
> +}
>

Is this the host file descriptor? If so, we want to use something more
abstract (if the host side is in kernel, there will be no fd, or if the
device is implemented using >1 files (or <1 files)).

> +
> +#define VDISK_WRITE 1
> +#define VDISK_READ 0
> +
> +struct vdisk_request {
> + unsigned long buf;
> + unsigned long count;
> +};
> +
> +typedef struct vdisk_request (*vdisk_req_t)[VDISK_NR_REQ];
> +
> +struct vdisk_response {
> + unsigned long intparm;
> + unsigned long count;
> + unsigned long failed;
> +};
> +
> +typedef struct vdisk_response (*vdisk_irq_t)[VDISK_NR_RES];
> +
> +struct vdisk_device {
> + struct list_head head;
> + int blocksize;
> + long size;
> + int read_only;
> + struct gendisk *gd;
> + struct vdev *vdev;
> + spinlock_t lock;
> + struct rw_semaphore pump_sem;
> + int open_count;
> + int vfd;
> + struct vdisk_request (*submit_page)[VDISK_NR_REQ];
>


> + struct workqueue_struct *wq;
> + vdisk_irq_t irq_page;
> + wait_queue_head_t wait;
> +};
> +
> +struct vdisk_work {
> + struct work_struct work;
> + struct bio* bio;
> +};
> +
> +struct vdisk_elem {
> + unsigned int fd;
> + unsigned int command;
> + unsigned long offset;
> + unsigned long buffer;
> + unsigned long nbytes;
>

We'll want scatter/gather here.

> +};
> +
> +struct vdisk_iocb_container {
> + struct iocb iocb;
> + struct bio *bio;
> + struct vdisk_device *dev;
> + int ctx_index;
> + unsigned long context;
> + struct list_head list;
> +};
> +
> +// from aio_abi.h
> +typedef enum io_iocb_cmd {
> + IO_CMD_PREAD = 0,
> + IO_CMD_PWRITE = 1,
> +
> + IO_CMD_FSYNC = 2,
> + IO_CMD_FDSYNC = 3,
> +
> + IO_CMD_POLL = 5,
> + IO_CMD_NOOP = 6,
> +} io_iocb_cmd_t;
>

Our own commands, please. We need READV, WRITEV, and a barrier for
journalling filesystems. FDSYNC should work as a barrier, but is
wasteful. The FSYNC/FDSYNC distinction is meaningless. POLL/NOOP are
irrelevant.

> +static void vdisk_pump_bvecs(struct vdisk_device *dev, int op,
> + loff_t start_offset, int requestno,
> + struct bio* bio, struct bio_vec *(vectors[256]))
> +{
> + int i, rc;
> + loff_t offset = start_offset;
> + int nr_done = 0;
> + long size;
> + long flags=0;
> + DEFINE_WAIT(wait);
> +
> + spin_lock_irqsave(&dev->lock, flags);
> + prepare_to_wait_exclusive(&dev->wait, &wait,
> + TASK_UNINTERRUPTIBLE);
> +
> + while (nr_done < requestno) {
> + memset(dev->submit_page, 0, PAGE_SIZE);
> + for (i=nr_done; i<requestno; i++) {
> + (*dev->submit_page)[i-nr_done].buf =
> + (unsigned long)page_address(vectors[i]->bv_page) +
> + vectors[i]->bv_offset;
> + (*dev->submit_page)[i-nr_done].count = vectors[i]->bv_len;
> + }
> +
> + rc = diag_vdisk_submit_request(dev->vfd,
> + dev->submit_page,
> + op, offset,
> + requestno-nr_done, bio);
> +
> + if (rc < 0) {
> + // error case
> + size = 0;
> + for (i=0; i<(requestno-nr_done); i++)
> + size += (*dev->submit_page)[i].count;
> + bio_io_error(bio, size);
> + break;
> + }
> +
> + if (rc == requestno - nr_done)
> + // everything was submitted propper
> + break;
> +
> + if (rc) {
> + //request was partly submitted
> + for (i=0; i<rc; i++)
> + offset += (*dev->submit_page)[i].count;
> + nr_done += rc;
> + }
> + // we need to throttle IO, and retry submission later
> + spin_unlock_irqrestore(&dev->lock, flags);
> + io_schedule();
> + spin_lock_irqsave(&dev->lock, flags);
> + }
> + finish_wait(&dev->wait, &wait);
> + spin_unlock_irqrestore(&dev->lock, flags);
> + return;
> +}
>

We want to amortize the hypercall over multiple bios (but maybe you're
doing that -- I'm not 100% up to speed on the block layer)

> +
> +static void vdisk_pump_bio(struct work_struct *zw)
> +{
> + struct vdisk_work *work =
> + container_of(zw, struct vdisk_work, work);
> +
> + struct bio *bio = work->bio;
> + struct bio_vec *bvec;
> + struct bio_vec *(vectors[256]);
> + struct vdisk_device *dev = bio->bi_bdev->bd_disk->private_data;
> + int i, op, requestno=0;
> + loff_t start_offset, offset;
> +
> + BUG_ON(!dev);
> +
> + kfree (zw);
> +
> + if (bio_data_dir(bio))
> + op = VDISK_WRITE;
> + else
> + op = VDISK_READ;
> +
> + offset = start_offset = ((loff_t)bio->bi_sector)<<SECTOR_SHIFT;
> +
> + bio_for_each_segment(bvec, bio, i) {
> + if (bvec->bv_len & (dev->blocksize - 1)) //FIXME: Zugriff auf dev ohne lock
> + goto out;
> +
> + vectors[requestno] = bvec;
> + offset += bvec->bv_len;
> + requestno++;
> + if (requestno == 255) {
> + vdisk_pump_bvecs(dev, op, start_offset, requestno,
> + bio, vectors);
> + start_offset = offset;
> + requestno = 0;
> + }
> + }
> +
> + if (requestno)
> + vdisk_pump_bvecs(dev, op, start_offset, requestno, bio, vectors);
> +
> +out:
> + return;
> +}
> +
> +static int vdisk_make_request(request_queue_t *q, struct bio *bio)
> +{
> + struct vdisk_device *dev = bio->bi_bdev->bd_disk->private_data;
> + struct vdisk_work *work;
> + int rc;
> +
> + if (!dev) {
> + rc = -ENODEV;
> + goto out;
> + }
> +
> + if (bio_barrier(bio)) {
> + rc = -EOPNOTSUPP;
> + goto out;
> + }
> +
> + work = kmalloc(sizeof(struct vdisk_work), GFP_KERNEL);
> + if (!work) {
> + rc = -ENOMEM;
> + goto out;
> + }
> +
> + work->bio = bio;
> +
> + INIT_WORK(&work->work, vdisk_pump_bio);
> +
> + if (!queue_work(dev->wq, &work->work)) {
> + rc = -EIO;
> + kfree(work);
> + } else
> + rc = 0;
>

Any reason not to perform the work directly?

--
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-14 13:23:53 UTC
Permalink
Avi Kivity wrote:
> Is this the host file descriptor? If so, we want to use something more
> abstract (if the host side is in kernel, there will be no fd, or if the
> device is implemented using >1 files (or <1 files)).
This is indeed the host file descriptor. Host userland uses sys_open
to retrieve it. I see the beauty of having the remote side in the
kernel, however I fail to see why we would want to reinvent the wheel:
asynchronous IO with O_DIRECT (to avoid host caching) does just what
we want. System call latency adds to the in-kernel approach here.

> We'll want scatter/gather here.
If you want scatter/gather, you have to do request merging in the
guest and use the do_request function of the block queue. That is
because in make_request you only have a single chunk at hand.
With do_request, you would do that request merging twice and get twice
the block device plug latency for nothing. The host is the better
place to do IO scheduling, because it can optimize over IO from all
guest machines.
>
>> +};
>> +
>> +struct vdisk_iocb_container {
>> + struct iocb iocb;
>> + struct bio *bio;
>> + struct vdisk_device *dev;
>> + int ctx_index;
>> + unsigned long context;
>> + struct list_head list;
>> +};
>> +
>> +// from aio_abi.h
>> +typedef enum io_iocb_cmd {
>> + IO_CMD_PREAD = 0,
>> + IO_CMD_PWRITE = 1,
>> +
>> + IO_CMD_FSYNC = 2,
>> + IO_CMD_FDSYNC = 3,
>> +
>> + IO_CMD_POLL = 5,
>> + IO_CMD_NOOP = 6,
>> +} io_iocb_cmd_t;
>>
>
> Our own commands, please. We need READV, WRITEV, and a barrier for
> journalling filesystems. FDSYNC should work as a barrier, but is
> wasteful. The FSYNC/FDSYNC distinction is meaningless. POLL/NOOP are
> irrelevant.
This matches the api of libaio. If userland translates this into
struct iocp, this makes sense. The barrier however is a general
problem with this approach: today, the asynchronous IO userspace api
does not allow to submit a barrier. Therefore, our make_request
function in the guest returns -ENOTSUPP in the guest which forces the
file system to wait for IO completion. This does sacrifice some
performance. The right thing to do would be to add the possibility to
submit a barrier to the kernel aio interface.

> We want to amortize the hypercall over multiple bios (but maybe you're
> doing that -- I'm not 100% up to speed on the block layer)
We don't. We do one per bio, and I agree that this is a major
disadvantage of this approach. Since IO is slow (compared to
vmenter/vmexit), it pays back from to better IO scheduling. On our
platform, this approach outperforms the scatter/gather do_request one.

> Any reason not to perform the work directly?
I owe you an answer to this one, I have to revisit our CVS logs to
find out. We used to call from make_request without workqueue before,
and I cannot remember why we changed that.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-14 14:39:16 UTC
Permalink
Carsten Otte wrote:
>
> Avi Kivity wrote:
>> Is this the host file descriptor? If so, we want to use something
>> more abstract (if the host side is in kernel, there will be no fd, or
>> if the device is implemented using >1 files (or <1 files)).
> This is indeed the host file descriptor. Host userland uses sys_open
> to retrieve it. I see the beauty of having the remote side in the
> kernel, however I fail to see why we would want to reinvent the wheel:
> asynchronous IO with O_DIRECT (to avoid host caching) does just what
> we want.

I don't see an immediate need to put the host-side driver in the kernel,
but I don't want to embed the host fd (which is an implementation
detail) into the host/guest ABI. There may not even be a host fd.

> System call latency adds to the in-kernel approach here.

I don't understand this.

>
>> We'll want scatter/gather here.
> If you want scatter/gather, you have to do request merging in the
> guest and use the do_request function of the block queue. That is
> because in make_request you only have a single chunk at hand.
> With do_request, you would do that request merging twice and get twice
> the block device plug latency for nothing. The host is the better
> place to do IO scheduling, because it can optimize over IO from all
> guest machines.

The bio layer already has scatter/gather (basically, a biovec), but the
aio api (which you copy) doesn't. The basic request should be a bio,
not a bio page.

I don't think the guest driver needs to do its own merging.

>>
>>> +};
>>> +
>>> +struct vdisk_iocb_container {
>>> + struct iocb iocb;
>>> + struct bio *bio;
>>> + struct vdisk_device *dev;
>>> + int ctx_index;
>>> + unsigned long context;
>>> + struct list_head list;
>>> +};
>>> +
>>> +// from aio_abi.h
>>> +typedef enum io_iocb_cmd {
>>> + IO_CMD_PREAD = 0,
>>> + IO_CMD_PWRITE = 1,
>>> +
>>> + IO_CMD_FSYNC = 2,
>>> + IO_CMD_FDSYNC = 3,
>>> +
>>> + IO_CMD_POLL = 5,
>>> + IO_CMD_NOOP = 6,
>>> +} io_iocb_cmd_t;
>>>
>>
>> Our own commands, please. We need READV, WRITEV, and a barrier for
>> journalling filesystems. FDSYNC should work as a barrier, but is
>> wasteful. The FSYNC/FDSYNC distinction is meaningless. POLL/NOOP
>> are irrelevant.
> This matches the api of libaio. If userland translates this into
> struct iocp, this makes sense. The barrier however is a general
> problem with this approach: today, the asynchronous IO userspace api
> does not allow to submit a barrier. Therefore, our make_request
> function in the guest returns -ENOTSUPP in the guest which forces the
> file system to wait for IO completion. This does sacrifice some
> performance. The right thing to do would be to add the possibility to
> submit a barrier to the kernel aio interface.

Right. But the ABI needs to support barriers regardless of host kernel
support. When unavailable, barriers can be emulated by waiting for the
request queue to flush itself. If we do implement the host side in the
kernel, then barriers become available.

>
>> We want to amortize the hypercall over multiple bios (but maybe
>> you're doing that -- I'm not 100% up to speed on the block layer)
> We don't. We do one per bio, and I agree that this is a major
> disadvantage of this approach. Since IO is slow (compared to
> vmenter/vmexit), it pays back from to better IO scheduling. On our
> platform, this approach outperforms the scatter/gather do_request one.

I/O may be slow, but you can have a lot more disks than cpus.

For example, if an I/O takes 1ms, and you have 100 disks, then you can
issue 100K IOPS. With one hypercall per request, that's ~50% of a cpu
(at about 5us per hypercall that goes all the way to userspace). That's
not counting the overhead of calling io_submit().


--
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-15 11:47:53 UTC
Permalink
Avi Kivity wrote:
> I don't see an immediate need to put the host-side driver in the kernel,
> but I don't want to embed the host fd (which is an implementation
> detail) into the host/guest ABI. There may not even be a host fd.
Your point is taken, it also punches a hole in the security barrier
between guest kernel and userspace which our usage scenario of
multiple guests per uid requires.

>> System call latency adds to the in-kernel approach here.
> I don't understand this.
What I meant to state was: If the host side of the block driver runs
in userspace, we have the extra latency to leave the kernel system
call context, compute on behalf of the user process, and do another
system call (to drive the IO). This extra overhead does not show when
handling IO requests from the guest in the kernel.

> The bio layer already has scatter/gather (basically, a biovec), but the
> aio api (which you copy) doesn't. The basic request should be a bio,
> not a bio page.
With our block driver it is, we submit an entire bio which may contain
multiple biovecs at one hypercall.

> Right. But the ABI needs to support barriers regardless of host kernel
> support. When unavailable, barriers can be emulated by waiting for the
> request queue to flush itself. If we do implement the host side in the
> kernel, then barriers become available.
Agreed.

> I/O may be slow, but you can have a lot more disks than cpus.
>
> For example, if an I/O takes 1ms, and you have 100 disks, then you can
> issue 100K IOPS. With one hypercall per request, that's ~50% of a cpu
> (at about 5us per hypercall that goes all the way to userspace). That's
> not counting the overhead of calling io_submit().
Even when a hypercall round-trip takes as long as 5us, and even if you
have 512byte per biovec only (we use 4k blocksize), I don't see how
this gets a performance problem:
With linear read/write you get 200.000 hypercalls per second with 128
kbyte per hypercall. That's 25.6 GByte per second per CPU. With random
read (worst case: 512 byte per hypercall) you still get 100 MByte per
second per CPU. There are tighter bottlenecks in the IO hardware afaics.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-16 10:01:16 UTC
Permalink
Carsten Otte wrote:
>
>>> System call latency adds to the in-kernel approach here.
>> I don't understand this.
> What I meant to state was: If the host side of the block driver runs
> in userspace, we have the extra latency to leave the kernel system
> call context, compute on behalf of the user process, and do another
> system call (to drive the IO). This extra overhead does not show when
> handling IO requests from the guest in the kernel.
>

Well, this argument seems to be in favor of not using an fd ;)

Actually, an fd is usable when storing a disk in a raw file. But qemu
supports non-raw (formatted) disk images, which have additional features
like snapshots. The fd alone does not contain enough information.


>
>> I/O may be slow, but you can have a lot more disks than cpus.
>>
>> For example, if an I/O takes 1ms, and you have 100 disks, then you
>> can issue 100K IOPS. With one hypercall per request, that's ~50% of
>> a cpu (at about 5us per hypercall that goes all the way to
>> userspace). That's not counting the overhead of calling io_submit().
> Even when a hypercall round-trip takes as long as 5us, and even if you
> have 512byte per biovec only (we use 4k blocksize), I don't see how
> this gets a performance problem:
> With linear read/write you get 200.000 hypercalls per second with 128
> kbyte per hypercall. That's 25.6 GByte per second per CPU. With random
> read (worst case: 512 byte per hypercall) you still get 100 MByte per
> second per CPU. There are tighter bottlenecks in the IO hardware afaics.
>

If all you do is I/O, sure. If you want to limit I/O cpu overhead to
10%, the raw bandwidth becomes 10 MB/s/cpu.

(bandwidth isn't a good measure for random I/O, IOPS is the interesting
metric)

Both the guest and host will favor batched requests. It's a shame to
deny that because of a simplistic protocol.


--
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-14 11:52:36 UTC
Permalink
Carsten Otte wrote:
> From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
>
> This driver provides access to virtual block devices. It does use its own
> make_request function which passes the bio to a workqueue thread. The workqueue
> thread does use the diagnose hypervisor call to call the hosting Linux.
> The hypervisor code in host userspace does use aio_submit to initiate the IO.
> Once the IO is done, the host will use io_getevents and then generate an
> interrupt to the guest. The interrupt handler calls bio_endio.
> This device driver is currently architecture dependent. We intend to move the
> host API to hypercall instead of the diagnose instuction. Please review.
>
>

Oh. Why not use Xen's pending block driver? It probably has everything
needed.

--
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-14 13:26:14 UTC
Permalink
Avi Kivity wrote:
> Oh. Why not use Xen's pending block driver? It probably has everything
> needed.
We're not too eager to have our own device drivers become the solution
of choice. I have'nt looked at it so far, will do.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-11 17:36:05 UTC
Permalink
From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>

This is a work-in- progress paravirtualized network driver for Linux systems
running under a hypervisor.
The basic idea of this network driver is to have a shared memory pool between
host and guest. The guest allocates the buffer and registers its memory with
the host.

There are two queues one for guest to host traffic and vice versa. The queue
state is tracked via an 32 bit atomic. The first 16 bits indicate the slot in
the queue, where to put the next packet in, the last 16 bits indicate the slot
in the queue where to take the next packet out. Macros are provided to check
the queue for empty and full.

We use notification, when the queue _was_ empty or full. Guest to host
notification is done via the diagnose instruction. This is basically an
instruction for hypervisor call, similar to vmmcall and vmcall. For the
reverse notification the host sends an interrupt to the guest. Using NAPI, we
react on changes of the queue with netif_rx_schedule, netif_wake_queue,
netif_stop_queue and netif_rx_complete.
As the host only sends an interrupt if the queue was empty, we also need to
check for a race in the poll function and use netif_rx_reschedule. Otherwise
we would miss a wakup and would sleep forever.

Our s390 network drivers support cards that do IP packets only and provide no
MAC header at all. Therefore, the driver currently supports a layer2 based mode
(like ethernet) and a layer3 mode (we claim to be ptp) depending on the host.

So we have several prototypes and device drivers for paravirtualized network:
KVM, XEN, our prototype,lguest.... In the long term we want to have one driver
to rule^w drive them all, right?

This driver has currently some s390 specific aspects. I think we could get rid
of the diagnose calls with an architecture defined hypervisor call.

Please review, comments on the design are very welcome.

Signed-off-by: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>
Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>

---
drivers/s390/guest/Makefile | 2
drivers/s390/guest/vnet.h | 147 +++++++++++
drivers/s390/guest/vnet_guest.c | 509 ++++++++++++++++++++++++++++++++++++++++
drivers/s390/guest/vnet_guest.h | 111 ++++++++
drivers/s390/net/Kconfig | 9
5 files changed, 777 insertions(+), 1 deletion(-)

Index: linux-2.6.21/drivers/s390/guest/vnet.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet.h
@@ -0,0 +1,147 @@
+/*
+ * vnet - virtual networking for guest systems
+ *
+ * Copyright IBM Corp. 2007
+ * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ *
+ */
+
+#ifndef __VNET_H
+#define __VNET_H
+#include <linux/crc32.h>
+#include <linux/ioctl.h>
+#include <linux/if_ether.h>
+#include <linux/netdevice.h>
+#include <asm/atomic.h>
+#include <asm/page.h>
+
+#define VNET_MAJOR 12 //COFIXME
+
+#define VNET_QUEUE_LEN 80 // careful, vnet_control must be < PAGE
+#define VNET_BUFFER_SIZE 65536
+#define VNET_BUFFER_ORDER get_order(VNET_BUFFER_SIZE)
+#define VNET_BUFFER_PAGES (((VNET_BUFFER_SIZE-1)>>PAGE_SHIFT)+1)
+
+#define VNET_TIMEOUT (5*HZ)
+
+#define VNET_IRQ_START_RX 0
+#define VNET_IRQ_START_TX 1
+
+struct vnet_info {
+ int linktype;
+ int maxmtu;
+};
+
+#define VNET_IOCTL_ID 'Z'
+#define VNET_REGISTER_CTL _IOW(VNET_IOCTL_ID ,0, unsigned long long)
+#define VNET_INTERRUPT _IOW(VNET_IOCTL_ID, 1, int)
+#define VNET_INFO _IOR(VNET_IOCTL_ID, 2, struct vnet_info*)
+
+#define QUEUE_IS_EMPTY 1
+#define QUEUE_WAS_EMPTY 2
+#define QUEUE_IS_FULL 4
+#define QUEUE_WAS_FULL 8
+
+
+struct xmit_buffer {
+ __be16 len;
+ __be16 proto;
+ void *data;
+};
+
+struct vnet_control {
+ int buffer_size;
+ char mac[ETH_ALEN];
+ atomic_t p2smit __attribute__((__aligned__(SMP_CACHE_BYTES)));
+ atomic_t s2pmit __attribute__((__aligned__(SMP_CACHE_BYTES)));
+ struct xmit_buffer p2sbufs[VNET_QUEUE_LEN] __attribute__((__aligned__(SMP_CACHE_BYTES)));
+ struct xmit_buffer s2pbufs[VNET_QUEUE_LEN] __attribute__((__aligned__(SMP_CACHE_BYTES)));
+};
+
+
+#define __nextx(val) (((val) & 0xffff0000)>>16)
+#define __nextr(val) ((val) & 0xffff)
+#define __mkxr(x,r) ((((x) & 0xffff)<<16) | ((r) & 0xffff))
+
+static inline int
+vnet_q_empty(int val)
+{
+ return (__nextx(val) == __nextr(val));
+}
+
+static inline int
+vnet_q_half(int val)
+{
+ if (__nextx(val) == __nextr(val))
+ return 0;
+ if (__nextx(val) < __nextr(val)) {
+ if ((__nextr(val) - __nextx(val)) < VNET_QUEUE_LEN / 2)
+ return 1;
+ } else {
+ if ((__nextx(val) - __nextr(val)) > VNET_QUEUE_LEN / 2)
+ return 1;
+ }
+ return 0;
+}
+
+
+static inline int
+vnet_q_full(int val)
+{
+ if (__nextx(val) == __nextr(val) - 1)
+ return 1;
+ if ((__nextr(val) == 0) && (__nextx(val) == VNET_QUEUE_LEN - 1))
+ return 1;
+ return 0;
+}
+
+/* returns values:
+ * bit RX_QUEUE_NOW_EMPTY (1)
+ * and/or RX_QUEUE_WAS_FULL (2)
+ */
+static inline int
+vnet_rx_packet(atomic_t *ato)
+{
+ int oldval, newval, rc;
+
+ do {
+ oldval = atomic_read(ato);
+
+ //do we wrap?
+ if (__nextr(oldval)+1 == VNET_QUEUE_LEN)
+ newval = __mkxr(__nextx(oldval),0);
+ else
+ newval = __mkxr(__nextx(oldval),__nextr(oldval)+1);
+ } while (atomic_cmpxchg(ato, oldval, newval) != oldval);
+
+ rc = 0;
+ if (vnet_q_empty(newval))
+ rc |= QUEUE_IS_EMPTY;
+ if (vnet_q_full(oldval))
+ rc |= QUEUE_WAS_FULL;
+ return rc;
+}
+
+static inline int
+vnet_tx_packet(atomic_t *ato)
+{
+ int oldval, newval, rc;
+
+ do {
+ oldval = atomic_read(ato);
+
+ //do we wrap?
+ if (__nextx(oldval)+1 == VNET_QUEUE_LEN)
+ newval = __mkxr(0, __nextr(oldval));
+ else
+ newval = __mkxr(__nextx(oldval)+1,__nextr(oldval));
+ } while (atomic_cmpxchg(ato, oldval, newval) != oldval);
+ rc = 0;
+ if (vnet_q_empty(oldval))
+ rc |= QUEUE_WAS_EMPTY;
+ if (vnet_q_full(newval))
+ rc |= QUEUE_IS_FULL;
+ return rc;
+}
+#endif
Index: linux-2.6.21/drivers/s390/guest/vnet_guest.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_guest.c
@@ -0,0 +1,509 @@
+/*
+ * vnet - virtual network device driver
+ *
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ *
+ */
+
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
+#include <linux/etherdevice.h>
+#include <linux/if.h>
+#include <linux/if_ether.h>
+#include <linux/if_arp.h>
+#include <linux/rtnetlink.h>
+#include <linux/hardirq.h>
+#include <linux/mm.h>
+#include <linux/notifier.h>
+#include <asm/s390_ext.h>
+#include <asm/atomic.h>
+#include <asm/vdev.h>
+
+#include "vnet.h"
+#include "vnet_guest.h"
+
+static LIST_HEAD(vnet_devices);
+static rwlock_t vnet_devices_lock = RW_LOCK_UNLOCKED;
+
+
+static int
+vnet_net_open(struct net_device *dev)
+{
+ struct vnet_guest_device *guest;
+ struct vnet_control *control;
+
+ guest = netdev_priv(dev);
+ control = guest->control;
+ atomic_set(&control->s2pmit, 0);
+ netif_start_queue(dev);
+ return 0;
+}
+
+static int
+vnet_net_stop(struct net_device *dev)
+{
+ netif_stop_queue(dev);
+ return 0;
+}
+
+static void vnet_net_tx_timeout(struct net_device *dev)
+{
+ struct vnet_guest_device *zk = netdev_priv(dev);
+ struct vnet_control *control = zk->control;
+
+ printk(KERN_ERR "problems in xmit for device %s\n Resetting...\n",
+ dev->name);
+ atomic_set(&control->p2smit, 0);
+ atomic_set(&control->s2pmit, 0);
+ diag_vnet_send_interrupt(zk->hostfd, VNET_IRQ_START_RX);
+ netif_wake_queue(dev);
+}
+
+
+static void skbcopy(char *dest, struct sk_buff *skb)
+{
+ int i;
+
+ memcpy(dest, skb->data, skb_headlen(skb));
+ dest += skb_headlen(skb);
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+ memcpy(dest, page_address(frag->page) +
+ frag->page_offset, frag->size);
+ dest+=frag->size;
+ }
+}
+
+static int
+vnet_net_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+ struct vnet_guest_device *zk = netdev_priv(dev);
+ struct vnet_control *control = zk->control;
+ struct xmit_buffer *buf;
+ int pkid;
+ int buffer_status;
+
+ if (!spin_trylock(&zk->lock))
+ return NETDEV_TX_LOCKED;
+ if(vnet_q_full(atomic_read(&control->p2smit))) {
+ netif_stop_queue(dev);
+ goto full;
+ }
+ pkid = __nextx(atomic_read(&control->p2smit));
+ buf = &control->p2sbufs[pkid];
+ buf->len = skb->len;
+ buf->proto = skb->protocol;
+ skbcopy(buf->data, skb);
+ buffer_status = vnet_tx_packet(&control->p2smit);
+ spin_unlock(&zk->lock);
+ zk->stats.tx_packets++;
+ zk->stats.tx_bytes += skb->len;
+ dev_kfree_skb_any(skb);
+ dev->trans_start = jiffies;
+ if (buffer_status & QUEUE_WAS_EMPTY)
+ diag_vnet_send_interrupt(zk->hostfd, VNET_IRQ_START_RX);
+ if (!(buffer_status & QUEUE_IS_FULL))
+ return NETDEV_TX_OK;
+ netif_stop_queue(dev);
+ spin_lock(&zk->lock);
+full:
+ if (!vnet_q_full(atomic_read(&control->p2smit)))
+ netif_start_queue(dev);
+ spin_unlock(&zk->lock);
+ return NETDEV_TX_OK;
+}
+
+static int
+vnet_l2_poll(struct net_device *dev, int *budget)
+{
+ struct vnet_guest_device *zk = netdev_priv(dev);
+ struct vnet_control *control = zk->control;
+ struct xmit_buffer *buf;
+ struct sk_buff *skb;
+ int pkid, count, numpackets = min(dev->quota, *budget);
+ int buffer_status;
+
+ if (vnet_q_empty(atomic_read(&control->s2pmit))) {
+ count = 0;
+ goto empty;
+ }
+loop:
+ count = 0;
+ while(numpackets) {
+ pkid = __nextr(atomic_read(&control->s2pmit));
+ buf = &control->s2pbufs[pkid]; /* kernel pointer!*/
+ skb = dev_alloc_skb(buf->len);
+ if (likely(skb)) {
+ memcpy(skb_put(skb, buf->len), buf->data, buf->len);
+ skb->dev = dev;
+ skb->protocol = eth_type_trans(skb, dev);
+ zk->stats.rx_packets++;
+ zk->stats.rx_bytes += buf->len;
+ netif_receive_skb(skb);
+ numpackets--;
+ (*budget)--;
+ dev->quota--;
+ count++;
+ } else
+ zk->stats.rx_dropped++;
+ buffer_status = vnet_rx_packet(&control->s2pmit);
+ if (buffer_status & QUEUE_WAS_FULL)
+ diag_vnet_send_interrupt(zk->hostfd,
+ VNET_IRQ_START_TX);
+ if (buffer_status & QUEUE_IS_EMPTY)
+ goto empty;
+ }
+ return 1; /* please ask us again */
+ empty:
+ netif_rx_complete(dev);
+ /* we might have raced against a wakup */
+ if (!vnet_q_empty(atomic_read(&control->s2pmit))) {
+ if (netif_rx_reschedule(dev, count))
+ goto loop;
+ }
+ return 0; /* we're done for now */
+}
+
+
+static int
+vnet_l3_poll(struct net_device *dev, int *budget)
+{
+ struct vnet_guest_device *zk = dev->priv;
+ struct vnet_control *control = zk->control;
+ struct xmit_buffer *buf;
+ struct sk_buff *skb;
+ int pkid, count, numpackets = min(dev->quota, *budget);
+ int buffer_status;
+
+ if (vnet_q_empty(atomic_read(&control->s2pmit))) {
+ count = 0;
+ goto empty;
+ }
+loop:
+ count = 0;
+ while(numpackets) {
+ pkid = __nextr(atomic_read(&control->s2pmit));
+ buf = &control->s2pbufs[pkid]; /*kernel pointer*/
+ skb = dev_alloc_skb(buf->len + NET_IP_ALIGN);
+ if (likely(skb)) {
+ skb_reserve(skb, NET_IP_ALIGN);
+ memcpy(skb_put(skb, buf->len), buf->data, buf->len);
+ skb->dev = dev;
+ skb->protocol = buf->proto;
+ skb->mac.raw = skb->data;
+ zk->stats.rx_packets++;
+ zk->stats.rx_bytes += buf->len;
+ netif_receive_skb(skb);
+ numpackets--;
+ (*budget)--;
+ dev->quota--;
+ count++;
+ } else
+ zk->stats.rx_dropped++;
+ buffer_status = vnet_rx_packet(&control->s2pmit);
+ if (buffer_status & QUEUE_WAS_FULL)
+ diag_vnet_send_interrupt(zk->hostfd,
+ VNET_IRQ_START_TX);
+ if (buffer_status & QUEUE_IS_EMPTY)
+ goto empty;
+ }
+ return 1; /* please ask us again */
+ empty:
+ netif_rx_complete(dev);
+ /* we might have raced against a wakup */
+ if (!vnet_q_empty(atomic_read(&control->s2pmit))) {
+ if (netif_rx_reschedule(dev, count))
+ goto loop;
+ }
+ return 0; /* we're done for now */
+}
+
+static struct net_device_stats *
+vnet_net_stats(struct net_device *dev)
+{
+ struct vnet_guest_device *zk = netdev_priv(dev);
+ return &zk->stats;
+}
+
+static int
+vnet_net_change_mtu(struct net_device *dev, int new_mtu)
+{
+ if (new_mtu <= ETH_ZLEN)
+ return -ERANGE;
+ if (new_mtu > VNET_BUFFER_SIZE-ETH_HLEN)
+ return -ERANGE;
+ dev->mtu = new_mtu;
+ return 0;
+}
+
+
+static void
+__vnet_common_init(struct net_device *dev)
+{
+ dev->open = vnet_net_open;
+ dev->stop = vnet_net_stop;
+ dev->hard_start_xmit = vnet_net_xmit;
+ dev->get_stats = vnet_net_stats;
+ dev->tx_timeout = vnet_net_tx_timeout;
+ dev->watchdog_timeo = VNET_TIMEOUT;
+ dev->change_mtu = vnet_net_change_mtu;
+ dev->weight = 64;
+ dev->features |= NETIF_F_SG | NETIF_F_LLTX;
+}
+
+static void
+__vnet_layer3_init(struct net_device *dev)
+{
+ dev->mtu = ETH_DATA_LEN;
+ dev->tx_queue_len = 1000;
+ dev->flags = IFF_BROADCAST|IFF_MULTICAST|IFF_NOARP;
+ dev->type = ARPHRD_PPP;
+ dev->mtu = 1492;
+ dev->poll = vnet_l3_poll;
+ __vnet_common_init(dev);
+}
+
+static void
+__vnet_layer2_init(struct net_device *dev)
+{
+ ether_setup(dev);
+ dev->mtu = 1492;
+ dev->poll = vnet_l2_poll;
+ __vnet_common_init(dev);
+}
+
+static struct vnet_guest_device *
+__get_vnet_dev_by_fd(int fd)
+{
+ struct vnet_guest_device *zk;
+
+ read_lock(&vnet_devices_lock);
+ list_for_each_entry(zk, &vnet_devices, lh) {
+ if (zk->hostfd == fd)
+ goto found;
+ }
+ zk = NULL;
+ found:
+ read_unlock (&vnet_devices_lock);
+ return zk;
+}
+
+void vnet_ext_handler(__u16 code)
+{
+ unsigned int type = S390_lowcore.ext_params & 3;
+ unsigned int fd = S390_lowcore.ext_params >> 2;
+
+ struct vnet_guest_device *zk = __get_vnet_dev_by_fd(fd);
+
+ BUG_ON(!zk);
+ switch (type) {
+ case VNET_IRQ_START_RX:
+ netif_rx_schedule(zk->netdev);
+ break;
+ case VNET_IRQ_START_TX:
+ netif_wake_queue(zk->netdev);
+ break;
+ default:
+ BUG();
+ }
+}
+
+static void
+vnet_delete_device(struct vnet_guest_device *zd)
+{
+ int i;
+ unsigned long flags;
+
+ if (zd->hostfd >= 0)
+ diag_vnet_release(zd->hostfd);
+ write_lock_irqsave(&vnet_devices_lock, flags);
+ list_del(&zd->lh);
+ write_unlock_irqrestore(&vnet_devices_lock, flags);
+
+ for (i=0; i<VNET_QUEUE_LEN; i++) {
+ if (zd->control->s2pbufs[i].data) {
+ free_pages((unsigned long) zd->control->s2pbufs[i].data, VNET_BUFFER_ORDER);
+ zd->control->s2pbufs[i].data = NULL;
+ }
+ if (zd->control->p2sbufs[i].data) {
+ free_pages((unsigned long) zd->control->p2sbufs[i].data, VNET_BUFFER_ORDER);
+ zd->control->p2sbufs[i].data = NULL;
+ }
+ }
+ if (zd->control) {
+ kfree(zd->control);
+ zd->control = NULL;
+ }
+ if (zd->netdev) /* CAUTION: this also frees zd*/
+ free_netdev(zd->netdev);
+}
+
+
+static int vnet_device_alloc(struct vnet_guest_device *zd)
+{
+ int i;
+
+ zd->control = kzalloc(sizeof(struct vnet_control), GFP_KERNEL);
+ if (!zd->control)
+ return -ENOMEM;
+ for (i=0; i<VNET_QUEUE_LEN; i++) {
+ zd->control->s2pbufs[i].data = (void *) __get_free_pages(GFP_KERNEL, VNET_BUFFER_ORDER);
+ if (!zd->control->s2pbufs[i].data)
+ return -ENOMEM;
+ zd->control->p2sbufs[i].data = (void *) __get_free_pages(GFP_KERNEL, VNET_BUFFER_ORDER);
+ if (!zd->control->p2sbufs[i].data)
+ return -ENOMEM;
+ }
+ return 0;
+}
+
+static int vnet_probe(struct vdev *vdev)
+{
+ int ret;
+ long flags;
+ struct vnet_guest_device *zd;
+ struct net_device *netdev;
+ int linktype;
+
+ if (strlen(vdev->symname) > IFNAMSIZ) {
+ printk(KERN_ERR "vnet: %s is too long for a network device,"
+ "discarding it\n", vdev->symname);
+ return -EINVAL;
+ }
+ ret = diag_vnet_info(vdev->hostid, &linktype);
+ if (ret)
+ return ret;
+ if (linktype == 3)
+ netdev = alloc_netdev(sizeof(*zd), vdev->symname,__vnet_layer3_init);
+ else
+ netdev = alloc_netdev(sizeof(*zd), vdev->symname, __vnet_layer2_init);
+ if (!netdev)
+ return -ENOMEM;
+ zd = netdev_priv(netdev);
+ zd->netdev = netdev;
+
+ ret =vnet_device_alloc(zd);
+ if (ret)
+ goto out;
+ zd->control->buffer_size = VNET_BUFFER_SIZE;
+ zd->linktype = linktype;
+ memcpy(zd->ifname, vdev->symname, IFNAMSIZ);
+ INIT_LIST_HEAD(&zd->lh);
+
+ write_lock_irqsave(&vnet_devices_lock, flags);
+ zd->hostfd = diag_vnet_open(vdev->hostid, zd->control);
+ if (zd->hostfd < 0) {
+ write_unlock_irqrestore(&vnet_devices_lock, flags);
+ goto out;
+ }
+ list_add_tail(&zd->lh, &vnet_devices);
+ write_unlock_irqrestore(&vnet_devices_lock, flags);
+
+ // host is ready, now we can set up our local network interface
+ rtnl_lock();
+ memcpy(netdev->dev_addr, zd->control->mac, 6);
+ spin_lock_init(&zd->lock);
+
+ if (!(ret = register_netdevice(zd->netdev))) {
+ /* good case */
+ rtnl_unlock();
+ printk("vnet: Successfully registered %s\n", vdev->symname);
+ return 0;
+ }
+ printk("vnet: Could not register network interface %s\n", vdev->symname);
+ rtnl_unlock();
+ out:
+ vnet_delete_device(zd);
+ return ret;
+}
+
+static struct vdev_driver vnet_driver = {
+ .name = "vnet",
+ .owner = THIS_MODULE,
+ .vdev_type = VDEV_TYPE_NET,
+ .probe = vnet_probe,
+};
+
+static int vnet_ip_event(struct notifier_block *this,
+ unsigned long event,void *ptr)
+{
+ struct in_ifaddr *ifa = (struct in_ifaddr *)ptr;
+ struct net_device *dev =(struct net_device *) ifa->ifa_dev->dev;
+ struct vnet_guest_device *zk;
+ read_lock(&vnet_devices_lock);
+ list_for_each_entry(zk, &vnet_devices, lh)
+ if (zk->netdev == dev) {
+ read_unlock(&vnet_devices_lock);
+ if (event == NETDEV_UP)
+ diag_vnet_ip(1, ifa->ifa_address,
+ ifa->ifa_mask,
+ ifa->ifa_broadcast);
+ if (event == NETDEV_DOWN)
+ diag_vnet_ip(0, ifa->ifa_address,
+ ifa->ifa_mask,
+ ifa->ifa_broadcast);
+ return NOTIFY_OK;
+ }
+ read_unlock(&vnet_devices_lock);
+ return NOTIFY_DONE;
+}
+
+static struct notifier_block vnet_ip_notifier = {
+ vnet_ip_event,
+ NULL
+};
+
+/* module related section */
+int
+vnet_guest_init(void)
+{
+ int ret;
+
+ if (!MACHINE_IS_GUEST)
+ return -ENODEV;
+ BUILD_BUG_ON(sizeof(struct vnet_control) > PAGE_SIZE);
+ register_external_interrupt(0x1236, vnet_ext_handler);
+ if(register_inetaddr_notifier(&vnet_ip_notifier)) {
+ printk(KERN_ERR "vnet: Could not register ip callback\n");
+ unregister_external_interrupt(0x1236, vnet_ext_handler);
+ }
+ ret = vdev_driver_register(&vnet_driver);
+ if (ret) {
+ printk(KERN_ERR "vnet: Could not register driver\n");
+ unregister_external_interrupt(0x1236, vnet_ext_handler);
+ unregister_inetaddr_notifier(&vnet_ip_notifier);
+ return ret;
+ }
+ return ret;
+}
+
+void
+vnet_guest_exit(void)
+{
+ struct vnet_guest_device *zk;
+ struct vnet_guest_device *temp;
+
+
+ unregister_external_interrupt(0x1236, vnet_ext_handler);
+ unregister_inetaddr_notifier(&vnet_ip_notifier);
+ rtnl_lock();
+ write_lock(&vnet_devices_lock);
+ list_for_each_entry_safe(zk, temp, &vnet_devices, lh) {
+ netif_stop_queue(zk->netdev);
+ unregister_netdevice(zk->netdev);
+ vnet_delete_device(zk);
+ }
+ write_unlock(&vnet_devices_lock);
+ rtnl_unlock();
+}
+
+module_init(vnet_guest_init);
+module_exit(vnet_guest_exit);
+MODULE_DESCRIPTION("VNET: Virtual network driver");
+MODULE_AUTHOR("Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+***@public.gmane.org>");
+MODULE_LICENSE("GPL");
Index: linux-2.6.21/drivers/s390/guest/vnet_guest.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_guest.h
@@ -0,0 +1,111 @@
+/*
+ * vnet - zlive insular communication knack
+ *
+ * Copyright (C) 2005 IBM Corporation
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ * Author: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ *
+ */
+
+#ifndef __VNET_GUEST_H
+#define __VNET_GUEST_H
+
+#include <linux/netdevice.h>
+#include <linux/workqueue.h>
+#include <linux/spinlock.h>
+#include "vnet.h"
+
+
+struct vnet_guest_device {
+ struct list_head lh;
+ int hostfd;
+ char ifname[IFNAMSIZ];
+ struct net_device *netdev;
+ struct vnet_control *control;
+ struct net_device_stats stats;
+ struct work_struct work;
+ int linktype;
+ spinlock_t lock;
+};
+
+static inline int
+diag_vnet_info(char *ifname, int *linktype)
+{
+ register char * __arg1 asm ("2") = ifname;
+ register int * __arg2 asm ("3") = linktype;
+ register int __svcres asm("2");
+ int __res;
+
+ __asm__ __volatile__ (
+ "diag 0,0,0x0e"
+ : "=d" (__svcres)
+ : "0" (__arg1),
+ "d" (__arg2)
+ : "cc", "memory");
+ __res = __svcres;
+ return __res;
+}
+
+static inline int
+diag_vnet_open(char *ifname, struct vnet_control *ctrl)
+{
+ register char * __arg1 asm ("2") = ifname;
+ register struct vnet_control * __arg2 asm ("3") = ctrl;
+ register int __svcres asm("2");
+ int __res;
+
+ __asm__ __volatile__ (
+ "diag 0,0,0x0f"
+ : "=d" (__svcres)
+ : "0" (__arg1),
+ "d" (__arg2)
+ : "cc", "memory");
+ __res = __svcres;
+ return __res;
+}
+
+static inline void
+diag_vnet_send_interrupt(int fd, int type)
+{
+ register long __arg1 asm ("2") = fd;
+ register long __arg2 asm ("3") = type;
+
+ __asm__ __volatile__ (
+ "diag 0,0,0x11"
+ : : "d" (__arg1),
+ "d" (__arg2)
+ : "cc", "memory");
+}
+
+static inline void
+diag_vnet_release(int fd)
+{
+ register long __arg1 asm ("2") = fd;
+
+ __asm__ __volatile__ (
+ "diag 0,0,0x13"
+ : : "d" (__arg1)
+ : "cc", "memory");
+}
+static inline int
+diag_vnet_ip(int add, u32 addr, u32 mask, u32 broadcast)
+{
+ register long __arg1 asm ("2") = add;
+ register long __arg2 asm ("3") = addr;
+ register long __arg3 asm ("4") = mask;
+ register long __arg4 asm ("5") = broadcast;
+ register int __svcres asm("2");
+ int __res;
+
+ __asm__ __volatile__ (
+ "diag 0,0,0x1f"
+ : "=d" (__svcres)
+ : "d" (__arg1),
+ "d" (__arg2),
+ "d" (__arg3),
+ "d" (__arg4)
+ : "cc", "memory");
+ __res = __svcres;
+ return __res;
+}
+#endif
Index: linux-2.6.21/drivers/s390/guest/Makefile
===================================================================
--- linux-2.6.21.orig/drivers/s390/guest/Makefile
+++ linux-2.6.21/drivers/s390/guest/Makefile
@@ -5,4 +5,4 @@
obj-$(CONFIG_GUEST_CONSOLE) += guest_console.o guest_tty.o
obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o
obj-$(CONFIG_VDISK) += vdisk.o vdisk_blk.o
-
+obj-$(CONFIG_VNET_GUEST) += vnet_guest.o
Index: linux-2.6.21/drivers/s390/net/Kconfig
===================================================================
--- linux-2.6.21.orig/drivers/s390/net/Kconfig
+++ linux-2.6.21/drivers/s390/net/Kconfig
@@ -86,4 +86,13 @@ config CCWGROUP
tristate
default (LCS || CTC || QETH)

+config VNET_GUEST
+ tristate "virtual networking support (GUEST)"
+ depends on S390_GUEST
+ help
+ This is the guest part of the vnet guest network connection.
+ Say Y if you plan to run this kernel as guest with network
+ connection.
+ If you're not using host/guest support, say N.
+
endmenu



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
ron minnich
2007-05-11 19:44:02 UTC
Permalink
Let me ask what may seem to be a naive question to the linux world. I
see you are doing a lot off solid work on adding block and network
devices. The code for block and network devices
is implemented in different ways. I've also seen this difference of
inerface/implementation on Xen.

Hence my question:
Why are the INTERFACES to the block and network devices different? I
can understand that the implementation -- what goes on "inside the
box" -- would be different. But, again, why is the interface to the
resource different in each case? Will every distinct type of I/O
device end up with a different interface?

These questions doubtless seem naive, I suppose, except I use a system
(Plan 9) in which a common interface is in fact used for the different
resources. I have been hoping that we could bring this model -- same
interface, different resource -- to the inter-vm communications. I
would like to at least raise the idea that it could be used on KVM.

Avoiding too much detail, in the plan 9 world, read and write of data
to a disk is via file read and write system calls. Same for a network.
Same for the mouse, the window system, the serial port, the console,
USB, and so on. Please see this note from IBM on what is
possible:http://domino.watson.ibm.com/library/CyberDig.nsf/0/c6c779bbf1650fa4852570670054f3ca?OpenDocument
or http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf

Different resources, same interface. In the hypervisor world, you
build one shared memory queue as a basic abstraction. On top of that
queue, you run 9P. The provider (network, block device, etc.) provides
certain resources to you, the guest domain The resources have names. A
network can look like this, to a kvm guest (this command from a Plan 9
system):
cpu% ls /net/ether0
/net/ether0/0
/net/ether0/1
/net/ether0/2
/net/ether0/addr
/net/ether0/clone
/net/ether0/ifstats
/net/ether0/stats
To get network stats, or do I/O, one simply gains access to the
appropriate ring buffer, by finding the name, and does the ring buffer
sends and receives via shared memory queues. The I/O operations can be
very efficient.

Disk looks like this:
cpu% ls -l /dev/sdC0
--rw-r----- S 0 bootes bootes 104857600 Jan 22 15:49 /dev/sdC0/9fat
--rw-r----- S 0 bootes bootes 65361213440 Jan 22 15:49 /dev/sdC0/arenas
--rw-r----- S 0 bootes bootes 0 Jan 22 15:49 /dev/sdC0/ctl
--rw-r----- S 0 bootes bootes 82348277760 Jan 22 15:49 /dev/sdC0/data
--rw-r----- S 0 bootes bootes 13072242688 Jan 22 15:49 /dev/sdC0/fossil
--rw-r----- S 0 bootes bootes 3268060672 Jan 22 15:49 /dev/sdC0/isect
--rw-r----- S 0 bootes bootes 512 Jan 22 15:49 /dev/sdC0/nvram
--rw-r----- S 0 bootes bootes 82343245824 Jan 22 15:49 /dev/sdC0/plan9
-lrw------- S 0 bootes bootes 0 Jan 22 15:49 /dev/sdC0/raw
--rw-r----- S 0 bootes bootes 536870912 Jan 22 15:49 /dev/sdC0/swap
cpu%

So the disk partitions are "files", with the "data" file being the
whole disk. Again, on a hypervisor system, to do I/O, software could
create a connection to the "file" and establish the in-memory ring
buffer, for that partition. This I/O can be very efficient; IBM
research is working on zero-copy mechanisms for moving data between
domains.

The result is a single, consistent mechanism for accessing all
resources from a guest domain. The resources have names, and it is
easy to examine the status -- binary interfaces can be minimized. The
resources can be provided by in-kernel servers -- Linux drivers -- or
out-of-kernel servers -- proceses. Same interface, and yet the
implementation of the provider of the resource can be utterly
different.

We had hoped to get something like this into Xen. On Xen, for example,
the block device and ethernet device interfaces are as different as
one could imagine. Disk I/O does not steal pages from the guest. The
network does. Disk I/O is in 4k chunks, period, with a bitmap
describing which of the 8 512-byte subunits are being sent. The enet
device, on read, returns a page with your packet, but also potentially
containing bits of other domain's packets too. The interfaces are as
dissimilar as they can be, and I see no reason for such a huge
variance between what are basically read/write devices.

Another issue is that kvm, in its current form (-24) is beautifully
simple. These additions seem to detract from the beauty a bit. Might
it be worth taking a little time to consider these ideas in order to
preserve the basic elegance of KVM?

So, before we go too far down the Xen-like paravirtualized device
route, can we discuss the way this ought to look a bit?

thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-11 20:12:05 UTC
Permalink
ron minnich wrote:
> Avoiding too much detail, in the plan 9 world, read and write of data
> to a disk is via file read and write system calls.

For low speed devices, I think paravirtualization doesn't make a lot of
sense unless it's absolutely required. I don't know enough about s390
to know if it supports things like uarts but if so, then emulating a
uart would in my mind make a lot more sense than a PV console device.

> Same for a network.
> Same for the mouse, the window system, the serial port, the console,
> USB, and so on. Please see this note from IBM on what is
> possible:http://domino.watson.ibm.com/library/CyberDig.nsf/0/c6c779bbf1650fa4852570670054f3ca?OpenDocument
> or http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
> Different resources, same interface. In the hypervisor world, you
> build one shared memory queue as a basic abstraction. On top of that
> queue, you run 9P. The provider (network, block device, etc.) provides
> certain resources to you, the guest domain The resources have names. A
> network can look like this, to a kvm guest (this command from a Plan 9
> system):
> cpu% ls /net/ether0
> /net/ether0/0
> /net/ether0/1
> /net/ether0/2
> /net/ether0/addr
> /net/ether0/clone
> /net/ether0/ifstats
> /net/ether0/stats
>

This smells a bit like XenStore which I think most will agree was an
unmitigated disaster. This sort of thing gets terribly complicated to
deal with in the corner cases. Atomic operation of multiple read/write
operations is difficult to express. Moreover, quite a lot of things are
naturally expressed as a state machine which is not straight forward to
do in this sort of model. This may have been all figured out in 9P but
it's certainly not a simple thing to get right.

I think a general rule of thumb for a virtualized environment is that
the closer you stick to the way hardware tends to do things, the less
likely you are to screw yourself up and the easier it will be for other
platforms to support your devices. Implementing a full 9P client just
to get console access in something like mini-os would be unfortunate.
At least the posted s390 console driver behaves roughly like a uart so
it's pretty obvious that it will be easy to implement in any OS that
supports uarts already.

Regards,

Anthony Liguori

> To get network stats, or do I/O, one simply gains access to the
> appropriate ring buffer, by finding the name, and does the ring buffer
> sends and receives via shared memory queues. The I/O operations can be
> very efficient.
>
> Disk looks like this:
> cpu% ls -l /dev/sdC0
> --rw-r----- S 0 bootes bootes 104857600 Jan 22 15:49 /dev/sdC0/9fat
> --rw-r----- S 0 bootes bootes 65361213440 Jan 22 15:49 /dev/sdC0/arenas
> --rw-r----- S 0 bootes bootes 0 Jan 22 15:49 /dev/sdC0/ctl
> --rw-r----- S 0 bootes bootes 82348277760 Jan 22 15:49 /dev/sdC0/data
> --rw-r----- S 0 bootes bootes 13072242688 Jan 22 15:49 /dev/sdC0/fossil
> --rw-r----- S 0 bootes bootes 3268060672 Jan 22 15:49 /dev/sdC0/isect
> --rw-r----- S 0 bootes bootes 512 Jan 22 15:49 /dev/sdC0/nvram
> --rw-r----- S 0 bootes bootes 82343245824 Jan 22 15:49 /dev/sdC0/plan9
> -lrw------- S 0 bootes bootes 0 Jan 22 15:49 /dev/sdC0/raw
> --rw-r----- S 0 bootes bootes 536870912 Jan 22 15:49 /dev/sdC0/swap
> cpu%
>
> So the disk partitions are "files", with the "data" file being the
> whole disk. Again, on a hypervisor system, to do I/O, software could
> create a connection to the "file" and establish the in-memory ring
> buffer, for that partition. This I/O can be very efficient; IBM
> research is working on zero-copy mechanisms for moving data between
> domains.
>
> The result is a single, consistent mechanism for accessing all
> resources from a guest domain. The resources have names, and it is
> easy to examine the status -- binary interfaces can be minimized. The
> resources can be provided by in-kernel servers -- Linux drivers -- or
> out-of-kernel servers -- proceses. Same interface, and yet the
> implementation of the provider of the resource can be utterly
> different.
>
> We had hoped to get something like this into Xen. On Xen, for example,
> the block device and ethernet device interfaces are as different as
> one could imagine. Disk I/O does not steal pages from the guest. The
> network does. Disk I/O is in 4k chunks, period, with a bitmap
> describing which of the 8 512-byte subunits are being sent. The enet
> device, on read, returns a page with your packet, but also potentially
> containing bits of other domain's packets too. The interfaces are as
> dissimilar as they can be, and I see no reason for such a huge
> variance between what are basically read/write devices.
>
> Another issue is that kvm, in its current form (-24) is beautifully
> simple. These additions seem to detract from the beauty a bit. Might
> it be worth taking a little time to consider these ideas in order to
> preserve the basic elegance of KVM?
>
> So, before we go too far down the Xen-like paravirtualized device
> route, can we discuss the way this ought to look a bit?
>
> thanks
>
> ron
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+***@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Eric Van Hensbergen
2007-05-11 21:15:57 UTC
Permalink
On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
> > cpu% ls /net/ether0
> > /net/ether0/0
> > /net/ether0/1
> > /net/ether0/2
> > /net/ether0/addr
> > /net/ether0/clone
> > /net/ether0/ifstats
> > /net/ether0/stats
> >
>
> This smells a bit like XenStore which I think most will agree was an
> unmitigated disaster.
>

I'd have to disagree with you Anthony. The Plan 9 interfaces are
simple and built into the kernel - they don't have the
multi-layered-stack-python-xmlrpc garbage that made up the Xen
interfaces.

>This sort of thing gets terribly complicated to deal with in the
corner cases.
>Atomic operation of multiple read/write operations is difficult to express.
> Moreover, quite a lot of things are naturally expressed as a state machine which
> is not straight forward to do in this sort of model. This may have been all
> figured out in 9P but it's certainly not a simple thing to get right.
>

That's true, but we have been doing it for over 20 years - I think we
have a good model to base stuff on.

> I think a general rule of thumb for a virtualized environment is that
> the closer you stick to the way hardware tends to do things, the less
> likely you are to screw yourself up and the easier it will be for other
> platforms to support your devices. Implementing a full 9P client just
> to get console access in something like mini-os would be unfortunate.
> At least the posted s390 console driver behaves roughly like a uart so
> it's pretty obvious that it will be easy to implement in any OS that
> supports uarts already.
>

If it were just console access, I would agree with you, but its really
about implementing a single solution for all drivers you are accessing
across the interface. A single client versus dozens of different
driver variants. Our existing 9p client for mini-os is ~3000 LOC and
it is a pretty naive port from the p9p code base so it could probably
be reduced even further. It is a very small percentage of our
existing mini-os kernels and gives us console, disk, network, IP
stack, file system, and control interfaces. Of course Linux clients
could just use v9fs with a hypervisor-shared-memory transport which I
haven't merged yet. We'll also be using the same set of interfaces
for the simulator shortly.

Oh yeah, and don't forget the fact that resource access can bridge
seamlessly over any network and the protocol has provisions to be
secured with authentication/encryption/digesting if desired.

Los Alamos will be presenting 9p based control interfaces for KVM at OLS.

-eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-11 21:47:02 UTC
Permalink
Eric Van Hensbergen wrote:
> On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
>
>>> cpu% ls /net/ether0
>>> /net/ether0/0
>>> /net/ether0/1
>>> /net/ether0/2
>>> /net/ether0/addr
>>> /net/ether0/clone
>>> /net/ether0/ifstats
>>> /net/ether0/stats
>>>
>>>
>> This smells a bit like XenStore which I think most will agree was an
>> unmitigated disaster.
>>
>>
>
> I'd have to disagree with you Anthony. The Plan 9 interfaces are
> simple and built into the kernel - they don't have the
> multi-layered-stack-python-xmlrpc garbage that made up the Xen
> interfaces.
>

My point isn't that 9p is just like XenStore but rather that turning
this idea into something that is useful and elegant is non-trivial.

> If it were just console access, I would agree with you, but its really
> about implementing a single solution for all drivers you are accessing
> across the interface. A single client versus dozens of different
> driver variants.

There's definitely a conversation to have here. There are going to be a
lot of small devices that would benefit from a common transport
mechanism. Someone mentioned a PV entropy device on LKML. A
host=>guest filesystem is another consumer of such an interface.

I'm inclined to think though that the abstraction point should be the
transport and not the actual protocol. My concern with standardizing on
a protocol like 9p would be that one would lose some potential
optimizations (like passing PFN's directly between guest and host).

> Our existing 9p client for mini-os is ~3000 LOC and
> it is a pretty naive port from the p9p code base so it could probably
> be reduced even further. It is a very small percentage of our
> existing mini-os kernels and gives us console, disk, network, IP
> stack, file system, and control interfaces. Of course Linux clients
> could just use v9fs with a hypervisor-shared-memory transport which I
> haven't merged yet. We'll also be using the same set of interfaces
> for the simulator shortly.
>

So is there any reason to even tie 9p to KVM? Why not just have a
common PV transport that 9p can use. For certain things, it may make
sense (like v9fs).

Regards,

Anthony Liguori

> Oh yeah, and don't forget the fact that resource access can bridge
> seamlessly over any network and the protocol has provisions to be
> secured with authentication/encryption/digesting if desired.
>
> Los Alamos will be presenting 9p based control interfaces for KVM at OLS.
>
> -eric
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+***@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Eric Van Hensbergen
2007-05-11 22:21:44 UTC
Permalink
On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
>
> There's definitely a conversation to have here. There are going to be a
> lot of small devices that would benefit from a common transport
> mechanism. Someone mentioned a PV entropy device on LKML. A
> host=>guest filesystem is another consumer of such an interface.
>
> I'm inclined to think though that the abstraction point should be the
> transport and not the actual protocol. My concern with standardizing on
> a protocol like 9p would be that one would lose some potential
> optimizations (like passing PFN's directly between guest and host).
>

I think that there are two layers - having a standard, well defined,
simple shared memory transport between partitions (or between
emulators and the host system) is certainly a prerequisite. There are
lots of different decisions to made here:
a) does it communicate with userspace, kernelspace, or both?
b) is it multi-channel? prioritized? interrupt driven or poll driven?
c) how big are the buffers? is it packetized?
d) can all of these parameters be something controllable from userspace?
e) I'm sure there are many others that I can't be bothered to think
of on a Friday

Regardless of the details, I think we can definitely come together on
a common mechanism here and avoid lots of duplication in the drivers
are already there and which will follow. My personal preference is to
keep things as simple and flat as possible. No XML, no multiple
stacks and daemons to contend with.

What runs on top of the transport is no doubt going to be a touchy
subject for some time to come. Many of Ron's arguments for 9p mostly
apply to this upper level. I/we will be pursuing this as a unified PV
resource sharing mechanism over the next few months in combination
with reorganization and optimization of the Linux 9p code. LANL has
also been making progress in this same direction. I'd have gotten
started sooner, but I was waiting for my new Thinkpad so that I can
actually run KVM ;)

>
> So is there any reason to even tie 9p to KVM? Why not just have a
> common PV transport that 9p can use. For certain things, it may make
> sense (like v9fs).
>

Well, I think we were discussing tying KVM to 9p, not vice-versa.

My personal view is that developing a generalized solution for
resource sharing of all manner of devices and services across
virtualization, emulation, and network boundaries is a better way to
spend our time than writing a bunch of specific
drivers/protocols/interfaces for each type of device and each type of
interconnect.

-eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-16 17:28:00 UTC
Permalink
Eric Van Hensbergen wrote:
> On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
>>
>> There's definitely a conversation to have here. There are going to be a
>> lot of small devices that would benefit from a common transport
>> mechanism. Someone mentioned a PV entropy device on LKML. A
>> host=>guest filesystem is another consumer of such an interface.
>>
>> I'm inclined to think though that the abstraction point should be the
>> transport and not the actual protocol. My concern with standardizing on
>> a protocol like 9p would be that one would lose some potential
>> optimizations (like passing PFN's directly between guest and host).
>>
>
> I think that there are two layers - having a standard, well defined,
> simple shared memory transport between partitions (or between
> emulators and the host system) is certainly a prerequisite. There are
> lots of different decisions to made here:

What do you think about a socket interface? I'm not sure how discovery
would work yet, but there are a few PV socket implementations for Xen at
the moment.

> a) does it communicate with userspace, kernelspace, or both?

sockets are usable for both userspace/kernespace.

> b) is it multi-channel? prioritized? interrupt driven or poll driven?

Of course, arguments can be made for any of these depending on the
circumstance. I think you'd have to start with something simple that
would cover the most number of users (non-multiplexed, interrupt driven).

> c) how big are the buffers? is it packetized?

This could probably be tweaked with sockopts. I suspect you would have
an implementation for Xen, KVM, etc. and support a common set of options
(and possible some per-VM type of options).

> d) can all of these parameters be something controllable from userspace?
> e) I'm sure there are many others that I can't be bothered to think
> of on a Friday

The biggest point of contention would probably be what goes in the
sockaddr structure.

Thoughts?

Regards,

Anthony Liguori

> Regardless of the details, I think we can definitely come together on
> a common mechanism here and avoid lots of duplication in the drivers
> are already there and which will follow. My personal preference is to
> keep things as simple and flat as possible. No XML, no multiple
> stacks and daemons to contend with.
>
> What runs on top of the transport is no doubt going to be a touchy
> subject for some time to come. Many of Ron's arguments for 9p mostly
> apply to this upper level. I/we will be pursuing this as a unified PV
> resource sharing mechanism over the next few months in combination
> with reorganization and optimization of the Linux 9p code. LANL has
> also been making progress in this same direction. I'd have gotten
> started sooner, but I was waiting for my new Thinkpad so that I can
> actually run KVM ;)
>
>>
>> So is there any reason to even tie 9p to KVM? Why not just have a
>> common PV transport that 9p can use. For certain things, it may make
>> sense (like v9fs).
>>
>
> Well, I think we were discussing tying KVM to 9p, not vice-versa.
>
> My personal view is that developing a generalized solution for
> resource sharing of all manner of devices and services across
> virtualization, emulation, and network boundaries is a better way to
> spend our time than writing a bunch of specific
> drivers/protocols/interfaces for each type of device and each type of
> interconnect.
>
> -eric


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Daniel P. Berrange
2007-05-16 17:38:22 UTC
Permalink
On Wed, May 16, 2007 at 12:28:00PM -0500, Anthony Liguori wrote:
> Eric Van Hensbergen wrote:
> > On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
> >>
> >> There's definitely a conversation to have here. There are going to be a
> >> lot of small devices that would benefit from a common transport
> >> mechanism. Someone mentioned a PV entropy device on LKML. A
> >> host=>guest filesystem is another consumer of such an interface.
> >>
> >> I'm inclined to think though that the abstraction point should be the
> >> transport and not the actual protocol. My concern with standardizing on
> >> a protocol like 9p would be that one would lose some potential
> >> optimizations (like passing PFN's directly between guest and host).
> >>
> >
> > I think that there are two layers - having a standard, well defined,
> > simple shared memory transport between partitions (or between
> > emulators and the host system) is certainly a prerequisite. There are
> > lots of different decisions to made here:
>
> What do you think about a socket interface? I'm not sure how discovery
> would work yet, but there are a few PV socket implementations for Xen at
> the moment.

As a userspace apps service, I'd very much like to see a common sockets
interface for inter-VM communication that is portable across virt systems
like Xen & KVM. I'd see it as similar to UNIX domain sockets in style. So
basically any app which could do UNIX domain sockets, could be ported to
inter-VM sockets by just changing PF_UNIX to say, PF_VIRT
Lots of interesting details around impl & security (what VMs are allowed
to talk to each other, whether this policy should be controlled by the
host, or allow VMs to decide for themselves).

> > a) does it communicate with userspace, kernelspace, or both?
>
> sockets are usable for both userspace/kernespace.

For userspace, it would be very easy to adapt existing sockets based
apps using IP or UNIX sockets to use inter-VM sockets, which is a big
positive.

> > d) can all of these parameters be something controllable from userspace?
> > e) I'm sure there are many others that I can't be bothered to think
> > of on a Friday
>
> The biggest point of contention would probably be what goes in the
> sockaddr structure.

Keeping it very simple would be some arbitrary 'path', similar to UNIX
domain sockets in the abstract namespace ?

Regards,
Dan.
--
|=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=|
|=- Perl modules: http://search.cpan.org/~danberr/ -=|
|=- Projects: http://freshmeat.net/~danielpb/ -=|
|=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-17 09:29:13 UTC
Permalink
Daniel P. Berrange wrote:
> As a userspace apps service, I'd very much like to see a common sockets
> interface for inter-VM communication that is portable across virt systems
> like Xen & KVM. I'd see it as similar to UNIX domain sockets in style. So
> basically any app which could do UNIX domain sockets, could be ported to
> inter-VM sockets by just changing PF_UNIX to say, PF_VIRT
> Lots of interesting details around impl & security (what VMs are allowed
> to talk to each other, whether this policy should be controlled by the
> host, or allow VMs to decide for themselves).
z/VM, the premium hypervisor on 390 already has this capability for
decades. This is called IUCV (inter user communication vehicle), where
user really means virtual machine. It so happens the support for
AF_IUCV was recently merged to Linux mainline. It may be worth a look,
either for using it or because learning from existing solutions is
always a good idea.

so long,
Carsten


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-17 14:22:23 UTC
Permalink
Carsten Otte wrote:
> Daniel P. Berrange wrote:
>
>> As a userspace apps service, I'd very much like to see a common sockets
>> interface for inter-VM communication that is portable across virt systems
>> like Xen & KVM. I'd see it as similar to UNIX domain sockets in style. So
>> basically any app which could do UNIX domain sockets, could be ported to
>> inter-VM sockets by just changing PF_UNIX to say, PF_VIRT
>> Lots of interesting details around impl & security (what VMs are allowed
>> to talk to each other, whether this policy should be controlled by the
>> host, or allow VMs to decide for themselves).
>>
> z/VM, the premium hypervisor on 390 already has this capability for
> decades. This is called IUCV (inter user communication vehicle), where
> user really means virtual machine. It so happens the support for
> AF_IUCV was recently merged to Linux mainline. It may be worth a look,
> either for using it or because learning from existing solutions is
> always a good idea.
>

Is there anything that explains what the fields in sockaddr mean:

sa_family_t siucv_family;
unsigned short siucv_port; /* Reserved */
unsigned int siucv_addr; /* Reserved */
char siucv_nodeid[8]; /* Reserved */
char siucv_user_id[8]; /* Guest User Id */
char siucv_name[8]; /* Application Name */

Regards,

Anthony LIugori

> so long,
> Carsten
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+***@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Christian Borntraeger
2007-05-21 11:11:35 UTC
Permalink
Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+***@public.gmane.org> wrote on 17.05.2007 16:22:23:
> Is there anything that explains what the fields in sockaddr mean:
>
> sa_family_t siucv_family;
> unsigned short siucv_port; /* Reserved */
> unsigned int siucv_addr; /* Reserved */
> char siucv_nodeid[8]; /* Reserved */
> char siucv_user_id[8]; /* Guest User Id */
> char siucv_name[8]; /* Application Name */

There is a small description in "Device Drivers, Features, and
Commands SC33-8289-03" on page 211 (its page 235 if you use the pdf
viewer page number)
http://download.boulder.ibm.com/ibmdl/pub/software/dw/linux390/docu/l26cdd03.pdf
(the file is 6.7 MB)

More generic information about iucv can be found in in
http://www-03.ibm.com/servers/eserver/zseries/zos/bkserv/zvmpdf/zvm52.html
or to be precise:
http://publibz.boulder.ibm.com/epubs/pdf/hcse5b11.pdf part 2. (11 MB)

That said, AF_IUCV builds on top of iucv and therefore requires z/VM
as hypervisor. I dont think that KVM should implement (af_)iucv. But
(af_)iucv shows several aspects how to make things good and bad.
(e.g. AF_IUCV as procotol on top of iucv was first defined in
CMS several years ago and is, therefore, not very smp-friendly.
On the other hand iucv itself offers modern features like
scatter/gather).

Back to the old question:
If shared memory or socket is better - I dont know. z/VM has both, see
dcss for its shared memory support.
Eric Van Hensbergen
2007-05-16 17:41:17 UTC
Permalink
On 5/16/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
>
> What do you think about a socket interface? I'm not sure how discovery
> would work yet, but there are a few PV socket implementations for Xen at
> the moment.
>

>From a functional standpoint I don't have a huge problem with it,
particularly if its more of a pure socket and not something that tries
to look like a TCP/IP endpoint -- I would prefer something closer to
netlink. Sockets would allow the exisitng 9p stuff to pretty much
work as-is.

However, all that being said, I noticed some pretty big differences
between sockets and shared memory in terms of overhead under Linux.

If you take a look at the RPC latency graph in:
http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf

You'll see that a local socket implementation has about an order of
magnitude worse latency than a PROSE/Libra inter-partition shared
memory channel. Furthermore it will really limit our ability to trim
the fat of unnecessary copies in order to have competitive
performance. But perhaps there's magic you can do to eliminate that.

Of course, you could always layer a socket interface for userspace
simplicity on top of a more performance-optimized underlying transport
that could be used directly by kernel-modules.

-eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-16 18:47:04 UTC
Permalink
Eric Van Hensbergen wrote:
> On 5/16/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
>>
>> What do you think about a socket interface? I'm not sure how discovery
>> would work yet, but there are a few PV socket implementations for Xen at
>> the moment.
>>
>
> From a functional standpoint I don't have a huge problem with it,
> particularly if its more of a pure socket and not something that tries
> to look like a TCP/IP endpoint -- I would prefer something closer to
> netlink. Sockets would allow the exisitng 9p stuff to pretty much
> work as-is.

So you would prefer assigning out types instead of using an identifier
string in the sockaddr?

> However, all that being said, I noticed some pretty big differences
> between sockets and shared memory in terms of overhead under Linux.
>
> If you take a look at the RPC latency graph in:
> http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
>
> You'll see that a local socket implementation has about an order of
> magnitude worse latency than a PROSE/Libra inter-partition shared
> memory channel.

You seem to suggest that the low latency is due to a very greedy (CPU
hungry) polling algorithm. A poll vs. interrupt model would seem to me
to be orthogonal to using sockets as an interface.

> Furthermore it will really limit our ability to trim
> the fat of unnecessary copies in order to have competitive
> performance. But perhaps there's magic you can do to eliminate that.

sockets do add copies. My initial thinking is that one can work around
this by passing guest PFNs (or grant references in Xen). I'm also happy
to start out focusing on "low-speed" devices.

> Of course, you could always layer a socket interface for userspace
> simplicity on top of a more performance-optimized underlying transport
> that could be used directly by kernel-modules.

Right.

Regards,

Anthony Liguori

> -eric


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Eric Van Hensbergen
2007-05-16 19:33:34 UTC
Permalink
On 5/16/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
> Eric Van Hensbergen wrote:
> >
> > From a functional standpoint I don't have a huge problem with it,
> > particularly if its more of a pure socket and not something that tries
> > to look like a TCP/IP endpoint -- I would prefer something closer to
> > netlink. Sockets would allow the exisitng 9p stuff to pretty much
> > work as-is.
>
> So you would prefer assigning out types instead of using an identifier
> string in the sockaddr?
>

I wasn't really thinking that extreme, just having an assigned type
for the vm sockets so that we can minimize baggage. Perhaps I'm
being overzealous.

> > However, all that being said, I noticed some pretty big differences
> > between sockets and shared memory in terms of overhead under Linux.
> >
> > If you take a look at the RPC latency graph in:
> > http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
> >
> > You'll see that a local socket implementation has about an order of
> > magnitude worse latency than a PROSE/Libra inter-partition shared
> > memory channel.
>
> You seem to suggest that the low latency is due to a very greedy (CPU
> hungry) polling algorithm. A poll vs. interrupt model would seem to me
> to be orthogonal to using sockets as an interface.
>

That certainly was a theory -- I never did detailed measurements,
however, there is certainly extra overhead associated with the socket
path due to kernel-user space boundary crossings and additional code
path length associated with socket operations. Still I'm game to
comparing the alternatives.

-eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Gregory Haskins
2007-05-16 17:45:42 UTC
Permalink
>>> On Wed, May 16, 2007 at 1:28 PM, in message <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+***@public.gmane.org>,
Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
>
> What do you think about a socket interface? I'm not sure how discovery
> would work yet, but there are a few PV socket implementations for Xen at
> the moment.

FYI: The work I am doing is exactly that. I am going to extend host-based unix domain sockets up to the KVM guest. Not sure how well it will work yet, as I had to lay the LAPIC work down first for IO-completion.

-Greg


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-16 18:39:39 UTC
Permalink
Gregory Haskins wrote:
>>>> On Wed, May 16, 2007 at 1:28 PM, in message <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+***@public.gmane.org>,
>>>>
> Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
>
>> What do you think about a socket interface? I'm not sure how discovery
>> would work yet, but there are a few PV socket implementations for Xen at
>> the moment.
>>
>
> FYI: The work I am doing is exactly that. I am going to extend host-based unix domain sockets up to the KVM guest. Not sure how well it will work yet, as I had to lay the LAPIC work down first for IO-completion.
>

Do you plan on introducing a new address family in the guest?

Regards,

Anthony Liguori

> -Greg
>
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Gregory Haskins
2007-05-16 18:57:10 UTC
Permalink
>>> On Wed, May 16, 2007 at 2:39 PM, in message <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+***@public.gmane.org>,
Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
> Gregory Haskins wrote:
>>>>> On Wed, May 16, 2007 at 1:28 PM, in message <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+***@public.gmane.org>,
>>>>>
>> Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
>>
>>> What do you think about a socket interface? I'm not sure how discovery
>>> would work yet, but there are a few PV socket implementations for Xen at
>>> the moment.
>>>
>>
>> FYI: The work I am doing is exactly that. I am going to extend host- based
> unix domain sockets up to the KVM guest. Not sure how well it will work yet,
> as I had to lay the LAPIC work down first for IO- completion.
>>
>
> Do you plan on introducing a new address family in the guest?

Well, since I had to step back and lay some infrastructure groundwork I haven't vetted this approach yet...so its possible what I am about to say is relatively naive: But my primary application is to create a guest-kernel to host IVMC. For that you can just think of the guest as any other process on the host, and it will just use the sockets normally as any host-process would. There might be some thunking that has to happen to deal with gpa vs va, etc, but otherwise its a standard consumer. If you want to extend IVMC up to guest-userspace, I think making some kind of new socket family makes sense in the guests stack. PF_VIRT like someone else suggested, for instance. But since I dont need this type of IVMC I haven't really thought about this too much.

-Greg



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-16 19:10:36 UTC
Permalink
Gregory Haskins wrote:
>>>> On Wed, May 16, 2007 at 2:39 PM, in message <464B4FEB.7070300-r/Jw6+rmf7HQT0dZR+***@public.gmane.org>,
>>>>
> Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
>
>> Gregory Haskins wrote:
>>
>>>>>> On Wed, May 16, 2007 at 1:28 PM, in message <464B3F20.4030904-r/Jw6+rmf7HQT0dZR+***@public.gmane.org>,
>>>>>>
>>>>>>
>>> Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
>>>
>>>
>>>> What do you think about a socket interface? I'm not sure how discovery
>>>> would work yet, but there are a few PV socket implementations for Xen at
>>>> the moment.
>>>>
>>>>
>>> FYI: The work I am doing is exactly that. I am going to extend host- based
>>>
>> unix domain sockets up to the KVM guest. Not sure how well it will work yet,
>> as I had to lay the LAPIC work down first for IO- completion.
>>
>>>
>>>
>> Do you plan on introducing a new address family in the guest?
>>
>
> Well, since I had to step back and lay some infrastructure groundwork I haven't vetted this approach yet...so its possible what I am about to say is relatively naive: But my primary application is to create a guest-kernel to host IVMC.

This is quite easy with KVM. I like the approach that vmchannel has
taken. A simple PCI device. That gives you a discovery mechanism for
shared memory and an interrupt and then you can just implement a ring
queue using those mechanisms (along with a PIO port for signaling from
the guest to the host). So given that underlying mechanism, the
question is how to expose that within the guest kernel/userspace and
within the host.

For the host, you can probably stay entirely within QEMU. Interguest
communication would be a bit tricky but guest->host communication is
real simple.

You could stop at exposing the channel as a socket within the guest
kernel/userspace. That would work, but you may also want to expose the
ring queue within the kernel at least if there are consumers that need
to avoid the copy.

A tricky bit of this is how to do discovery. If you want to support
interguest communication, it's not really sufficient to just use strings
since they identifiers would have to be unique throughout the entire
system. Maybe you just leave it as a guest=>host channel and be done
with it.

Regards,

Anthony Liguori

> For that you can just think of the guest as any other process on the host, and it will just use the sockets normally as any host-process would. There might be some thunking that has to happen to deal with gpa vs va, etc, but otherwise its a standard consumer. If you want to extend IVMC up to guest-userspace, I think making some kind of new socket family makes sense in the guests stack. PF_VIRT like someone else suggested, for instance. But since I dont need this type of IVMC I haven't really thought about this too much.
>
> -Greg
>
>
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Rusty Russell
2007-05-17 04:24:40 UTC
Permalink
On Wed, 2007-05-16 at 14:10 -0500, Anthony Liguori wrote:
> For the host, you can probably stay entirely within QEMU. Interguest
> communication would be a bit tricky but guest->host communication is
> real simple.

guest->host is always simple. But it'd be great if it didn't matter to
the guest whether it's talking to the host or another guest.

I think shared memory is an obvious start, but it's not enough for
inter-guest where they can't freely access each other's memory. So you
really want a ring-buffer of descriptors with a hypervisor-assist to say
"read/write this into the memory referred to by that descriptor".

I think this can be done as a simple variation of the current schemes in
existence.

But I'm shutting up until I have some demonstration code 8)

> A tricky bit of this is how to do discovery. If you want to support
> interguest communication, it's not really sufficient to just use strings
> since they identifiers would have to be unique throughout the entire
> system. Maybe you just leave it as a guest=>host channel and be done
> with it.

Hmm, I was going to leave that unspecified. One thing at a time...

Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-17 16:13:57 UTC
Permalink
Rusty Russell wrote:
> On Wed, 2007-05-16 at 14:10 -0500, Anthony Liguori wrote:
>
>> For the host, you can probably stay entirely within QEMU. Interguest
>> communication would be a bit tricky but guest->host communication is
>> real simple.
>>
>
> guest->host is always simple. But it'd be great if it didn't matter to
> the guest whether it's talking to the host or another guest.
>
> I think shared memory is an obvious start, but it's not enough for
> inter-guest where they can't freely access each other's memory. So you
> really want a ring-buffer of descriptors with a hypervisor-assist to say
> "read/write this into the memory referred to by that descriptor".
>

I think this is getting a little ahead of ourselves. An example of this
idea is pretty straight-forward but it gets more complicated when trying
to support the existing memory sharing mechanisms on various
hypervisors. There are a few cases to consider:

1) The target VM can access all of the memory of the guest VM with no
penalty. This is the case when going from guest=>QEMU in KVM or going
from guest=>kernel (ignoring highmem) in KVM. For this, you can send
arbitrary memory to the host.

2) The target VM can access all of the memory of the guest VM with a
penalty. For guest=>other userspace process in KVM, an mmap() would be
required. This would work for Xen provided the target VM was domain-0
but it would incur a xc_map_foreign_range().

3) The target and source VM can only share memory based on an existing
pool. This is the guest with Xen and grant tables.

I think an API that covers these three cases is a bit tricky and will
likely make undesired trade-offs. I think it's easier to start out
focusing on the "low-speed" case where there's a mandatory data-copy.

You can still pass gntref's or PFNs down this transport if you like and
perhaps down the road we'll find that we can make a common interface for
doing this sort of thing.

Regards,

Anthony Liguori

> I think this can be done as a simple variation of the current schemes in
> existence.
>
> But I'm shutting up until I have some demonstration code 8)
>
>
>> A tricky bit of this is how to do discovery. If you want to support
>> interguest communication, it's not really sufficient to just use strings
>> since they identifiers would have to be unique throughout the entire
>> system. Maybe you just leave it as a guest=>host channel and be done
>> with it.
>>
>
> Hmm, I was going to leave that unspecified. One thing at a time...
>
> Rusty.
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+***@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Rusty Russell
2007-05-17 23:34:08 UTC
Permalink
On Thu, 2007-05-17 at 11:13 -0500, Anthony Liguori wrote:
> Rusty Russell wrote:
> > I think shared memory is an obvious start, but it's not enough for
> > inter-guest where they can't freely access each other's memory. So you
> > really want a ring-buffer of descriptors with a hypervisor-assist to say
> > "read/write this into the memory referred to by that descriptor".
>
> I think this is getting a little ahead of ourselves. An example of this
> idea is pretty straight-forward but it gets more complicated when trying
> to support the existing memory sharing mechanisms on various
> hypervisors. There are a few cases to consider:

To clarify, I'm not overly interested in existing mechanisms. I'm first
trying for something sane from a Linux driver POV, then see if it can be
implemented in terms of legacy systems.

This reflects my belief that we will see more virtualization solutions
in the medium term, so it's reasonable to look at a new system.

Cheers,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Christian Borntraeger
2007-05-21 09:07:07 UTC
Permalink
> This is quite easy with KVM. I like the approach that vmchannel has
> taken. A simple PCI device. That gives you a discovery mechanism for
> shared memory and an interrupt and then you can just implement a ring
> queue using those mechanisms (along with a PIO port for signaling from
> the guest to the host). So given that underlying mechanism, the
> question is how to expose that within the guest kernel/userspace and
> within the host.

Sorry for answering late, but I dont like PCI as a device bus for all
platforms. s390 has no PCI and s390 has no PIO. I would prefer a new
simple hypercall based virtual bus. I dont know much about windows
driver programming, but I guess it it is not that hard to add a new bus.
Cornelia Huck
2007-05-21 09:27:21 UTC
Permalink
On Mon, 21 May 2007 11:07:07 +0200,
Christian Borntraeger <CBORNTRA-tA70FqPdS9bQT0dZR+***@public.gmane.org> wrote:

> > This is quite easy with KVM. I like the approach that vmchannel has
> > taken. A simple PCI device. That gives you a discovery mechanism for
> > shared memory and an interrupt and then you can just implement a ring
> > queue using those mechanisms (along with a PIO port for signaling from
> > the guest to the host). So given that underlying mechanism, the
> > question is how to expose that within the guest kernel/userspace and
> > within the host.
>
> Sorry for answering late, but I dont like PCI as a device bus for all
> platforms. s390 has no PCI and s390 has no PIO. I would prefer a new
> simple hypercall based virtual bus. I dont know much about windows
> driver programming, but I guess it it is not that hard to add a new bus.

Agreed. Moreover, if you have an existing OS running on a non-pci
platform, it will be far more likely that they will be able to write a
driver against a simple hypercall-based bus than to cook up a
full-blown pci interface.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Arnd Bergmann
2007-05-21 11:28:03 UTC
Permalink
On Monday 21 May 2007, Christian Borntraeger wrote:
> > This is quite easy with KVM.  I like the approach that vmchannel has
> > taken.  A simple PCI device.  That gives you a discovery mechanism for
> > shared memory and an interrupt and then you can just implement a ring
> > queue using those mechanisms (along with a PIO port for signaling from
> > the guest to the host).  So given that underlying mechanism, the
> > question is how to expose that within the guest kernel/userspace and
> > within the host.
>
> Sorry for answering late, but I dont like PCI as a device bus for all
> platforms. s390 has no PCI and s390 has no PIO. I would prefer a new
> simple hypercall based virtual bus. I dont know much about windows
> driver programming, but I guess it it is not that hard to add a new bus.

We've had the same discussion about PCI as virtual device abstraction
recently when hpa made the suggestions to get a set of PCI device
numbers registered for Linux.

IIRC, the conclusion to which we came was that it is indeed helpful
for most architecture to have a PCI device as one way to probe for
the functionality, but not to rely on it. s390 is the obvious
example where you can't have PCI, but you may also want to build
a guest kernel without PCI support because of space constraints
in a many-guests machine.

What I think would be ideal is to have a new bus type in Linux
that does not have any dependency on PCI itself, but can be
easily implemented as a child of a PCI device.

If we only need the stuff mentioned by Anthony, the interface could
look like

struct vmchannel_device {
struct resource virt_mem;
struct vm_device_id id;
int irq;
int (*signal)(struct vmchannel_device *);
int (*irq_ack)(struct vmchannel_device *);
struct device dev;
};

Such a device can easily be provided as a child of a PCI device,
or as something that is purely virtual based on an hcall interface.

Arnd <><

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Cornelia Huck
2007-05-21 11:56:28 UTC
Permalink
On Mon, 21 May 2007 13:28:03 +0200,
Arnd Bergmann <arnd-***@public.gmane.org> wrote:

> We've had the same discussion about PCI as virtual device abstraction
> recently when hpa made the suggestions to get a set of PCI device
> numbers registered for Linux.

(If you want to read it up, it's the thread at
http://marc.info/?t=117554525400003&r=1&w=2)

>
> IIRC, the conclusion to which we came was that it is indeed helpful
> for most architecture to have a PCI device as one way to probe for
> the functionality, but not to rely on it. s390 is the obvious
> example where you can't have PCI, but you may also want to build
> a guest kernel without PCI support because of space constraints
> in a many-guests machine.
>
> What I think would be ideal is to have a new bus type in Linux
> that does not have any dependency on PCI itself, but can be
> easily implemented as a child of a PCI device.
>
> If we only need the stuff mentioned by Anthony, the interface could
> look like
>
> struct vmchannel_device {
> struct resource virt_mem;
> struct vm_device_id id;
> int irq;
^^^^^^^^
> int (*signal)(struct vmchannel_device *);
> int (*irq_ack)(struct vmchannel_device *);
> struct device dev;
> };

IRQ numbers are evil :)

It should be more like a
void *vmchannel_device_handle;
which could be different things depending on what we want the
vmchannel_device to be a child of (it could be an IRQ number for
PCI devices, or something like subchannel_id if we wanted to
support channel devices).

>
> Such a device can easily be provided as a child of a PCI device,
> or as something that is purely virtual based on an hcall interface.

This looks like a flexible approach.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Arnd Bergmann
2007-05-21 13:53:25 UTC
Permalink
On Monday 21 May 2007, Cornelia Huck wrote:
> IRQ numbers are evil :)

yes, but getting rid of them is an entirely different discussion.
I really think that in the first step, you should be able to
use its "external interrupts" with the same request_irq interface
as the other architectures.

Fundamentally, the s390 architecture has external interrupt numbers
as well, you're just using a different interface for registering them.
The ccw devices obviously have a better interface already, but
that doesn't help you here.

> It should be more like a
>         void *vmchannel_device_handle;
> which could be different things depending on what we want the
> vmchannel_device to be a child of (it could be an IRQ number for
> PCI devices, or something like subchannel_id if we wanted to
> support channel devices).

No, the driver needs to know how to get at the interrupt without
caring about the bus implementation, that's why you either need
to have a callback function set by the driver (like s390 CCW
or USB have it), or visible interrupt number (like everyone does).

There is no need for a pointer back to a vmchannel_device_handle,
all information needed by the bus layer can simply be in a
subclass derived from the vmchannel_device, e.g.

struct vmchannel_pci {
struct pci_device *parent; /* shortcut, same as
to_pci_dev(&this.vmdev.dev.parent) */
unsigned long signal_ioport; /* for interrupt generation */
struct vmchannel_device vmdev;
};

You would allocate this structure in the pci_driver that registers
the vmchannel_device.

Arnd <><

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-21 18:45:37 UTC
Permalink
Arnd Bergmann wrote:
> On Monday 21 May 2007, Christian Borntraeger wrote:
>
>>> This is quite easy with KVM. I like the approach that vmchannel has
>>> taken. A simple PCI device. That gives you a discovery mechanism for
>>> shared memory and an interrupt and then you can just implement a ring
>>> queue using those mechanisms (along with a PIO port for signaling from
>>> the guest to the host). So given that underlying mechanism, the
>>> question is how to expose that within the guest kernel/userspace and
>>> within the host.
>>>
>> Sorry for answering late, but I dont like PCI as a device bus for all
>> platforms. s390 has no PCI and s390 has no PIO.

Right, I'm not interested in the lowest level implementation (PCI device
+ PIO). I'm more interested in the higher level interface. The goal is
to allow drivers to be able to be written to the higher level interface
so that they work on any platform that implements the lower level
interface. On x86, that would be PCI/PIO. On s390, that could be
hypercall based.

>> I would prefer a new
>> simple hypercall based virtual bus. I dont know much about windows
>> driver programming, but I guess it it is not that hard to add a new bus.
>>
>
> We've had the same discussion about PCI as virtual device abstraction
> recently when hpa made the suggestions to get a set of PCI device
> numbers registered for Linux.
>
> IIRC, the conclusion to which we came was that it is indeed helpful
> for most architecture to have a PCI device as one way to probe for
> the functionality, but not to rely on it. s390 is the obvious
> example where you can't have PCI, but you may also want to build
> a guest kernel without PCI support because of space constraints
> in a many-guests machine.
>
> What I think would be ideal is to have a new bus type in Linux
> that does not have any dependency on PCI itself, but can be
> easily implemented as a child of a PCI device.
>
> If we only need the stuff mentioned by Anthony, the interface could
> look like
>
> struct vmchannel_device {
> struct resource virt_mem;
> struct vm_device_id id;
> int irq;
> int (*signal)(struct vmchannel_device *);
> int (*irq_ack)(struct vmchannel_device *);
> struct device dev;
> };
>
> Such a device can easily be provided as a child of a PCI device,
> or as something that is purely virtual based on an hcall interface.
>

Yes, this is close to what I was thinking. I'm not sure that this
particular interface can encompass the variety of memory sharing
mechanisms though.

When I mentioned shared memory via the PCI device, I was referring to
the memory needed for boot strapping the device. You still need a
mechanism to transfer memory for things like zero-copy disk IO and
network devices. This may involve passing memory addresses directly,
copying data, or page flipping.

This leads me to think that a higher level interface that provided a
data passing interface would be more useful. Something like:

struct vmchannel_device {
struct vm_device_id id;
int (*open)(struct vmchannel_device *, const char *name, const char
*service)
int (*release)(struct vmchannel_device *);
ssize_t (*sendmsg)(struct vmchannel_device *, const void *, size_t);
ssize_t (*recvmsg)(struct vmchannel_device *, void *, size_t);
struct device dev;
};

The consuming interface of this would be a socket (PF_VIRTLINK). The
sockaddr would contain a name identifying a VM and a service description.

This doesn't address the memory issues I raised above but I think it
would be easier to special case the drivers where it mattered. For
instance, on x86 KVM, a PV disk driver front end would consist of
connecting to a virtlink socket, and then transferring struct bio's.
QEMU instances would listen on the virtlink socket in the host, and
service them directly (QEMU can access all of the guests memory directly
in userspace).

A PV graphics device could just be a VNC server that listened on a
virtlink socket.

Regards,

Anthony Liguori

> Arnd <><
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
ron minnich
2007-05-21 23:09:55 UTC
Permalink
OK, so what are we doing here? We're using a PCI abstraction, as a
common abstraction,which is not common really, because we don't have a
common abstraction? So we describe all these non-pci resources with a
pci abstraction?

I don't get it at all. I really think the resource interface idea I
mentioned, which is borrowed from Plan 9, makes a whole lot more
sense. IBM Austin has already shown it in practice in the papers I
referenced. It can work. A memory channel at the bottom, with a
resource sharing protocol (9p) above it, and then you describe your
resources via names and a simple file-directory model. Note that PCI
sort of tries to do this tree model, but it's all binary, and, as
noted, it's hardly universal.

All of this is trivially exported over a network, so the use of shared
memory channels in no way rules out network access. Plan 9 exports
devices over the network routinely.

If you're using a PCI abstraction, something has gone badly wrong I think.

thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-22 00:29:06 UTC
Permalink
ron minnich wrote:
> OK, so what are we doing here? We're using a PCI abstraction, as a
> common abstraction,which is not common really, because we don't have a
> common abstraction? So we describe all these non-pci resources with a
> pci abstraction?
>

No. You're confusing PV device discovery with the actual paravirtual
transport. In a fully virtual environment like KVM, a PCI bus is
present. You need some way for the guest to detect that a PV device is
present. The most natural way to do this IMHO is to have an entry for
the PV device in the PCI bus. That will make a lot of existing code happy.

Once you've identified that the device exists, you're free to do
whatever you want with it.

Regards,

Anthony Liguori



> I don't get it at all. I really think the resource interface idea I
> mentioned, which is borrowed from Plan 9, makes a whole lot more
> sense. IBM Austin has already shown it in practice in the papers I
> referenced. It can work. A memory channel at the bottom, with a
> resource sharing protocol (9p) above it, and then you describe your
> resources via names and a simple file-directory model. Note that PCI
> sort of tries to do this tree model, but it's all binary, and, as
> noted, it's hardly universal.
>
> All of this is trivially exported over a network, so the use of shared
> memory channels in no way rules out network access. Plan 9 exports
> devices over the network routinely.
>
> If you're using a PCI abstraction, something has gone badly wrong I think.
>
> thanks
>
> ron
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+***@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
ron minnich
2007-05-22 00:45:47 UTC
Permalink
On 5/21/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+***@public.gmane.org> wrote:
> ron minnich wrote:
> > OK, so what are we doing here? We're using a PCI abstraction, as a
> > common abstraction,which is not common really, because we don't have a
> > common abstraction? So we describe all these non-pci resources with a
> > pci abstraction?
> >
>
> No. You're confusing PV device discovery with the actual paravirtual
> transport. In a fully virtual environment like KVM, a PCI bus is
> present. You need some way for the guest to detect that a PV device is
> present. The most natural way to do this IMHO is to have an entry for
> the PV device in the PCI bus. That will make a lot of existing code happy.
>

I don't think I am confusing it, now that you've explained it more
fully. I'm even less happy with it :-)

How will I explain this sort of thing to my grandchildren? :-)
"grandpop, why do those PV devices look like a bus defined in 1994?"

Why would you not have, e.g., a 9p server for PV device "config space"
as well? I actually implemented that on Xen -- it was quite trivial,
and it makes more sense -- to me anyway -- than pretending a PV device
is something it's not.

What it happening, it seems to me, is that people are still trying to
use an abstraction -- "PCI device" -- which is not really an
abstraction, to model aspects of PV device discovery, enumeration,
configuration and operation. I'm still pretty uncomfortable with it --
well, honestly, it seems kind of gross to me. It's just as easy to
build the right abstraction underneath all this, and then, for those
OSes that have existing code that needs to be happy, present that
abstraction as a PCI bus. But making the PCI bus the underlying
abstraction is getting the order inverted,I believe.

I realize that PCI device space is a pretty handy way to do this, that
it is very convenient. I wonder what happens when you get a system
without enough "holes" in the config space for you to hide the PV
devices in, or that has some other weird property that breaks this
model. I've already worked with one system that had 32 PCI busses.

There are other hypervisors that made convenient choices over the
right choice, and they are paying for it. Let's try to avoid that on
kvm. Kvm has so much going for it right now.

thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-22 01:13:45 UTC
Permalink
ron minnich wrote:
> On 5/21/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+***@public.gmane.org> wrote:
>> No. You're confusing PV device discovery with the actual paravirtual
>> transport. In a fully virtual environment like KVM, a PCI bus is
>> present. You need some way for the guest to detect that a PV device is
>> present. The most natural way to do this IMHO is to have an entry for
>> the PV device in the PCI bus. That will make a lot of existing code
>> happy.
>>
>
> I don't think I am confusing it, now that you've explained it more
> fully. I'm even less happy with it :-)

Sometimes I think the best way to make you happy is to just stop talking :-)

> How will I explain this sort of thing to my grandchildren? :-)
> "grandpop, why do those PV devices look like a bus defined in 1994?"
>
> Why would you not have, e.g., a 9p server for PV device "config space"
> as well? I actually implemented that on Xen -- it was quite trivial,
> and it makes more sense -- to me anyway -- than pretending a PV device
> is something it's not.
>
> What it happening, it seems to me, is that people are still trying to
> use an abstraction -- "PCI device" -- which is not really an
> abstraction, to model aspects of PV device discovery, enumeration,
> configuration and operation. I'm still pretty uncomfortable with it --
> well, honestly, it seems kind of gross to me. It's just as easy to
> build the right abstraction underneath all this, and then, for those
> OSes that have existing code that needs to be happy, present that
> abstraction as a PCI bus. But making the PCI bus the underlying
> abstraction is getting the order inverted,I believe.

Okay. The first problem here is that you're assuming that I'm
suggesting that this who thing mandate a PCI bus. I'm not. I'm merely
saying that one possible way to implement this is by using a PCI bus to
discover the existing of a VIRTLINK socket. Clearly, the s390 guys
would have to use something else.

For PV Xen where there is no PCI bus, XenBus would be used. So very
concretely, there are three separate classes of problems:

1) How to determine that a VM can use virtlink sockets
2) How to enumerate paravirtual devices
3) The various PV protocols for each device

Whatever Linux implements, it has to allow multiple implementations for
#1. For x86 VMs, PCI is just the easiest thing to do here. You could
do hypercalls but it gets messy on different hypervisors (vmcall with 0
in eax may do something funky in Xen but be the probing hypercall on KVM).

For #2, I'm not really proposing anything concrete. One possibility is
to allow virtlink sockets to be addressed with a "service" and to use
that. That doesn't allow for enumeration though so it may not be perfect.

I'm not proposing anything at all for #3. That's outside the scope of
this discussion in my mind.

Now, once you have a virtlink socket, could you use p9 to implement #2
and #3? Sounds like something you could write a paper about :-) But
that's later argument. Right now, I'm just focused on solving the boot
strap issue.

Hope this clarifies things a bit.

Regards,

Anthony Liguori

> I realize that PCI device space is a pretty handy way to do this, that
> it is very convenient. I wonder what happens when you get a system
> without enough "holes" in the config space for you to hide the PV
> devices in, or that has some other weird property that breaks this
> model. I've already worked with one system that had 32 PCI busses.
>
> There are other hypervisors that made convenient choices over the
> right choice, and they are paying for it. Let's try to avoid that on
> kvm. Kvm has so much going for it right now.
>
> thanks
>
> ron
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Eric Van Hensbergen
2007-05-22 01:34:16 UTC
Permalink
On 5/21/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+***@public.gmane.org> wrote:
> ron minnich wrote:
> > OK, so what are we doing here? We're using a PCI abstraction, as a
> > common abstraction,which is not common really, because we don't have a
> > common abstraction? So we describe all these non-pci resources with a
> > pci abstraction?
> >
>
> No. You're confusing PV device discovery with the actual paravirtual
> transport.

In a PV environment why not just pass an initial cookie/hash/whatever
as a command-line argument/register/memory-space to the underlying
kernel? The presence of such a kernel argument would suggest the
existence of a hypercall interface or other such mechanism to "attach"
to the initial transport(s). Command-line arguments may be a bit too
linux-centric to Ron's taste, but if we are going to chose something
arbitrary like PCI, I'd prefer we chose something a bit more
straightforward to interact with instead of doing crazy ritual dances
to extract what should be straightforward information. I really don't
want to have integrate PCI parsing into my testOS/libOS kernels.

-eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-22 01:42:17 UTC
Permalink
Eric Van Hensbergen wrote:
> On 5/21/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+***@public.gmane.org> wrote:
>> ron minnich wrote:
>> > OK, so what are we doing here? We're using a PCI abstraction, as a
>> > common abstraction,which is not common really, because we don't have a
>> > common abstraction? So we describe all these non-pci resources with a
>> > pci abstraction?
>> >
>>
>> No. You're confusing PV device discovery with the actual paravirtual
>> transport.
>
> In a PV environment why not just pass an initial cookie/hash/whatever
> as a command-line argument/register/memory-space to the underlying
> kernel?

You can't pass a command line argument to Windows (at least, not easily
AFAIK). You could get away with an MSR/CPUID flag but then you're
relying on uniqueness which isn't guaranteed.

> The presence of such a kernel argument would suggest the
> existence of a hypercall interface or other such mechanism to "attach"
> to the initial transport(s). Command-line arguments may be a bit too
> linux-centric to Ron's taste, but if we are going to chose something
> arbitrary like PCI, I'd prefer we chose something a bit more
> straightforward to interact with instead of doing crazy ritual dances
> to extract what should be straightforward information. I really don't
> want to have integrate PCI parsing into my testOS/libOS kernels.

You could just hard code a PIC interrupt and rely on some static memory
address for IO and avoid the PCI bus entirely. The whole point of the
PCI bus is to avoid hardcoding this sort of things but if you don't want
the complexity associated with PCI, then using the "older" mechanisms
seems like the obvious thing to do.

Regards,

Anthony Liguori

> -eric
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-22 05:17:13 UTC
Permalink
Anthony Liguori wrote:
>>
>> In a PV environment why not just pass an initial cookie/hash/whatever
>> as a command-line argument/register/memory-space to the underlying
>> kernel?
>>
>
> You can't pass a command line argument to Windows (at least, not easily
> AFAIK). You could get away with an MSR/CPUID flag but then you're
> relying on uniqueness which isn't guaranteed.
>


In the general case, you can't pass a command line argument to Linux
either. kvm doesn't boot Linux; it boots the bios, which boots the boot
sector, which boots grub, which boots Linux. Relying on the user to
edit the command line in grub is wrong.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Eric Van Hensbergen
2007-05-22 12:49:51 UTC
Permalink
On 5/22/07, Avi Kivity <avi-atKUWr5tajBWk0Htik3J/***@public.gmane.org> wrote:
> Anthony Liguori wrote:
> >>
> >> In a PV environment why not just pass an initial cookie/hash/whatever
> >> as a command-line argument/register/memory-space to the underlying
> >> kernel?
> >>
> >
> > You can't pass a command line argument to Windows (at least, not easily
> > AFAIK). You could get away with an MSR/CPUID flag but then you're
> > relying on uniqueness which isn't guaranteed.
> >
>
> In the general case, you can't pass a command line argument to Linux
> either. kvm doesn't boot Linux; it boots the bios, which boots the boot
> sector, which boots grub, which boots Linux. Relying on the user to
> edit the command line in grub is wrong.
>

I didn't think we were talking about the general case, I thought we
were discussing the PV case. In the PV case, having bios/bootloader
is unnecessary overhead. To that same end, I don't see Windows in the
PV case unless they magically want to to coordinate PV standards with
us, in which case we certainly can negotiate a more sane discovery
mechanism.

-eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Christoph Hellwig
2007-05-22 12:56:55 UTC
Permalink
On Tue, May 22, 2007 at 07:49:51AM -0500, Eric Van Hensbergen wrote:
> > In the general case, you can't pass a command line argument to Linux
> > either. kvm doesn't boot Linux; it boots the bios, which boots the boot
> > sector, which boots grub, which boots Linux. Relying on the user to
> > edit the command line in grub is wrong.
> >
>
> I didn't think we were talking about the general case, I thought we
> were discussing the PV case. In the PV case, having bios/bootloader
> is unnecessary overhead. To that same end, I don't see Windows in the
> PV case unless they magically want to to coordinate PV standards with
> us, in which case we certainly can negotiate a more sane discovery
> mechanism.

In case of KVM no one is speaking of pure PV. What people have been
working on is PV accelaration of a vullvirt host, similar to how
s390 is working for decaded. The host emulates the full architecture,
but there are some escape for speedups. Typical escapes would be drivers
for storage or networking because those can no be virtualized very well
on x86-style hardware.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Eric Van Hensbergen
2007-05-22 14:50:38 UTC
Permalink
On 5/22/07, Christoph Hellwig <hch-***@public.gmane.org> wrote:
> >
> > I didn't think we were talking about the general case, I thought we
> > were discussing the PV case.
> >
>
> In case of KVM no one is speaking of pure PV.
>

Why not? It seems worthwhile to come up with something that can cover
the whole spectrum instead of having different hypervisors (and
interfaces).

Maybe my view is skewed because I don't care to run windows.

-eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-22 15:05:18 UTC
Permalink
Eric Van Hensbergen wrote:
> On 5/22/07, Christoph Hellwig <hch-***@public.gmane.org> wrote:
>
>>> I didn't think we were talking about the general case, I thought we
>>> were discussing the PV case.
>>>
>>>
>> In case of KVM no one is speaking of pure PV.
>>
>>
>
> Why not? It seems worthwhile to come up with something that can cover
> the whole spectrum instead of having different hypervisors (and
> interfaces).
>

Because in a few years, almost everyone will have hardware capable of
doing full virtualization so why bother with pure PV.

> Maybe my view is skewed because I don't care to run windows.
>

It's not just windows. There are a lot of people who want to use
virtualization to run RHEL2 or even RH9. Backporting PV to these
kernels is a huge effort.

Regards,

Anthony Liguori

> -eric
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+***@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
ron minnich
2007-05-22 15:31:48 UTC
Permalink
On 5/22/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+***@public.gmane.org> wrote:
> Eric Van Hensbergen wrote:
> > On 5/22/07, Christoph Hellwig <hch-***@public.gmane.org> wrote:
> >
> >>> I didn't think we were talking about the general case, I thought we
> >>> were discussing the PV case.
> >>>
> >>>
> >> In case of KVM no one is speaking of pure PV.
> >>
> >>
> >
> > Why not? It seems worthwhile to come up with something that can cover
> > the whole spectrum instead of having different hypervisors (and
> > interfaces).
> >
>
> Because in a few years, almost everyone will have hardware capable of
> doing full virtualization so why bother with pure PV.

I don't know, we could shoot for a clean, simple interface that makes
PV easy to integrate into any kernel. Pick a common underlying
abstraction for all resources.
Define a simple, efficient memory channel for the comms. Lay 9p over
it. Then take it from there for each device.

I agree, from the way (e.g.) the Xen devices work, PV is a pain. But
it need not be that way.

I think from the Plan 9 side we're happy to run full PV. But we're 0%
of the world, so that may bias our importance a bit :-)

thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Eric Van Hensbergen
2007-05-22 16:25:34 UTC
Permalink
On 5/22/07, Anthony Liguori <anthony-rdkfGonbjUSkNkDKm+***@public.gmane.org> wrote:
> Eric Van Hensbergen wrote:
> > On 5/22/07, Christoph Hellwig <hch-***@public.gmane.org> wrote:
> >>>
> >>>
> >> In case of KVM no one is speaking of pure PV.
> >>
> >>
> >
> > Why not? It seems worthwhile to come up with something that can cover
> > the whole spectrum instead of having different hypervisors (and
> > interfaces).
> >
>
> Because in a few years, almost everyone will have hardware capable of
> doing full virtualization so why bother with pure PV.
>

No matter what the capabilities, full device emulation is always going
to be wasteful. Just because I have the hardware to run Vista,
doesn't mean I should run Vista.

> > Maybe my view is skewed because I don't care to run windows.
> >
>
> It's not just windows. There are a lot of people who want to use
> virtualization to run RHEL2 or even RH9. Backporting PV to these
> kernels is a huge effort.
>

I'm not opposed to supporting emulation environments, just don't make
a large pile of crap the default like Xen -- and having to integrate
PCI probing code in my guest domains is a large pile of crap.

-eric

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
ron minnich
2007-05-22 17:00:42 UTC
Permalink
On 5/22/07, Eric Van Hensbergen <ericvh-***@public.gmane.org> wrote:

> I'm not opposed to supporting emulation environments, just don't make
> a large pile of crap the default like Xen -- and having to integrate
> PCI probing code in my guest domains is a large pile of crap.

Exactly. I'm about to start a pretty large project here, using xen or
kvm, not sure. One thing for sure, we are NOT going to use anything
but PV devices. Full emulation is nice, but it's just plain silly if
you don't have to do it. And we don't have to do it. So let's get the
PV devices right, not try to shoehorn them into some framework like
PCI.

What happens to these schemes if I want to try, e.g., 2^16 PV devices?
Or some other crazy thing that doesn't play well with PCI -- simple
example -- I want a 256 GB region of memory for a device. PCI rules
require me to align it on 256GB boundaries and it must be contiguous
address space. This is a hardware rule, done for hardware reasons, and
has no place in the PV world. What if I want a bit more than the basic
set of BARs that PCI gives me? Why would we apply such rules to a PV?
Why limit ourselves this early in the game?

thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Christoph Hellwig
2007-05-22 17:06:28 UTC
Permalink
On Tue, May 22, 2007 at 10:00:42AM -0700, ron minnich wrote:
> On 5/22/07, Eric Van Hensbergen <ericvh-***@public.gmane.org> wrote:
>
> >I'm not opposed to supporting emulation environments, just don't make
> >a large pile of crap the default like Xen -- and having to integrate
> >PCI probing code in my guest domains is a large pile of crap.
>
> Exactly. I'm about to start a pretty large project here, using xen or
> kvm, not sure. One thing for sure, we are NOT going to use anything
> but PV devices. Full emulation is nice, but it's just plain silly if
> you don't have to do it. And we don't have to do it. So let's get the
> PV devices right, not try to shoehorn them into some framework like
> PCI.

If you don't care about full virtualization kvm is the wrong project for
you. You might want to take a look at lguest.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
ron minnich
2007-05-22 17:34:41 UTC
Permalink
On 5/22/07, Christoph Hellwig <hch-***@public.gmane.org> wrote:

> If you don't care about full virtualization kvm is the wrong project for
> you. You might want to take a look at lguest.

Ah, I had not realized that KVM was purely a full-virt environment
with no real use for PV-only users. I'll move on. Thanks for the tip!

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Dor Laor
2007-05-22 20:03:59 UTC
Permalink
>> If you don't care about full virtualization kvm is the wrong project
for
>> you. You might want to take a look at lguest.
>
>Ah, I had not realized that KVM was purely a full-virt environment
>with no real use for PV-only users. I'll move on. Thanks for the tip!
>ron

Don't quit so soon on us.
KVM has already PV kernel capabilities (in Ingo Molnar's tree) and has
network and block PV drivers.

We do plan on supporting/improving the PV kernel capabilities. The near
future change is direct guest paging.
Although all new x86 cpus now ship with hardware support, software PV
can always find spots for acceleration.

Regarding PV drivers, our initial approach was try not to invent the
wheel and implement the PV discovery using pci. For full-virt OSs,
especially windows it was simpler. Now that more platforms might be kvm
based, I agree we should switch to a generic solution.
Dor.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
ron minnich
2007-05-22 20:10:18 UTC
Permalink
On 5/22/07, Dor Laor <dor.laor-atKUWr5tajBWk0Htik3J/***@public.gmane.org> wrote:

> Don't quit so soon on us.

OK. I'll go look at Ingo's stuff.

Thanks again

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Nakajima, Jun
2007-05-22 22:56:34 UTC
Permalink
Dor Laor wrote:
> > > If you don't care about full virtualization kvm is the wrong
project for
> > > you. You might want to take a look at lguest.
> >
> > Ah, I had not realized that KVM was purely a full-virt environment
> > with no real use for PV-only users. I'll move on. Thanks for the
tip!
> > ron
>
> Don't quit so soon on us.
> KVM has already PV kernel capabilities (in Ingo Molnar's tree) and has
> network and block PV drivers.
>
> We do plan on supporting/improving the PV kernel capabilities. The
near
> future change is direct guest paging.
> Although all new x86 cpus now ship with hardware support, software PV
> can always find spots for acceleration.

BTW, I'm presenting this at OLS:
http://www.linuxsymposium.org/2007/view_abstract.php?content_key=192

This uses direct paging mode today.

>
> Regarding PV drivers, our initial approach was try not to invent the
> wheel and implement the PV discovery using pci. For full-virt OSs,
> especially windows it was simpler. Now that more platforms might be
kvm
> based, I agree we should switch to a generic solution.
> Dor.
>

Jun
---
Intel Open Source Technology Center

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-23 08:15:03 UTC
Permalink
I have been closely following thisvery interresting discussion. Here's
my summary:
- PV capabilities is something we'll want
- being able to surface virtual devices to the guest as PCI is
preferable to Windows
- we need an additional way to surface virtual devices to the guest.
We don't have PCI on s390, and Ron doesn't want PCI in his guests.
- complex interfaces are a mess to implement and maintain in different
hypervisors and guest operating systems, we need a simple and clear
structure like plan9 has today

To me, it looks like we need a virtual device abstraction both in the
guest kernel and in the kvm/qemu. This abstraction needs to be simple
and fast, and needs to be representable as PCI device and in a simpler
way. PCI obstacles are supposed to be transparent to the virutal device.
For me, plan9 does provide answers to a lot of above requirements.
However, it does not provide capabilities for shared memory and it
adds extra complexity. It's been designed to solve a different problem.

I think the virtual device abstraction should provide the following
functionality:
- hypercall guest to host with parameters and return value
- interrupt from host to guest with parameters
- thin interrupt from host to guest, no parameters
- shared memory between guest and host
- dma access to guest memory, possibly via kmap on the host
- copy from/to guest memory

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-23 12:25:41 UTC
Permalink
Carsten Otte wrote:
> I have been closely following thisvery interresting discussion. Here's
> my summary:
> - PV capabilities is something we'll want
> - being able to surface virtual devices to the guest as PCI is
> preferable to Windows
> - we need an additional way to surface virtual devices to the guest.
> We don't have PCI on s390, and Ron doesn't want PCI in his guests.
> - complex interfaces are a mess to implement and maintain in different
> hypervisors and guest operating systems, we need a simple and clear
> structure like plan9 has today
>
> To me, it looks like we need a virtual device abstraction both in the
> guest kernel and in the kvm/qemu. This abstraction needs to be simple
> and fast, and needs to be representable as PCI device and in a simpler
> way. PCI obstacles are supposed to be transparent to the virutal device.
> For me, plan9 does provide answers to a lot of above requirements.
> However, it does not provide capabilities for shared memory and it
> adds extra complexity. It's been designed to solve a different problem.
>
> I think the virtual device abstraction should provide the following
> functionality:
> - hypercall guest to host with parameters and return value
> - interrupt from host to guest with parameters
> - thin interrupt from host to guest, no parameters
> - shared memory between guest and host
> - dma access to guest memory, possibly via kmap on the host
> - copy from/to guest memory
>
>

I agree with all of the above. In addition, it would be nice if we can
share this interface with other hypervisors. Unfortunately Xen is
riding the XenBus, but maybe we can share the interface with lguest and VMI.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-23 12:21:42 UTC
Permalink
Nakajima, Jun wrote:
> BTW, I'm presenting this at OLS:
> http://www.linuxsymposium.org/2007/view_abstract.php?content_key=192
>
> This uses direct paging mode today.
>

Are there patches available anywhere?

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-23 12:16:50 UTC
Permalink
Christoph Hellwig wrote:
> On Tue, May 22, 2007 at 10:00:42AM -0700, ron minnich wrote:
>
>> On 5/22/07, Eric Van Hensbergen <ericvh-***@public.gmane.org> wrote:
>>
>>
>>> I'm not opposed to supporting emulation environments, just don't make
>>> a large pile of crap the default like Xen -- and having to integrate
>>> PCI probing code in my guest domains is a large pile of crap.
>>>
>> Exactly. I'm about to start a pretty large project here, using xen or
>> kvm, not sure. One thing for sure, we are NOT going to use anything
>> but PV devices. Full emulation is nice, but it's just plain silly if
>> you don't have to do it. And we don't have to do it. So let's get the
>> PV devices right, not try to shoehorn them into some framework like
>> PCI.
>>
>
> If you don't care about full virtualization kvm is the wrong project for
> you. You might want to take a look at lguest.
>
>

This is incorrect. While kvm started out as a full virtualization
project, it will expand with I/O PV and core PV. Eventually most of the
paravirt_ops interface will have a kvm implementation.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Christoph Hellwig
2007-05-23 12:20:02 UTC
Permalink
On Wed, May 23, 2007 at 03:16:50PM +0300, Avi Kivity wrote:
> Christoph Hellwig wrote:
> > On Tue, May 22, 2007 at 10:00:42AM -0700, ron minnich wrote:
> >
> >> On 5/22/07, Eric Van Hensbergen <ericvh-***@public.gmane.org> wrote:
> >>
> >>
> >>> I'm not opposed to supporting emulation environments, just don't make
> >>> a large pile of crap the default like Xen -- and having to integrate
> >>> PCI probing code in my guest domains is a large pile of crap.
> >>>
> >> Exactly. I'm about to start a pretty large project here, using xen or
> >> kvm, not sure. One thing for sure, we are NOT going to use anything
> >> but PV devices. Full emulation is nice, but it's just plain silly if
> >> you don't have to do it. And we don't have to do it. So let's get the
> >> PV devices right, not try to shoehorn them into some framework like
> >> PCI.
> >>
> >
> > If you don't care about full virtualization kvm is the wrong project for
> > you. You might want to take a look at lguest.
> >
> >
>
> This is incorrect. While kvm started out as a full virtualization
> project, it will expand with I/O PV and core PV. Eventually most of the
> paravirt_ops interface will have a kvm implementation.

The statement above was a little misworded I think. It should have
been a "if you care about pure PV ..."


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-23 12:20:25 UTC
Permalink
ron minnich wrote:
> On 5/22/07, Eric Van Hensbergen <ericvh-***@public.gmane.org> wrote:
>
>
>> I'm not opposed to supporting emulation environments, just don't make
>> a large pile of crap the default like Xen -- and having to integrate
>> PCI probing code in my guest domains is a large pile of crap.
>>
>
> Exactly. I'm about to start a pretty large project here, using xen or
> kvm, not sure. One thing for sure, we are NOT going to use anything
> but PV devices. Full emulation is nice, but it's just plain silly if
> you don't have to do it. And we don't have to do it. So let's get the
> PV devices right, not try to shoehorn them into some framework like
> PCI.
>
> What happens to these schemes if I want to try, e.g., 2^16 PV devices?
> Or some other crazy thing that doesn't play well with PCI -- simple
> example -- I want a 256 GB region of memory for a device. PCI rules
> require me to align it on 256GB boundaries and it must be contiguous
> address space. This is a hardware rule, done for hardware reasons, and
> has no place in the PV world. What if I want a bit more than the basic
> set of BARs that PCI gives me? Why would we apply such rules to a PV?
> Why limit ourselves this early in the game?
>
>

Device discovery and device operation are separate. Closed operating
systems and older Linuces will need pci as a way to have easy
plug'n'play discovery with no modifications to the kernel.
Virtualization-friendly systems like newer Linux and s390 can have a
virtual bus for discovery.


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-23 11:55:13 UTC
Permalink
Eric Van Hensbergen wrote:
> On 5/22/07, Christoph Hellwig <hch-***@public.gmane.org> wrote:
>> >
>> > I didn't think we were talking about the general case, I thought we
>> > were discussing the PV case.
>> >
>>
>> In case of KVM no one is speaking of pure PV.
>>
>
> Why not? It seems worthwhile to come up with something that can cover
> the whole spectrum instead of having different hypervisors (and
> interfaces).
>

That's the plan. PV I/O and PV mmu are on the roadmap. PV timers and
interrupts should be easily doable too. The far end of the spectrum (PV
with no hardware virtualization extensions) is possible, but no one is
planning to do it AFAIK.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-22 13:08:43 UTC
Permalink
Eric Van Hensbergen wrote:
> On 5/22/07, Avi Kivity <avi-atKUWr5tajBWk0Htik3J/***@public.gmane.org> wrote:
>> Anthony Liguori wrote:
>> >>
>> >> In a PV environment why not just pass an initial cookie/hash/whatever
>> >> as a command-line argument/register/memory-space to the underlying
>> >> kernel?
>> >>
>> >
>> > You can't pass a command line argument to Windows (at least, not
>> easily
>> > AFAIK). You could get away with an MSR/CPUID flag but then you're
>> > relying on uniqueness which isn't guaranteed.
>> >
>>
>> In the general case, you can't pass a command line argument to Linux
>> either. kvm doesn't boot Linux; it boots the bios, which boots the boot
>> sector, which boots grub, which boots Linux. Relying on the user to
>> edit the command line in grub is wrong.
>>
>
> I didn't think we were talking about the general case, I thought we
> were discussing the PV case.

It is still useful to use PV drivers with full virtualization so it's
something that ought to be considered.

Regards,

Anthony Liguori

> In the PV case, having bios/bootloader
> is unnecessary overhead. To that same end, I don't see Windows in the
> PV case unless they magically want to to coordinate PV standards with
> us, in which case we certainly can negotiate a more sane discovery
> mechanism.
>
> -eric
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
ron minnich
2007-05-18 05:31:20 UTC
Permalink
On 5/16/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
> What do you think about a socket interface? I'm not sure how discovery
> would work yet, but there are a few PV socket implementations for Xen at
> the moment.

Hi Anthony,

I still feel that "how about a socket interface" is still focused on
the "how to implement", and not "what the interface should be". I also
am not sure the socket system call interface is quite what we want,
although it's a neat idea. It's also not that portable outside the
"everything is a Linux variant" world.

So how about this as an interface design. The communications channels
are visible in our name space at a mountpoint of our choice. Let's
call this mount point, for sake of argument, vmic.

When we mount on vmic, we see one file:
/vmic/clone

When we open and read /vmic/clone, we get a number, let's pretend for
this example we get '0'. The numbers are not important, except to
distinguish connections. Opening the clone file gets us a connection
endpoint. Ls of the directory now shows this:
/vmic/clone
/vmic/0

The "directory", and the "files" in it, are owned by me, mode 700 or
600 or 400 as the file requires. The mode can be changed, of course,
if I wish to allow wider access to the channel. Here, already, we see
some advantage to the use of the file system for this type of
capability.

What is in the directory? Here is one proposal.
/vmic/0/data
/vmic/0/status
/vmic/0/ctl
/vmic/0/local
/vmic/0/remote
What can we do with this?
Data is pretty obvious: we can read it or write it, and that data is
received/sent from the other endpoint. Note that I'm not saying how
the data flows: it can be done in whatever manner is most efficient,
by the kernel, including zero copy. It can be different for many
reasons, but the point is that the interface is basically unchanging.
Of course, it is an error to read or write data until something at the
other end connects to the local end!

What is status? We cat it and it gets us status in some meaningful
text string. E.g.:
cat /vmic/0/status
connected /domain/name

What is local? It's our local name for the resource in this domain
What is remote? It's the name of other endpoint.

What's a name look like? I'm thinking it might look like /domain/name,
but that is just a guess ...

What is ctl? here is where the fun begins. We might do things such as
echo bind somename > /vmic/0/ctl
this names the vmic. We might want to wait for a connection:
echo listen 1> /vmic/0/ctl
We might want to restrict it somehow
echo key somekey > /vmic/0/ctl
echo listendomain domainnumber > /vmic/0/ctl
or we might know there is something out there.
echo connect /domainname/somename > /vmic/0/ctl

Once it is connected, we can move data.

This is similar to your socket idea, but consider that:
o to see active vmics, I use 'ls'
o I don't have to create a new sockaddr address type
o I can control access with chmod
o I am seperating the interface from the implementation
o This is, of course, not really 'files', but in-memory data
structures; this can
(and will) be fast
o No binary data structures.
For different domains, even on the same machine, alignment rules etc. are not
always the same -- I hit this when I ported Plan 9 to Xen, esp. back when Xen
relied so heavily on gcc tricks such as __align__ and packed. Using
character strings
eliminates that problem.

This is, I think, the kind of thing Eric would also like to see, but
he can correct me.
Thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-18 14:31:01 UTC
Permalink
ron minnich wrote:
> Hi Anthony,
>
> I still feel that "how about a socket interface" is still focused on
> the "how to implement", and not "what the interface should be".

Right. I'm not trying to answer that question ATM. There are a number
of paravirt devices that would be useful in a virtual setting. For
instance, a PV device for providing the guest with entropy and a shared
PV clipboard. These devices should be simple but all current
communication mechanisms are far too complicated.

> I also
> am not sure the socket system call interface is quite what we want,
> although it's a neat idea. It's also not that portable outside the
> "everything is a Linux variant" world.

A filesystem interface certainly isn't very portable outside the POSIX
world :-)

> Once it is connected, we can move data.
>
> This is similar to your socket idea, but consider that:
> o to see active vmics, I use 'ls'
> o I don't have to create a new sockaddr address type
> o I can control access with chmod
> o I am seperating the interface from the implementation
> o This is, of course, not really 'files', but in-memory data
> structures; this can
> (and will) be fast
> o No binary data structures.
> For different domains, even on the same machine, alignment rules etc.
> are not
> always the same -- I hit this when I ported Plan 9 to Xen, esp. back
> when Xen
> relied so heavily on gcc tricks such as __align__ and packed. Using
> character strings
> eliminates that problem.

The interface you're proposing is almost functionally identical to a
socket. In fact, once you open /data you've got an fd that you interact
with in the same way as you would interact with a socket.

It's not that there's an unique value for this sort of interface in
virtualization; I don't think you're making that argument. Instead,
you're making a general argument as to why this way of doing things is
better than what Unix has been doing forever (with things like
sockets). That's fine, I think you have a valid point, but that's a
larger argument to have on LKML or at a conference. This isn't the
place to shoe-horn this sort of thing.

A socket interface would provide a simple, well-understood interface
that few people in the Linux community would disagree with (it's already
there for s390). It should also be easy enough to stream p9 over the
socket so you can build these interfaces easily and continue your
attempts to expose the world as a virtual filesystem :-)

Regards,

Anthony Liguori

> This is, I think, the kind of thing Eric would also like to see, but
> he can correct me.
> Thanks
>
> ron


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
ron minnich
2007-05-18 15:14:28 UTC
Permalink
On 5/18/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:
>
> > I also
> > am not sure the socket system call interface is quite what we want,
> > although it's a neat idea. It's also not that portable outside the
> > "everything is a Linux variant" world.
>
> A filesystem interface certainly isn't very portable outside the POSIX
> world :-)

Actually, it probably the most portable thing you can have.

> The interface you're proposing is almost functionally identical to a
> socket. In fact, once you open /data you've got an fd that you interact
> with in the same way as you would interact with a socket.

Well, sure, I stole the interface from Plan 9, and they use this
interface to do sockets, among *many* other things -- and there's the
point. The interface is not just sockets. But if you're used to
sockets, it looks familiar. I only steal from the best :-)

Note, btw, that the fd has a path, and can be examined easily, and
also passed to other programs for use. That's messy and ugly with
sockets.

>
> It's not that there's an unique value for this sort of interface in
> virtualization; I don't think you're making that argument. Instead,
> you're making a general argument as to why this way of doing things is
> better than what Unix has been doing forever (with things like
> sockets)

Yes, Unix has been "doing it this way" forever. The interface I am
proposing was
the one designed by the Unix guys -- once they realized how deficient
the Unix way of doing things had become.

But, forgetting all this argument, it still seems to me that the file
system interface is far simpler than a socket interface. No binary
structures. No new sockaddr structures needed. No alignment/padding
rules. You can actually set up a link from a shell script, or perl, or
python, or whatever, without a special set of bindings.

> A socket interface would provide a simple, well-understood interface
> that few people in the Linux community would disagree with (it's already
> there for s390).

Yes, but ... well understood to the Linux community. Can we look at a
broader scope?

We've got a golden opportunity here to build a really flexible VMIC
interface. I would hate to lose it.

Anyway, thanks for discussing this.

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
ron minnich
2007-05-11 21:51:13 UTC
Permalink
On 5/11/07, Anthony Liguori <aliguori-r/Jw6+rmf7HQT0dZR+***@public.gmane.org> wrote:

> For low speed devices, I think paravirtualization doesn't make a lot of
> sense unless it's absolutely required. I don't know enough about s390
> to know if it supports things like uarts but if so, then emulating a
> uart would in my mind make a lot more sense than a PV console device.

I don't see how. Paravirtualization is pretty trivial for a console. I
think emulating hardware
is always worth avoiding. A PV ocnsole driver is going to much more
flexible than a uart
emulator.

> This smells a bit like XenStore which I think most will agree was an
> unmitigated disaster.

No, not at all. Just because we represent resources as directories and
files, does that imply or require xenstore? Is /proc a xenstore
entity? Is /sys? Not at all. These resources, which are represented
over 9p as files and directories, are simply a representation of
kernel data structures. I think you are jumping ahead too far, because
that's not what I'm talking about.

What I'm trying to propose is that the kvm host use a standard model
for paravirt resources, and, since we've had 20 years of very good
luck on Plan 9 using 9p and a directory/file model for all resources,
including devices, I am hoping we can use that for the way that kvm
communicates with its guests about devices.

Consider /proc. It works. It's not a thing on disk, or a python glob
like xenstore. There are not even really tree-like data structures in
/proc. The proc outputs are generated on demand as programs do
operations on files in the /proc file system.

This idea is similar -- not the same code, or implementation
technique, but similar.

Our proposal (it was Eric's idea, really, and he has in fact shown it
in practice on IBM hypervisors) is that we define a standard memory
channel for comms, as in Eric's paper; we define a standard
request/response protocol to run over that channel, i.e. 9p, again, as
in Eric's paper(s); and then, what you layer over it is up to the
provider of the resource. This gives us one interface, and it can be
efficient.

Again, in this way, we get a common interface to diverse resources.
This is a basic technique in computer science, and I was sorry to see
Xen ignore it. Eric and I tried to get the Xen team to look at this,
but they were too far along with their myriad interfaces, and it was
too late to change. It's not too late for KVM. I am hoping we can use
this model on KVM, before we have a whole pile of totally different
interfaces to different PV devices.

> This sort of thing gets terribly complicated to
> deal with in the corner cases. Atomic operation of multiple read/write
> operations is difficult to express. Moreover, quite a lot of things are
> naturally expressed as a state machine which is not straight forward to
> do in this sort of model. This may have been all figured out in 9P but
> it's certainly not a simple thing to get right.

We have the QED. It's called Plan 9. Then we have the second QED. It's
called Inferno. They are each a reliable, simple, industrial-strength
kernel running in a router near you. I accept it is hard to get right.
I think you'd have to accept that it can, and in fact has, been gotten
right for quite some time -- 20 years in the case of Plan 9.

> I think a general rule of thumb for a virtualized environment is that
> the closer you stick to the way hardware tends to do things,

You mean like level interrupts emulation in Xen? That was easy? Or not
screwed up? It was one of the messiest things I had to deal with in
the Plan 9 port to Xen. And it made no sense, whatsoever, to have a
level interrupt emulation. Except, of course, that the edge interrupts
were even less fun :-)

I believe that PV can buy us a very clean interface if done right.
Emulating hardware is easy for the simple bits, and very hard to get
perfect for the messy bits .Do we really want to emulate a 10G PHY,
for example?

> Implementing a full 9P client just
> to get console access in something like mini-os would be unfortunate.

9p clients are trivial. newsham's 9p python client is a whopping 352
lines, 20 of them comments. A 9p client is far less code than the sum
of the Linux uart code.

> At least the posted s390 console driver behaves roughly like a uart so
> it's pretty obvious that it will be easy to implement in any OS that
> supports uarts already.

Including all the fifo bugs? Because to really emulate hardware, to
match a driver, you have to correctly emulate the *bugs*, not just the
spec. That's where the fun begins.

I think KVM has a great opportunity here to do a better job than Xen
did with devices. So, I'll keep arguing and see if I can convince you
:-)

thanks

ron

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-12 08:46:49 UTC
Permalink
ron minnich wrote:
> Let me ask what may seem to be a naive question to the linux world. I
> see you are doing a lot off solid work on adding block and network
> devices. The code for block and network devices
> is implemented in different ways. I've also seen this difference of
> inerface/implementation on Xen.
Actually, the difference derives from the fact that block and network
are indeed different:
- block submits requests that ask the host to transfer from/to
preallocated guest data buffers via dma (request driven)
- net transmits packets that should end up in an skb on the remote
side (two way, push driven)
- net is sensitive to round-trip times, block is not due to the device
plug for request merging

We tried different access methods for both block and network. We have
selected the current communication mechanics after doing performance
measurements.
I believe for a portable solution we need to develop a set of
primitives for sending signals (read: interrupts) back and forth, for
copying data to guest memory, and for establishing shared memory
between guests and between guest+host. These primitives need to be
implemented for each platform, and paravirtual drivers should build on
top of that.
At this point in time, we are aware that these device drivers don't do
what we'd want for a portable solution. We'll focus on getting the
kernel interfaces to sie/vt/svm proper and portable first.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Dor Laor
2007-05-13 12:04:19 UTC
Permalink
>ron minnich wrote:
>> Let me ask what may seem to be a naive question to the linux world. I
>> see you are doing a lot off solid work on adding block and network
>> devices. The code for block and network devices
>> is implemented in different ways. I've also seen this difference of
>> inerface/implementation on Xen.
>Actually, the difference derives from the fact that block and network
>are indeed different:
>- block submits requests that ask the host to transfer from/to
>preallocated guest data buffers via dma (request driven)
>- net transmits packets that should end up in an skb on the remote
>side (two way, push driven)
>- net is sensitive to round-trip times, block is not due to the device
>plug for request merging
>
>We tried different access methods for both block and network. We have
>selected the current communication mechanics after doing performance
>measurements.
>I believe for a portable solution we need to develop a set of
>primitives for sending signals (read: interrupts) back and forth, for
>copying data to guest memory, and for establishing shared memory
>between guests and between guest+host. These primitives need to be
>implemented for each platform, and paravirtual drivers should build on
>top of that.
>At this point in time, we are aware that these device drivers don't do
>what we'd want for a portable solution. We'll focus on getting the
>kernel interfaces to sie/vt/svm proper and portable first.
>
>so long,
>Carsten

Based on the previous discussion and the s390 PV drivers I have more
gasoline to pour to the flame:

We have a working PV driver with 1Gbit performance. The reasons we don't

push it into the kernel are:
a. We should perform much better
b. It would be a painful task getting all the code review that a

complicated network interface should get.
c. There's already a PV driver that answers a,b.
The Xen's PV network driver is now pushed into the kernel.
It is optimized, and support tso.
By adding a generic ops calls we can make enjoy all the above.

Using Xen's core PV code doesn't imply that we will have their interface
{xenstore} the interface creation and tear-down would be kvm specific.
They could even have a plain directory structure.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-13 14:49:35 UTC
Permalink
Dor Laor wrote:
> push it into the kernel are:
> a. We should perform much better
> b. It would be a painful task getting all the code review that a
>
> complicated network interface should get.
> c. There's already a PV driver that answers a,b.
> The Xen's PV network driver is now pushed into the kernel.
>

Actually, it's not (at least not as of a few moments ago). Furthermore,
the plan is to completely rearchitect the netback/netfront protocol for
the next Xen release (this effort is referred to netchannel2).

See some of the XenSummit slides as to why this is necessary.

Regards,

Anthony Liguori

> It is optimized, and support tso.
> By adding a generic ops calls we can make enjoy all the above.
>
> Using Xen's core PV code doesn't imply that we will have their interface
> {xenstore} the interface creation and tear-down would be kvm specific.
> They could even have a plain directory structure.
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+***@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Dor Laor
2007-05-13 16:23:05 UTC
Permalink
>Subject: Re: [kvm-devel] [PATCH/RFC 7/9] Virtual network guest device
>driver
>
>Dor Laor wrote:
>> push it into the kernel are:
>> a. We should perform much better
>> b. It would be a painful task getting all the code review that a
>>
>> complicated network interface should get.
>> c. There's already a PV driver that answers a,b.
>> The Xen's PV network driver is now pushed into the kernel.
>>
>
>Actually, it's not (at least not as of a few moments ago).
Furthermore,
>the plan is to completely rearchitect the netback/netfront protocol for
>the next Xen release (this effort is referred to netchannel2).

But isn't Jeremy Fitzhardinge is pushing big patch queue into the
kernel?
If we manage to plant hooks into the netback/front for using net_ops,
they and the code will get into the kernel they will be have to keep the
hooks for netchannel2.

>
>See some of the XenSummit slides as to why this is necessary.

It's looks like generalizing all the level 0,1,2 features plus
performance optimizations. It's not something we couldn't upgrade to.

>Regards,
>
>Anthony Liguori
>
>> It is optimized, and support tso.
>> By adding a generic ops calls we can make enjoy all the above.
>>
>> Using Xen's core PV code doesn't imply that we will have their
interface
>> {xenstore} the interface creation and tear-down would be kvm
specific.
>> They could even have a plain directory structure.
>>
>>
------------------------------------------------------------------------
-
>> This SF.net email is sponsored by DB2 Express
>> Download DB2 Express C - the FREE version of DB2 express and take
>> control of your XML. No limits. Just data. Click to get it now.
>> http://sourceforge.net/powerbar/db2/
>> _______________________________________________
>> kvm-devel mailing list
>> kvm-devel-5NWGOfrQmneRv+***@public.gmane.org
>> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>>
>>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-13 16:49:14 UTC
Permalink
Dor Laor wrote:
> Furthermore,
>
>> the plan is to completely rearchitect the netback/netfront protocol for
>> the next Xen release (this effort is referred to netchannel2).
>>
>
> But isn't Jeremy Fitzhardinge is pushing big patch queue into the
> kernel?
>

Yes, but it's not in the kernel yet and there's no guarantee it'll get
there in time for KVM's consumption.

> If we manage to plant hooks into the netback/front for using net_ops,
> they and the code will get into the kernel they will be have to keep the
> hooks for netchannel2.
>
>
>> See some of the XenSummit slides as to why this is necessary.
>>
>
> It's looks like generalizing all the level 0,1,2 features plus
> performance optimizations. It's not something we couldn't upgrade to.
>

I'm curious what Rusty thinks as I do not know nearly enough about the
networking subsystem to make an educated statement here. Would it be
better to just try and generalize netback/netfront or build something
from scratch? Could the lguest driver be generalized more easily?

Regards,

Anthony LIguori

>> Regards,
>>
>> Anthony Liguori
>>
>>
>>> It is optimized, and support tso.
>>> By adding a generic ops calls we can make enjoy all the above.
>>>
>>> Using Xen's core PV code doesn't imply that we will have their
>>>
> interface
>
>>> {xenstore} the interface creation and tear-down would be kvm
>>>
> specific.
>
>>> They could even have a plain directory structure.
>>>
>>>
>>>
> ------------------------------------------------------------------------
> -
>
>>> This SF.net email is sponsored by DB2 Express
>>> Download DB2 Express C - the FREE version of DB2 express and take
>>> control of your XML. No limits. Just data. Click to get it now.
>>> http://sourceforge.net/powerbar/db2/
>>> _______________________________________________
>>> kvm-devel mailing list
>>> kvm-devel-5NWGOfrQmneRv+***@public.gmane.org
>>> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>>>
>>>
>>>
>
>
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Muli Ben-Yehuda
2007-05-13 17:06:08 UTC
Permalink
On Sun, May 13, 2007 at 11:49:14AM -0500, Anthony Liguori wrote:
> Dor Laor wrote:
> > Furthermore,
> >
> >> the plan is to completely rearchitect the netback/netfront
> >> protocol for the next Xen release (this effort is referred to
> >> netchannel2).
> >>
> >
> > But isn't Jeremy Fitzhardinge is pushing big patch queue into the
> > kernel?
> >
>
> Yes, but it's not in the kernel yet and there's no guarantee it'll
> get there in time for KVM's consumption.

On the other hand, there's strong interest in having unified virtual
drivers. Given that the Xen drivers are out there, have been submitted
and have been reasonably optimized, there will be some resistance to
putting in "yet another" set of PV drivers. Also, the contentious
merge point as I understand it is xenbus needing review, rather than
the drivers themselves which are in pretty good shape.

Cheers,
Muli

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Dor Laor
2007-05-13 20:31:32 UTC
Permalink
Subject: Re: [kvm-devel] [PATCH/RFC 7/9] Virtual network guest device
>driver
>
>On Sun, May 13, 2007 at 11:49:14AM -0500, Anthony Liguori wrote:
>> Dor Laor wrote:
>> > Furthermore,
>> >
>> >> the plan is to completely rearchitect the netback/netfront
>> >> protocol for the next Xen release (this effort is referred to
>> >> netchannel2).
>> >>
>> >
>> > But isn't Jeremy Fitzhardinge is pushing big patch queue into the
>> > kernel?
>> >
>>
>> Yes, but it's not in the kernel yet and there's no guarantee it'll
>> get there in time for KVM's consumption.
>
>On the other hand, there's strong interest in having unified virtual
>drivers. Given that the Xen drivers are out there, have been submitted
>and have been reasonably optimized, there will be some resistance to
>putting in "yet another" set of PV drivers. Also, the contentious
>merge point as I understand it is xenbus needing review, rather than
>the drivers themselves which are in pretty good shape.

Moreover, it's not that it is too complex to write set of back/front
ends, it just it's already written and optimized down to the bit.
Our current implementation has all the regular bells and whistles
(rings, delayed notifications, napi) it is simper than Xen's but it
lacks further optimizations and tso/scatter gather.
If we even use the NetChannel2 we should enjoy from smart NIC features
too.
It's more tempting and fun to continue to support our implementation but
it's righter to reuse code.
Nevertheless, we'll be happy to hear and discuss what others are
thinking.

If the current Xen code fail to hit the kernel, then it would be even
easier for us - we'll just rip off all the Xen wrapping, the grant
tables and the flipping would go away leaving clean optimized network
code.
Regards,
Dor.

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Rusty Russell
2007-05-14 02:39:15 UTC
Permalink
On Sun, 2007-05-13 at 11:49 -0500, Anthony Liguori wrote:
> Dor Laor wrote:
> > Furthermore,
> >
> >> the plan is to completely rearchitect the netback/netfront protocol for
> >> the next Xen release (this effort is referred to netchannel2).
> > It's looks like generalizing all the level 0,1,2 features plus
> > performance optimizations. It's not something we couldn't upgrade to.
>
> I'm curious what Rusty thinks as I do not know nearly enough about the
> networking subsystem to make an educated statement here. Would it be
> better to just try and generalize netback/netfront or build something
> from scratch? Could the lguest driver be generalized more easily?

In turn, I'm curious as to Herbert's opinions on this.

The lguest netdriver has only two features: it's small, and it does
multi-way inter-guest networking as well as guest<->host. It's not
clear how much the latter wins in real life over a point-to-point comms
system.

My interest is in a common low-level transport. My experience is that
it's easy to create an efficient comms channel between a guest and host
(ie. one side can access the others' memory), but it's worthwhile trying
for a model which transparently allows untrusted comms (ie.
hypervisor-assisted to access the other guest's memory). That's easier
if you only want point-to-point (see lguest's io.c for a more general
solution).

Cheers,
Rusty.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-14 11:53:33 UTC
Permalink
Anthony Liguori wrote:
> Dor Laor wrote:
>
>> Furthermore,
>>
>>
>>> the plan is to completely rearchitect the netback/netfront protocol for
>>> the next Xen release (this effort is referred to netchannel2).
>>>
>>>
>> But isn't Jeremy Fitzhardinge is pushing big patch queue into the
>> kernel?
>>
>>
>
> Yes, but it's not in the kernel yet and there's no guarantee it'll get
> there in time for KVM's consumption.
>

I doubt we could add the missing features to kvmnet, test, optimize,
submit to netdev, apply comments, re-submit, re-write, update to latest
netdev api, and fix all the bugs much faster.

--
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-14 12:05:04 UTC
Permalink
ron minnich wrote:
> We had hoped to get something like this into Xen. On Xen, for example,
> the block device and ethernet device interfaces are as different as
> one could imagine. Disk I/O does not steal pages from the guest. The
> network does. Disk I/O is in 4k chunks, period, with a bitmap
> describing which of the 8 512-byte subunits are being sent. The enet
> device, on read, returns a page with your packet, but also potentially
> containing bits of other domain's packets too. The interfaces are as
> dissimilar as they can be, and I see no reason for such a huge
> variance between what are basically read/write devices.
>

The reason for the variance is that hardware capabilities are very
different for disk and block. Block device requests are always
guest-initiated and sector-aligned, and often span many pages. On the
other hand, network packets are byte aligned, and rx packets are
host-initiated, triggering the stolen pages concept (which
unsurprisingly turned out not to be a win). Network has such esoteric
features as TSO. Block is very interested in actually getting things
onto the disk (barrier support).

In short, the "everything is a stream of bytes" grossly oversimplifies
things.

> Another issue is that kvm, in its current form (-24) is beautifully
> simple. These additions seem to detract from the beauty a bit. Might
> it be worth taking a little time to consider these ideas in order to
> preserve the basic elegance of KVM?
>

kvm? elegant and simple? it's basically a pile of special cases.

But I agree that the growing code base is a problem. With the block
driver we can probably keep the host side in userspace, but to do the
same for networking is much more work. I do think (now) that it is doable.

--
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Christian Bornträger
2007-05-14 12:24:44 UTC
Permalink
On Monday 14 May 2007 14:05, Avi Kivity wrote:
> But I agree that the growing code base is a problem. With the block
> driver we can probably keep the host side in userspace, but to do the
> same for networking is much more work. I do think (now) that it is doable.

Interesting. What kind of userspace networking do you have in mind?

One of the first trys from Carsten was to use tun/tap, which proved to be slow
performance-wise.

What I had in mind was some kind of switch in userspace. That would allow
non-root guests to define there own private networks. We could use Linux fast
pipe implementation for guest-to-guest communication.

The questions is how to connect user space networks to the host ones?
- tun/tap is quite slow
- last time we checked, netfiler offered only IP hooks (if you dont use the
bridging code)
- raw sockets get tricky if you do in/out at the same time because you have to
manually deal with loops

This reminds me, that we actually have another party doing virtual networking
between guests: UML. User mode linux actually can do networking/switching in
userspace, but I cannot tell how well UMLs concept works out.

Christian

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Avi Kivity
2007-05-14 12:32:07 UTC
Permalink
Christian Bornträger wrote:
> On Monday 14 May 2007 14:05, Avi Kivity wrote:
>
>> But I agree that the growing code base is a problem. With the block
>> driver we can probably keep the host side in userspace, but to do the
>> same for networking is much more work. I do think (now) that it is doable.
>>
>
> Interesting. What kind of userspace networking do you have in mind?
>
> One of the first trys from Carsten was to use tun/tap, which proved to be slow
> performance-wise.
>

tun/tap, but extended with:

- true aio
- aio with scatter/gather (IO_CMD_PWRITEV/IO_CMD_PREADV)
- qemu support for native Linux aio (not the glibc hackaround currently
in place), so we get event coalescing and cheap multi request submission
- tap support for tso

With these, we could conceivably reach speeds close to an in-kernel
driver. Unfortunately we'd only know after all the hard work was done.

> What I had in mind was some kind of switch in userspace. That would allow
> non-root guests to define there own private networks. We could use Linux fast
> pipe implementation for guest-to-guest communication.
>
> The questions is how to connect user space networks to the host ones?
> - tun/tap is quite slow
> - last time we checked, netfiler offered only IP hooks (if you dont use the
> bridging code)
> - raw sockets get tricky if you do in/out at the same time because you have to
> manually deal with loops
>

qemu has some support for this, see the '-net socket' option.


--
error compiling committee.c: too many arguments to function


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-14 13:36:43 UTC
Permalink
Avi Kivity wrote:
> But I agree that the growing code base is a problem. With the block
> driver we can probably keep the host side in userspace, but to do the
> same for networking is much more work. I do think (now) that it is doable.
I agree that networking needs to be handled in the host kernel. We go
out to userspace for signaling at this time, but that's simply broken.
All our userspace does is do a system call next.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-11 17:36:00 UTC
Permalink
From: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>

This driver provides a simple virtualized console. Userspace can
use read/write to its console to pass the data to the host.

Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>

---
drivers/s390/Kconfig | 5 +
drivers/s390/guest/Makefile | 1
drivers/s390/guest/guest_console.c | 72 +++++++++++++++++
drivers/s390/guest/guest_console.h | 47 +++++++++++
drivers/s390/guest/guest_tty.c | 153 +++++++++++++++++++++++++++++++++++++
5 files changed, 278 insertions(+)

Index: linux-2.6.21/drivers/s390/guest/guest_console.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/guest_console.c
@@ -0,0 +1,72 @@
+/*
+ * guest console device driver
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ */
+
+#include "linux/kernel.h"
+#include "linux/types.h"
+#include "linux/console.h"
+#include "linux/string.h"
+#include "linux/init.h"
+#include "linux/errno.h"
+#include "guest_console.h"
+
+#define guest_console_major 4 /* TTYAUX_MAJOR */
+#define guest_console_minor 65
+#define guest_console_name "ttyS"
+
+static void guest_console_write(struct console *console, const char *string,
+ unsigned len)
+{
+ int ret;
+ size_t pos;
+
+ for(pos=0; pos < strlen(string); pos += ret) {
+ ret = diag_write(1, string + pos, len - pos);
+ if (ret <= 0)
+ break;
+ }
+}
+
+static struct tty_driver *
+guest_console_device(struct console *c, int *index)
+{
+ *index = c->index;
+ return guest_tty_driver;
+}
+
+static void
+guest_console_unblank(void)
+{
+ return;
+}
+
+static struct console guest_console =
+{
+ .name = guest_console_name,
+ .write = guest_console_write,
+ .device = guest_console_device,
+ .unblank = guest_console_unblank,
+ .flags = CON_PRINTBUFFER,
+ .index = 0 /* ttyS0 */
+};
+
+/*
+ * called by console_init() in drivers/char/tty_io.c at boot-time.
+ */
+static int __init
+guest_console_init(void)
+{
+ if (!MACHINE_IS_GUEST)
+ return 0;
+
+ printk (KERN_INFO "z/Live console initialized\n");
+
+ /* enable printk-access to this driver */
+ register_console(&guest_console);
+ return 0;
+}
+
+console_initcall(guest_console_init);
+
Index: linux-2.6.21/drivers/s390/guest/guest_console.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/guest_console.h
@@ -0,0 +1,47 @@
+/*
+ * guest console device driver
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ */
+
+
+#ifndef __GCONSOLE_H
+#define __GCONSOLE_H
+extern struct tty_driver *guest_tty_driver;
+static inline int diag_write(int fd, const void *buffer, size_t count)
+{
+ register long __arg1 asm("2") = fd;
+ register const void * __arg2 asm("3") = buffer;
+ register size_t __arg3 asm("4") = count;
+ register long __svcres asm("2");
+ long __res;
+ asm volatile (
+ "diag 0,0,2"
+ : "=d" (__svcres)
+ : "0" (__arg1),
+ "d" (__arg2),
+ "d" (__arg3)
+ : "cc", "memory");
+ __res = __svcres;
+ return __res;
+}
+
+static inline int diag_read(int fd, const void *buffer, size_t count)
+{
+ register long __arg1 asm("2") = fd;
+ register const void * __arg2 asm("3") = buffer;
+ register size_t __arg3 asm("4") = count;
+ register long __svcres asm("2");
+ long __res;
+ asm volatile (
+ "diag 0,0,1"
+ : "=d" (__svcres)
+ : "0" (__arg1),
+ "d" (__arg2),
+ "d" (__arg3)
+ : "cc", "memory");
+ __res = __svcres;
+ return __res;
+}
+#endif
+
Index: linux-2.6.21/drivers/s390/guest/guest_tty.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/guest_tty.c
@@ -0,0 +1,153 @@
+/*
+ * guest console tty device driver
+ * Copyright IBM Corp. 2007
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ */
+
+#include <linux/fs.h>
+#include <linux/tty.h>
+#include <linux/tty_flip.h>
+#include <linux/module.h>
+#include <asm/s390_ext.h>
+#include "guest_console.h"
+
+struct tty_driver *guest_tty_driver;
+static struct tty_struct *guest_tty;
+
+MODULE_DESCRIPTION("Guest console for linux guests");
+MODULE_AUTHOR("Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>");
+MODULE_LICENSE("GPL");
+
+static int
+guest_tty_open(struct tty_struct *tty, struct file *filp)
+{
+ guest_tty = tty;
+ tty->driver_data = NULL;
+ return 0;
+}
+
+static void
+guest_tty_close(struct tty_struct *tty, struct file *filp)
+{
+ if (tty->count > 1)
+ return;
+ guest_tty = NULL;
+}
+
+static int
+guest_tty_ioctl(struct tty_struct *tty, struct file * file,
+ unsigned int cmd, unsigned long arg)
+{
+ return -ENOIOCTLCMD;
+}
+
+static int
+guest_tty_write(struct tty_struct *tty, const unsigned char *str, int count)
+{
+ int ret;
+ size_t pos;
+
+ for(pos=0; pos < count; pos += ret) {
+ ret = diag_write(1, str + pos, count - pos);
+ if (ret <= 0)
+ break;
+ }
+ return pos;
+}
+
+static void
+guest_tty_put_char(struct tty_struct *tty, unsigned char ch)
+{
+ guest_tty_write (tty, &ch, 1);
+}
+
+static void
+guest_tty_flush_chars(struct tty_struct *tty)
+{
+ int nop;
+ nop=0; // :)
+}
+
+static int
+guest_tty_chars_in_buffer(struct tty_struct *tty)
+{
+ return 0;
+}
+
+static void
+guest_tty_flush_buffer(struct tty_struct *tty)
+{
+ guest_tty_flush_chars(tty); // :)
+}
+
+static int
+guest_tty_write_room (struct tty_struct *tty)
+{
+ return 65536;
+}
+
+static struct tty_operations guest_ops = {
+ .open = guest_tty_open,
+ .close = guest_tty_close,
+ .write = guest_tty_write,
+ .put_char = guest_tty_put_char,
+ .flush_chars = guest_tty_flush_chars,
+ .write_room = guest_tty_write_room,
+ .chars_in_buffer = guest_tty_chars_in_buffer,
+ .flush_buffer = guest_tty_flush_buffer,
+ .ioctl = guest_tty_ioctl,
+};
+
+static void
+guest_tty_ext_handler(__u16 code)
+{
+ char buffer[256];
+ int count;
+
+ count = diag_read(0, buffer, 256);
+ if (count <= 0)
+ return;
+
+ if (!guest_tty)
+ return;
+ tty_insert_flip_string(guest_tty, buffer, count);
+ tty_flip_buffer_push(guest_tty);
+}
+
+int __init
+guest_tty_init(void)
+{
+ struct tty_driver *driver;
+ int rc;
+
+ if (!MACHINE_IS_GUEST)
+ return 0;
+ register_external_interrupt(0x1234, guest_tty_ext_handler);
+ driver = alloc_tty_driver(1);
+ if (!driver)
+ return -ENOMEM;
+ guest_tty = NULL;
+ driver->owner = THIS_MODULE;
+ driver->driver_name = "guest_line";
+ driver->name = "guest_line";
+ driver->major = TTY_MAJOR;
+ driver->minor_start = 65;
+ driver->type = TTY_DRIVER_TYPE_SYSTEM;
+ driver->subtype = SYSTEM_TYPE_TTY;
+ driver->init_termios = tty_std_termios;
+ driver->init_termios.c_iflag = IGNBRK | IGNPAR;
+ driver->init_termios.c_oflag = ONLCR | XTABS;
+ driver->init_termios.c_lflag = ISIG | ECHO;
+ driver->flags = TTY_DRIVER_REAL_RAW;
+ tty_set_operations(driver, &guest_ops);
+ rc = tty_register_driver(driver);
+ if (rc) {
+ printk(KERN_ERR "guest tty driver: could not register tty - "
+ "tty_register_driver returned %d\n", rc);
+ put_tty_driver(driver);
+ return rc;
+ }
+ guest_tty_driver = driver;
+ return 0;
+}
+module_init(guest_tty_init);
Index: linux-2.6.21/drivers/s390/Kconfig
===================================================================
--- linux-2.6.21.orig/drivers/s390/Kconfig
+++ linux-2.6.21/drivers/s390/Kconfig
@@ -211,6 +211,11 @@ config MONWRITER
help
Character device driver for writing z/VM monitor service records

+config GUEST_CONSOLE
+ bool "Guest console support"
+ depends on S390_GUEST
+ help
+ Select this option if you want to run as an s390 guest
endmenu

menu "Cryptographic devices"
Index: linux-2.6.21/drivers/s390/guest/Makefile
===================================================================
--- linux-2.6.21.orig/drivers/s390/guest/Makefile
+++ linux-2.6.21/drivers/s390/guest/Makefile
@@ -2,5 +2,6 @@
# s390 Linux virtual environment
#

+obj-$(CONFIG_GUEST_CONSOLE) += guest_console.o guest_tty.o
obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-11 19:00:12 UTC
Permalink
I think it would be better to use hvc_console as Xen now uses it too.

Carsten Otte wrote:
> + if (!MACHINE_IS_GUEST)
> + return 0;
> + register_external_interrupt(0x1234, guest_tty_ext_handler);
>

This is an interesting way to get input data from the console :-) How
many interrupts does s390 support (the x86 only supports 256)? Can you
afford to burn interrupts like this? Is there not a better way to
assign interrupts such that conflict isn't an issue?

Regards,

Anthony Liguori

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Christian Bornträger
2007-05-11 19:42:11 UTC
Permalink
On Friday 11 May 2007 21:00, Anthony Liguori wrote:

> I think it would be better to use hvc_console as Xen now uses it too.

I dont know hvc_console, but I will have a look at it.

> Carsten Otte wrote:
> > + if (!MACHINE_IS_GUEST)
> > + return 0;
> > + register_external_interrupt(0x1234, guest_tty_ext_handler);
> >
>
> This is an interesting way to get input data from the console :-) How
> many interrupts does s390 support (the x86 only supports 256)? Can you
> afford to burn interrupts like this? Is there not a better way to
> assign interrupts such that conflict isn't an issue?

On s390 we have a 16 bit interrupt code, so we actually have plenty of
numbers... But, yes its a very good point, burning interrupts wont work
cross-platform.

Our patches are prototypes and need rework anyway. Take these patches as
discussion contribution in the spirit of release early. :-)

cheers

Christian

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-12 08:07:21 UTC
Permalink
Anthony Liguori wrote:
> I think it would be better to use hvc_console as Xen now uses it too.
This console driver is pretty basic indeed.

> This is an interesting way to get input data from the console :-) How
> many interrupts does s390 support (the x86 only supports 256)? Can you
> afford to burn interrupts like this? Is there not a better way to
> assign interrupts such that conflict isn't an issue?
We have 2^16 external interrupts on 390, plus IO interrupts,
multiplied by the fact that each interrupt can be used in various
interrupt subclasses. We can burn irqs indeed, but as Christian
mentioned this cannot go into the portable approach.

so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Christian Bornträger
2007-05-14 16:23:13 UTC
Permalink
On Friday 11 May 2007 21:00, Anthony Liguori wrote:
> I think it would be better to use hvc_console as Xen now uses it too.

I just had a look at hvc_console, and indeed this driver looks appropriate for
us. Looking at the xen-frontend driver (~130 lines of code) and the simple
interface (get_char and put_char) it should be reasonably easy to convert our
driver to a hvc_console user.

Christian

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Christian Borntraeger
2007-05-14 16:48:18 UTC
Permalink
On Monday 14 May 2007 18:23, Christian Bornträger wrote:
> On Friday 11 May 2007 21:00, Anthony Liguori wrote:
> > I think it would be better to use hvc_console as Xen now uses it too.
> I just had a look at hvc_console, and indeed this driver looks appropriate

As I started prototyping this frontend I realized that hvc_console requires
some interfaces, which are not present on s390, e.g. we have no request_irq
and free_irq. Dont know if hvc_console is still the right way to go for us.
This needs more thinking.

Christian

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-14 17:49:17 UTC
Permalink
Christian Borntraeger wrote:
> On Monday 14 May 2007 18:23, Christian Bornträger wrote:
>
>> On Friday 11 May 2007 21:00, Anthony Liguori wrote:
>>
>>> I think it would be better to use hvc_console as Xen now uses it too.
>>>
>> I just had a look at hvc_console, and indeed this driver looks appropriate
>>
>
> As I started prototyping this frontend I realized that hvc_console requires
> some interfaces, which are not present on s390, e.g. we have no request_irq
> and free_irq. Dont know if hvc_console is still the right way to go for us.
>

It seems like request_irq is roughly the same as
register_external_interrupt. I suspect that you could get away with
either patching hvc_console to use register_external_interrupt if
CONFIG_S390 or perhaps providing a common interface.

I suspect that this is going to come up again for sharing other paravirt
drivers.

Regards,

Anthony Liguori

> This needs more thinking.
>
> Christian
>
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Arnd Bergmann
2007-05-15 00:27:34 UTC
Permalink
On Monday 14 May 2007, Anthony Liguori wrote:
> It seems like request_irq is roughly the same as
> register_external_interrupt.  I suspect that you could get away with
> either patching hvc_console to use register_external_interrupt if
> CONFIG_S390 or perhaps providing a common interface.
>
> I suspect that this is going to come up again for sharing other paravirt
> drivers.

request_irq() is not a nice interface for s390, but it will probably make
sense to convert the two existing users of register_external_interrupt to
use that instead, in order to get something that can be shared across
architectures for virtual drivers.

It basically means extending struct ext_int_info_t to include a name and
a void* member that gets passed back to the interrupt handler, and to check
for invalid flags passed to request_irq.

You might want to show these in /proc/interrupts then as well,
as per-interrupt values.

Arnd <><

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-15 07:54:45 UTC
Permalink
Anthony Liguori wrote:
> It seems like request_irq is roughly the same as
> register_external_interrupt. I suspect that you could get away with
> either patching hvc_console to use register_external_interrupt if
> CONFIG_S390 or perhaps providing a common interface.
>
> I suspect that this is going to come up again for sharing other paravirt
> drivers.
Maybe we should have a wrappers for request_irq/free_irq in arch/
rather then #ifdefs in each paravirtual driver. We need to talk this
over with Martin (our arch maintainer).


so long,
Carsten

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-11 17:36:08 UTC
Permalink
From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>

This is the host counterpart for the virtual network device driver. This driver
has an char device node where the hypervisor can attach. It also
has a kind of dumb switch that passes packets between guests. Last but not least
it contains a host network interface. Patches for attaching other host network
devices to the switch via raw sockets, extensions to qeth or netfilter are
currently tested but not ready yet. We did not use the linux bridging code to
allow non-root users to create virtual networks between guests.

Signed-off-by: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>
Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>

---
drivers/s390/guest/Makefile | 3
drivers/s390/guest/vnet_port_guest.c | 302 ++++++++++++
drivers/s390/guest/vnet_port_guest.h | 21
drivers/s390/guest/vnet_port_host.c | 418 +++++++++++++++++
drivers/s390/guest/vnet_port_host.h | 18
drivers/s390/guest/vnet_switch.c | 828 +++++++++++++++++++++++++++++++++++
drivers/s390/guest/vnet_switch.h | 119 +++++
drivers/s390/net/Kconfig | 12
8 files changed, 1721 insertions(+)

Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.c
@@ -0,0 +1,302 @@
+/*
+ * Copyright (C) 2005 IBM Corporation
+ * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ *
+ */
+#include <linux/etherdevice.h>
+#include <linux/fs.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/pagemap.h>
+#include <linux/poll.h>
+#include <linux/spinlock.h>
+
+#include "vnet.h"
+#include "vnet_port_guest.h"
+#include "vnet_switch.h"
+
+static void COFIXME_add_irq(struct vnet_guest_port *zgp, int data)
+{
+ int oldval, newval;
+
+ do {
+ oldval = atomic_read(&zgp->pending_irqs);
+ newval = oldval | data;
+ } while (atomic_cmpxchg(&zgp->pending_irqs, oldval, newval) != oldval);
+}
+
+static int COFIXME_get_irq(struct vnet_guest_port *zgp)
+{
+ int oldval;
+
+ do {
+ oldval = atomic_read(&zgp->pending_irqs);
+ } while (atomic_cmpxchg(&zgp->pending_irqs, oldval, 0) != oldval);
+
+ return oldval;
+}
+
+static void
+vnet_guest_interrupt(struct vnet_port *port, int type)
+{
+ struct vnet_guest_port *priv;
+
+ priv = port->priv;
+
+ if (!priv->fasync) {
+ printk (KERN_WARNING "vnet: cannot send interrupt,"
+ "fd not async\n");
+ return;
+ }
+ switch (type) {
+ case VNET_IRQ_START_RX:
+ COFIXME_add_irq(priv, POLLIN);
+ kill_fasync(&priv->fasync, SIGIO, POLL_IN);
+ break;
+ case VNET_IRQ_START_TX:
+ COFIXME_add_irq(priv, POLLOUT);
+ kill_fasync(&priv->fasync, SIGIO, POLL_OUT);
+ break;
+ default:
+ BUG();
+ }
+}
+
+/* release all pinned user pages*/
+static void
+vnet_guest_release_pages(struct vnet_port *port)
+{
+ int i,j;
+
+ for (i=0; i<VNET_QUEUE_LEN; i++)
+ for (j=0; j<VNET_BUFFER_PAGES; j++) {
+ if (port->s2p_data[i][j]) {
+ page_cache_release(virt_to_page(port->s2p_data[i][j]));
+ port->s2p_data[i][j] = NULL;
+ }
+ if (port->p2s_data[i][j]) {
+ page_cache_release(virt_to_page(port->p2s_data[i][j]));
+ port->p2s_data[i][j] = NULL;
+ }
+ }
+ if (port->control) {
+ page_cache_release(virt_to_page(port->control));
+ port->control = NULL;
+ }
+}
+
+static int
+vnet_chr_open(struct inode *ino, struct file *filp)
+{
+ int minor;
+ struct vnet_port *port;
+ char name[BUS_ID_SIZE];
+
+ minor = iminor(filp->f_dentry->d_inode);
+ snprintf(name, BUS_ID_SIZE, "guest:%d", current->pid);
+ port = vnet_port_get(minor, name);
+ if (!port)
+ return -ENODEV;
+ port->priv = kzalloc(sizeof(struct vnet_guest_port), GFP_KERNEL);
+ if (!port->priv) {
+ vnet_port_put(port);
+ return -ENOMEM;
+ }
+ port->interrupt = vnet_guest_interrupt;
+ filp->private_data = port;
+ return nonseekable_open(ino, filp);
+}
+
+static int
+vnet_chr_release (struct inode *ino, struct file *filp)
+{
+ struct vnet_port *port;
+ port = (struct vnet_port *) filp->private_data;
+
+//FIXME: what about open close? We unregister non exisiting mac addresses
+// in vnet_port_detach!
+ vnet_port_detach(port);
+ vnet_guest_release_pages(port);
+ vnet_port_put(port);
+ return 0;
+}
+
+
+/* helper function which maps a user page into the kernel
+ * the memory must be free with page_cache_release */
+static void *user_to_kernel(char __user *user)
+{
+ struct page *temp_page;
+ int rc;
+
+ BUG_ON(((unsigned long) user) % PAGE_SIZE);
+ rc = fault_in_pages_writeable(user, PAGE_SIZE);
+ if (rc)
+ return NULL;
+ rc = get_user_pages(current, current->mm, (unsigned long) user,
+ 1, 1, 1, &temp_page, NULL);
+ if (rc != 1)
+ return NULL;
+ return page_address(temp_page);
+}
+
+/* this function pins the userspace buffers into memory*/
+static int
+vnet_guest_alloc_pages(struct vnet_port *port)
+{
+ int i,j;
+
+ down_read(&current->mm->mmap_sem);
+ for (i=0; i<VNET_QUEUE_LEN; i++)
+ for (j=0; j<VNET_BUFFER_PAGES; j++) {
+ port->s2p_data[i][j] = user_to_kernel(port->control->
+ s2pbufs[i].data + j*PAGE_SIZE);
+ if (!port->s2p_data[i][j])
+ goto cleanup;
+ port->p2s_data[i][j] = user_to_kernel(port->control->
+ p2sbufs[i].data + j*PAGE_SIZE);
+ if (!port->p2s_data[i][j])
+ goto cleanup;
+
+ }
+ up_read(&current->mm->mmap_sem);
+ return 0;
+cleanup:
+ up_read(&current->mm->mmap_sem);
+ vnet_guest_release_pages(port);
+ return -ENOMEM;
+}
+
+/* userspace control data structure stuff */
+static int
+vnet_register_control(struct vnet_port *port, unsigned long user_addr)
+{
+ u64 uaddr;
+ int rc;
+ struct page *control_page;
+
+ rc = copy_from_user(&uaddr, (void __user *) user_addr, sizeof(uaddr));
+ if (rc)
+ return -EFAULT;
+ if (uaddr % PAGE_SIZE)
+ return -EFAULT;
+ down_read(&current->mm->mmap_sem);
+ rc = get_user_pages(current, current->mm, (unsigned long)uaddr,
+ 1, 1, 1, &control_page, NULL);
+ up_read(&current->mm->mmap_sem);
+ if (rc!=1)
+ return -EFAULT;
+ port->control = (struct vnet_control *) page_address(control_page);
+ rc = vnet_guest_alloc_pages(port);
+ if (rc) {
+ printk("vnet: could not get buffers\n");
+ return rc;
+ }
+ random_ether_addr(port->mac);
+ memcpy(port->control->mac, port->mac,6);
+ vnet_port_attach(port);
+ return 0;
+}
+
+static int
+vnet_interrupt(struct vnet_port *port, int __user *u_type)
+{
+ int type, rc;
+
+ rc = copy_from_user (&type, u_type, sizeof(int));
+ if (rc)
+ return -EFAULT;
+ switch (type) {
+ case VNET_IRQ_START_RX:
+ vnet_port_rx(port);
+ break;
+ case VNET_IRQ_START_TX: /* noop with current drop packet approach*/
+ break;
+ default:
+ printk(KERN_ERR "vnet: Unknown interrupt type %d\n", type);
+ rc = -EINVAL;
+ }
+ return rc;
+}
+
+
+
+
+//this is a HACK. >>COFIXME<<
+unsigned int
+vnet_poll(struct file *filp, poll_table * wait)
+{
+ struct vnet_port *port;
+ struct vnet_guest_port *zgp;
+
+ port = filp->private_data;
+ zgp = port->priv;
+ return COFIXME_get_irq(zgp);
+}
+
+static int vnet_fill_info(struct vnet_port *zp, void __user *data)
+{
+ struct vnet_info info;
+
+ info.linktype = zp->zs->linktype;
+ info.maxmtu=32768; //FIXME
+ return copy_to_user(data, &info, sizeof(info));
+}
+long
+vnet_ioctl(struct file *filp, unsigned int no, unsigned long data)
+{
+ struct vnet_port *port =
+ (struct vnet_port *) filp->private_data;
+ int rc;
+
+ switch (no) {
+ case VNET_REGISTER_CTL:
+ rc = vnet_register_control(port, data);
+ break;
+ case VNET_INTERRUPT:
+ rc = vnet_interrupt(port, (int __user *) data);
+ break;
+ case VNET_INFO:
+ rc = vnet_fill_info(port, (void __user *) data);
+ break;
+ default:
+ rc = -ENOTTY;
+ }
+ return rc;
+}
+
+int vnet_fasync(int fd, struct file *filp, int on)
+{
+ struct vnet_port *port;
+ struct vnet_guest_port *zgp;
+ int rc;
+
+ port = filp->private_data;
+ zgp = port->priv;
+
+ if ((rc = fasync_helper(fd, filp, on, &zgp->fasync)) < 0)
+ return rc;
+
+ if (on)
+ rc = f_setown(filp, current->pid, 0);
+ return rc;
+}
+
+
+static struct file_operations vnet_char_fops = {
+ .owner = THIS_MODULE,
+ .open = vnet_chr_open,
+ .release = vnet_chr_release,
+ .unlocked_ioctl = vnet_ioctl,
+ .fasync = vnet_fasync,
+ .poll = vnet_poll,
+};
+
+
+
+void vnet_cdev_init(struct cdev *cdev)
+{
+ cdev_init(cdev, &vnet_char_fops);
+}
Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.h
@@ -0,0 +1,21 @@
+/*
+ * Copyright (C) 2005 IBM Corporation
+ * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ *
+ */
+
+#ifndef __VNET_PORTS_GUEST_H
+#define __VNET_PORTS_GUEST_H
+
+#include <linux/fs.h>
+#include <linux/cdev.h>
+#include <asm/atomic.h>
+
+struct vnet_guest_port {
+ struct fasync_struct *fasync;
+ atomic_t pending_irqs;
+};
+
+extern void vnet_cdev_init(struct cdev *cdev);
+#endif
Index: linux-2.6.21/drivers/s390/guest/vnet_port_host.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_port_host.c
@@ -0,0 +1,418 @@
+/*
+ * vnet zlswitch handling
+ *
+ * Copyright (C) 2005 IBM Corporation
+ * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ *
+ */
+
+#include <linux/etherdevice.h>
+#include <linux/if.h>
+#include <linux/if_ether.h>
+#include <linux/if_arp.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/rtnetlink.h>
+#include <linux/pagemap.h>
+#include <linux/spinlock.h>
+
+#include "vnet.h"
+#include "vnet_switch.h"
+#include "vnet_port_host.h"
+
+static void
+vnet_host_interrupt(struct vnet_port *zp, int type)
+{
+ struct vnet_host_port *zhp;
+
+ zhp = zp->priv;
+
+ BUG_ON(!zhp->netdev);
+
+ switch (type) {
+ case VNET_IRQ_START_RX:
+ netif_rx_schedule(zhp->netdev);
+ break;
+ case VNET_IRQ_START_TX:
+ netif_wake_queue(zhp->netdev);
+ break;
+ default:
+ BUG();
+ }
+ /* we are called via system call path. enforce softirq handling */
+ do_softirq();
+}
+
+static void
+vnet_host_free(struct vnet_port *zp)
+{
+ int i,j;
+
+ for (i=0; i<VNET_QUEUE_LEN; i++)
+ for (j=0; j<VNET_BUFFER_PAGES; j++) {
+ if (zp->s2p_data[i][j]) {
+ free_page((unsigned long) zp->s2p_data[i][j]);
+ zp->s2p_data[i][j] = NULL;
+ }
+ if (zp->p2s_data[i][j]) {
+ free_page((unsigned long) zp->p2s_data[i][j]);
+ zp->p2s_data[i][j] = NULL;
+ }
+ }
+ if (zp->control) {
+ kfree(zp->control);
+ zp->control = NULL;
+ }
+}
+
+static int
+vnet_port_hostsetup(struct vnet_port *zp)
+{
+ int i,j;
+
+ zp->control = kzalloc(sizeof(*zp->control), GFP_KERNEL);
+ if (!zp->control)
+ return -ENOMEM;
+ for (i=0; i<VNET_QUEUE_LEN; i++)
+ for (j=0; j<VNET_BUFFER_PAGES; j++) {
+ zp->s2p_data[i][j] = (void *) __get_free_pages(GFP_KERNEL,0);
+ if (!zp->s2p_data[i][j])
+ goto oom;
+ zp->p2s_data[i][j] = (void *) __get_free_pages(GFP_KERNEL,0);
+ if (!zp->p2s_data[i][j]) {
+ free_page((unsigned long) zp->s2p_data[i][j]);
+ goto oom;
+ }
+ }
+ zp->control->buffer_size = VNET_BUFFER_SIZE;
+ return 0;
+oom:
+ printk(KERN_WARNING "vnet: No memory for buffer space of host device\n");
+ vnet_host_free(zp);
+ return -ENOMEM;
+}
+
+/* host interface specific parts */
+
+
+static int
+vnet_net_open(struct net_device *dev)
+{
+ struct vnet_port *port;
+ struct vnet_control *control;
+
+ port = dev->priv;
+ control = port->control;
+ atomic_set(&control->s2pmit, 0);
+ netif_start_queue(dev);
+ return 0;
+}
+
+static int
+vnet_net_stop(struct net_device *dev)
+{
+ netif_stop_queue(dev);
+ return 0;
+}
+
+static void vnet_net_tx_timeout(struct net_device *dev)
+{
+ struct vnet_port *port = dev->priv;
+ struct vnet_control *control = port->control;
+
+ printk(KERN_ERR "problems in xmit for device %s\n Resetting...\n",
+ dev->name);
+ atomic_set(&control->p2smit, 0);
+ atomic_set(&control->s2pmit, 0);
+ vnet_port_rx(port);
+ netif_wake_queue(dev);
+}
+
+
+static int
+vnet_net_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+ struct vnet_port *zhost;
+ struct vnet_host_port *zhp;
+ struct vnet_control *control;
+ struct xmit_buffer *buf;
+ int buffer_status;
+ int pkid;
+
+ zhost = dev->priv;
+ zhp = zhost->priv;
+ control = zhost->control;
+
+ if (!spin_trylock(&zhost->txlock))
+ return NETDEV_TX_LOCKED;
+ if (vnet_q_full(atomic_read(&control->p2smit))) {
+ netif_stop_queue(dev);
+ goto full;
+ }
+ pkid = __nextx(atomic_read(&control->p2smit));
+ buf = &control->p2sbufs[pkid];
+ buf->len = skb->len;
+ buf->proto = skb->protocol;
+ vnet_copy_buf_to_pages(zhost->p2s_data[pkid], skb->data, skb->len);
+ buffer_status = vnet_tx_packet(&control->p2smit);
+ spin_unlock(&zhost->txlock);
+ zhp->stats.tx_packets++;
+ zhp->stats.tx_bytes += skb->len;
+ dev_kfree_skb(skb);
+ dev->trans_start = jiffies;
+ if (buffer_status & QUEUE_WAS_EMPTY)
+ vnet_port_rx(zhost);
+ if (buffer_status & QUEUE_IS_FULL) {
+ netif_stop_queue(dev);
+ spin_lock(&zhost->txlock);
+ } else
+ return NETDEV_TX_OK;
+full:
+ /* we might have raced against the wakeup */
+ if (!vnet_q_full(atomic_read(&control->p2smit)))
+ netif_start_queue(dev);
+ spin_unlock(&zhost->txlock);
+ return NETDEV_TX_OK;
+}
+
+static int
+vnet_l3_poll(struct net_device *dev, int *budget)
+{
+ struct vnet_port *zp = dev->priv;
+ struct vnet_host_port *zhp = zp->priv;
+ struct vnet_control *control = zp->control;
+ struct xmit_buffer *buf;
+ struct sk_buff *skb;
+ int pkid, count, numpackets = min(64, min(dev->quota, *budget));
+ int buffer_status;
+
+ if (vnet_q_empty(atomic_read(&control->s2pmit))) {
+ count = 0;
+ goto empty;
+ }
+loop:
+ count = 0;
+ while(numpackets) {
+ pkid = __nextr(atomic_read(&control->s2pmit));
+ buf = &control->s2pbufs[pkid];
+ skb = dev_alloc_skb(buf->len + 2);
+ if (likely(skb)) {
+ skb_reserve(skb, 2);
+ vnet_copy_pages_to_buf(skb_put(skb, buf->len),
+ zp->s2p_data[pkid], buf->len);
+ skb->dev = dev;
+ skb->protocol = buf->proto;
+// skb->ip_summed = CHECKSUM_UNNECESSARY;
+ zhp->stats.rx_packets++;
+ zhp->stats.rx_bytes += buf->len;
+ netif_receive_skb(skb);
+ numpackets--;
+ (*budget)--;
+ dev->quota--;
+ count++;
+ } else
+ zhp->stats.rx_dropped++;
+ buffer_status = vnet_rx_packet(&control->s2pmit);
+ if (buffer_status & QUEUE_IS_EMPTY)
+ goto empty;
+ }
+ return 1; //please ask us again
+empty:
+ netif_rx_complete(dev);
+ /* we might have raced against a wakup*/
+ if (!vnet_q_empty(atomic_read(&control->s2pmit))) {
+ if (netif_rx_reschedule(dev, count))
+ goto loop;
+ }
+ return 0;
+}
+
+
+static int
+vnet_l2_poll(struct net_device *dev, int *budget)
+{
+ struct vnet_port *zp = dev->priv;
+ struct vnet_host_port *zhp = zp->priv;
+ struct vnet_control *control = zp->control;
+ struct xmit_buffer *buf;
+ struct sk_buff *skb;
+ int pkid, count, numpackets = min(64, min(dev->quota, *budget));
+ int buffer_status;
+
+ if (vnet_q_empty(atomic_read(&control->s2pmit))) {
+ count = 0;
+ goto empty;
+ }
+loop:
+ count = 0;
+ while(numpackets) {
+ pkid = __nextr(atomic_read(&control->s2pmit));
+ buf = &control->s2pbufs[pkid];
+ skb = dev_alloc_skb(buf->len + 2);
+ if (likely(skb)) {
+ skb_reserve(skb, 2);
+ vnet_copy_pages_to_buf(skb_put(skb, buf->len),
+ zp->s2p_data[pkid], buf->len);
+ skb->dev = dev;
+ skb->protocol = eth_type_trans(skb, dev);
+// skb->ip_summed = CHECKSUM_UNNECESSARY;
+ zhp->stats.rx_packets++;
+ zhp->stats.rx_bytes += buf->len;
+ netif_receive_skb(skb);
+ numpackets--;
+ (*budget)--;
+ dev->quota--;
+ count++;
+ } else
+ zhp->stats.rx_dropped++;
+ buffer_status = vnet_rx_packet(&control->s2pmit);
+ if (buffer_status & QUEUE_IS_EMPTY)
+ goto empty;
+ }
+ return 1; //please ask us again
+empty:
+ netif_rx_complete(dev);
+ /* we might have raced against a wakup*/
+ if (!vnet_q_empty(atomic_read(&control->s2pmit))) {
+ if (netif_rx_reschedule(dev, count))
+ goto loop;
+ }
+ return 0;
+}
+
+static struct net_device_stats *
+vnet_net_stats(struct net_device *dev)
+{
+ struct vnet_port *zp;
+ struct vnet_host_port *zhp;
+
+ zp = dev->priv;
+ zhp = zp->priv;
+ return &zhp->stats;
+}
+
+static int
+vnet_net_change_mtu(struct net_device *dev, int new_mtu)
+{
+ if (new_mtu <= ETH_ZLEN)
+ return -ERANGE;
+ if (new_mtu > VNET_BUFFER_SIZE-ETH_HLEN)
+ return -ERANGE;
+ dev->mtu = new_mtu;
+ return 0;
+}
+
+static void
+__vnet_common_init(struct net_device *dev)
+{
+ dev->open = vnet_net_open;
+ dev->stop = vnet_net_stop;
+ dev->hard_start_xmit = vnet_net_xmit;
+ dev->get_stats = vnet_net_stats;
+ dev->tx_timeout = vnet_net_tx_timeout;
+ dev->watchdog_timeo = VNET_TIMEOUT;
+ dev->change_mtu = vnet_net_change_mtu;
+ dev->weight = 64;
+ //dev->features |= NETIF_F_NO_CSUM | NETIF_F_LLTX;
+ dev->features |= NETIF_F_LLTX;
+}
+
+static void
+__vnet_layer3_init(struct net_device *dev)
+{
+ dev->mtu = ETH_DATA_LEN;
+ dev->tx_queue_len = 1000;
+ dev->flags = IFF_BROADCAST|IFF_MULTICAST|IFF_NOARP;
+ dev->type = ARPHRD_PPP;
+ dev->mtu = 1492;
+ dev->poll = vnet_l3_poll;
+ __vnet_common_init(dev);
+}
+
+static void
+__vnet_layer2_init(struct net_device *dev)
+{
+ ether_setup(dev);
+ random_ether_addr(dev->dev_addr);
+ dev->mtu = 1492;
+ dev->poll = vnet_l2_poll;
+ __vnet_common_init(dev);
+}
+
+static void
+vnet_host_destroy(struct vnet_port *zhost)
+{
+ struct vnet_host_port *zhp;
+ zhp = zhost->priv;
+
+ vnet_port_detach(zhost);
+ unregister_netdev(zhp->netdev);
+ free_netdev(zhp->netdev);
+ zhp->netdev = NULL;
+ vnet_host_free(zhost);
+ kfree(zhp);
+ vnet_port_put(zhost);
+}
+
+
+
+struct vnet_port *
+vnet_host_create(char *name)
+{
+ int rc;
+ struct vnet_port *port;
+ struct vnet_host_port *host;
+ char busname[BUS_ID_SIZE];
+ int minor;
+
+ snprintf(busname, BUS_ID_SIZE, "host:%s", name);
+
+ minor = vnet_minor_by_name(name);
+ if (minor < 0)
+ return NULL;
+ port = vnet_port_get(minor, busname);
+ if (!port)
+ goto out;
+ host = kzalloc(sizeof(struct vnet_host_port), GFP_KERNEL);
+ if (!host) {
+ kfree(port);
+ port = NULL;
+ goto out;
+ }
+ port->priv = host;
+ rc =vnet_port_hostsetup(port);
+ if (rc)
+ goto out_free_host;
+ rtnl_lock();
+ if (port->zs->linktype == 2)
+ host->netdev = alloc_netdev(0, name, __vnet_layer2_init);
+ else
+ host->netdev = alloc_netdev(0, name, __vnet_layer3_init);
+ if (!host->netdev)
+ goto out_unlock;
+ memcpy(port->mac, host->netdev->dev_addr, ETH_ALEN);
+
+ host->netdev->priv = port;
+ port->interrupt = vnet_host_interrupt;
+ port->destroy = vnet_host_destroy;
+
+ if (!register_netdevice(host->netdev)) {
+ /* good case */
+ rtnl_unlock();
+ return port;
+ }
+ host->netdev->priv = NULL;
+ free_netdev(host->netdev);
+ host->netdev = NULL;
+out_unlock:
+ rtnl_unlock();
+ vnet_host_free(port);
+out_free_host:
+ vnet_port_put(port);
+ port = NULL;
+out:
+ return port;
+}
Index: linux-2.6.21/drivers/s390/guest/vnet_port_host.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_port_host.h
@@ -0,0 +1,18 @@
+/*
+ * Copyright (C) 2005 IBM Corporation
+ * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ *
+ */
+
+#ifndef __VNET_PORTS_HOST_H
+#define __VNET_PORTS_HOST_H
+
+#include <linux/netdevice.h>
+#include "vnet_switch.h"
+
+struct vnet_host_port {
+ struct net_device_stats stats;
+ struct net_device *netdev;
+};
+extern struct vnet_port * vnet_host_create(char *name);
+#endif
Index: linux-2.6.21/drivers/s390/guest/vnet_switch.c
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_switch.c
@@ -0,0 +1,828 @@
+/*
+ * vnet zlswitch handling
+ *
+ * Copyright (C) 2005 IBM Corporation
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ * Author: Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ *
+ */
+
+#include <linux/device.h>
+#include <linux/etherdevice.h>
+#include <linux/fs.h>
+#include <linux/if.h>
+#include <linux/if_ether.h>
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/miscdevice.h>
+#include <linux/module.h>
+#include <linux/netdevice.h>
+#include <linux/rtnetlink.h>
+#include <linux/pagemap.h>
+#include <linux/spinlock.h>
+
+#include "vnet.h"
+#include "vnet_port_guest.h"
+#include "vnet_port_host.h"
+#include "vnet_switch.h"
+
+#define NUM_MINORS 1024
+
+/* devices housekeeping, creation & destruction */
+static LIST_HEAD(vnet_switches);
+static rwlock_t vnet_switches_lock = RW_LOCK_UNLOCKED;
+static struct class *zwitch_class;
+static int vnet_major;
+static struct device *root_dev;
+
+
+/* The following functions allow ports of the switch to know about
+ * the MAC addresses of other ports. This is necessary for special
+ * hardware like OSA express which silently drops incoming packets
+ * that not match known MAC addresses and do not support promiscous
+ * mode as well. We have to register all guest MAC addresses at OSA
+ * make packet receive working */
+
+/* Announces the own MAC address to all other ports
+ * this function is called if a new port is added */
+static void vnet_switch_add_mac(struct vnet_port *port)
+{
+ struct vnet_port *other_port;
+
+ read_lock(&port->zs->ports_lock);
+ list_for_each_entry(other_port, &port->zs->switch_ports, lh)
+ if ((other_port != port) && (other_port->set_mac))
+ other_port->set_mac(other_port,port->mac, 1);
+ read_unlock(&port->zs->ports_lock);
+}
+
+/* Removes the own MAC address from all other ports
+ * this function is called if a port is detached*/
+static void vnet_switch_del_mac(struct vnet_port *port)
+{
+ struct vnet_port *other_port;
+
+ read_lock(&port->zs->ports_lock);
+ list_for_each_entry(other_port, &port->zs->switch_ports, lh)
+ if (other_port->set_mac)
+ other_port->set_mac(other_port, port->mac, 0);
+ read_unlock(&port->zs->ports_lock);
+}
+
+/* Learn MACs from other ports on the same zwitch and forward
+ * the MAC addresses to the set_mac function of the port.*/
+static void __vnet_port_learn_macs(struct vnet_port *port)
+{
+ struct vnet_port *other_port;
+
+ if (!port->set_mac)
+ return;
+ list_for_each_entry(other_port, &port->zs->switch_ports, lh)
+ if (other_port != port)
+ port->set_mac(port, other_port->mac, 1);
+}
+
+/* Unlearn MACS from other ports on the same zwitch */
+static void __vnet_port_unlearn_macs(struct vnet_port *port)
+{
+ struct vnet_port *other_port;
+
+ if (!port->set_mac)
+ return;
+ list_for_each_entry(other_port, &port->zs->switch_ports, lh)
+ if (other_port != port)
+ port->set_mac(port, other_port->mac, 0);
+}
+
+
+static struct vnet_switch *__vnet_switch_by_minor(int minor)
+{
+ struct vnet_switch *zs;
+
+ list_for_each_entry(zs, &vnet_switches, lh) {
+ if (MINOR(zs->cdev.dev) == minor)
+ return zs;
+ }
+ return NULL;
+}
+
+static struct vnet_switch *__vnet_switch_by_name(char *name)
+{
+ struct vnet_switch *zs;
+
+ list_for_each_entry(zs, &vnet_switches, lh)
+ if (strncmp(zs->name, name, ZWITCH_NAME_SIZE) == 0)
+ return zs;
+ return NULL;
+}
+
+/* Returns a switch structure and increases the reference count. If no such
+ * switch exists a new one is created with reference count 1 */
+static struct vnet_switch *zwitch_get(int minor)
+{
+ struct vnet_switch *zs;
+
+ read_lock(&vnet_switches_lock);
+ zs = __vnet_switch_by_minor(minor);
+ if (!zs) {
+ read_unlock(&vnet_switches_lock);
+ return zs;
+ }
+ get_device(&zs->dev);
+ read_unlock(&vnet_switches_lock);
+ return zs;
+}
+
+/* reduces the reference count of the switch. */
+static void zwitch_put(struct vnet_switch * zs)
+{
+ put_device(&zs->dev);
+}
+
+/* looks into the packet and searches a matching MAC address
+ * return NULL if unknown or broadcast */
+static struct vnet_port *__vnet_find_l2(struct vnet_switch *zs, char *data)
+{
+ //FIXME: make this a hash lookup, more macs per device?
+ struct vnet_port *port;
+
+ if (is_multicast_ether_addr(data))
+ return NULL;
+ list_for_each_entry(port, &zs->switch_ports, lh) {
+ if (compare_ether_addr(port->mac, data)==0)
+ goto out;
+ }
+ port = NULL;
+ out:
+ return port;
+}
+
+/* searches the destination for IP only interfaces. Normally routing
+ * is the way to go, but guests should see the net transparently without
+ * a hop in between*/
+static struct vnet_port *__vnet_find_l3(struct vnet_switch *zs, char *data)
+{
+ return NULL;
+}
+
+static struct vnet_port * __vnet_find_destination(struct vnet_switch *zs,
+ char *data)
+{
+ switch (zs->linktype) {
+ case 2:
+ return __vnet_find_l2(zs, data);
+ case 3:
+ return __vnet_find_l3(zs, data);
+ default:
+ BUG();
+ }
+}
+
+/* copies len bytes of data from the memory specified by the list of
+ * pointers **from into the memory specified by the list of pointers **to
+ * with each pointer pointing to a page */
+static void
+vnet_switch_page_copy(void **to, void **from, int len)
+{
+ int remaining=len;
+ int pageid = 0;
+ int amount;
+
+ while(remaining) {
+ amount = min((int)PAGE_SIZE, remaining);
+ memcpy(to[pageid], from[pageid], amount);
+ pageid++;
+ remaining -= amount;
+ }
+}
+
+/* copies to data into a buffer of destination
+ * returns 0 if ok*/
+static int
+vnet_unicast(struct vnet_port *destination, void **from_data, int len, int proto)
+{
+ int pkid;
+ int buffer_status;
+ void **to_data;
+ struct vnet_control *control;
+
+ control = destination->control;
+ spin_lock_bh(&destination->rxlock);
+ if (vnet_q_full(atomic_read(&control->s2pmit))) {
+ destination->rx_dropped++;
+ spin_unlock_bh(&destination->rxlock);
+ return -ENOBUFS;
+ }
+ pkid = __nextx(atomic_read(&control->s2pmit));
+ to_data = destination->s2p_data[pkid];
+ vnet_switch_page_copy(to_data, from_data, len);
+ control->s2pbufs[pkid].len = len;
+ control->s2pbufs[pkid].proto = proto;
+ buffer_status = vnet_tx_packet(&control->s2pmit);
+ spin_unlock_bh(&destination->rxlock);
+ if (buffer_status & QUEUE_WAS_EMPTY)
+ destination->interrupt(destination, VNET_IRQ_START_RX);
+ destination->rx_bytes += len;
+ destination->rx_packets++;
+ return 0;
+}
+
+/* send packets to all ports and emulate broadcasts via unicasts*/
+static int vnet_allcast(struct vnet_port *from_port, void **fromdata,
+ int len, int proto)
+{
+ struct vnet_port *destination;
+ int failure = 0;
+
+ list_for_each_entry(destination, &from_port->zs->switch_ports, lh)
+ if (destination != from_port)
+ failure |= vnet_unicast(destination, fromdata,
+ len, proto);
+ return failure;
+}
+
+/* takes an incoming packet and forwards it to the right port
+ * if a failure occurs, increase the tx_dropped count of the sender*/
+static void vnet_switch_packet(struct vnet_port *from_port,
+ void **from_data, int len, int proto)
+{
+ struct vnet_port *destination;
+ int failure;
+
+ read_lock(&from_port->zs->ports_lock);
+ destination = __vnet_find_destination(from_port->zs, from_data[0]);
+ /* we dont want to loop. FIXME: document when this can happen*/
+ if (destination == from_port) {
+ read_unlock(&from_port->zs->ports_lock);
+ return;
+ }
+ if (destination)
+ failure = vnet_unicast(destination, from_data, len, proto);
+ else
+ failure = vnet_allcast(from_port, from_data, len, proto);
+ read_unlock(&from_port->zs->ports_lock);
+ if (failure)
+ from_port->tx_dropped++;
+ else {
+ from_port->tx_packets++;
+ from_port->tx_bytes += len;
+ }
+}
+
+static void vnet_port_release(struct device *dev)
+{
+ struct vnet_port *port;
+
+ port = container_of(dev, struct vnet_port, dev);
+ zwitch_put(port->zs);
+ kfree(port);
+}
+
+static ssize_t vnet_port_read_mac(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct vnet_port *port;
+
+ port = container_of(dev, struct vnet_port, dev);
+ return sprintf(buf,"%02X:%02X:%02X:%02X:%02X:%02X", port->mac[0],
+ port->mac[1], port->mac[2], port->mac[3],
+ port->mac[4], port->mac[5]);
+}
+
+static ssize_t vnet_port_read_tx_bytes(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct vnet_port *port;
+
+ port = container_of(dev, struct vnet_port, dev);
+ return sprintf(buf,"%lu", port->tx_bytes);
+}
+
+static ssize_t vnet_port_read_rx_bytes(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct vnet_port *port;
+
+ port = container_of(dev, struct vnet_port, dev);
+ return sprintf(buf,"%lu", port->rx_bytes);
+}
+
+static ssize_t vnet_port_read_tx_packets(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct vnet_port *port;
+
+ port = container_of(dev, struct vnet_port, dev);
+ return sprintf(buf,"%lu", port->tx_packets);
+}
+
+static ssize_t vnet_port_read_rx_packets(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct vnet_port *port;
+
+ port = container_of(dev, struct vnet_port, dev);
+ return sprintf(buf,"%lu", port->rx_packets);
+}
+
+static ssize_t vnet_port_read_tx_dropped(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct vnet_port *port;
+
+ port = container_of(dev, struct vnet_port, dev);
+ return sprintf(buf,"%lu", port->tx_dropped);
+}
+
+static ssize_t vnet_port_read_rx_dropped(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct vnet_port *port;
+
+ port = container_of(dev, struct vnet_port, dev);
+ return sprintf(buf,"%lu", port->rx_dropped);
+}
+
+static DEVICE_ATTR(mac, S_IRUSR, vnet_port_read_mac, NULL);
+static DEVICE_ATTR(tx_bytes, S_IRUSR, vnet_port_read_tx_bytes, NULL);
+static DEVICE_ATTR(rx_bytes, S_IRUSR, vnet_port_read_rx_bytes, NULL);
+static DEVICE_ATTR(tx_packets, S_IRUSR, vnet_port_read_tx_packets, NULL);
+static DEVICE_ATTR(rx_packets, S_IRUSR, vnet_port_read_rx_packets, NULL);
+static DEVICE_ATTR(tx_dropped, S_IRUSR, vnet_port_read_tx_dropped, NULL);
+static DEVICE_ATTR(rx_dropped, S_IRUSR, vnet_port_read_rx_dropped, NULL);
+
+static int vnet_port_attributes(struct device *dev)
+{
+ int rc;
+ rc = device_create_file(dev, &dev_attr_mac);
+ if (rc)
+ return rc;
+ rc = device_create_file(dev, &dev_attr_tx_dropped);
+ if (rc)
+ return rc;
+ rc = device_create_file(dev, &dev_attr_rx_dropped);
+ if (rc)
+ return rc;
+ rc = device_create_file(dev, &dev_attr_rx_bytes);
+ if (rc)
+ return rc;
+ rc = device_create_file(dev, &dev_attr_tx_bytes);
+ if (rc)
+ return rc;
+ rc = device_create_file(dev, &dev_attr_rx_packets);
+ if (rc)
+ return rc;
+ rc = device_create_file(dev, &dev_attr_tx_packets);
+ return rc;
+}
+
+
+//FIXME implement this
+static int vnet_port_exists(struct vnet_switch *zs, char *name)
+{
+ read_lock(&zs->ports_lock);
+ read_unlock(&zs->ports_lock);
+ return 0;
+
+}
+
+static struct vnet_port *vnet_port_create(struct vnet_switch *zs,
+ char *name)
+{
+ struct vnet_port *port;
+
+ if (vnet_port_exists(zs, name))
+ return NULL;
+
+ port = kzalloc(sizeof(*port), GFP_KERNEL);
+ if (port) {
+ spin_lock_init(&port->rxlock);
+ spin_lock_init(&port->txlock);
+ INIT_LIST_HEAD(&port->lh);
+ port->zs = zs;
+ } else
+ return NULL;
+ port->dev.parent = &zs->dev;
+ port->dev.release = vnet_port_release;
+ strncpy(port->dev.bus_id, name, BUS_ID_SIZE);
+ if (device_register(&port->dev)) {
+ kfree(port);
+ return NULL;
+ }
+ if (vnet_port_attributes(&port->dev)) {
+ device_unregister(&port->dev);
+ kfree(port);
+ return NULL;
+ }
+ return port;
+}
+
+/*------------------------ switch creation/Destruction/housekeeping---------*/
+
+static void zwitch_destroy_ports(struct vnet_switch *zs)
+{
+ struct vnet_port *port, *tmp;
+
+ list_for_each_entry_safe(port, tmp, &zs->switch_ports, lh) {
+ if (port->destroy)
+ port->destroy(port);
+ else
+ printk("No destroy function for port\n");
+ }
+}
+
+
+static void zwitch_destroy(struct vnet_switch *zs)
+{
+ class_device_destroy(zwitch_class, zs->cdev.dev);
+ cdev_del(&zs->cdev);
+ device_unregister(&zs->dev);
+}
+
+static void zwitch_release(struct device *dev)
+{
+ struct vnet_switch *zs;
+
+ zs = container_of(dev, struct vnet_switch, dev);
+ kfree(zs);
+}
+
+static int __zwitch_get_minor(void)
+{
+ int d, found;
+ struct vnet_switch *zs;
+
+ for (d=0; d< NUM_MINORS; d++) {
+ found = 0;
+ list_for_each_entry(zs, &vnet_switches, lh)
+ if (MINOR(zs->cdev.dev) == d)
+ found++;
+ if (!found) break;
+ }
+ if (found) return -ENODEV;
+ return d;
+}
+
+/*
+ * checks if this name already exists for a zwitch
+ */
+static int __zwitch_check_name(char *name)
+{
+ struct vnet_switch *zs;
+
+ list_for_each_entry(zs, &vnet_switches, lh)
+ if (!strncmp(name, zs->name, ZWITCH_NAME_SIZE))
+ return -EEXIST;
+ return 0;
+}
+
+static int zwitch_create(char *name, int linktype)
+{
+ struct vnet_switch *zs;
+ int minor;
+ int ret;
+
+ if ((linktype < 2) || (linktype > 3))
+ return -EINVAL;
+ zs = kzalloc(sizeof(*zs), GFP_KERNEL);
+ if (!zs) {
+ printk("Creation of %s failed: out of memory\n", name);
+ return -ENOMEM;
+ }
+ zs->linktype = linktype;
+ strncpy(zs->name, name, ZWITCH_NAME_SIZE);
+ rwlock_init(&zs->ports_lock);
+ INIT_LIST_HEAD(&zs->switch_ports);
+
+ write_lock(&vnet_switches_lock);
+ minor = __zwitch_get_minor();
+ if (minor < 0) {
+ write_unlock(&vnet_switches_lock);
+ printk("Creation of %s failed: No free minor number\n", name);
+ kfree(zs);
+ return minor;
+ }
+ if (__zwitch_check_name(zs->name)) {
+ write_unlock(&vnet_switches_lock);
+ printk("Creation of %s failed: name exists\n", name);
+ kfree(zs);
+ return -EEXIST;
+ }
+ list_add_tail(&zs->lh, &vnet_switches);
+ write_unlock(&vnet_switches_lock);
+ strncpy(zs->dev.bus_id, name, min((int) strlen(name),
+ ZWITCH_NAME_SIZE));
+ zs->dev.parent = root_dev;
+ zs->dev.release = zwitch_release;
+ ret = device_register(&zs->dev);
+ if (ret) {
+ write_lock(&vnet_switches_lock);
+ list_del(&zs->lh);
+ write_unlock(&vnet_switches_lock);
+ printk("Creation of %s failed: no device\n",name);
+ return ret;
+ }
+ vnet_cdev_init(&zs->cdev);
+ cdev_add(&zs->cdev, MKDEV(vnet_major, minor), 1);
+ zs->class_device = class_device_create(zwitch_class, NULL,
+ zs->cdev.dev, &zs->dev, name);
+ if (IS_ERR(zs->class_device)) {
+ cdev_del(&zs->cdev);
+ write_lock(&vnet_switches_lock);
+ list_del(&zs->lh);
+ write_unlock(&vnet_switches_lock);
+ printk("Creation of %s failed: no class_device\n", name);
+ device_unregister(&zs->dev);
+ return PTR_ERR(zs->class_device);
+ }
+ return 0;
+}
+
+
+static int zwitch_delete(char *name)
+{
+ struct vnet_switch *zs;
+
+ write_lock(&vnet_switches_lock);
+ zs = __vnet_switch_by_name(name);
+ if (!zs) {
+ write_unlock(&vnet_switches_lock);
+ return -ENOENT;
+ }
+ list_del(&zs->lh);
+ write_unlock(&vnet_switches_lock);
+ zwitch_destroy_ports(zs);
+ zwitch_destroy(zs);
+ return 0;
+}
+
+/* checks if a switch for the given minor exists
+ * if yes, create an unconnected port on this switch
+ * if no, return NULL */
+struct vnet_port *vnet_port_get(int minor, char *port_name)
+{
+ struct vnet_switch *zs;
+ struct vnet_port *port;
+
+ zs = zwitch_get(minor);
+ if (!zs)
+ return NULL;
+ port = vnet_port_create(zs, port_name);
+ if (!port)
+ zwitch_put(zs);
+ return port;
+}
+
+/* attaches the port to the switch. The port must be
+ * fully initialized, as it may get called immediately afterwards */
+void vnet_port_attach(struct vnet_port *port)
+{
+ write_lock_bh(&port->zs->ports_lock);
+ __vnet_port_learn_macs(port);
+ list_add(&port->lh, &port->zs->switch_ports);
+ write_unlock_bh(&port->zs->ports_lock);
+ vnet_switch_add_mac(port);
+ return;
+}
+
+/* detaches the port from the switch. After that,
+ * no calls into the port are made */
+void vnet_port_detach(struct vnet_port *port)
+{
+ vnet_switch_del_mac(port);
+ write_lock_bh(&port->zs->ports_lock);
+ if (!list_empty(&port->lh))
+ list_del(&port->lh);
+ __vnet_port_unlearn_macs(port);
+ write_unlock_bh(&port->zs->ports_lock);
+}
+
+/* releases all ressources allocated with vnet_port_get */
+void vnet_port_put(struct vnet_port *port)
+{
+ BUG_ON(!list_empty(&port->lh) &&( port->lh.next != LIST_POISON1));
+ device_unregister(&port->dev);
+}
+
+/* tell the switch that new data is available */
+void vnet_port_rx(struct vnet_port *port)
+{
+ struct vnet_control *control;
+ int pkid, rc;
+
+ control = port->control;
+ if (vnet_q_empty(atomic_read(&control->p2smit))) {
+ printk(KERN_WARNING "vnet_switch: Empty buffer"
+ "on interrupt\n");
+ return;
+ }
+ do {
+ pkid = __nextr(atomic_read(&control->p2smit));
+ /* fire and forget. Let the switch care about lost packets*/
+ vnet_switch_packet(port, port->p2s_data[pkid],
+ control->p2sbufs[pkid].len,
+ control->p2sbufs[pkid].proto);
+ rc = vnet_rx_packet(&control->p2smit);
+ if (rc & QUEUE_WAS_FULL) {
+ port->interrupt(port, VNET_IRQ_START_TX);
+ }
+ } while (!(rc & QUEUE_IS_EMPTY));
+ return;
+}
+
+/* checks if the given address is locally attached to the switch*/
+int vnet_address_is_local(struct vnet_switch *zs, char *address)
+{
+ struct vnet_port *port;
+
+ read_lock(&zs->ports_lock);
+ port = __vnet_find_destination(zs, address);
+ read_unlock(&zs->ports_lock);
+ return (port != NULL);
+}
+
+
+int vnet_minor_by_name(char *name)
+{
+ struct vnet_switch *zs;
+ int ret;
+
+ read_lock(&vnet_switches_lock);
+ zs = __vnet_switch_by_name(name);
+ if (zs)
+ ret = MINOR(zs->cdev.dev);
+ else
+ ret = -ENODEV;
+ read_unlock(&vnet_switches_lock);
+ return ret;
+}
+
+static void vnet_root_release(struct device *dev)
+{
+ kfree(dev);
+}
+
+
+struct command {
+ char *string1;
+ char *string2;
+};
+
+/*FIXME this is ugly. Dont worry: as soon as we have finalized the interface,
+ this crap is going away. Still, it works.......*/
+static long vnet_control_ioctl(struct file *f, unsigned int command,
+ unsigned long data)
+{
+ char string1[BUS_ID_SIZE];
+ char string2[BUS_ID_SIZE];
+ struct command com;
+ struct vnet_port *port;
+
+ if (!capable(CAP_NET_ADMIN))
+ return -EPERM;
+ if (copy_from_user(&com, (__user struct command*) data, sizeof(struct command)))
+ return -EFAULT;
+ if (copy_from_user(string1, (__user char *) com.string1, ZWITCH_NAME_SIZE))
+ return -EFAULT;
+ if (command >=2)
+ if (copy_from_user(string2, (__user char *) com.string2, ZWITCH_NAME_SIZE))
+ return -EFAULT;
+ if (strnlen(string1, ZWITCH_NAME_SIZE) == ZWITCH_NAME_SIZE)
+ return -EINVAL;
+ switch(command) {
+ case ADD_SWITCH:
+ return zwitch_create(string1,3);
+ case DEL_SWITCH:
+ return zwitch_delete(string1);
+ case ADD_HOST:
+ port = vnet_host_create(string1);
+ if (port) {
+ vnet_port_attach(port);
+ return 0;
+ } else
+ return -ENODEV;
+ default:
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static int vnet_control_open(struct inode *inode, struct file *file)
+{
+ return 0;
+}
+
+static int vnet_control_release(struct inode *inode, struct file *file)
+{
+ return 0;
+}
+
+struct file_operations vnet_control_fops = {
+ .open = vnet_control_open,
+ .release = vnet_control_release,
+ .unlocked_ioctl = &vnet_control_ioctl,
+ .compat_ioctl = &vnet_control_ioctl,
+};
+
+struct miscdevice vnet_control_device = {
+ .minor = MISC_DYNAMIC_MINOR,
+ .name = "vnet",
+ .fops = &vnet_control_fops,
+};
+
+int vnet_register_control_device(void)
+{
+ return misc_register(&vnet_control_device);
+}
+
+int __init vnet_switch_init(void)
+{
+ int ret;
+ dev_t dev;
+
+ zwitch_class = class_create(THIS_MODULE, "vnet");
+ if (IS_ERR(zwitch_class)) {
+ printk(KERN_ERR "vnet_switch: class_create failed!\n");
+ ret = PTR_ERR(zwitch_class);
+ goto out;
+ }
+ ret = alloc_chrdev_region(&dev, 0, NUM_MINORS, "vnet");
+ if (ret) {
+ printk(KERN_ERR "vnet_switch: alloc_chrdev_region failed\n");
+ goto out_class;
+ }
+ vnet_major = MAJOR(dev);
+ root_dev = kzalloc(sizeof(*root_dev), GFP_KERNEL);
+ if (!root_dev) {
+ printk(KERN_ERR "vnet_switch:allocation of device failed\n");
+ ret = -ENOMEM;
+ goto out_chrdev;
+ }
+ strncpy(root_dev->bus_id, "vnet", 5);
+ root_dev->release = vnet_root_release;
+ ret =device_register(root_dev);
+ if (ret) {
+ printk(KERN_ERR "vnet_switch: could not register device\n");
+ kfree(root_dev);
+ goto out_chrdev;
+ }
+ ret = vnet_register_control_device();
+ if (ret) {
+ printk("vnet_switch: could not create control device\n");
+ goto out_dev;
+ }
+ printk ("vnet_switch loaded\n");
+/* FIXME ---------- remove these static defines as soon as everyone has the
+ * user tools */
+ {
+ struct vnet_port *port;
+ zwitch_create("myswitch0",2);
+ zwitch_create("myswitch1",3);
+
+ port = vnet_host_create("myswitch0");
+ if (port)
+ vnet_port_attach(port);
+ port = vnet_host_create("myswitch1");
+ if (port)
+ vnet_port_attach(port);
+ }
+/*-----------------------------------------------------------*/
+ return 0;
+out_dev:
+ device_unregister(root_dev);
+out_chrdev:
+ unregister_chrdev_region(MKDEV(vnet_major,0), NUM_MINORS);
+out_class:
+ class_destroy(zwitch_class);
+out:
+ return ret;
+}
+
+/* remove all existing vnet_zwitches in the system and unregister the
+ * character device from the system */
+void vnet_switch_exit(void)
+{
+ struct vnet_switch *zs, *tmp;
+ list_for_each_entry_safe(zs, tmp, &vnet_switches, lh) {
+ zwitch_destroy_ports(zs);
+ zwitch_destroy(zs);
+ }
+ device_unregister(root_dev);
+ misc_deregister(&vnet_control_device);
+ unregister_chrdev_region(MKDEV(vnet_major,0), NUM_MINORS);
+ class_destroy(zwitch_class);
+ printk ("vnet_switch unloaded\n");
+}
+
+module_init(vnet_switch_init);
+module_exit(vnet_switch_exit);
+MODULE_DESCRIPTION("VNET: Virtual switch for vnet interfaces");
+MODULE_AUTHOR("Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+***@public.gmane.org>");
+MODULE_LICENSE("GPL");
Index: linux-2.6.21/drivers/s390/guest/vnet_switch.h
===================================================================
--- /dev/null
+++ linux-2.6.21/drivers/s390/guest/vnet_switch.h
@@ -0,0 +1,119 @@
+/*
+ * vnet_switch - zlive insular communication knack switch
+ * infrastructure for virtual switching of Linux guests running under Linux
+ *
+ * Copyright (C) 2005 IBM Corporation
+ * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+***@public.gmane.org>
+ *
+ */
+
+#ifndef __VNET_SWITCH_H
+#define __VNET_SWITCH_H
+
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/if_ether.h>
+#include <linux/spinlock.h>
+
+#include "vnet.h"
+
+/* defines for IOCTLs. interface should be replaced by something better */
+#define ADD_SWITCH 0
+#define DEL_SWITCH 1
+#define ADD_OSA 2
+#define DEL_OSA 3
+#define ADD_HOST 4
+#define DEL_HOST 5
+
+/* min(IFNAMSIZ, BUS_ID_SIZE)*/
+#define ZWITCH_NAME_SIZE 16
+
+/* This structure describes a virtual switch for ports to userspace network
+ * interfaces, e.g. in Linux under Linux environments*/
+struct vnet_switch {
+ struct list_head lh;
+ char name[ZWITCH_NAME_SIZE];
+ struct list_head switch_ports; /* list of ports */
+ rwlock_t ports_lock; /* lock for switch_ports */
+ struct class_device *class_device;
+ struct cdev cdev;
+ struct device dev;
+ struct vnet_port *osa;
+ int linktype; /* 2=ethernet 3=IP */
+};
+
+/* description of a port of the vnet_switch */
+struct vnet_port {
+ struct list_head lh;
+ struct vnet_switch *zs;
+ struct vnet_control *control;
+ void *s2p_data[VNET_QUEUE_LEN][(VNET_BUFFER_SIZE>>PAGE_SHIFT)];
+ void *p2s_data[VNET_QUEUE_LEN][(VNET_BUFFER_SIZE>>PAGE_SHIFT)];
+ char mac[ETH_ALEN];
+ void *priv;
+ int (*set_mac) (struct vnet_port *port, char mac[ETH_ALEN], int add);
+ void (*interrupt) (struct vnet_port *port, int type);
+ void (*destroy) (struct vnet_port *port);
+ struct device dev;
+ unsigned long rx_packets; /* total packets received */
+ unsigned long tx_packets; /* total packets transmitted */
+ unsigned long rx_bytes; /* total bytes received */
+ unsigned long tx_bytes; /* total bytes transmitted */
+ unsigned long rx_dropped; /* no space in receive buffer */
+ unsigned long tx_dropped; /* no space in destination buffer */
+ spinlock_t rxlock;
+ spinlock_t txlock;
+};
+
+
+static inline int
+vnet_copy_buf_to_pages(void **data, char *buf, int len)
+{
+ int i;
+
+ if (len == 0)
+ return 0;
+ for (i=0; i <= ((len - 1) >> PAGE_SHIFT); i++ )
+ memcpy(data[i], buf + i*PAGE_SIZE, min(PAGE_SIZE, len - i*PAGE_SIZE));
+ return len;
+}
+
+static inline int
+vnet_copy_pages_to_buf(char *buf, void **data, int len)
+{
+ int i;
+
+ if (len == 0)
+ return 0;
+ for (i=0; i <= ((len -1) >> PAGE_SHIFT); i++ )
+ memcpy(buf + i*PAGE_SIZE, data[i], min(PAGE_SIZE, len - i*PAGE_SIZE));
+ return len;
+}
+
+
+/* checks if a switch with the given minor exists
+ * if yes, create a named and unconnected port on
+ * this switch with the given name. if no, return NULL */
+extern struct vnet_port *vnet_port_get(int minor, char *port_name);
+
+/* attaches the port to the switch. The port must be
+ * fully initialized, as it may get data immediately afterwards */
+extern void vnet_port_attach(struct vnet_port *port);
+
+/* detaches the port from the switch. After that,
+ * no calls into the port are made */
+extern void vnet_port_detach(struct vnet_port *port);
+
+/* releases all ressources allocated with vnet_port_get */
+extern void vnet_port_put(struct vnet_port *port);
+
+/* tell the switch that new data is available */
+extern void vnet_port_rx(struct vnet_port *port);
+
+/* get the minor for a given name */
+extern int vnet_minor_by_name(char *name);
+
+/* checks if the given address is locally attached to the switch*/
+extern int vnet_address_is_local(struct vnet_switch *zs, char *address);
+#endif
Index: linux-2.6.21/drivers/s390/guest/Makefile
===================================================================
--- linux-2.6.21.orig/drivers/s390/guest/Makefile
+++ linux-2.6.21/drivers/s390/guest/Makefile
@@ -6,3 +6,6 @@ obj-$(CONFIG_GUEST_CONSOLE) += guest_con
obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o
obj-$(CONFIG_VDISK) += vdisk.o vdisk_blk.o
obj-$(CONFIG_VNET_GUEST) += vnet_guest.o
+vnet_host-objs := vnet_switch.o vnet_port_guest.o vnet_port_host.o
+obj-$(CONFIG_VNET_HOST) += vnet_host.o
+
Index: linux-2.6.21/drivers/s390/net/Kconfig
===================================================================
--- linux-2.6.21.orig/drivers/s390/net/Kconfig
+++ linux-2.6.21/drivers/s390/net/Kconfig
@@ -95,4 +95,16 @@ config VNET_GUEST
connection.
If you're not using host/guest support, say N.

+config VNET_HOST
+ tristate "virtual networking support (HOST)"
+ depends on QETH && S390_HOST
+ help
+ This is the host part of the vnet guest network connection.
+ Say Y if you plan to host guests with network
+ connection. The host part consists of a virtual switch
+ a host device as well as a connection to the qeth
+ driver.
+ If you're not using this kernel for hosting guest, say N.
+
+
endmenu



-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Anthony Liguori
2007-05-11 20:21:28 UTC
Permalink
Carsten Otte wrote:
> From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>
>
> This is the host counterpart for the virtual network device driver. This driver
> has an char device node where the hypervisor can attach. It also
> has a kind of dumb switch that passes packets between guests. Last but not least
> it contains a host network interface. Patches for attaching other host network
> devices to the switch via raw sockets, extensions to qeth or netfilter are
>

Any feel for the performance relative to the bridging code? The
bridging code is a pretty big bottle neck in guest=>guest communications
in Xen at least.

> currently tested but not ready yet. We did not use the linux bridging code to
> allow non-root users to create virtual networks between guests.
>

Is that the primary reason? If so, that seems like a rather large
hammer for something that a userspace suid wrapper could have addressed...

Regards,

Anthony Liguori

> Signed-off-by: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>
> Signed-off-by: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
>
> ---
> drivers/s390/guest/Makefile | 3
> drivers/s390/guest/vnet_port_guest.c | 302 ++++++++++++
> drivers/s390/guest/vnet_port_guest.h | 21
> drivers/s390/guest/vnet_port_host.c | 418 +++++++++++++++++
> drivers/s390/guest/vnet_port_host.h | 18
> drivers/s390/guest/vnet_switch.c | 828 +++++++++++++++++++++++++++++++++++
> drivers/s390/guest/vnet_switch.h | 119 +++++
> drivers/s390/net/Kconfig | 12
> 8 files changed, 1721 insertions(+)
>
> Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.c
> @@ -0,0 +1,302 @@
> +/*
> + * Copyright (C) 2005 IBM Corporation
> + * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
> + * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+***@public.gmane.org>
> + *
> + */
> +#include <linux/etherdevice.h>
> +#include <linux/fs.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/module.h>
> +#include <linux/pagemap.h>
> +#include <linux/poll.h>
> +#include <linux/spinlock.h>
> +
> +#include "vnet.h"
> +#include "vnet_port_guest.h"
> +#include "vnet_switch.h"
> +
> +static void COFIXME_add_irq(struct vnet_guest_port *zgp, int data)
> +{
> + int oldval, newval;
> +
> + do {
> + oldval = atomic_read(&zgp->pending_irqs);
> + newval = oldval | data;
> + } while (atomic_cmpxchg(&zgp->pending_irqs, oldval, newval) != oldval);
> +}
> +
> +static int COFIXME_get_irq(struct vnet_guest_port *zgp)
> +{
> + int oldval;
> +
> + do {
> + oldval = atomic_read(&zgp->pending_irqs);
> + } while (atomic_cmpxchg(&zgp->pending_irqs, oldval, 0) != oldval);
> +
> + return oldval;
> +}
> +
> +static void
> +vnet_guest_interrupt(struct vnet_port *port, int type)
> +{
> + struct vnet_guest_port *priv;
> +
> + priv = port->priv;
> +
> + if (!priv->fasync) {
> + printk (KERN_WARNING "vnet: cannot send interrupt,"
> + "fd not async\n");
> + return;
> + }
> + switch (type) {
> + case VNET_IRQ_START_RX:
> + COFIXME_add_irq(priv, POLLIN);
> + kill_fasync(&priv->fasync, SIGIO, POLL_IN);
> + break;
> + case VNET_IRQ_START_TX:
> + COFIXME_add_irq(priv, POLLOUT);
> + kill_fasync(&priv->fasync, SIGIO, POLL_OUT);
> + break;
> + default:
> + BUG();
> + }
> +}
> +
> +/* release all pinned user pages*/
> +static void
> +vnet_guest_release_pages(struct vnet_port *port)
> +{
> + int i,j;
> +
> + for (i=0; i<VNET_QUEUE_LEN; i++)
> + for (j=0; j<VNET_BUFFER_PAGES; j++) {
> + if (port->s2p_data[i][j]) {
> + page_cache_release(virt_to_page(port->s2p_data[i][j]));
> + port->s2p_data[i][j] = NULL;
> + }
> + if (port->p2s_data[i][j]) {
> + page_cache_release(virt_to_page(port->p2s_data[i][j]));
> + port->p2s_data[i][j] = NULL;
> + }
> + }
> + if (port->control) {
> + page_cache_release(virt_to_page(port->control));
> + port->control = NULL;
> + }
> +}
> +
> +static int
> +vnet_chr_open(struct inode *ino, struct file *filp)
> +{
> + int minor;
> + struct vnet_port *port;
> + char name[BUS_ID_SIZE];
> +
> + minor = iminor(filp->f_dentry->d_inode);
> + snprintf(name, BUS_ID_SIZE, "guest:%d", current->pid);
> + port = vnet_port_get(minor, name);
> + if (!port)
> + return -ENODEV;
> + port->priv = kzalloc(sizeof(struct vnet_guest_port), GFP_KERNEL);
> + if (!port->priv) {
> + vnet_port_put(port);
> + return -ENOMEM;
> + }
> + port->interrupt = vnet_guest_interrupt;
> + filp->private_data = port;
> + return nonseekable_open(ino, filp);
> +}
> +
> +static int
> +vnet_chr_release (struct inode *ino, struct file *filp)
> +{
> + struct vnet_port *port;
> + port = (struct vnet_port *) filp->private_data;
> +
> +//FIXME: what about open close? We unregister non exisiting mac addresses
> +// in vnet_port_detach!
> + vnet_port_detach(port);
> + vnet_guest_release_pages(port);
> + vnet_port_put(port);
> + return 0;
> +}
> +
> +
> +/* helper function which maps a user page into the kernel
> + * the memory must be free with page_cache_release */
> +static void *user_to_kernel(char __user *user)
> +{
> + struct page *temp_page;
> + int rc;
> +
> + BUG_ON(((unsigned long) user) % PAGE_SIZE);
> + rc = fault_in_pages_writeable(user, PAGE_SIZE);
> + if (rc)
> + return NULL;
> + rc = get_user_pages(current, current->mm, (unsigned long) user,
> + 1, 1, 1, &temp_page, NULL);
> + if (rc != 1)
> + return NULL;
> + return page_address(temp_page);
> +}
> +
> +/* this function pins the userspace buffers into memory*/
> +static int
> +vnet_guest_alloc_pages(struct vnet_port *port)
> +{
> + int i,j;
> +
> + down_read(&current->mm->mmap_sem);
> + for (i=0; i<VNET_QUEUE_LEN; i++)
> + for (j=0; j<VNET_BUFFER_PAGES; j++) {
> + port->s2p_data[i][j] = user_to_kernel(port->control->
> + s2pbufs[i].data + j*PAGE_SIZE);
> + if (!port->s2p_data[i][j])
> + goto cleanup;
> + port->p2s_data[i][j] = user_to_kernel(port->control->
> + p2sbufs[i].data + j*PAGE_SIZE);
> + if (!port->p2s_data[i][j])
> + goto cleanup;
> +
> + }
> + up_read(&current->mm->mmap_sem);
> + return 0;
> +cleanup:
> + up_read(&current->mm->mmap_sem);
> + vnet_guest_release_pages(port);
> + return -ENOMEM;
> +}
> +
> +/* userspace control data structure stuff */
> +static int
> +vnet_register_control(struct vnet_port *port, unsigned long user_addr)
> +{
> + u64 uaddr;
> + int rc;
> + struct page *control_page;
> +
> + rc = copy_from_user(&uaddr, (void __user *) user_addr, sizeof(uaddr));
> + if (rc)
> + return -EFAULT;
> + if (uaddr % PAGE_SIZE)
> + return -EFAULT;
> + down_read(&current->mm->mmap_sem);
> + rc = get_user_pages(current, current->mm, (unsigned long)uaddr,
> + 1, 1, 1, &control_page, NULL);
> + up_read(&current->mm->mmap_sem);
> + if (rc!=1)
> + return -EFAULT;
> + port->control = (struct vnet_control *) page_address(control_page);
> + rc = vnet_guest_alloc_pages(port);
> + if (rc) {
> + printk("vnet: could not get buffers\n");
> + return rc;
> + }
> + random_ether_addr(port->mac);
> + memcpy(port->control->mac, port->mac,6);
> + vnet_port_attach(port);
> + return 0;
> +}
> +
> +static int
> +vnet_interrupt(struct vnet_port *port, int __user *u_type)
> +{
> + int type, rc;
> +
> + rc = copy_from_user (&type, u_type, sizeof(int));
> + if (rc)
> + return -EFAULT;
> + switch (type) {
> + case VNET_IRQ_START_RX:
> + vnet_port_rx(port);
> + break;
> + case VNET_IRQ_START_TX: /* noop with current drop packet approach*/
> + break;
> + default:
> + printk(KERN_ERR "vnet: Unknown interrupt type %d\n", type);
> + rc = -EINVAL;
> + }
> + return rc;
> +}
> +
> +
> +
> +
> +//this is a HACK. >>COFIXME<<
> +unsigned int
> +vnet_poll(struct file *filp, poll_table * wait)
> +{
> + struct vnet_port *port;
> + struct vnet_guest_port *zgp;
> +
> + port = filp->private_data;
> + zgp = port->priv;
> + return COFIXME_get_irq(zgp);
> +}
> +
> +static int vnet_fill_info(struct vnet_port *zp, void __user *data)
> +{
> + struct vnet_info info;
> +
> + info.linktype = zp->zs->linktype;
> + info.maxmtu=32768; //FIXME
> + return copy_to_user(data, &info, sizeof(info));
> +}
> +long
> +vnet_ioctl(struct file *filp, unsigned int no, unsigned long data)
> +{
> + struct vnet_port *port =
> + (struct vnet_port *) filp->private_data;
> + int rc;
> +
> + switch (no) {
> + case VNET_REGISTER_CTL:
> + rc = vnet_register_control(port, data);
> + break;
> + case VNET_INTERRUPT:
> + rc = vnet_interrupt(port, (int __user *) data);
> + break;
> + case VNET_INFO:
> + rc = vnet_fill_info(port, (void __user *) data);
> + break;
> + default:
> + rc = -ENOTTY;
> + }
> + return rc;
> +}
> +
> +int vnet_fasync(int fd, struct file *filp, int on)
> +{
> + struct vnet_port *port;
> + struct vnet_guest_port *zgp;
> + int rc;
> +
> + port = filp->private_data;
> + zgp = port->priv;
> +
> + if ((rc = fasync_helper(fd, filp, on, &zgp->fasync)) < 0)
> + return rc;
> +
> + if (on)
> + rc = f_setown(filp, current->pid, 0);
> + return rc;
> +}
> +
> +
> +static struct file_operations vnet_char_fops = {
> + .owner = THIS_MODULE,
> + .open = vnet_chr_open,
> + .release = vnet_chr_release,
> + .unlocked_ioctl = vnet_ioctl,
> + .fasync = vnet_fasync,
> + .poll = vnet_poll,
> +};
> +
> +
> +
> +void vnet_cdev_init(struct cdev *cdev)
> +{
> + cdev_init(cdev, &vnet_char_fops);
> +}
> Index: linux-2.6.21/drivers/s390/guest/vnet_port_guest.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.21/drivers/s390/guest/vnet_port_guest.h
> @@ -0,0 +1,21 @@
> +/*
> + * Copyright (C) 2005 IBM Corporation
> + * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
> + * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>
> + *
> + */
> +
> +#ifndef __VNET_PORTS_GUEST_H
> +#define __VNET_PORTS_GUEST_H
> +
> +#include <linux/fs.h>
> +#include <linux/cdev.h>
> +#include <asm/atomic.h>
> +
> +struct vnet_guest_port {
> + struct fasync_struct *fasync;
> + atomic_t pending_irqs;
> +};
> +
> +extern void vnet_cdev_init(struct cdev *cdev);
> +#endif
> Index: linux-2.6.21/drivers/s390/guest/vnet_port_host.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.21/drivers/s390/guest/vnet_port_host.c
> @@ -0,0 +1,418 @@
> +/*
> + * vnet zlswitch handling
> + *
> + * Copyright (C) 2005 IBM Corporation
> + * Authors: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
> + * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+***@public.gmane.org>
> + *
> + */
> +
> +#include <linux/etherdevice.h>
> +#include <linux/if.h>
> +#include <linux/if_ether.h>
> +#include <linux/if_arp.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/module.h>
> +#include <linux/netdevice.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/pagemap.h>
> +#include <linux/spinlock.h>
> +
> +#include "vnet.h"
> +#include "vnet_switch.h"
> +#include "vnet_port_host.h"
> +
> +static void
> +vnet_host_interrupt(struct vnet_port *zp, int type)
> +{
> + struct vnet_host_port *zhp;
> +
> + zhp = zp->priv;
> +
> + BUG_ON(!zhp->netdev);
> +
> + switch (type) {
> + case VNET_IRQ_START_RX:
> + netif_rx_schedule(zhp->netdev);
> + break;
> + case VNET_IRQ_START_TX:
> + netif_wake_queue(zhp->netdev);
> + break;
> + default:
> + BUG();
> + }
> + /* we are called via system call path. enforce softirq handling */
> + do_softirq();
> +}
> +
> +static void
> +vnet_host_free(struct vnet_port *zp)
> +{
> + int i,j;
> +
> + for (i=0; i<VNET_QUEUE_LEN; i++)
> + for (j=0; j<VNET_BUFFER_PAGES; j++) {
> + if (zp->s2p_data[i][j]) {
> + free_page((unsigned long) zp->s2p_data[i][j]);
> + zp->s2p_data[i][j] = NULL;
> + }
> + if (zp->p2s_data[i][j]) {
> + free_page((unsigned long) zp->p2s_data[i][j]);
> + zp->p2s_data[i][j] = NULL;
> + }
> + }
> + if (zp->control) {
> + kfree(zp->control);
> + zp->control = NULL;
> + }
> +}
> +
> +static int
> +vnet_port_hostsetup(struct vnet_port *zp)
> +{
> + int i,j;
> +
> + zp->control = kzalloc(sizeof(*zp->control), GFP_KERNEL);
> + if (!zp->control)
> + return -ENOMEM;
> + for (i=0; i<VNET_QUEUE_LEN; i++)
> + for (j=0; j<VNET_BUFFER_PAGES; j++) {
> + zp->s2p_data[i][j] = (void *) __get_free_pages(GFP_KERNEL,0);
> + if (!zp->s2p_data[i][j])
> + goto oom;
> + zp->p2s_data[i][j] = (void *) __get_free_pages(GFP_KERNEL,0);
> + if (!zp->p2s_data[i][j]) {
> + free_page((unsigned long) zp->s2p_data[i][j]);
> + goto oom;
> + }
> + }
> + zp->control->buffer_size = VNET_BUFFER_SIZE;
> + return 0;
> +oom:
> + printk(KERN_WARNING "vnet: No memory for buffer space of host device\n");
> + vnet_host_free(zp);
> + return -ENOMEM;
> +}
> +
> +/* host interface specific parts */
> +
> +
> +static int
> +vnet_net_open(struct net_device *dev)
> +{
> + struct vnet_port *port;
> + struct vnet_control *control;
> +
> + port = dev->priv;
> + control = port->control;
> + atomic_set(&control->s2pmit, 0);
> + netif_start_queue(dev);
> + return 0;
> +}
> +
> +static int
> +vnet_net_stop(struct net_device *dev)
> +{
> + netif_stop_queue(dev);
> + return 0;
> +}
> +
> +static void vnet_net_tx_timeout(struct net_device *dev)
> +{
> + struct vnet_port *port = dev->priv;
> + struct vnet_control *control = port->control;
> +
> + printk(KERN_ERR "problems in xmit for device %s\n Resetting...\n",
> + dev->name);
> + atomic_set(&control->p2smit, 0);
> + atomic_set(&control->s2pmit, 0);
> + vnet_port_rx(port);
> + netif_wake_queue(dev);
> +}
> +
> +
> +static int
> +vnet_net_xmit(struct sk_buff *skb, struct net_device *dev)
> +{
> + struct vnet_port *zhost;
> + struct vnet_host_port *zhp;
> + struct vnet_control *control;
> + struct xmit_buffer *buf;
> + int buffer_status;
> + int pkid;
> +
> + zhost = dev->priv;
> + zhp = zhost->priv;
> + control = zhost->control;
> +
> + if (!spin_trylock(&zhost->txlock))
> + return NETDEV_TX_LOCKED;
> + if (vnet_q_full(atomic_read(&control->p2smit))) {
> + netif_stop_queue(dev);
> + goto full;
> + }
> + pkid = __nextx(atomic_read(&control->p2smit));
> + buf = &control->p2sbufs[pkid];
> + buf->len = skb->len;
> + buf->proto = skb->protocol;
> + vnet_copy_buf_to_pages(zhost->p2s_data[pkid], skb->data, skb->len);
> + buffer_status = vnet_tx_packet(&control->p2smit);
> + spin_unlock(&zhost->txlock);
> + zhp->stats.tx_packets++;
> + zhp->stats.tx_bytes += skb->len;
> + dev_kfree_skb(skb);
> + dev->trans_start = jiffies;
> + if (buffer_status & QUEUE_WAS_EMPTY)
> + vnet_port_rx(zhost);
> + if (buffer_status & QUEUE_IS_FULL) {
> + netif_stop_queue(dev);
> + spin_lock(&zhost->txlock);
> + } else
> + return NETDEV_TX_OK;
> +full:
> + /* we might have raced against the wakeup */
> + if (!vnet_q_full(atomic_read(&control->p2smit)))
> + netif_start_queue(dev);
> + spin_unlock(&zhost->txlock);
> + return NETDEV_TX_OK;
> +}
> +
> +static int
> +vnet_l3_poll(struct net_device *dev, int *budget)
> +{
> + struct vnet_port *zp = dev->priv;
> + struct vnet_host_port *zhp = zp->priv;
> + struct vnet_control *control = zp->control;
> + struct xmit_buffer *buf;
> + struct sk_buff *skb;
> + int pkid, count, numpackets = min(64, min(dev->quota, *budget));
> + int buffer_status;
> +
> + if (vnet_q_empty(atomic_read(&control->s2pmit))) {
> + count = 0;
> + goto empty;
> + }
> +loop:
> + count = 0;
> + while(numpackets) {
> + pkid = __nextr(atomic_read(&control->s2pmit));
> + buf = &control->s2pbufs[pkid];
> + skb = dev_alloc_skb(buf->len + 2);
> + if (likely(skb)) {
> + skb_reserve(skb, 2);
> + vnet_copy_pages_to_buf(skb_put(skb, buf->len),
> + zp->s2p_data[pkid], buf->len);
> + skb->dev = dev;
> + skb->protocol = buf->proto;
> +// skb->ip_summed = CHECKSUM_UNNECESSARY;
> + zhp->stats.rx_packets++;
> + zhp->stats.rx_bytes += buf->len;
> + netif_receive_skb(skb);
> + numpackets--;
> + (*budget)--;
> + dev->quota--;
> + count++;
> + } else
> + zhp->stats.rx_dropped++;
> + buffer_status = vnet_rx_packet(&control->s2pmit);
> + if (buffer_status & QUEUE_IS_EMPTY)
> + goto empty;
> + }
> + return 1; //please ask us again
> +empty:
> + netif_rx_complete(dev);
> + /* we might have raced against a wakup*/
> + if (!vnet_q_empty(atomic_read(&control->s2pmit))) {
> + if (netif_rx_reschedule(dev, count))
> + goto loop;
> + }
> + return 0;
> +}
> +
> +
> +static int
> +vnet_l2_poll(struct net_device *dev, int *budget)
> +{
> + struct vnet_port *zp = dev->priv;
> + struct vnet_host_port *zhp = zp->priv;
> + struct vnet_control *control = zp->control;
> + struct xmit_buffer *buf;
> + struct sk_buff *skb;
> + int pkid, count, numpackets = min(64, min(dev->quota, *budget));
> + int buffer_status;
> +
> + if (vnet_q_empty(atomic_read(&control->s2pmit))) {
> + count = 0;
> + goto empty;
> + }
> +loop:
> + count = 0;
> + while(numpackets) {
> + pkid = __nextr(atomic_read(&control->s2pmit));
> + buf = &control->s2pbufs[pkid];
> + skb = dev_alloc_skb(buf->len + 2);
> + if (likely(skb)) {
> + skb_reserve(skb, 2);
> + vnet_copy_pages_to_buf(skb_put(skb, buf->len),
> + zp->s2p_data[pkid], buf->len);
> + skb->dev = dev;
> + skb->protocol = eth_type_trans(skb, dev);
> +// skb->ip_summed = CHECKSUM_UNNECESSARY;
> + zhp->stats.rx_packets++;
> + zhp->stats.rx_bytes += buf->len;
> + netif_receive_skb(skb);
> + numpackets--;
> + (*budget)--;
> + dev->quota--;
> + count++;
> + } else
> + zhp->stats.rx_dropped++;
> + buffer_status = vnet_rx_packet(&control->s2pmit);
> + if (buffer_status & QUEUE_IS_EMPTY)
> + goto empty;
> + }
> + return 1; //please ask us again
> +empty:
> + netif_rx_complete(dev);
> + /* we might have raced against a wakup*/
> + if (!vnet_q_empty(atomic_read(&control->s2pmit))) {
> + if (netif_rx_reschedule(dev, count))
> + goto loop;
> + }
> + return 0;
> +}
> +
> +static struct net_device_stats *
> +vnet_net_stats(struct net_device *dev)
> +{
> + struct vnet_port *zp;
> + struct vnet_host_port *zhp;
> +
> + zp = dev->priv;
> + zhp = zp->priv;
> + return &zhp->stats;
> +}
> +
> +static int
> +vnet_net_change_mtu(struct net_device *dev, int new_mtu)
> +{
> + if (new_mtu <= ETH_ZLEN)
> + return -ERANGE;
> + if (new_mtu > VNET_BUFFER_SIZE-ETH_HLEN)
> + return -ERANGE;
> + dev->mtu = new_mtu;
> + return 0;
> +}
> +
> +static void
> +__vnet_common_init(struct net_device *dev)
> +{
> + dev->open = vnet_net_open;
> + dev->stop = vnet_net_stop;
> + dev->hard_start_xmit = vnet_net_xmit;
> + dev->get_stats = vnet_net_stats;
> + dev->tx_timeout = vnet_net_tx_timeout;
> + dev->watchdog_timeo = VNET_TIMEOUT;
> + dev->change_mtu = vnet_net_change_mtu;
> + dev->weight = 64;
> + //dev->features |= NETIF_F_NO_CSUM | NETIF_F_LLTX;
> + dev->features |= NETIF_F_LLTX;
> +}
> +
> +static void
> +__vnet_layer3_init(struct net_device *dev)
> +{
> + dev->mtu = ETH_DATA_LEN;
> + dev->tx_queue_len = 1000;
> + dev->flags = IFF_BROADCAST|IFF_MULTICAST|IFF_NOARP;
> + dev->type = ARPHRD_PPP;
> + dev->mtu = 1492;
> + dev->poll = vnet_l3_poll;
> + __vnet_common_init(dev);
> +}
> +
> +static void
> +__vnet_layer2_init(struct net_device *dev)
> +{
> + ether_setup(dev);
> + random_ether_addr(dev->dev_addr);
> + dev->mtu = 1492;
> + dev->poll = vnet_l2_poll;
> + __vnet_common_init(dev);
> +}
> +
> +static void
> +vnet_host_destroy(struct vnet_port *zhost)
> +{
> + struct vnet_host_port *zhp;
> + zhp = zhost->priv;
> +
> + vnet_port_detach(zhost);
> + unregister_netdev(zhp->netdev);
> + free_netdev(zhp->netdev);
> + zhp->netdev = NULL;
> + vnet_host_free(zhost);
> + kfree(zhp);
> + vnet_port_put(zhost);
> +}
> +
> +
> +
> +struct vnet_port *
> +vnet_host_create(char *name)
> +{
> + int rc;
> + struct vnet_port *port;
> + struct vnet_host_port *host;
> + char busname[BUS_ID_SIZE];
> + int minor;
> +
> + snprintf(busname, BUS_ID_SIZE, "host:%s", name);
> +
> + minor = vnet_minor_by_name(name);
> + if (minor < 0)
> + return NULL;
> + port = vnet_port_get(minor, busname);
> + if (!port)
> + goto out;
> + host = kzalloc(sizeof(struct vnet_host_port), GFP_KERNEL);
> + if (!host) {
> + kfree(port);
> + port = NULL;
> + goto out;
> + }
> + port->priv = host;
> + rc =vnet_port_hostsetup(port);
> + if (rc)
> + goto out_free_host;
> + rtnl_lock();
> + if (port->zs->linktype == 2)
> + host->netdev = alloc_netdev(0, name, __vnet_layer2_init);
> + else
> + host->netdev = alloc_netdev(0, name, __vnet_layer3_init);
> + if (!host->netdev)
> + goto out_unlock;
> + memcpy(port->mac, host->netdev->dev_addr, ETH_ALEN);
> +
> + host->netdev->priv = port;
> + port->interrupt = vnet_host_interrupt;
> + port->destroy = vnet_host_destroy;
> +
> + if (!register_netdevice(host->netdev)) {
> + /* good case */
> + rtnl_unlock();
> + return port;
> + }
> + host->netdev->priv = NULL;
> + free_netdev(host->netdev);
> + host->netdev = NULL;
> +out_unlock:
> + rtnl_unlock();
> + vnet_host_free(port);
> +out_free_host:
> + vnet_port_put(port);
> + port = NULL;
> +out:
> + return port;
> +}
> Index: linux-2.6.21/drivers/s390/guest/vnet_port_host.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.21/drivers/s390/guest/vnet_port_host.h
> @@ -0,0 +1,18 @@
> +/*
> + * Copyright (C) 2005 IBM Corporation
> + * Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>
> + *
> + */
> +
> +#ifndef __VNET_PORTS_HOST_H
> +#define __VNET_PORTS_HOST_H
> +
> +#include <linux/netdevice.h>
> +#include "vnet_switch.h"
> +
> +struct vnet_host_port {
> + struct net_device_stats stats;
> + struct net_device *netdev;
> +};
> +extern struct vnet_port * vnet_host_create(char *name);
> +#endif
> Index: linux-2.6.21/drivers/s390/guest/vnet_switch.c
> ===================================================================
> --- /dev/null
> +++ linux-2.6.21/drivers/s390/guest/vnet_switch.c
> @@ -0,0 +1,828 @@
> +/*
> + * vnet zlswitch handling
> + *
> + * Copyright (C) 2005 IBM Corporation
> + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
> + * Author: Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+***@public.gmane.org>
> + *
> + */
> +
> +#include <linux/device.h>
> +#include <linux/etherdevice.h>
> +#include <linux/fs.h>
> +#include <linux/if.h>
> +#include <linux/if_ether.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/miscdevice.h>
> +#include <linux/module.h>
> +#include <linux/netdevice.h>
> +#include <linux/rtnetlink.h>
> +#include <linux/pagemap.h>
> +#include <linux/spinlock.h>
> +
> +#include "vnet.h"
> +#include "vnet_port_guest.h"
> +#include "vnet_port_host.h"
> +#include "vnet_switch.h"
> +
> +#define NUM_MINORS 1024
> +
> +/* devices housekeeping, creation & destruction */
> +static LIST_HEAD(vnet_switches);
> +static rwlock_t vnet_switches_lock = RW_LOCK_UNLOCKED;
> +static struct class *zwitch_class;
> +static int vnet_major;
> +static struct device *root_dev;
> +
> +
> +/* The following functions allow ports of the switch to know about
> + * the MAC addresses of other ports. This is necessary for special
> + * hardware like OSA express which silently drops incoming packets
> + * that not match known MAC addresses and do not support promiscous
> + * mode as well. We have to register all guest MAC addresses at OSA
> + * make packet receive working */
> +
> +/* Announces the own MAC address to all other ports
> + * this function is called if a new port is added */
> +static void vnet_switch_add_mac(struct vnet_port *port)
> +{
> + struct vnet_port *other_port;
> +
> + read_lock(&port->zs->ports_lock);
> + list_for_each_entry(other_port, &port->zs->switch_ports, lh)
> + if ((other_port != port) && (other_port->set_mac))
> + other_port->set_mac(other_port,port->mac, 1);
> + read_unlock(&port->zs->ports_lock);
> +}
> +
> +/* Removes the own MAC address from all other ports
> + * this function is called if a port is detached*/
> +static void vnet_switch_del_mac(struct vnet_port *port)
> +{
> + struct vnet_port *other_port;
> +
> + read_lock(&port->zs->ports_lock);
> + list_for_each_entry(other_port, &port->zs->switch_ports, lh)
> + if (other_port->set_mac)
> + other_port->set_mac(other_port, port->mac, 0);
> + read_unlock(&port->zs->ports_lock);
> +}
> +
> +/* Learn MACs from other ports on the same zwitch and forward
> + * the MAC addresses to the set_mac function of the port.*/
> +static void __vnet_port_learn_macs(struct vnet_port *port)
> +{
> + struct vnet_port *other_port;
> +
> + if (!port->set_mac)
> + return;
> + list_for_each_entry(other_port, &port->zs->switch_ports, lh)
> + if (other_port != port)
> + port->set_mac(port, other_port->mac, 1);
> +}
> +
> +/* Unlearn MACS from other ports on the same zwitch */
> +static void __vnet_port_unlearn_macs(struct vnet_port *port)
> +{
> + struct vnet_port *other_port;
> +
> + if (!port->set_mac)
> + return;
> + list_for_each_entry(other_port, &port->zs->switch_ports, lh)
> + if (other_port != port)
> + port->set_mac(port, other_port->mac, 0);
> +}
> +
> +
> +static struct vnet_switch *__vnet_switch_by_minor(int minor)
> +{
> + struct vnet_switch *zs;
> +
> + list_for_each_entry(zs, &vnet_switches, lh) {
> + if (MINOR(zs->cdev.dev) == minor)
> + return zs;
> + }
> + return NULL;
> +}
> +
> +static struct vnet_switch *__vnet_switch_by_name(char *name)
> +{
> + struct vnet_switch *zs;
> +
> + list_for_each_entry(zs, &vnet_switches, lh)
> + if (strncmp(zs->name, name, ZWITCH_NAME_SIZE) == 0)
> + return zs;
> + return NULL;
> +}
> +
> +/* Returns a switch structure and increases the reference count. If no such
> + * switch exists a new one is created with reference count 1 */
> +static struct vnet_switch *zwitch_get(int minor)
> +{
> + struct vnet_switch *zs;
> +
> + read_lock(&vnet_switches_lock);
> + zs = __vnet_switch_by_minor(minor);
> + if (!zs) {
> + read_unlock(&vnet_switches_lock);
> + return zs;
> + }
> + get_device(&zs->dev);
> + read_unlock(&vnet_switches_lock);
> + return zs;
> +}
> +
> +/* reduces the reference count of the switch. */
> +static void zwitch_put(struct vnet_switch * zs)
> +{
> + put_device(&zs->dev);
> +}
> +
> +/* looks into the packet and searches a matching MAC address
> + * return NULL if unknown or broadcast */
> +static struct vnet_port *__vnet_find_l2(struct vnet_switch *zs, char *data)
> +{
> + //FIXME: make this a hash lookup, more macs per device?
> + struct vnet_port *port;
> +
> + if (is_multicast_ether_addr(data))
> + return NULL;
> + list_for_each_entry(port, &zs->switch_ports, lh) {
> + if (compare_ether_addr(port->mac, data)==0)
> + goto out;
> + }
> + port = NULL;
> + out:
> + return port;
> +}
> +
> +/* searches the destination for IP only interfaces. Normally routing
> + * is the way to go, but guests should see the net transparently without
> + * a hop in between*/
> +static struct vnet_port *__vnet_find_l3(struct vnet_switch *zs, char *data)
> +{
> + return NULL;
> +}
> +
> +static struct vnet_port * __vnet_find_destination(struct vnet_switch *zs,
> + char *data)
> +{
> + switch (zs->linktype) {
> + case 2:
> + return __vnet_find_l2(zs, data);
> + case 3:
> + return __vnet_find_l3(zs, data);
> + default:
> + BUG();
> + }
> +}
> +
> +/* copies len bytes of data from the memory specified by the list of
> + * pointers **from into the memory specified by the list of pointers **to
> + * with each pointer pointing to a page */
> +static void
> +vnet_switch_page_copy(void **to, void **from, int len)
> +{
> + int remaining=len;
> + int pageid = 0;
> + int amount;
> +
> + while(remaining) {
> + amount = min((int)PAGE_SIZE, remaining);
> + memcpy(to[pageid], from[pageid], amount);
> + pageid++;
> + remaining -= amount;
> + }
> +}
> +
> +/* copies to data into a buffer of destination
> + * returns 0 if ok*/
> +static int
> +vnet_unicast(struct vnet_port *destination, void **from_data, int len, int proto)
> +{
> + int pkid;
> + int buffer_status;
> + void **to_data;
> + struct vnet_control *control;
> +
> + control = destination->control;
> + spin_lock_bh(&destination->rxlock);
> + if (vnet_q_full(atomic_read(&control->s2pmit))) {
> + destination->rx_dropped++;
> + spin_unlock_bh(&destination->rxlock);
> + return -ENOBUFS;
> + }
> + pkid = __nextx(atomic_read(&control->s2pmit));
> + to_data = destination->s2p_data[pkid];
> + vnet_switch_page_copy(to_data, from_data, len);
> + control->s2pbufs[pkid].len = len;
> + control->s2pbufs[pkid].proto = proto;
> + buffer_status = vnet_tx_packet(&control->s2pmit);
> + spin_unlock_bh(&destination->rxlock);
> + if (buffer_status & QUEUE_WAS_EMPTY)
> + destination->interrupt(destination, VNET_IRQ_START_RX);
> + destination->rx_bytes += len;
> + destination->rx_packets++;
> + return 0;
> +}
> +
> +/* send packets to all ports and emulate broadcasts via unicasts*/
> +static int vnet_allcast(struct vnet_port *from_port, void **fromdata,
> + int len, int proto)
> +{
> + struct vnet_port *destination;
> + int failure = 0;
> +
> + list_for_each_entry(destination, &from_port->zs->switch_ports, lh)
> + if (destination != from_port)
> + failure |= vnet_unicast(destination, fromdata,
> + len, proto);
> + return failure;
> +}
> +
> +/* takes an incoming packet and forwards it to the right port
> + * if a failure occurs, increase the tx_dropped count of the sender*/
> +static void vnet_switch_packet(struct vnet_port *from_port,
> + void **from_data, int len, int proto)
> +{
> + struct vnet_port *destination;
> + int failure;
> +
> + read_lock(&from_port->zs->ports_lock);
> + destination = __vnet_find_destination(from_port->zs, from_data[0]);
> + /* we dont want to loop. FIXME: document when this can happen*/
> + if (destination == from_port) {
> + read_unlock(&from_port->zs->ports_lock);
> + return;
> + }
> + if (destination)
> + failure = vnet_unicast(destination, from_data, len, proto);
> + else
> + failure = vnet_allcast(from_port, from_data, len, proto);
> + read_unlock(&from_port->zs->ports_lock);
> + if (failure)
> + from_port->tx_dropped++;
> + else {
> + from_port->tx_packets++;
> + from_port->tx_bytes += len;
> + }
> +}
> +
> +static void vnet_port_release(struct device *dev)
> +{
> + struct vnet_port *port;
> +
> + port = container_of(dev, struct vnet_port, dev);
> + zwitch_put(port->zs);
> + kfree(port);
> +}
> +
> +static ssize_t vnet_port_read_mac(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct vnet_port *port;
> +
> + port = container_of(dev, struct vnet_port, dev);
> + return sprintf(buf,"%02X:%02X:%02X:%02X:%02X:%02X", port->mac[0],
> + port->mac[1], port->mac[2], port->mac[3],
> + port->mac[4], port->mac[5]);
> +}
> +
> +static ssize_t vnet_port_read_tx_bytes(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct vnet_port *port;
> +
> + port = container_of(dev, struct vnet_port, dev);
> + return sprintf(buf,"%lu", port->tx_bytes);
> +}
> +
> +static ssize_t vnet_port_read_rx_bytes(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct vnet_port *port;
> +
> + port = container_of(dev, struct vnet_port, dev);
> + return sprintf(buf,"%lu", port->rx_bytes);
> +}
> +
> +static ssize_t vnet_port_read_tx_packets(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct vnet_port *port;
> +
> + port = container_of(dev, struct vnet_port, dev);
> + return sprintf(buf,"%lu", port->tx_packets);
> +}
> +
> +static ssize_t vnet_port_read_rx_packets(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct vnet_port *port;
> +
> + port = container_of(dev, struct vnet_port, dev);
> + return sprintf(buf,"%lu", port->rx_packets);
> +}
> +
> +static ssize_t vnet_port_read_tx_dropped(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct vnet_port *port;
> +
> + port = container_of(dev, struct vnet_port, dev);
> + return sprintf(buf,"%lu", port->tx_dropped);
> +}
> +
> +static ssize_t vnet_port_read_rx_dropped(struct device *dev,
> + struct device_attribute *attr,
> + char *buf)
> +{
> + struct vnet_port *port;
> +
> + port = container_of(dev, struct vnet_port, dev);
> + return sprintf(buf,"%lu", port->rx_dropped);
> +}
> +
> +static DEVICE_ATTR(mac, S_IRUSR, vnet_port_read_mac, NULL);
> +static DEVICE_ATTR(tx_bytes, S_IRUSR, vnet_port_read_tx_bytes, NULL);
> +static DEVICE_ATTR(rx_bytes, S_IRUSR, vnet_port_read_rx_bytes, NULL);
> +static DEVICE_ATTR(tx_packets, S_IRUSR, vnet_port_read_tx_packets, NULL);
> +static DEVICE_ATTR(rx_packets, S_IRUSR, vnet_port_read_rx_packets, NULL);
> +static DEVICE_ATTR(tx_dropped, S_IRUSR, vnet_port_read_tx_dropped, NULL);
> +static DEVICE_ATTR(rx_dropped, S_IRUSR, vnet_port_read_rx_dropped, NULL);
> +
> +static int vnet_port_attributes(struct device *dev)
> +{
> + int rc;
> + rc = device_create_file(dev, &dev_attr_mac);
> + if (rc)
> + return rc;
> + rc = device_create_file(dev, &dev_attr_tx_dropped);
> + if (rc)
> + return rc;
> + rc = device_create_file(dev, &dev_attr_rx_dropped);
> + if (rc)
> + return rc;
> + rc = device_create_file(dev, &dev_attr_rx_bytes);
> + if (rc)
> + return rc;
> + rc = device_create_file(dev, &dev_attr_tx_bytes);
> + if (rc)
> + return rc;
> + rc = device_create_file(dev, &dev_attr_rx_packets);
> + if (rc)
> + return rc;
> + rc = device_create_file(dev, &dev_attr_tx_packets);
> + return rc;
> +}
> +
> +
> +//FIXME implement this
> +static int vnet_port_exists(struct vnet_switch *zs, char *name)
> +{
> + read_lock(&zs->ports_lock);
> + read_unlock(&zs->ports_lock);
> + return 0;
> +
> +}
> +
> +static struct vnet_port *vnet_port_create(struct vnet_switch *zs,
> + char *name)
> +{
> + struct vnet_port *port;
> +
> + if (vnet_port_exists(zs, name))
> + return NULL;
> +
> + port = kzalloc(sizeof(*port), GFP_KERNEL);
> + if (port) {
> + spin_lock_init(&port->rxlock);
> + spin_lock_init(&port->txlock);
> + INIT_LIST_HEAD(&port->lh);
> + port->zs = zs;
> + } else
> + return NULL;
> + port->dev.parent = &zs->dev;
> + port->dev.release = vnet_port_release;
> + strncpy(port->dev.bus_id, name, BUS_ID_SIZE);
> + if (device_register(&port->dev)) {
> + kfree(port);
> + return NULL;
> + }
> + if (vnet_port_attributes(&port->dev)) {
> + device_unregister(&port->dev);
> + kfree(port);
> + return NULL;
> + }
> + return port;
> +}
> +
> +/*------------------------ switch creation/Destruction/housekeeping---------*/
> +
> +static void zwitch_destroy_ports(struct vnet_switch *zs)
> +{
> + struct vnet_port *port, *tmp;
> +
> + list_for_each_entry_safe(port, tmp, &zs->switch_ports, lh) {
> + if (port->destroy)
> + port->destroy(port);
> + else
> + printk("No destroy function for port\n");
> + }
> +}
> +
> +
> +static void zwitch_destroy(struct vnet_switch *zs)
> +{
> + class_device_destroy(zwitch_class, zs->cdev.dev);
> + cdev_del(&zs->cdev);
> + device_unregister(&zs->dev);
> +}
> +
> +static void zwitch_release(struct device *dev)
> +{
> + struct vnet_switch *zs;
> +
> + zs = container_of(dev, struct vnet_switch, dev);
> + kfree(zs);
> +}
> +
> +static int __zwitch_get_minor(void)
> +{
> + int d, found;
> + struct vnet_switch *zs;
> +
> + for (d=0; d< NUM_MINORS; d++) {
> + found = 0;
> + list_for_each_entry(zs, &vnet_switches, lh)
> + if (MINOR(zs->cdev.dev) == d)
> + found++;
> + if (!found) break;
> + }
> + if (found) return -ENODEV;
> + return d;
> +}
> +
> +/*
> + * checks if this name already exists for a zwitch
> + */
> +static int __zwitch_check_name(char *name)
> +{
> + struct vnet_switch *zs;
> +
> + list_for_each_entry(zs, &vnet_switches, lh)
> + if (!strncmp(name, zs->name, ZWITCH_NAME_SIZE))
> + return -EEXIST;
> + return 0;
> +}
> +
> +static int zwitch_create(char *name, int linktype)
> +{
> + struct vnet_switch *zs;
> + int minor;
> + int ret;
> +
> + if ((linktype < 2) || (linktype > 3))
> + return -EINVAL;
> + zs = kzalloc(sizeof(*zs), GFP_KERNEL);
> + if (!zs) {
> + printk("Creation of %s failed: out of memory\n", name);
> + return -ENOMEM;
> + }
> + zs->linktype = linktype;
> + strncpy(zs->name, name, ZWITCH_NAME_SIZE);
> + rwlock_init(&zs->ports_lock);
> + INIT_LIST_HEAD(&zs->switch_ports);
> +
> + write_lock(&vnet_switches_lock);
> + minor = __zwitch_get_minor();
> + if (minor < 0) {
> + write_unlock(&vnet_switches_lock);
> + printk("Creation of %s failed: No free minor number\n", name);
> + kfree(zs);
> + return minor;
> + }
> + if (__zwitch_check_name(zs->name)) {
> + write_unlock(&vnet_switches_lock);
> + printk("Creation of %s failed: name exists\n", name);
> + kfree(zs);
> + return -EEXIST;
> + }
> + list_add_tail(&zs->lh, &vnet_switches);
> + write_unlock(&vnet_switches_lock);
> + strncpy(zs->dev.bus_id, name, min((int) strlen(name),
> + ZWITCH_NAME_SIZE));
> + zs->dev.parent = root_dev;
> + zs->dev.release = zwitch_release;
> + ret = device_register(&zs->dev);
> + if (ret) {
> + write_lock(&vnet_switches_lock);
> + list_del(&zs->lh);
> + write_unlock(&vnet_switches_lock);
> + printk("Creation of %s failed: no device\n",name);
> + return ret;
> + }
> + vnet_cdev_init(&zs->cdev);
> + cdev_add(&zs->cdev, MKDEV(vnet_major, minor), 1);
> + zs->class_device = class_device_create(zwitch_class, NULL,
> + zs->cdev.dev, &zs->dev, name);
> + if (IS_ERR(zs->class_device)) {
> + cdev_del(&zs->cdev);
> + write_lock(&vnet_switches_lock);
> + list_del(&zs->lh);
> + write_unlock(&vnet_switches_lock);
> + printk("Creation of %s failed: no class_device\n", name);
> + device_unregister(&zs->dev);
> + return PTR_ERR(zs->class_device);
> + }
> + return 0;
> +}
> +
> +
> +static int zwitch_delete(char *name)
> +{
> + struct vnet_switch *zs;
> +
> + write_lock(&vnet_switches_lock);
> + zs = __vnet_switch_by_name(name);
> + if (!zs) {
> + write_unlock(&vnet_switches_lock);
> + return -ENOENT;
> + }
> + list_del(&zs->lh);
> + write_unlock(&vnet_switches_lock);
> + zwitch_destroy_ports(zs);
> + zwitch_destroy(zs);
> + return 0;
> +}
> +
> +/* checks if a switch for the given minor exists
> + * if yes, create an unconnected port on this switch
> + * if no, return NULL */
> +struct vnet_port *vnet_port_get(int minor, char *port_name)
> +{
> + struct vnet_switch *zs;
> + struct vnet_port *port;
> +
> + zs = zwitch_get(minor);
> + if (!zs)
> + return NULL;
> + port = vnet_port_create(zs, port_name);
> + if (!port)
> + zwitch_put(zs);
> + return port;
> +}
> +
> +/* attaches the port to the switch. The port must be
> + * fully initialized, as it may get called immediately afterwards */
> +void vnet_port_attach(struct vnet_port *port)
> +{
> + write_lock_bh(&port->zs->ports_lock);
> + __vnet_port_learn_macs(port);
> + list_add(&port->lh, &port->zs->switch_ports);
> + write_unlock_bh(&port->zs->ports_lock);
> + vnet_switch_add_mac(port);
> + return;
> +}
> +
> +/* detaches the port from the switch. After that,
> + * no calls into the port are made */
> +void vnet_port_detach(struct vnet_port *port)
> +{
> + vnet_switch_del_mac(port);
> + write_lock_bh(&port->zs->ports_lock);
> + if (!list_empty(&port->lh))
> + list_del(&port->lh);
> + __vnet_port_unlearn_macs(port);
> + write_unlock_bh(&port->zs->ports_lock);
> +}
> +
> +/* releases all ressources allocated with vnet_port_get */
> +void vnet_port_put(struct vnet_port *port)
> +{
> + BUG_ON(!list_empty(&port->lh) &&( port->lh.next != LIST_POISON1));
> + device_unregister(&port->dev);
> +}
> +
> +/* tell the switch that new data is available */
> +void vnet_port_rx(struct vnet_port *port)
> +{
> + struct vnet_control *control;
> + int pkid, rc;
> +
> + control = port->control;
> + if (vnet_q_empty(atomic_read(&control->p2smit))) {
> + printk(KERN_WARNING "vnet_switch: Empty buffer"
> + "on interrupt\n");
> + return;
> + }
> + do {
> + pkid = __nextr(atomic_read(&control->p2smit));
> + /* fire and forget. Let the switch care about lost packets*/
> + vnet_switch_packet(port, port->p2s_data[pkid],
> + control->p2sbufs[pkid].len,
> + control->p2sbufs[pkid].proto);
> + rc = vnet_rx_packet(&control->p2smit);
> + if (rc & QUEUE_WAS_FULL) {
> + port->interrupt(port, VNET_IRQ_START_TX);
> + }
> + } while (!(rc & QUEUE_IS_EMPTY));
> + return;
> +}
> +
> +/* checks if the given address is locally attached to the switch*/
> +int vnet_address_is_local(struct vnet_switch *zs, char *address)
> +{
> + struct vnet_port *port;
> +
> + read_lock(&zs->ports_lock);
> + port = __vnet_find_destination(zs, address);
> + read_unlock(&zs->ports_lock);
> + return (port != NULL);
> +}
> +
> +
> +int vnet_minor_by_name(char *name)
> +{
> + struct vnet_switch *zs;
> + int ret;
> +
> + read_lock(&vnet_switches_lock);
> + zs = __vnet_switch_by_name(name);
> + if (zs)
> + ret = MINOR(zs->cdev.dev);
> + else
> + ret = -ENODEV;
> + read_unlock(&vnet_switches_lock);
> + return ret;
> +}
> +
> +static void vnet_root_release(struct device *dev)
> +{
> + kfree(dev);
> +}
> +
> +
> +struct command {
> + char *string1;
> + char *string2;
> +};
> +
> +/*FIXME this is ugly. Dont worry: as soon as we have finalized the interface,
> + this crap is going away. Still, it works.......*/
> +static long vnet_control_ioctl(struct file *f, unsigned int command,
> + unsigned long data)
> +{
> + char string1[BUS_ID_SIZE];
> + char string2[BUS_ID_SIZE];
> + struct command com;
> + struct vnet_port *port;
> +
> + if (!capable(CAP_NET_ADMIN))
> + return -EPERM;
> + if (copy_from_user(&com, (__user struct command*) data, sizeof(struct command)))
> + return -EFAULT;
> + if (copy_from_user(string1, (__user char *) com.string1, ZWITCH_NAME_SIZE))
> + return -EFAULT;
> + if (command >=2)
> + if (copy_from_user(string2, (__user char *) com.string2, ZWITCH_NAME_SIZE))
> + return -EFAULT;
> + if (strnlen(string1, ZWITCH_NAME_SIZE) == ZWITCH_NAME_SIZE)
> + return -EINVAL;
> + switch(command) {
> + case ADD_SWITCH:
> + return zwitch_create(string1,3);
> + case DEL_SWITCH:
> + return zwitch_delete(string1);
> + case ADD_HOST:
> + port = vnet_host_create(string1);
> + if (port) {
> + vnet_port_attach(port);
> + return 0;
> + } else
> + return -ENODEV;
> + default:
> + return -EINVAL;
> + }
> + return 0;
> +}
> +
> +static int vnet_control_open(struct inode *inode, struct file *file)
> +{
> + return 0;
> +}
> +
> +static int vnet_control_release(struct inode *inode, struct file *file)
> +{
> + return 0;
> +}
> +
> +struct file_operations vnet_control_fops = {
> + .open = vnet_control_open,
> + .release = vnet_control_release,
> + .unlocked_ioctl = &vnet_control_ioctl,
> + .compat_ioctl = &vnet_control_ioctl,
> +};
> +
> +struct miscdevice vnet_control_device = {
> + .minor = MISC_DYNAMIC_MINOR,
> + .name = "vnet",
> + .fops = &vnet_control_fops,
> +};
> +
> +int vnet_register_control_device(void)
> +{
> + return misc_register(&vnet_control_device);
> +}
> +
> +int __init vnet_switch_init(void)
> +{
> + int ret;
> + dev_t dev;
> +
> + zwitch_class = class_create(THIS_MODULE, "vnet");
> + if (IS_ERR(zwitch_class)) {
> + printk(KERN_ERR "vnet_switch: class_create failed!\n");
> + ret = PTR_ERR(zwitch_class);
> + goto out;
> + }
> + ret = alloc_chrdev_region(&dev, 0, NUM_MINORS, "vnet");
> + if (ret) {
> + printk(KERN_ERR "vnet_switch: alloc_chrdev_region failed\n");
> + goto out_class;
> + }
> + vnet_major = MAJOR(dev);
> + root_dev = kzalloc(sizeof(*root_dev), GFP_KERNEL);
> + if (!root_dev) {
> + printk(KERN_ERR "vnet_switch:allocation of device failed\n");
> + ret = -ENOMEM;
> + goto out_chrdev;
> + }
> + strncpy(root_dev->bus_id, "vnet", 5);
> + root_dev->release = vnet_root_release;
> + ret =device_register(root_dev);
> + if (ret) {
> + printk(KERN_ERR "vnet_switch: could not register device\n");
> + kfree(root_dev);
> + goto out_chrdev;
> + }
> + ret = vnet_register_control_device();
> + if (ret) {
> + printk("vnet_switch: could not create control device\n");
> + goto out_dev;
> + }
> + printk ("vnet_switch loaded\n");
> +/* FIXME ---------- remove these static defines as soon as everyone has the
> + * user tools */
> + {
> + struct vnet_port *port;
> + zwitch_create("myswitch0",2);
> + zwitch_create("myswitch1",3);
> +
> + port = vnet_host_create("myswitch0");
> + if (port)
> + vnet_port_attach(port);
> + port = vnet_host_create("myswitch1");
> + if (port)
> + vnet_port_attach(port);
> + }
> +/*-----------------------------------------------------------*/
> + return 0;
> +out_dev:
> + device_unregister(root_dev);
> +out_chrdev:
> + unregister_chrdev_region(MKDEV(vnet_major,0), NUM_MINORS);
> +out_class:
> + class_destroy(zwitch_class);
> +out:
> + return ret;
> +}
> +
> +/* remove all existing vnet_zwitches in the system and unregister the
> + * character device from the system */
> +void vnet_switch_exit(void)
> +{
> + struct vnet_switch *zs, *tmp;
> + list_for_each_entry_safe(zs, tmp, &vnet_switches, lh) {
> + zwitch_destroy_ports(zs);
> + zwitch_destroy(zs);
> + }
> + device_unregister(root_dev);
> + misc_deregister(&vnet_control_device);
> + unregister_chrdev_region(MKDEV(vnet_major,0), NUM_MINORS);
> + class_destroy(zwitch_class);
> + printk ("vnet_switch unloaded\n");
> +}
> +
> +module_init(vnet_switch_init);
> +module_exit(vnet_switch_exit);
> +MODULE_DESCRIPTION("VNET: Virtual switch for vnet interfaces");
> +MODULE_AUTHOR("Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+***@public.gmane.org>");
> +MODULE_LICENSE("GPL");
> Index: linux-2.6.21/drivers/s390/guest/vnet_switch.h
> ===================================================================
> --- /dev/null
> +++ linux-2.6.21/drivers/s390/guest/vnet_switch.h
> @@ -0,0 +1,119 @@
> +/*
> + * vnet_switch - zlive insular communication knack switch
> + * infrastructure for virtual switching of Linux guests running under Linux
> + *
> + * Copyright (C) 2005 IBM Corporation
> + * Author: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>
> + * Christian Borntraeger <borntrae-tA70FqPdS9bQT0dZR+***@public.gmane.org>
> + *
> + */
> +
> +#ifndef __VNET_SWITCH_H
> +#define __VNET_SWITCH_H
> +
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/if_ether.h>
> +#include <linux/spinlock.h>
> +
> +#include "vnet.h"
> +
> +/* defines for IOCTLs. interface should be replaced by something better */
> +#define ADD_SWITCH 0
> +#define DEL_SWITCH 1
> +#define ADD_OSA 2
> +#define DEL_OSA 3
> +#define ADD_HOST 4
> +#define DEL_HOST 5
> +
> +/* min(IFNAMSIZ, BUS_ID_SIZE)*/
> +#define ZWITCH_NAME_SIZE 16
> +
> +/* This structure describes a virtual switch for ports to userspace network
> + * interfaces, e.g. in Linux under Linux environments*/
> +struct vnet_switch {
> + struct list_head lh;
> + char name[ZWITCH_NAME_SIZE];
> + struct list_head switch_ports; /* list of ports */
> + rwlock_t ports_lock; /* lock for switch_ports */
> + struct class_device *class_device;
> + struct cdev cdev;
> + struct device dev;
> + struct vnet_port *osa;
> + int linktype; /* 2=ethernet 3=IP */
> +};
> +
> +/* description of a port of the vnet_switch */
> +struct vnet_port {
> + struct list_head lh;
> + struct vnet_switch *zs;
> + struct vnet_control *control;
> + void *s2p_data[VNET_QUEUE_LEN][(VNET_BUFFER_SIZE>>PAGE_SHIFT)];
> + void *p2s_data[VNET_QUEUE_LEN][(VNET_BUFFER_SIZE>>PAGE_SHIFT)];
> + char mac[ETH_ALEN];
> + void *priv;
> + int (*set_mac) (struct vnet_port *port, char mac[ETH_ALEN], int add);
> + void (*interrupt) (struct vnet_port *port, int type);
> + void (*destroy) (struct vnet_port *port);
> + struct device dev;
> + unsigned long rx_packets; /* total packets received */
> + unsigned long tx_packets; /* total packets transmitted */
> + unsigned long rx_bytes; /* total bytes received */
> + unsigned long tx_bytes; /* total bytes transmitted */
> + unsigned long rx_dropped; /* no space in receive buffer */
> + unsigned long tx_dropped; /* no space in destination buffer */
> + spinlock_t rxlock;
> + spinlock_t txlock;
> +};
> +
> +
> +static inline int
> +vnet_copy_buf_to_pages(void **data, char *buf, int len)
> +{
> + int i;
> +
> + if (len == 0)
> + return 0;
> + for (i=0; i <= ((len - 1) >> PAGE_SHIFT); i++ )
> + memcpy(data[i], buf + i*PAGE_SIZE, min(PAGE_SIZE, len - i*PAGE_SIZE));
> + return len;
> +}
> +
> +static inline int
> +vnet_copy_pages_to_buf(char *buf, void **data, int len)
> +{
> + int i;
> +
> + if (len == 0)
> + return 0;
> + for (i=0; i <= ((len -1) >> PAGE_SHIFT); i++ )
> + memcpy(buf + i*PAGE_SIZE, data[i], min(PAGE_SIZE, len - i*PAGE_SIZE));
> + return len;
> +}
> +
> +
> +/* checks if a switch with the given minor exists
> + * if yes, create a named and unconnected port on
> + * this switch with the given name. if no, return NULL */
> +extern struct vnet_port *vnet_port_get(int minor, char *port_name);
> +
> +/* attaches the port to the switch. The port must be
> + * fully initialized, as it may get data immediately afterwards */
> +extern void vnet_port_attach(struct vnet_port *port);
> +
> +/* detaches the port from the switch. After that,
> + * no calls into the port are made */
> +extern void vnet_port_detach(struct vnet_port *port);
> +
> +/* releases all ressources allocated with vnet_port_get */
> +extern void vnet_port_put(struct vnet_port *port);
> +
> +/* tell the switch that new data is available */
> +extern void vnet_port_rx(struct vnet_port *port);
> +
> +/* get the minor for a given name */
> +extern int vnet_minor_by_name(char *name);
> +
> +/* checks if the given address is locally attached to the switch*/
> +extern int vnet_address_is_local(struct vnet_switch *zs, char *address);
> +#endif
> Index: linux-2.6.21/drivers/s390/guest/Makefile
> ===================================================================
> --- linux-2.6.21.orig/drivers/s390/guest/Makefile
> +++ linux-2.6.21/drivers/s390/guest/Makefile
> @@ -6,3 +6,6 @@ obj-$(CONFIG_GUEST_CONSOLE) += guest_con
> obj-$(CONFIG_S390_GUEST) += vdev.o vdev_device.o
> obj-$(CONFIG_VDISK) += vdisk.o vdisk_blk.o
> obj-$(CONFIG_VNET_GUEST) += vnet_guest.o
> +vnet_host-objs := vnet_switch.o vnet_port_guest.o vnet_port_host.o
> +obj-$(CONFIG_VNET_HOST) += vnet_host.o
> +
> Index: linux-2.6.21/drivers/s390/net/Kconfig
> ===================================================================
> --- linux-2.6.21.orig/drivers/s390/net/Kconfig
> +++ linux-2.6.21/drivers/s390/net/Kconfig
> @@ -95,4 +95,16 @@ config VNET_GUEST
> connection.
> If you're not using host/guest support, say N.
>
> +config VNET_HOST
> + tristate "virtual networking support (HOST)"
> + depends on QETH && S390_HOST
> + help
> + This is the host part of the vnet guest network connection.
> + Say Y if you plan to host guests with network
> + connection. The host part consists of a virtual switch
> + a host device as well as a connection to the qeth
> + driver.
> + If you're not using this kernel for hosting guest, say N.
> +
> +
> endmenu
>
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by DB2 Express
> Download DB2 Express C - the FREE version of DB2 express and take
> control of your XML. No limits. Just data. Click to get it now.
> http://sourceforge.net/powerbar/db2/
> _______________________________________________
> kvm-devel mailing list
> kvm-devel-5NWGOfrQmneRv+***@public.gmane.org
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>
>


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Christian Bornträger
2007-05-11 20:50:55 UTC
Permalink
On Friday 11 May 2007 22:21, Anthony Liguori wrote:
> Any feel for the performance relative to the bridging code? The
> bridging code is a pretty big bottle neck in guest=>guest communications
> in Xen at least.

Last time I checked it we had a quite decent guest to guest performance in the
gigabits/sec.
On the downside the switch is quite aggressive with dropping packages as the
inbound buffer of the virtual network adapters has space for 80 packets.
(that can be changed)

>
> > currently tested but not ready yet. We did not use the linux bridging code
to
> > allow non-root users to create virtual networks between guests.
> >
>
> Is that the primary reason? If so, that seems like a rather large
> hammer for something that a userspace suid wrapper could have addressed...

Actually there are some reasons why we did not use the bridging code:

- One thing is, that a lot of OSA network cards do not support promiscous
mode. There is also the issue that a lot of OSA cards are in layer 3 mode (we
get IP packets and no ethernet frames) so bridging wont work to the host
interface.
- non-root switches
- the performance of bridging (we copy directly from one guest buffer to
another without allocating an skb on the host)
- we considered to hook into the qeth driver (for OSA cards) to deal with
layer3 mode.

The first shot was actually a point-to-point driver (guest netif <--> host
netif). We added the switch at a later time.

Hmm, if we can make bridging work (with a decent performance) on s390 that
would reduce the maintainance work for us as this network switch is far from
being complete.

cheers

Christian

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Carsten Otte
2007-05-11 17:36:10 UTC
Permalink
From: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>

This patches fixes the accouting of guest cpu time. As sie is executed via a
system call, all guest operations were accounted as system time. To fix this
we define a per thread "sie context". Before issuing the sie instruction we
enter this context and leave the context afterwards. sie_enter and sie_exit
call account_system_vtime, which now checks for being in sie_context. We
define the sie_context to be accounted as user time.

Possible future enhancement: We could add an additional field: "interpretion
time" to cpu stat and process time. Thus we could differentiate between user
time in the host and host user time spent for guests. The main challenge is
the necessary user space change. Therefore, we could export the interpretion
time with a new interface. To be defined.

Signed-off-By: Christian Borntraeger <cborntra-tA70FqPdS9bQT0dZR+***@public.gmane.org>
Signed-off-By: Carsten Otte <cotte-tA70FqPdS9bQT0dZR+***@public.gmane.org>

---
arch/s390/Kconfig | 1 +
arch/s390/host/s390host.c | 15 +++++++++++++++
arch/s390/kernel/process.c | 1 +
arch/s390/kernel/vtime.c | 11 ++++++++++-
include/asm-s390/thread_info.h | 2 ++
5 files changed, 29 insertions(+), 1 deletion(-)

Index: linux-2.6.21/arch/s390/kernel/vtime.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/vtime.c
+++ linux-2.6.21/arch/s390/kernel/vtime.c
@@ -97,6 +97,11 @@ void account_vtime(struct task_struct *t
account_system_time(tsk, 0, cputime);
}

+static inline int task_is_in_sie(struct thread_info *thread)
+{
+ return thread->in_sie;
+}
+
/*
* Update process times based on virtual cpu times stored by entry.S
* to the lowcore fields user_timer, system_timer & steal_clock.
@@ -114,7 +119,11 @@ void account_system_vtime(struct task_st
cputime = S390_lowcore.system_timer >> 12;
S390_lowcore.system_timer -= cputime << 12;
S390_lowcore.steal_clock -= cputime << 12;
- account_system_time(tsk, 0, cputime);
+
+ if (task_is_in_sie(tsk->thread_info) && !hardirq_count() && !softirq_count())
+ account_user_time(tsk, cputime);
+ else
+ account_system_time(tsk, 0, cputime);
}

static inline void set_vtimer(__u64 expires)
Index: linux-2.6.21/arch/s390/host/s390host.c
===================================================================
--- linux-2.6.21.orig/arch/s390/host/s390host.c
+++ linux-2.6.21/arch/s390/host/s390host.c
@@ -27,6 +27,19 @@ static int s390host_do_action(unsigned l

static DEFINE_MUTEX(s390host_init_mutex);

+static void enter_sie(void)
+{
+ account_system_vtime(current);
+ current_thread_info()->in_sie = 1;
+}
+
+static void exit_sie(void)
+{
+ account_system_vtime(current);
+ current_thread_info()->in_sie = 0;
+}
+
+
static void s390host_get_data(struct s390host_data *data)
{
atomic_inc(&data->count);
@@ -297,7 +310,9 @@ again:
schedule();

sie_kernel->sie_block.icptcode = 0;
+ enter_sie();
ret = sie64a(sie_kernel);
+ exit_sie();
if (ret)
goto out;

Index: linux-2.6.21/include/asm-s390/thread_info.h
===================================================================
--- linux-2.6.21.orig/include/asm-s390/thread_info.h
+++ linux-2.6.21/include/asm-s390/thread_info.h
@@ -55,6 +55,7 @@ struct thread_info {
struct restart_block restart_block;
struct s390host_data *s390host_data; /* s390host data */
int sie_cpu; /* sie cpu number */
+ int in_sie; /* 1 => cpu is in sie*/
};

/*
@@ -72,6 +73,7 @@ struct thread_info {
}, \
.s390host_data = NULL, \
.sie_cpu = 0, \
+ .in_sie = 0, \
}

#define init_thread_info (init_thread_union.thread_info)
Index: linux-2.6.21/arch/s390/kernel/process.c
===================================================================
--- linux-2.6.21.orig/arch/s390/kernel/process.c
+++ linux-2.6.21/arch/s390/kernel/process.c
@@ -278,6 +278,7 @@ int copy_thread(int nr, unsigned long cl
memset(&p->thread.per_info,0,sizeof(p->thread.per_info));
p->thread_info->s390host_data = NULL;
p->thread_info->sie_cpu = -1;
+ p->thread_info->in_sie = 0;

return 0;
}
Index: linux-2.6.21/arch/s390/Kconfig
===================================================================
--- linux-2.6.21.orig/arch/s390/Kconfig
+++ linux-2.6.21/arch/s390/Kconfig
@@ -519,6 +519,7 @@ config S390_HOST
bool "s390 host support (EXPERIMENTAL)"
depends on 64BIT && EXPERIMENTAL
select S390_SWITCH_AMODE
+ select VIRT_CPU_ACCOUNTING
help
Select this option if you want to host guest Linux images




-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
Continue reading on narkive:
Loading...