Discussion:
[PATCH 0/4] mm: new flag to forbid zero page mappings for a vma
Dominik Dingel
2014-10-17 14:09:46 UTC
Permalink
s390 has the special notion of storage keys which are some sort of page flags
associated with physical pages and live outside of direct addressable memory.
These storage keys can be queried and changed with a special set of instructions.
The mentioned instructions behave quite nicely under virtualization, if there is:
- an invalid pte, then the instructions will work on some memory reserved in the host page table
- a valid pte, then the instructions will work with the real storage key

Thanks to Martin with his software reference and dirty bit tracking, the kernel does not issue any
storage key instructions as now a software based approach will be taken, on the other hand
distributions in the wild are currently using them.

However, for virtualized guests we still have a problem with guest pages mapped to zero pages
and the kernel same page merging. WIth each one multiple guest pages will point to the same
physical page and share the same storage key.

Let's fix this by introducing a new flag which will forbid new zero page mappings.
If the guest issues a storage key related instruction we flag all vmas and drop existing
zero page mappings and unmerge the guest memory.

Dominik Dingel (4):
s390/mm: recfactor global pgste updates
mm: introduce new VM_NOZEROPAGE flag
s390/mm: prevent and break zero page mappings in case of storage keys
s390/mm: disable KSM for storage key enabled pages

arch/s390/Kconfig | 3 +
arch/s390/include/asm/pgalloc.h | 2 -
arch/s390/include/asm/pgtable.h | 3 +-
arch/s390/kvm/kvm-s390.c | 2 +-
arch/s390/kvm/priv.c | 17 ++--
arch/s390/mm/pgtable.c | 181 ++++++++++++++++++----------------------
include/linux/mm.h | 13 ++-
mm/huge_memory.c | 2 +-
mm/memory.c | 2 +-
9 files changed, 112 insertions(+), 113 deletions(-)
--
1.8.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-17 14:09:49 UTC
Permalink
As soon as storage keys are enabled we need to work around of zero page
mappings to prevent inconsistencies between storage keys and pgste.

Otherwise following data corruption could happen:
1) guest enables storage key
2) guest sets storage key for not mapped page X
-> change goes to PGSTE
3) guest reads from page X
-> as X was not dirty before, the page will be zero page backed,
storage key from PGSTE for X will go to storage key for zero page
4) guest sets storage key for not mapped page Y (same logic as above
5) guest reads from page Y
-> as Y was not dirty before, the page will be zero page backed,
storage key from PGSTE for Y will got to storage key for zero page
overwriting storage key for X

While holding the mmap sem, we are safe before changes on entries we
already fixed. As sske and host large pages are also mutual exclusive
we do not even need to retry the fixup_user_fault.

Signed-off-by: Dominik Dingel <***@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <***@de.ibm.com>
Signed-off-by: Martin Schwidefsky <***@de.ibm.com>
---
arch/s390/Kconfig | 3 +++
arch/s390/mm/pgtable.c | 15 +++++++++++++++
2 files changed, 18 insertions(+)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 05c78bb..4e04e63 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -1,6 +1,9 @@
config MMU
def_bool y

+config NOZEROPAGE
+ def_bool y
+
config ZONE_DMA
def_bool y

diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index ab55ba8..6321692 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1309,6 +1309,15 @@ static int __s390_enable_skey(pte_t *pte, unsigned long addr,
pgste_t pgste;

pgste = pgste_get_lock(pte);
+ /*
+ * Remove all zero page mappings,
+ * after establishing a policy to forbid zero page mappings
+ * following faults for that page will get fresh anonymous pages
+ */
+ if (is_zero_pfn(pte_pfn(*pte))) {
+ ptep_flush_direct(walk->mm, addr, pte);
+ pte_val(*pte) = _PAGE_INVALID;
+ }
/* Clear storage key */
pgste_val(pgste) &= ~(PGSTE_ACC_BITS | PGSTE_FP_BIT |
PGSTE_GR_BIT | PGSTE_GC_BIT);
@@ -1323,10 +1332,16 @@ void s390_enable_skey(void)
{
struct mm_walk walk = { .pte_entry = __s390_enable_skey };
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;

down_write(&mm->mmap_sem);
if (mm_use_skey(mm))
goto out_up;
+
+ for (vma = mm->mmap; vma; vma = vma->vm_next)
+ vma->vm_flags |= VM_NOZEROPAGE;
+ mm->def_flags |= VM_NOZEROPAGE;
+
walk.mm = mm;
walk_page_range(0, TASK_SIZE, &walk);
mm->context.use_skey = 1;
--
1.8.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-17 14:09:48 UTC
Permalink
Add a new vma flag to allow an architecture to disable the backing
of non-present, anonymous pages with the read-only empty zero page.

Signed-off-by: Dominik Dingel <***@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <***@de.ibm.com>
Signed-off-by: Martin Schwidefsky <***@de.ibm.com>
---
include/linux/mm.h | 13 +++++++++++--
mm/huge_memory.c | 2 +-
mm/memory.c | 2 +-
3 files changed, 13 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cd33ae2..8f09c91 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -113,7 +113,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_GROWSDOWN 0x00000100 /* general info on the segment */
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
-
+#define VM_NOZEROPAGE 0x00001000 /* forbid new zero page mappings */
#define VM_LOCKED 0x00002000
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */

@@ -179,7 +179,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_SPECIAL (VM_IO | VM_DONTEXPAND | VM_PFNMAP | VM_MIXEDMAP)

/* This mask defines which mm->def_flags a process can inherit its parent */
-#define VM_INIT_DEF_MASK VM_NOHUGEPAGE
+#define VM_INIT_DEF_MASK (VM_NOHUGEPAGE | VM_NOZEROPAGE)

/*
* mapping from the currently active vm_flags protection bits (the
@@ -1293,6 +1293,15 @@ static inline int stack_guard_page_end(struct vm_area_struct *vma,
!vma_growsup(vma->vm_next, addr);
}

+static inline int vma_forbids_zeropage(struct vm_area_struct *vma)
+{
+#ifdef CONFIG_NOZEROPAGE
+ return vma->vm_flags & VM_NOZEROPAGE;
+#else
+ return 0;
+#endif
+}
+
extern struct task_struct *task_of_stack(struct task_struct *task,
struct vm_area_struct *vma, bool in_group);

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de98415..c271265 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -805,7 +805,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
return VM_FAULT_OOM;
- if (!(flags & FAULT_FLAG_WRITE) &&
+ if (!(flags & FAULT_FLAG_WRITE) && !vma_forbids_zeropage(vma) &&
transparent_hugepage_use_zero_page()) {
spinlock_t *ptl;
pgtable_t pgtable;
diff --git a/mm/memory.c b/mm/memory.c
index 64f82aa..1859b2b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2640,7 +2640,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_SIGBUS;

/* Use the zero-page for reads */
- if (!(flags & FAULT_FLAG_WRITE)) {
+ if (!(flags & FAULT_FLAG_WRITE) && !vma_forbids_zeropage(vma)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
vma->vm_page_prot));
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
--
1.8.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dave Hansen
2014-10-17 22:04:21 UTC
Permalink
Post by Dominik Dingel
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cd33ae2..8f09c91 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -113,7 +113,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_GROWSDOWN 0x00000100 /* general info on the segment */
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
-
+#define VM_NOZEROPAGE 0x00001000 /* forbid new zero page mappings */
#define VM_LOCKED 0x00002000
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */
This seems like an awfully obscure use for a very constrained resource
(VM_ flags).

Is there ever a time where the VMAs under an mm have mixed VM_NOZEROPAGE
status? Reading the patches, it _looks_ like it might be an all or
nothing thing.

Full disclosure: I've got an x86-specific feature I want to steal a flag
for. Maybe we should just define another VM_ARCH bit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-18 14:49:28 UTC
Permalink
On Fri, 17 Oct 2014 15:04:21 -0700
Post by Dave Hansen
Is there ever a time where the VMAs under an mm have mixed VM_NOZEROPAGE
status? Reading the patches, it _looks_ like it might be an all or
nothing thing.
Currently it is an all or nothing thing, but for a future change we might want to just
tag the guest memory instead of the complete user address space.
Post by Dave Hansen
Full disclosure: I've got an x86-specific feature I want to steal a flag
for. Maybe we should just define another VM_ARCH bit.
So you think of something like:

#if defined(CONFIG_S390)
# define VM_NOZEROPAGE VM_ARCH_1
#endif

#ifndef VM_NOZEROPAGE
# define VM_NOZEROPAGE VM_NONE
#endif

right?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dave Hansen
2014-10-18 16:28:17 UTC
Permalink
Post by Dominik Dingel
On Fri, 17 Oct 2014 15:04:21 -0700
Post by Dave Hansen
Is there ever a time where the VMAs under an mm have mixed VM_NOZEROPAGE
status? Reading the patches, it _looks_ like it might be an all or
nothing thing.
Currently it is an all or nothing thing, but for a future change we might want to just
tag the guest memory instead of the complete user address space.
I think it's a bad idea to reserve a flag for potential future use. If
you _need_ it in the future, let's have the discussion then. For now, I
think it should probably just be stored in the mm somewhere.
Post by Dominik Dingel
Post by Dave Hansen
Full disclosure: I've got an x86-specific feature I want to steal a flag
for. Maybe we should just define another VM_ARCH bit.
#if defined(CONFIG_S390)
# define VM_NOZEROPAGE VM_ARCH_1
#endif
#ifndef VM_NOZEROPAGE
# define VM_NOZEROPAGE VM_NONE
#endif
right?
Yeah, something like that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Paolo Bonzini
2014-10-20 18:14:53 UTC
Permalink
Post by Dave Hansen
Post by Dominik Dingel
Currently it is an all or nothing thing, but for a future change we might want to just
tag the guest memory instead of the complete user address space.
I think it's a bad idea to reserve a flag for potential future use. If
you_need_ it in the future, let's have the discussion then. For now, I
think it should probably just be stored in the mm somewhere.
I agree with Dave (I thought I disagreed, but I changed my mind while
writing down my thoughts). Just define mm_forbids_zeropage in
arch/s390/include/asm, and make it return mm->context.use_skey---with a
comment explaining how this is only for processes that use KVM, and then
only for guests that use storage keys.

Paolo (who was just taught what storage keys really are)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Martin Schwidefsky
2014-10-21 06:11:31 UTC
Permalink
On Mon, 20 Oct 2014 20:14:53 +0200
Post by Paolo Bonzini
Post by Dave Hansen
Post by Dominik Dingel
Currently it is an all or nothing thing, but for a future change we might want to just
tag the guest memory instead of the complete user address space.
I think it's a bad idea to reserve a flag for potential future use. If
you_need_ it in the future, let's have the discussion then. For now, I
think it should probably just be stored in the mm somewhere.
I agree with Dave (I thought I disagreed, but I changed my mind while
writing down my thoughts). Just define mm_forbids_zeropage in
arch/s390/include/asm, and make it return mm->context.use_skey---with a
comment explaining how this is only for processes that use KVM, and then
only for guests that use storage keys.
The mm_forbids_zeropage() sure will work for now, but I think a vma flag
is the better solution. This is analog to VM_MERGEABLE or VM_NOHUGEPAGE,
the best solution would be to only mark those vmas that are mapped to
the guest. That we have not found a way to do that yet in a sensible way
does not change the fact that "no-zero-page" is a per-vma property, no?

But if you insist we go with the mm_forbids_zeropage() until we find a
clever way to distinguish the guest vmas from the qemu ones.
--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.
Paolo Bonzini
2014-10-21 08:11:43 UTC
Permalink
Post by Martin Schwidefsky
Post by Paolo Bonzini
I agree with Dave (I thought I disagreed, but I changed my mind while
writing down my thoughts). Just define mm_forbids_zeropage in
arch/s390/include/asm, and make it return mm->context.use_skey---with a
comment explaining how this is only for processes that use KVM, and then
only for guests that use storage keys.
The mm_forbids_zeropage() sure will work for now, but I think a vma flag
is the better solution. This is analog to VM_MERGEABLE or VM_NOHUGEPAGE,
the best solution would be to only mark those vmas that are mapped to
the guest. That we have not found a way to do that yet in a sensible way
does not change the fact that "no-zero-page" is a per-vma property, no?
I agree it should be per-VMA. However, right now the code is
complicated unnecessarily by making it a per-VMA flag. Also, setting
the flag per VMA should probably be done in
kvm_arch_prepare_memory_region together with some kind of storage key
notifier. This is not very much like Dominik's patch. All in all,
mm_forbids_zeropage() provides a non-intrusive and non-controversial way
to fix the bug. Later on, switching to vma_forbids_zeropage() will be
trivial as far as mm/ code is concerned.
Post by Martin Schwidefsky
But if you insist we go with the mm_forbids_zeropage() until we find a
clever way to distinguish the guest vmas from the qemu ones.
Yeah, I think it is simpler for now.

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-21 11:20:25 UTC
Permalink
On Tue, 21 Oct 2014 10:11:43 +0200
Post by Paolo Bonzini
Post by Martin Schwidefsky
Post by Paolo Bonzini
I agree with Dave (I thought I disagreed, but I changed my mind while
writing down my thoughts). Just define mm_forbids_zeropage in
arch/s390/include/asm, and make it return mm->context.use_skey---with a
comment explaining how this is only for processes that use KVM, and then
only for guests that use storage keys.
The mm_forbids_zeropage() sure will work for now, but I think a vma flag
is the better solution. This is analog to VM_MERGEABLE or VM_NOHUGEPAGE,
the best solution would be to only mark those vmas that are mapped to
the guest. That we have not found a way to do that yet in a sensible way
does not change the fact that "no-zero-page" is a per-vma property, no?
I agree it should be per-VMA. However, right now the code is
complicated unnecessarily by making it a per-VMA flag. Also, setting
the flag per VMA should probably be done in
kvm_arch_prepare_memory_region together with some kind of storage key
notifier. This is not very much like Dominik's patch. All in all,
mm_forbids_zeropage() provides a non-intrusive and non-controversial way
to fix the bug. Later on, switching to vma_forbids_zeropage() will be
trivial as far as mm/ code is concerned.
Thank you for all the feedback, will cook up a new version.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-17 14:09:47 UTC
Permalink
Replace the s390 specific page table walker for the pgste updates
with a call to the common code walk_page_range function.
There are now two pte modification functions, one for the reset
of the CMMA state and another one for the initialization of the
storage keys.

Signed-off-by: Dominik Dingel <***@linux.vnet.ibm.com>
Signed-off-by: Martin Schwidefsky <***@de.ibm.com>
---
arch/s390/include/asm/pgalloc.h | 2 -
arch/s390/include/asm/pgtable.h | 1 +
arch/s390/kvm/kvm-s390.c | 2 +-
arch/s390/mm/pgtable.c | 153 ++++++++++++++--------------------------
4 files changed, 56 insertions(+), 102 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 9e18a61..120e126 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -22,8 +22,6 @@ unsigned long *page_table_alloc(struct mm_struct *, unsigned long);
void page_table_free(struct mm_struct *, unsigned long *);
void page_table_free_rcu(struct mmu_gather *, unsigned long *);

-void page_table_reset_pgste(struct mm_struct *, unsigned long, unsigned long,
- bool init_skey);
int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
unsigned long key, bool nq);

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 5efb2fe..1e991f6a 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1750,6 +1750,7 @@ extern int vmem_add_mapping(unsigned long start, unsigned long size);
extern int vmem_remove_mapping(unsigned long start, unsigned long size);
extern int s390_enable_sie(void);
extern void s390_enable_skey(void);
+extern void s390_reset_cmma(struct mm_struct *mm);

/*
* No page table caches to initialise
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 81b0e11..7a33c11 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -281,7 +281,7 @@ static int kvm_s390_mem_control(struct kvm *kvm, struct kvm_device_attr *attr)
case KVM_S390_VM_MEM_CLR_CMMA:
mutex_lock(&kvm->lock);
idx = srcu_read_lock(&kvm->srcu);
- page_table_reset_pgste(kvm->arch.gmap->mm, 0, TASK_SIZE, false);
+ s390_reset_cmma(kvm->arch.gmap->mm);
srcu_read_unlock(&kvm->srcu, idx);
mutex_unlock(&kvm->lock);
ret = 0;
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 5404a62..ab55ba8 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -885,99 +885,6 @@ static inline void page_table_free_pgste(unsigned long *table)
__free_page(page);
}

-static inline unsigned long page_table_reset_pte(struct mm_struct *mm, pmd_t *pmd,
- unsigned long addr, unsigned long end, bool init_skey)
-{
- pte_t *start_pte, *pte;
- spinlock_t *ptl;
- pgste_t pgste;
-
- start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
- pte = start_pte;
- do {
- pgste = pgste_get_lock(pte);
- pgste_val(pgste) &= ~_PGSTE_GPS_USAGE_MASK;
- if (init_skey) {
- unsigned long address;
-
- pgste_val(pgste) &= ~(PGSTE_ACC_BITS | PGSTE_FP_BIT |
- PGSTE_GR_BIT | PGSTE_GC_BIT);
-
- /* skip invalid and not writable pages */
- if (pte_val(*pte) & _PAGE_INVALID ||
- !(pte_val(*pte) & _PAGE_WRITE)) {
- pgste_set_unlock(pte, pgste);
- continue;
- }
-
- address = pte_val(*pte) & PAGE_MASK;
- page_set_storage_key(address, PAGE_DEFAULT_KEY, 1);
- }
- pgste_set_unlock(pte, pgste);
- } while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap_unlock(start_pte, ptl);
-
- return addr;
-}
-
-static inline unsigned long page_table_reset_pmd(struct mm_struct *mm, pud_t *pud,
- unsigned long addr, unsigned long end, bool init_skey)
-{
- unsigned long next;
- pmd_t *pmd;
-
- pmd = pmd_offset(pud, addr);
- do {
- next = pmd_addr_end(addr, end);
- if (pmd_none_or_clear_bad(pmd))
- continue;
- next = page_table_reset_pte(mm, pmd, addr, next, init_skey);
- } while (pmd++, addr = next, addr != end);
-
- return addr;
-}
-
-static inline unsigned long page_table_reset_pud(struct mm_struct *mm, pgd_t *pgd,
- unsigned long addr, unsigned long end, bool init_skey)
-{
- unsigned long next;
- pud_t *pud;
-
- pud = pud_offset(pgd, addr);
- do {
- next = pud_addr_end(addr, end);
- if (pud_none_or_clear_bad(pud))
- continue;
- next = page_table_reset_pmd(mm, pud, addr, next, init_skey);
- } while (pud++, addr = next, addr != end);
-
- return addr;
-}
-
-void page_table_reset_pgste(struct mm_struct *mm, unsigned long start,
- unsigned long end, bool init_skey)
-{
- unsigned long addr, next;
- pgd_t *pgd;
-
- down_write(&mm->mmap_sem);
- if (init_skey && mm_use_skey(mm))
- goto out_up;
- addr = start;
- pgd = pgd_offset(mm, addr);
- do {
- next = pgd_addr_end(addr, end);
- if (pgd_none_or_clear_bad(pgd))
- continue;
- next = page_table_reset_pud(mm, pgd, addr, next, init_skey);
- } while (pgd++, addr = next, addr != end);
- if (init_skey)
- current->mm->context.use_skey = 1;
-out_up:
- up_write(&mm->mmap_sem);
-}
-EXPORT_SYMBOL(page_table_reset_pgste);
-
int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
unsigned long key, bool nq)
{
@@ -1044,11 +951,6 @@ static inline unsigned long *page_table_alloc_pgste(struct mm_struct *mm,
return NULL;
}

-void page_table_reset_pgste(struct mm_struct *mm, unsigned long start,
- unsigned long end, bool init_skey)
-{
-}
-
static inline void page_table_free_pgste(unsigned long *table)
{
}
@@ -1400,13 +1302,66 @@ EXPORT_SYMBOL_GPL(s390_enable_sie);
* Enable storage key handling from now on and initialize the storage
* keys with the default key.
*/
+static int __s390_enable_skey(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ unsigned long ptev;
+ pgste_t pgste;
+
+ pgste = pgste_get_lock(pte);
+ /* Clear storage key */
+ pgste_val(pgste) &= ~(PGSTE_ACC_BITS | PGSTE_FP_BIT |
+ PGSTE_GR_BIT | PGSTE_GC_BIT);
+ ptev = pte_val(*pte);
+ if (!(ptev & _PAGE_INVALID) && (ptev & _PAGE_WRITE))
+ page_set_storage_key(ptev & PAGE_MASK, PAGE_DEFAULT_KEY, 1);
+ pgste_set_unlock(pte, pgste);
+ return 0;
+}
+
void s390_enable_skey(void)
{
- page_table_reset_pgste(current->mm, 0, TASK_SIZE, true);
+ struct mm_walk walk = { .pte_entry = __s390_enable_skey };
+ struct mm_struct *mm = current->mm;
+
+ down_write(&mm->mmap_sem);
+ if (mm_use_skey(mm))
+ goto out_up;
+ walk.mm = mm;
+ walk_page_range(0, TASK_SIZE, &walk);
+ mm->context.use_skey = 1;
+
+out_up:
+ up_write(&mm->mmap_sem);
}
EXPORT_SYMBOL_GPL(s390_enable_skey);

/*
+ * Reset CMMA state, make all pages stable again.
+ */
+static int __s390_reset_cmma(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ pgste_t pgste;
+
+ pgste = pgste_get_lock(pte);
+ pgste_val(pgste) &= ~_PGSTE_GPS_USAGE_MASK;
+ pgste_set_unlock(pte, pgste);
+ return 0;
+}
+
+void s390_reset_cmma(struct mm_struct *mm)
+{
+ struct mm_walk walk = { .pte_entry = __s390_reset_cmma };
+
+ down_write(&mm->mmap_sem);
+ walk.mm = mm;
+ walk_page_range(0, TASK_SIZE, &walk);
+ up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(s390_reset_cmma);
+
+/*
* Test and reset if a guest page is dirty
*/
bool gmap_test_and_clear_dirty(unsigned long address, struct gmap *gmap)
--
1.8.5.5
Dominik Dingel
2014-10-17 14:09:50 UTC
Permalink
When storage keys are enabled unmerge already merged pages and prevent
new pages from being merged.

Signed-off-by: Dominik Dingel <***@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <***@de.ibm.com>
Signed-off-by: Martin Schwidefsky <***@de.ibm.com>
---
arch/s390/include/asm/pgtable.h | 2 +-
arch/s390/kvm/priv.c | 17 ++++++++++++-----
arch/s390/mm/pgtable.c | 15 +++++++++++++--
3 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 1e991f6a..a5362e4 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1749,7 +1749,7 @@ static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset)
extern int vmem_add_mapping(unsigned long start, unsigned long size);
extern int vmem_remove_mapping(unsigned long start, unsigned long size);
extern int s390_enable_sie(void);
-extern void s390_enable_skey(void);
+extern int s390_enable_skey(void);
extern void s390_reset_cmma(struct mm_struct *mm);

/*
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index f89c1cd..e0967fd 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -156,21 +156,25 @@ static int handle_store_cpu_address(struct kvm_vcpu *vcpu)
return 0;
}

-static void __skey_check_enable(struct kvm_vcpu *vcpu)
+static int __skey_check_enable(struct kvm_vcpu *vcpu)
{
+ int rc = 0;
if (!(vcpu->arch.sie_block->ictl & (ICTL_ISKE | ICTL_SSKE | ICTL_RRBE)))
- return;
+ return rc;

- s390_enable_skey();
+ rc = s390_enable_skey();
trace_kvm_s390_skey_related_inst(vcpu);
vcpu->arch.sie_block->ictl &= ~(ICTL_ISKE | ICTL_SSKE | ICTL_RRBE);
+ return rc;
}


static int handle_skey(struct kvm_vcpu *vcpu)
{
- __skey_check_enable(vcpu);
+ int rc = __skey_check_enable(vcpu);

+ if (rc)
+ return rc;
vcpu->stat.instruction_storage_key++;

if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
@@ -692,7 +696,10 @@ static int handle_pfmf(struct kvm_vcpu *vcpu)
}

if (vcpu->run->s.regs.gprs[reg1] & PFMF_SK) {
- __skey_check_enable(vcpu);
+ int rc = __skey_check_enable(vcpu);
+
+ if (rc)
+ return rc;
if (set_guest_storage_key(current->mm, useraddr,
vcpu->run->s.regs.gprs[reg1] & PFMF_KEY,
vcpu->run->s.regs.gprs[reg1] & PFMF_NQ))
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 6321692..b3311c1 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -18,6 +18,8 @@
#include <linux/rcupdate.h>
#include <linux/slab.h>
#include <linux/swapops.h>
+#include <linux/ksm.h>
+#include <linux/mman.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -1328,18 +1330,26 @@ static int __s390_enable_skey(pte_t *pte, unsigned long addr,
return 0;
}

-void s390_enable_skey(void)
+int s390_enable_skey(void)
{
struct mm_walk walk = { .pte_entry = __s390_enable_skey };
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
+ int rc = 0;

down_write(&mm->mmap_sem);
if (mm_use_skey(mm))
goto out_up;

- for (vma = mm->mmap; vma; vma = vma->vm_next)
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (ksm_madvise(vma, vma->vm_start, vma->vm_end,
+ MADV_UNMERGEABLE, &vma->vm_flags)) {
+ rc = -ENOMEM;
+ goto out_up;
+ }
vma->vm_flags |= VM_NOZEROPAGE;
+ }
+ mm->def_flags &= ~VM_MERGEABLE;
mm->def_flags |= VM_NOZEROPAGE;

walk.mm = mm;
@@ -1348,6 +1358,7 @@ void s390_enable_skey(void)

out_up:
up_write(&mm->mmap_sem);
+ return rc;
}
EXPORT_SYMBOL_GPL(s390_enable_skey);
--
1.8.5.5
Dominik Dingel
2014-10-22 08:30:23 UTC
Permalink
As soon as storage keys are enabled we need to stop working on zero page
mappings to prevent inconsistencies between storage keys and pgste.

Otherwise following data corruption could happen:
1) guest enables storage key
2) guest sets storage key for not mapped page X
-> change goes to PGSTE
3) guest reads from page X
-> as X was not dirty before, the page will be zero page backed,
storage key from PGSTE for X will go to storage key for zero page
4) guest sets storage key for not mapped page Y (same logic as above
5) guest reads from page Y
-> as Y was not dirty before, the page will be zero page backed,
storage key from PGSTE for Y will got to storage key for zero page
overwriting storage key for X

While holding the mmap sem, we are safe against changes on entries we
already fixed, as every fault would need to take the mmap_sem (read).
As sske and host large pages are also mutual exclusive we do not even
need to retry the fixup_user_fault.

As use_skey is already the condition on which we call s390_enable_skey
we need to introduce a new flag for the mm->context on which we decide
if zero page mapping is allowed.

Signed-off-by: Dominik Dingel <***@linux.vnet.ibm.com>
---
arch/s390/include/asm/mmu.h | 2 ++
arch/s390/include/asm/pgtable.h | 14 ++++++++++++++
arch/s390/mm/pgtable.c | 12 ++++++++++++
3 files changed, 28 insertions(+)

diff --git a/arch/s390/include/asm/mmu.h b/arch/s390/include/asm/mmu.h
index a5e6562..0f38469 100644
--- a/arch/s390/include/asm/mmu.h
+++ b/arch/s390/include/asm/mmu.h
@@ -18,6 +18,8 @@ typedef struct {
unsigned int has_pgste:1;
/* The mmu context uses storage keys. */
unsigned int use_skey:1;
+ /* The mmu context forbids zeropage mappings. */
+ unsigned int forbids_zeropage:1;
} mm_context_t;

#define INIT_MM_CONTEXT(name) \
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 1e991f6a..fe3cfdf 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -481,6 +481,20 @@ static inline int mm_has_pgste(struct mm_struct *mm)
return 0;
}

+/*
+ * In the case that a guest uses storage keys
+ * faults should no longer be backed by zero pages
+ */
+#define mm_forbids_zeropage mm_forbids_zeropage
+static inline int mm_forbids_zeropage(struct mm_struct *mm)
+{
+#ifdef CONFIG_PGSTE
+ if (mm->context.forbids_zeropage)
+ return 1;
+#endif
+ return 0;
+}
+
static inline int mm_use_skey(struct mm_struct *mm)
{
#ifdef CONFIG_PGSTE
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index ab55ba8..1e06fbc 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1309,6 +1309,15 @@ static int __s390_enable_skey(pte_t *pte, unsigned long addr,
pgste_t pgste;

pgste = pgste_get_lock(pte);
+ /*
+ * Remove all zero page mappings,
+ * after establishing a policy to forbid zero page mappings
+ * following faults for that page will get fresh anonymous pages
+ */
+ if (is_zero_pfn(pte_pfn(*pte))) {
+ ptep_flush_direct(walk->mm, addr, pte);
+ pte_val(*pte) = _PAGE_INVALID;
+ }
/* Clear storage key */
pgste_val(pgste) &= ~(PGSTE_ACC_BITS | PGSTE_FP_BIT |
PGSTE_GR_BIT | PGSTE_GC_BIT);
@@ -1327,6 +1336,9 @@ void s390_enable_skey(void)
down_write(&mm->mmap_sem);
if (mm_use_skey(mm))
goto out_up;
+
+ mm->context.forbids_zeropage = 1;
+
walk.mm = mm;
walk_page_range(0, TASK_SIZE, &walk);
mm->context.use_skey = 1;
--
1.8.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Paolo Bonzini
2014-10-22 10:09:07 UTC
Permalink
Post by Dominik Dingel
As use_skey is already the condition on which we call s390_enable_skey
we need to introduce a new flag for the mm->context on which we decide
if zero page mapping is allowed.
Can you explain better why "mm->context.use_skey = 1" cannot be done
before the walk_page_range? Where does the walk or __s390_enable_skey
or (after the next patch) ksm_madvise rely on
"mm->context.forbids_zeropage && !mm->context.use_skey"?

The only reason I can think of, is that the next patch does not reset
"mm->context.forbids_zeropage" to 0 if the ksm_madvise fails. Why
doesn't it do that---or is it a bug?

Thanks, and sorry for the flurry of questions! :)

Paolo

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to ***@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Paolo Bonzini
2014-10-22 10:09:31 UTC
Permalink
Post by Dominik Dingel
As use_skey is already the condition on which we call s390_enable_skey
we need to introduce a new flag for the mm->context on which we decide
if zero page mapping is allowed.
Can you explain better why "mm->context.use_skey = 1" cannot be done
before the walk_page_range? Where does the walk or __s390_enable_skey
or (after the next patch) ksm_madvise rely on
"mm->context.forbids_zeropage && !mm->context.use_skey"?

The only reason I can think of, is that the next patch does not reset
"mm->context.forbids_zeropage" to 0 if the ksm_madvise fails. Why
doesn't it do that---or is it a bug?

Thanks, and sorry for the flurry of questions! :)

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-22 10:32:08 UTC
Permalink
On Wed, 22 Oct 2014 12:09:31 +0200
Post by Paolo Bonzini
Post by Dominik Dingel
As use_skey is already the condition on which we call s390_enable_skey
we need to introduce a new flag for the mm->context on which we decide
if zero page mapping is allowed.
Can you explain better why "mm->context.use_skey = 1" cannot be done
before the walk_page_range? Where does the walk or __s390_enable_skey
or (after the next patch) ksm_madvise rely on
"mm->context.forbids_zeropage && !mm->context.use_skey"?
I can't, my reasoning there is wrong.
I remembered incorrectly that we use mm_use_skey in arch/s390/kvm/priv.c to
check if we need to call s390_enable_skey, but that does happen
with the interception bits.

So every vCPU which get the a interception for a storage key instruction
will call s390_enable_skey and wait there for the mmap_sem.
Post by Paolo Bonzini
The only reason I can think of, is that the next patch does not reset
"mm->context.forbids_zeropage" to 0 if the ksm_madvise fails. Why
doesn't it do that---or is it a bug?
You are right, this is a bug, where we will drop to userspace with -ENOMEM.

I will fix this as well.
Post by Paolo Bonzini
Thanks, and sorry for the flurry of questions! :)
I really appreciate your questions and remarks. Thank you!
Post by Paolo Bonzini
Paolo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-22 08:30:21 UTC
Permalink
Replace the s390 specific page table walker for the pgste updates
with a call to the common code walk_page_range function.
There are now two pte modification functions, one for the reset
of the CMMA state and another one for the initialization of the
storage keys.

Signed-off-by: Dominik Dingel <***@linux.vnet.ibm.com>
---
arch/s390/include/asm/pgalloc.h | 2 -
arch/s390/include/asm/pgtable.h | 1 +
arch/s390/kvm/kvm-s390.c | 2 +-
arch/s390/mm/pgtable.c | 153 ++++++++++++++--------------------------
4 files changed, 56 insertions(+), 102 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 9e18a61..120e126 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -22,8 +22,6 @@ unsigned long *page_table_alloc(struct mm_struct *, unsigned long);
void page_table_free(struct mm_struct *, unsigned long *);
void page_table_free_rcu(struct mmu_gather *, unsigned long *);

-void page_table_reset_pgste(struct mm_struct *, unsigned long, unsigned long,
- bool init_skey);
int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
unsigned long key, bool nq);

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 5efb2fe..1e991f6a 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1750,6 +1750,7 @@ extern int vmem_add_mapping(unsigned long start, unsigned long size);
extern int vmem_remove_mapping(unsigned long start, unsigned long size);
extern int s390_enable_sie(void);
extern void s390_enable_skey(void);
+extern void s390_reset_cmma(struct mm_struct *mm);

/*
* No page table caches to initialise
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 81b0e11..7a33c11 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -281,7 +281,7 @@ static int kvm_s390_mem_control(struct kvm *kvm, struct kvm_device_attr *attr)
case KVM_S390_VM_MEM_CLR_CMMA:
mutex_lock(&kvm->lock);
idx = srcu_read_lock(&kvm->srcu);
- page_table_reset_pgste(kvm->arch.gmap->mm, 0, TASK_SIZE, false);
+ s390_reset_cmma(kvm->arch.gmap->mm);
srcu_read_unlock(&kvm->srcu, idx);
mutex_unlock(&kvm->lock);
ret = 0;
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 5404a62..ab55ba8 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -885,99 +885,6 @@ static inline void page_table_free_pgste(unsigned long *table)
__free_page(page);
}

-static inline unsigned long page_table_reset_pte(struct mm_struct *mm, pmd_t *pmd,
- unsigned long addr, unsigned long end, bool init_skey)
-{
- pte_t *start_pte, *pte;
- spinlock_t *ptl;
- pgste_t pgste;
-
- start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
- pte = start_pte;
- do {
- pgste = pgste_get_lock(pte);
- pgste_val(pgste) &= ~_PGSTE_GPS_USAGE_MASK;
- if (init_skey) {
- unsigned long address;
-
- pgste_val(pgste) &= ~(PGSTE_ACC_BITS | PGSTE_FP_BIT |
- PGSTE_GR_BIT | PGSTE_GC_BIT);
-
- /* skip invalid and not writable pages */
- if (pte_val(*pte) & _PAGE_INVALID ||
- !(pte_val(*pte) & _PAGE_WRITE)) {
- pgste_set_unlock(pte, pgste);
- continue;
- }
-
- address = pte_val(*pte) & PAGE_MASK;
- page_set_storage_key(address, PAGE_DEFAULT_KEY, 1);
- }
- pgste_set_unlock(pte, pgste);
- } while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap_unlock(start_pte, ptl);
-
- return addr;
-}
-
-static inline unsigned long page_table_reset_pmd(struct mm_struct *mm, pud_t *pud,
- unsigned long addr, unsigned long end, bool init_skey)
-{
- unsigned long next;
- pmd_t *pmd;
-
- pmd = pmd_offset(pud, addr);
- do {
- next = pmd_addr_end(addr, end);
- if (pmd_none_or_clear_bad(pmd))
- continue;
- next = page_table_reset_pte(mm, pmd, addr, next, init_skey);
- } while (pmd++, addr = next, addr != end);
-
- return addr;
-}
-
-static inline unsigned long page_table_reset_pud(struct mm_struct *mm, pgd_t *pgd,
- unsigned long addr, unsigned long end, bool init_skey)
-{
- unsigned long next;
- pud_t *pud;
-
- pud = pud_offset(pgd, addr);
- do {
- next = pud_addr_end(addr, end);
- if (pud_none_or_clear_bad(pud))
- continue;
- next = page_table_reset_pmd(mm, pud, addr, next, init_skey);
- } while (pud++, addr = next, addr != end);
-
- return addr;
-}
-
-void page_table_reset_pgste(struct mm_struct *mm, unsigned long start,
- unsigned long end, bool init_skey)
-{
- unsigned long addr, next;
- pgd_t *pgd;
-
- down_write(&mm->mmap_sem);
- if (init_skey && mm_use_skey(mm))
- goto out_up;
- addr = start;
- pgd = pgd_offset(mm, addr);
- do {
- next = pgd_addr_end(addr, end);
- if (pgd_none_or_clear_bad(pgd))
- continue;
- next = page_table_reset_pud(mm, pgd, addr, next, init_skey);
- } while (pgd++, addr = next, addr != end);
- if (init_skey)
- current->mm->context.use_skey = 1;
-out_up:
- up_write(&mm->mmap_sem);
-}
-EXPORT_SYMBOL(page_table_reset_pgste);
-
int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
unsigned long key, bool nq)
{
@@ -1044,11 +951,6 @@ static inline unsigned long *page_table_alloc_pgste(struct mm_struct *mm,
return NULL;
}

-void page_table_reset_pgste(struct mm_struct *mm, unsigned long start,
- unsigned long end, bool init_skey)
-{
-}
-
static inline void page_table_free_pgste(unsigned long *table)
{
}
@@ -1400,13 +1302,66 @@ EXPORT_SYMBOL_GPL(s390_enable_sie);
* Enable storage key handling from now on and initialize the storage
* keys with the default key.
*/
+static int __s390_enable_skey(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ unsigned long ptev;
+ pgste_t pgste;
+
+ pgste = pgste_get_lock(pte);
+ /* Clear storage key */
+ pgste_val(pgste) &= ~(PGSTE_ACC_BITS | PGSTE_FP_BIT |
+ PGSTE_GR_BIT | PGSTE_GC_BIT);
+ ptev = pte_val(*pte);
+ if (!(ptev & _PAGE_INVALID) && (ptev & _PAGE_WRITE))
+ page_set_storage_key(ptev & PAGE_MASK, PAGE_DEFAULT_KEY, 1);
+ pgste_set_unlock(pte, pgste);
+ return 0;
+}
+
void s390_enable_skey(void)
{
- page_table_reset_pgste(current->mm, 0, TASK_SIZE, true);
+ struct mm_walk walk = { .pte_entry = __s390_enable_skey };
+ struct mm_struct *mm = current->mm;
+
+ down_write(&mm->mmap_sem);
+ if (mm_use_skey(mm))
+ goto out_up;
+ walk.mm = mm;
+ walk_page_range(0, TASK_SIZE, &walk);
+ mm->context.use_skey = 1;
+
+out_up:
+ up_write(&mm->mmap_sem);
}
EXPORT_SYMBOL_GPL(s390_enable_skey);

/*
+ * Reset CMMA state, make all pages stable again.
+ */
+static int __s390_reset_cmma(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ pgste_t pgste;
+
+ pgste = pgste_get_lock(pte);
+ pgste_val(pgste) &= ~_PGSTE_GPS_USAGE_MASK;
+ pgste_set_unlock(pte, pgste);
+ return 0;
+}
+
+void s390_reset_cmma(struct mm_struct *mm)
+{
+ struct mm_walk walk = { .pte_entry = __s390_reset_cmma };
+
+ down_write(&mm->mmap_sem);
+ walk.mm = mm;
+ walk_page_range(0, TASK_SIZE, &walk);
+ up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(s390_reset_cmma);
+
+/*
* Test and reset if a guest page is dirty
*/
bool gmap_test_and_clear_dirty(unsigned long address, struct gmap *gmap)
--
1.8.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-22 08:30:22 UTC
Permalink
Add a new function stub to allow architectures to disable for
an mm_structthe backing of non-present, anonymous pages with
read-only empty zero pages.

Signed-off-by: Dominik Dingel <***@linux.vnet.ibm.com>
---
include/linux/mm.h | 4 ++++
mm/huge_memory.c | 2 +-
mm/memory.c | 2 +-
3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cd33ae2..0a2022e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -56,6 +56,10 @@ extern int sysctl_legacy_va_layout;
#define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0))
#endif

+#ifndef mm_forbids_zeropage
+#define mm_forbids_zeropage(X) (0)
+#endif
+
extern unsigned long sysctl_user_reserve_kbytes;
extern unsigned long sysctl_admin_reserve_kbytes;

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de98415..357a381 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -805,7 +805,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
return VM_FAULT_OOM;
- if (!(flags & FAULT_FLAG_WRITE) &&
+ if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm) &&
transparent_hugepage_use_zero_page()) {
spinlock_t *ptl;
pgtable_t pgtable;
diff --git a/mm/memory.c b/mm/memory.c
index 64f82aa..f275a9d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2640,7 +2640,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_SIGBUS;

/* Use the zero-page for reads */
- if (!(flags & FAULT_FLAG_WRITE)) {
+ if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
vma->vm_page_prot));
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
--
1.8.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-22 08:30:24 UTC
Permalink
When storage keys are enabled unmerge already merged pages and prevent
new pages from being merged.

Signed-off-by: Dominik Dingel <***@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <***@de.ibm.com>
---
arch/s390/include/asm/pgtable.h | 2 +-
arch/s390/kvm/priv.c | 17 ++++++++++++-----
arch/s390/mm/pgtable.c | 15 ++++++++++++++-
3 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index fe3cfdf..20f3186 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1763,7 +1763,7 @@ static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset)
extern int vmem_add_mapping(unsigned long start, unsigned long size);
extern int vmem_remove_mapping(unsigned long start, unsigned long size);
extern int s390_enable_sie(void);
-extern void s390_enable_skey(void);
+extern int s390_enable_skey(void);
extern void s390_reset_cmma(struct mm_struct *mm);

/*
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index f89c1cd..e0967fd 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -156,21 +156,25 @@ static int handle_store_cpu_address(struct kvm_vcpu *vcpu)
return 0;
}

-static void __skey_check_enable(struct kvm_vcpu *vcpu)
+static int __skey_check_enable(struct kvm_vcpu *vcpu)
{
+ int rc = 0;
if (!(vcpu->arch.sie_block->ictl & (ICTL_ISKE | ICTL_SSKE | ICTL_RRBE)))
- return;
+ return rc;

- s390_enable_skey();
+ rc = s390_enable_skey();
trace_kvm_s390_skey_related_inst(vcpu);
vcpu->arch.sie_block->ictl &= ~(ICTL_ISKE | ICTL_SSKE | ICTL_RRBE);
+ return rc;
}


static int handle_skey(struct kvm_vcpu *vcpu)
{
- __skey_check_enable(vcpu);
+ int rc = __skey_check_enable(vcpu);

+ if (rc)
+ return rc;
vcpu->stat.instruction_storage_key++;

if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
@@ -692,7 +696,10 @@ static int handle_pfmf(struct kvm_vcpu *vcpu)
}

if (vcpu->run->s.regs.gprs[reg1] & PFMF_SK) {
- __skey_check_enable(vcpu);
+ int rc = __skey_check_enable(vcpu);
+
+ if (rc)
+ return rc;
if (set_guest_storage_key(current->mm, useraddr,
vcpu->run->s.regs.gprs[reg1] & PFMF_KEY,
vcpu->run->s.regs.gprs[reg1] & PFMF_NQ))
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 1e06fbc..798ab49 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -18,6 +18,8 @@
#include <linux/rcupdate.h>
#include <linux/slab.h>
#include <linux/swapops.h>
+#include <linux/ksm.h>
+#include <linux/mman.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -1328,16 +1330,26 @@ static int __s390_enable_skey(pte_t *pte, unsigned long addr,
return 0;
}

-void s390_enable_skey(void)
+int s390_enable_skey(void)
{
struct mm_walk walk = { .pte_entry = __s390_enable_skey };
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+ int rc = 0;

down_write(&mm->mmap_sem);
if (mm_use_skey(mm))
goto out_up;

mm->context.forbids_zeropage = 1;
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (ksm_madvise(vma, vma->vm_start, vma->vm_end,
+ MADV_UNMERGEABLE, &vma->vm_flags)) {
+ rc = -ENOMEM;
+ goto out_up;
+ }
+ }
+ mm->def_flags &= ~VM_MERGEABLE;

walk.mm = mm;
walk_page_range(0, TASK_SIZE, &walk);
@@ -1345,6 +1357,7 @@ void s390_enable_skey(void)

out_up:
up_write(&mm->mmap_sem);
+ return rc;
}
EXPORT_SYMBOL_GPL(s390_enable_skey);
--
1.8.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-22 11:09:29 UTC
Permalink
As soon as storage keys are enabled we need to stop working on zero page
mappings to prevent inconsistencies between storage keys and pgste.

Otherwise following data corruption could happen:
1) guest enables storage key
2) guest sets storage key for not mapped page X
-> change goes to PGSTE
3) guest reads from page X
-> as X was not dirty before, the page will be zero page backed,
storage key from PGSTE for X will go to storage key for zero page
4) guest sets storage key for not mapped page Y (same logic as above
5) guest reads from page Y
-> as Y was not dirty before, the page will be zero page backed,
storage key from PGSTE for Y will got to storage key for zero page
overwriting storage key for X

While holding the mmap sem, we are safe against changes on entries we
already fixed, as every fault would need to take the mmap_sem (read).

Other vCPUs executing storage key instructions will get a one time interception
and be serialized also with mmap_sem.

Signed-off-by: Dominik Dingel <***@linux.vnet.ibm.com>
---
arch/s390/include/asm/pgtable.h | 5 +++++
arch/s390/mm/pgtable.c | 13 ++++++++++++-
2 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 1e991f6a..0da98d6 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -481,6 +481,11 @@ static inline int mm_has_pgste(struct mm_struct *mm)
return 0;
}

+/*
+ * In the case that a guest uses storage keys
+ * faults should no longer be backed by zero pages
+ */
+#define mm_forbids_zeropage mm_use_skey
static inline int mm_use_skey(struct mm_struct *mm)
{
#ifdef CONFIG_PGSTE
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index ab55ba8..58d7eb2 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1309,6 +1309,15 @@ static int __s390_enable_skey(pte_t *pte, unsigned long addr,
pgste_t pgste;

pgste = pgste_get_lock(pte);
+ /*
+ * Remove all zero page mappings,
+ * after establishing a policy to forbid zero page mappings
+ * following faults for that page will get fresh anonymous pages
+ */
+ if (is_zero_pfn(pte_pfn(*pte))) {
+ ptep_flush_direct(walk->mm, addr, pte);
+ pte_val(*pte) = _PAGE_INVALID;
+ }
/* Clear storage key */
pgste_val(pgste) &= ~(PGSTE_ACC_BITS | PGSTE_FP_BIT |
PGSTE_GR_BIT | PGSTE_GC_BIT);
@@ -1327,9 +1336,11 @@ void s390_enable_skey(void)
down_write(&mm->mmap_sem);
if (mm_use_skey(mm))
goto out_up;
+
+ mm->context.use_skey = 1;
+
walk.mm = mm;
walk_page_range(0, TASK_SIZE, &walk);
- mm->context.use_skey = 1;

out_up:
up_write(&mm->mmap_sem);
--
1.8.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Paolo Bonzini
2014-10-22 14:00:02 UTC
Permalink
Post by Dominik Dingel
As soon as storage keys are enabled we need to stop working on zero page
mappings to prevent inconsistencies between storage keys and pgste.
1) guest enables storage key
2) guest sets storage key for not mapped page X
-> change goes to PGSTE
3) guest reads from page X
-> as X was not dirty before, the page will be zero page backed,
storage key from PGSTE for X will go to storage key for zero page
4) guest sets storage key for not mapped page Y (same logic as above
5) guest reads from page Y
-> as Y was not dirty before, the page will be zero page backed,
storage key from PGSTE for Y will got to storage key for zero page
overwriting storage key for X
While holding the mmap sem, we are safe against changes on entries we
already fixed, as every fault would need to take the mmap_sem (read).
Other vCPUs executing storage key instructions will get a one time interception
and be serialized also with mmap_sem.
---
arch/s390/include/asm/pgtable.h | 5 +++++
arch/s390/mm/pgtable.c | 13 ++++++++++++-
2 files changed, 17 insertions(+), 1 deletion(-)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 1e991f6a..0da98d6 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -481,6 +481,11 @@ static inline int mm_has_pgste(struct mm_struct *mm)
return 0;
}
+/*
+ * In the case that a guest uses storage keys
+ * faults should no longer be backed by zero pages
+ */
+#define mm_forbids_zeropage mm_use_skey
static inline int mm_use_skey(struct mm_struct *mm)
{
#ifdef CONFIG_PGSTE
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index ab55ba8..58d7eb2 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -1309,6 +1309,15 @@ static int __s390_enable_skey(pte_t *pte, unsigned long addr,
pgste_t pgste;
pgste = pgste_get_lock(pte);
+ /*
+ * Remove all zero page mappings,
+ * after establishing a policy to forbid zero page mappings
+ * following faults for that page will get fresh anonymous pages
+ */
+ if (is_zero_pfn(pte_pfn(*pte))) {
+ ptep_flush_direct(walk->mm, addr, pte);
+ pte_val(*pte) = _PAGE_INVALID;
+ }
/* Clear storage key */
pgste_val(pgste) &= ~(PGSTE_ACC_BITS | PGSTE_FP_BIT |
PGSTE_GR_BIT | PGSTE_GC_BIT);
@@ -1327,9 +1336,11 @@ void s390_enable_skey(void)
down_write(&mm->mmap_sem);
if (mm_use_skey(mm))
goto out_up;
+
+ mm->context.use_skey = 1;
+
walk.mm = mm;
walk_page_range(0, TASK_SIZE, &walk);
- mm->context.use_skey = 1;
up_write(&mm->mmap_sem);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-22 11:09:27 UTC
Permalink
Replace the s390 specific page table walker for the pgste updates
with a call to the common code walk_page_range function.
There are now two pte modification functions, one for the reset
of the CMMA state and another one for the initialization of the
storage keys.

Signed-off-by: Dominik Dingel <***@linux.vnet.ibm.com>
---
arch/s390/include/asm/pgalloc.h | 2 -
arch/s390/include/asm/pgtable.h | 1 +
arch/s390/kvm/kvm-s390.c | 2 +-
arch/s390/mm/pgtable.c | 153 ++++++++++++++--------------------------
4 files changed, 56 insertions(+), 102 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 9e18a61..120e126 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -22,8 +22,6 @@ unsigned long *page_table_alloc(struct mm_struct *, unsigned long);
void page_table_free(struct mm_struct *, unsigned long *);
void page_table_free_rcu(struct mmu_gather *, unsigned long *);

-void page_table_reset_pgste(struct mm_struct *, unsigned long, unsigned long,
- bool init_skey);
int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
unsigned long key, bool nq);

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 5efb2fe..1e991f6a 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1750,6 +1750,7 @@ extern int vmem_add_mapping(unsigned long start, unsigned long size);
extern int vmem_remove_mapping(unsigned long start, unsigned long size);
extern int s390_enable_sie(void);
extern void s390_enable_skey(void);
+extern void s390_reset_cmma(struct mm_struct *mm);

/*
* No page table caches to initialise
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 81b0e11..7a33c11 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -281,7 +281,7 @@ static int kvm_s390_mem_control(struct kvm *kvm, struct kvm_device_attr *attr)
case KVM_S390_VM_MEM_CLR_CMMA:
mutex_lock(&kvm->lock);
idx = srcu_read_lock(&kvm->srcu);
- page_table_reset_pgste(kvm->arch.gmap->mm, 0, TASK_SIZE, false);
+ s390_reset_cmma(kvm->arch.gmap->mm);
srcu_read_unlock(&kvm->srcu, idx);
mutex_unlock(&kvm->lock);
ret = 0;
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 5404a62..ab55ba8 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -885,99 +885,6 @@ static inline void page_table_free_pgste(unsigned long *table)
__free_page(page);
}

-static inline unsigned long page_table_reset_pte(struct mm_struct *mm, pmd_t *pmd,
- unsigned long addr, unsigned long end, bool init_skey)
-{
- pte_t *start_pte, *pte;
- spinlock_t *ptl;
- pgste_t pgste;
-
- start_pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
- pte = start_pte;
- do {
- pgste = pgste_get_lock(pte);
- pgste_val(pgste) &= ~_PGSTE_GPS_USAGE_MASK;
- if (init_skey) {
- unsigned long address;
-
- pgste_val(pgste) &= ~(PGSTE_ACC_BITS | PGSTE_FP_BIT |
- PGSTE_GR_BIT | PGSTE_GC_BIT);
-
- /* skip invalid and not writable pages */
- if (pte_val(*pte) & _PAGE_INVALID ||
- !(pte_val(*pte) & _PAGE_WRITE)) {
- pgste_set_unlock(pte, pgste);
- continue;
- }
-
- address = pte_val(*pte) & PAGE_MASK;
- page_set_storage_key(address, PAGE_DEFAULT_KEY, 1);
- }
- pgste_set_unlock(pte, pgste);
- } while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap_unlock(start_pte, ptl);
-
- return addr;
-}
-
-static inline unsigned long page_table_reset_pmd(struct mm_struct *mm, pud_t *pud,
- unsigned long addr, unsigned long end, bool init_skey)
-{
- unsigned long next;
- pmd_t *pmd;
-
- pmd = pmd_offset(pud, addr);
- do {
- next = pmd_addr_end(addr, end);
- if (pmd_none_or_clear_bad(pmd))
- continue;
- next = page_table_reset_pte(mm, pmd, addr, next, init_skey);
- } while (pmd++, addr = next, addr != end);
-
- return addr;
-}
-
-static inline unsigned long page_table_reset_pud(struct mm_struct *mm, pgd_t *pgd,
- unsigned long addr, unsigned long end, bool init_skey)
-{
- unsigned long next;
- pud_t *pud;
-
- pud = pud_offset(pgd, addr);
- do {
- next = pud_addr_end(addr, end);
- if (pud_none_or_clear_bad(pud))
- continue;
- next = page_table_reset_pmd(mm, pud, addr, next, init_skey);
- } while (pud++, addr = next, addr != end);
-
- return addr;
-}
-
-void page_table_reset_pgste(struct mm_struct *mm, unsigned long start,
- unsigned long end, bool init_skey)
-{
- unsigned long addr, next;
- pgd_t *pgd;
-
- down_write(&mm->mmap_sem);
- if (init_skey && mm_use_skey(mm))
- goto out_up;
- addr = start;
- pgd = pgd_offset(mm, addr);
- do {
- next = pgd_addr_end(addr, end);
- if (pgd_none_or_clear_bad(pgd))
- continue;
- next = page_table_reset_pud(mm, pgd, addr, next, init_skey);
- } while (pgd++, addr = next, addr != end);
- if (init_skey)
- current->mm->context.use_skey = 1;
-out_up:
- up_write(&mm->mmap_sem);
-}
-EXPORT_SYMBOL(page_table_reset_pgste);
-
int set_guest_storage_key(struct mm_struct *mm, unsigned long addr,
unsigned long key, bool nq)
{
@@ -1044,11 +951,6 @@ static inline unsigned long *page_table_alloc_pgste(struct mm_struct *mm,
return NULL;
}

-void page_table_reset_pgste(struct mm_struct *mm, unsigned long start,
- unsigned long end, bool init_skey)
-{
-}
-
static inline void page_table_free_pgste(unsigned long *table)
{
}
@@ -1400,13 +1302,66 @@ EXPORT_SYMBOL_GPL(s390_enable_sie);
* Enable storage key handling from now on and initialize the storage
* keys with the default key.
*/
+static int __s390_enable_skey(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ unsigned long ptev;
+ pgste_t pgste;
+
+ pgste = pgste_get_lock(pte);
+ /* Clear storage key */
+ pgste_val(pgste) &= ~(PGSTE_ACC_BITS | PGSTE_FP_BIT |
+ PGSTE_GR_BIT | PGSTE_GC_BIT);
+ ptev = pte_val(*pte);
+ if (!(ptev & _PAGE_INVALID) && (ptev & _PAGE_WRITE))
+ page_set_storage_key(ptev & PAGE_MASK, PAGE_DEFAULT_KEY, 1);
+ pgste_set_unlock(pte, pgste);
+ return 0;
+}
+
void s390_enable_skey(void)
{
- page_table_reset_pgste(current->mm, 0, TASK_SIZE, true);
+ struct mm_walk walk = { .pte_entry = __s390_enable_skey };
+ struct mm_struct *mm = current->mm;
+
+ down_write(&mm->mmap_sem);
+ if (mm_use_skey(mm))
+ goto out_up;
+ walk.mm = mm;
+ walk_page_range(0, TASK_SIZE, &walk);
+ mm->context.use_skey = 1;
+
+out_up:
+ up_write(&mm->mmap_sem);
}
EXPORT_SYMBOL_GPL(s390_enable_skey);

/*
+ * Reset CMMA state, make all pages stable again.
+ */
+static int __s390_reset_cmma(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
+{
+ pgste_t pgste;
+
+ pgste = pgste_get_lock(pte);
+ pgste_val(pgste) &= ~_PGSTE_GPS_USAGE_MASK;
+ pgste_set_unlock(pte, pgste);
+ return 0;
+}
+
+void s390_reset_cmma(struct mm_struct *mm)
+{
+ struct mm_walk walk = { .pte_entry = __s390_reset_cmma };
+
+ down_write(&mm->mmap_sem);
+ walk.mm = mm;
+ walk_page_range(0, TASK_SIZE, &walk);
+ up_write(&mm->mmap_sem);
+}
+EXPORT_SYMBOL_GPL(s390_reset_cmma);
+
+/*
* Test and reset if a guest page is dirty
*/
bool gmap_test_and_clear_dirty(unsigned long address, struct gmap *gmap)
--
1.8.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-22 11:09:28 UTC
Permalink
Add a new function stub to allow architectures to disable for
an mm_structthe backing of non-present, anonymous pages with
read-only empty zero pages.

Signed-off-by: Dominik Dingel <***@linux.vnet.ibm.com>
---
include/linux/mm.h | 4 ++++
mm/huge_memory.c | 2 +-
mm/memory.c | 2 +-
3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index cd33ae2..0a2022e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -56,6 +56,10 @@ extern int sysctl_legacy_va_layout;
#define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0))
#endif

+#ifndef mm_forbids_zeropage
+#define mm_forbids_zeropage(X) (0)
+#endif
+
extern unsigned long sysctl_user_reserve_kbytes;
extern unsigned long sysctl_admin_reserve_kbytes;

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de98415..357a381 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -805,7 +805,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
return VM_FAULT_OOM;
- if (!(flags & FAULT_FLAG_WRITE) &&
+ if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm) &&
transparent_hugepage_use_zero_page()) {
spinlock_t *ptl;
pgtable_t pgtable;
diff --git a/mm/memory.c b/mm/memory.c
index 64f82aa..f275a9d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2640,7 +2640,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_SIGBUS;

/* Use the zero-page for reads */
- if (!(flags & FAULT_FLAG_WRITE)) {
+ if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
vma->vm_page_prot));
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
--
1.8.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Paolo Bonzini
2014-10-22 13:59:53 UTC
Permalink
Post by Dominik Dingel
Add a new function stub to allow architectures to disable for
an mm_structthe backing of non-present, anonymous pages with
read-only empty zero pages.
---
include/linux/mm.h | 4 ++++
mm/huge_memory.c | 2 +-
mm/memory.c | 2 +-
3 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cd33ae2..0a2022e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -56,6 +56,10 @@ extern int sysctl_legacy_va_layout;
#define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0))
#endif
+#ifndef mm_forbids_zeropage
+#define mm_forbids_zeropage(X) (0)
+#endif
+
extern unsigned long sysctl_user_reserve_kbytes;
extern unsigned long sysctl_admin_reserve_kbytes;
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index de98415..357a381 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -805,7 +805,7 @@ int do_huge_pmd_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
if (unlikely(khugepaged_enter(vma, vma->vm_flags)))
return VM_FAULT_OOM;
- if (!(flags & FAULT_FLAG_WRITE) &&
+ if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm) &&
transparent_hugepage_use_zero_page()) {
spinlock_t *ptl;
pgtable_t pgtable;
diff --git a/mm/memory.c b/mm/memory.c
index 64f82aa..f275a9d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2640,7 +2640,7 @@ static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_SIGBUS;
/* Use the zero-page for reads */
- if (!(flags & FAULT_FLAG_WRITE)) {
+ if (!(flags & FAULT_FLAG_WRITE) && !mm_forbids_zeropage(mm)) {
entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
vma->vm_page_prot));
page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2014-10-22 19:22:23 UTC
Permalink
Post by Dominik Dingel
Add a new function stub to allow architectures to disable for
an mm_structthe backing of non-present, anonymous pages with
read-only empty zero pages.
...
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -56,6 +56,10 @@ extern int sysctl_legacy_va_layout;
#define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0))
#endif
+#ifndef mm_forbids_zeropage
+#define mm_forbids_zeropage(X) (0)
+#endif
Can we document this please? What it does, why it does it. We should
also specify precisely which arch header file is responsible for
defining mm_forbids_zeropage.


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-22 19:45:52 UTC
Permalink
On Wed, 22 Oct 2014 12:22:23 -0700
Post by Andrew Morton
Post by Dominik Dingel
Add a new function stub to allow architectures to disable for
an mm_structthe backing of non-present, anonymous pages with
read-only empty zero pages.
...
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -56,6 +56,10 @@ extern int sysctl_legacy_va_layout;
#define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0))
#endif
+#ifndef mm_forbids_zeropage
+#define mm_forbids_zeropage(X) (0)
+#endif
Can we document this please? What it does, why it does it. We should
also specify precisely which arch header file is responsible for
defining mm_forbids_zeropage.
I will add a comment like:

/*
* To prevent common memory management code establishing
* a zero page mapping on a read fault.
* This function should be implemented within <asm/pgtable.h>.
* s390 does this to prevent multiplexing of hardware bits
* related to the physical page in case of virtualization.
*/

Okay?
Post by Andrew Morton
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
see: http://www.linux-mm.org/ .
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Andrew Morton
2014-10-22 19:49:30 UTC
Permalink
Post by Dominik Dingel
Post by Andrew Morton
Post by Dominik Dingel
+#ifndef mm_forbids_zeropage
+#define mm_forbids_zeropage(X) (0)
+#endif
Can we document this please? What it does, why it does it. We should
also specify precisely which arch header file is responsible for
defining mm_forbids_zeropage.
/*
* To prevent common memory management code establishing
* a zero page mapping on a read fault.
* This function should be implemented within <asm/pgtable.h>.
s/function should be implemented/macro should be defined/
Post by Dominik Dingel
* s390 does this to prevent multiplexing of hardware bits
* related to the physical page in case of virtualization.
*/
Okay?
Looks great, thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Dominik Dingel
2014-10-22 11:09:30 UTC
Permalink
When storage keys are enabled unmerge already merged pages and prevent
new pages from being merged.

Signed-off-by: Dominik Dingel <***@linux.vnet.ibm.com>
Acked-by: Christian Borntraeger <***@de.ibm.com>
---
arch/s390/include/asm/pgtable.h | 2 +-
arch/s390/kvm/priv.c | 17 ++++++++++++-----
arch/s390/mm/pgtable.c | 16 +++++++++++++++-
3 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 0da98d6..dfb38af 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1754,7 +1754,7 @@ static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset)
extern int vmem_add_mapping(unsigned long start, unsigned long size);
extern int vmem_remove_mapping(unsigned long start, unsigned long size);
extern int s390_enable_sie(void);
-extern void s390_enable_skey(void);
+extern int s390_enable_skey(void);
extern void s390_reset_cmma(struct mm_struct *mm);

/*
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index f89c1cd..e0967fd 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -156,21 +156,25 @@ static int handle_store_cpu_address(struct kvm_vcpu *vcpu)
return 0;
}

-static void __skey_check_enable(struct kvm_vcpu *vcpu)
+static int __skey_check_enable(struct kvm_vcpu *vcpu)
{
+ int rc = 0;
if (!(vcpu->arch.sie_block->ictl & (ICTL_ISKE | ICTL_SSKE | ICTL_RRBE)))
- return;
+ return rc;

- s390_enable_skey();
+ rc = s390_enable_skey();
trace_kvm_s390_skey_related_inst(vcpu);
vcpu->arch.sie_block->ictl &= ~(ICTL_ISKE | ICTL_SSKE | ICTL_RRBE);
+ return rc;
}


static int handle_skey(struct kvm_vcpu *vcpu)
{
- __skey_check_enable(vcpu);
+ int rc = __skey_check_enable(vcpu);

+ if (rc)
+ return rc;
vcpu->stat.instruction_storage_key++;

if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
@@ -692,7 +696,10 @@ static int handle_pfmf(struct kvm_vcpu *vcpu)
}

if (vcpu->run->s.regs.gprs[reg1] & PFMF_SK) {
- __skey_check_enable(vcpu);
+ int rc = __skey_check_enable(vcpu);
+
+ if (rc)
+ return rc;
if (set_guest_storage_key(current->mm, useraddr,
vcpu->run->s.regs.gprs[reg1] & PFMF_KEY,
vcpu->run->s.regs.gprs[reg1] & PFMF_NQ))
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 58d7eb2..82aa528 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -18,6 +18,8 @@
#include <linux/rcupdate.h>
#include <linux/slab.h>
#include <linux/swapops.h>
+#include <linux/ksm.h>
+#include <linux/mman.h>

#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -1328,22 +1330,34 @@ static int __s390_enable_skey(pte_t *pte, unsigned long addr,
return 0;
}

-void s390_enable_skey(void)
+int s390_enable_skey(void)
{
struct mm_walk walk = { .pte_entry = __s390_enable_skey };
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+ int rc = 0;

down_write(&mm->mmap_sem);
if (mm_use_skey(mm))
goto out_up;

mm->context.use_skey = 1;
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (ksm_madvise(vma, vma->vm_start, vma->vm_end,
+ MADV_UNMERGEABLE, &vma->vm_flags)) {
+ mm->context.use_skey = 0;
+ rc = -ENOMEM;
+ goto out_up;
+ }
+ }
+ mm->def_flags &= ~VM_MERGEABLE;

walk.mm = mm;
walk_page_range(0, TASK_SIZE, &walk);

out_up:
up_write(&mm->mmap_sem);
+ return rc;
}
EXPORT_SYMBOL_GPL(s390_enable_skey);
--
1.8.5.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Paolo Bonzini
2014-10-22 14:00:20 UTC
Permalink
Reviewed-by: Paolo Bonzini <***@redhat.com>

(missing R-b on patch 1 is _not_ a mistake :))

Paolo
Post by Dominik Dingel
When storage keys are enabled unmerge already merged pages and prevent
new pages from being merged.
---
arch/s390/include/asm/pgtable.h | 2 +-
arch/s390/kvm/priv.c | 17 ++++++++++++-----
arch/s390/mm/pgtable.c | 16 +++++++++++++++-
3 files changed, 28 insertions(+), 7 deletions(-)
diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 0da98d6..dfb38af 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1754,7 +1754,7 @@ static inline pte_t mk_swap_pte(unsigned long type, unsigned long offset)
extern int vmem_add_mapping(unsigned long start, unsigned long size);
extern int vmem_remove_mapping(unsigned long start, unsigned long size);
extern int s390_enable_sie(void);
-extern void s390_enable_skey(void);
+extern int s390_enable_skey(void);
extern void s390_reset_cmma(struct mm_struct *mm);
/*
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index f89c1cd..e0967fd 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -156,21 +156,25 @@ static int handle_store_cpu_address(struct kvm_vcpu *vcpu)
return 0;
}
-static void __skey_check_enable(struct kvm_vcpu *vcpu)
+static int __skey_check_enable(struct kvm_vcpu *vcpu)
{
+ int rc = 0;
if (!(vcpu->arch.sie_block->ictl & (ICTL_ISKE | ICTL_SSKE | ICTL_RRBE)))
- return;
+ return rc;
- s390_enable_skey();
+ rc = s390_enable_skey();
trace_kvm_s390_skey_related_inst(vcpu);
vcpu->arch.sie_block->ictl &= ~(ICTL_ISKE | ICTL_SSKE | ICTL_RRBE);
+ return rc;
}
static int handle_skey(struct kvm_vcpu *vcpu)
{
- __skey_check_enable(vcpu);
+ int rc = __skey_check_enable(vcpu);
+ if (rc)
+ return rc;
vcpu->stat.instruction_storage_key++;
if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
@@ -692,7 +696,10 @@ static int handle_pfmf(struct kvm_vcpu *vcpu)
}
if (vcpu->run->s.regs.gprs[reg1] & PFMF_SK) {
- __skey_check_enable(vcpu);
+ int rc = __skey_check_enable(vcpu);
+
+ if (rc)
+ return rc;
if (set_guest_storage_key(current->mm, useraddr,
vcpu->run->s.regs.gprs[reg1] & PFMF_KEY,
vcpu->run->s.regs.gprs[reg1] & PFMF_NQ))
diff --git a/arch/s390/mm/pgtable.c b/arch/s390/mm/pgtable.c
index 58d7eb2..82aa528 100644
--- a/arch/s390/mm/pgtable.c
+++ b/arch/s390/mm/pgtable.c
@@ -18,6 +18,8 @@
#include <linux/rcupdate.h>
#include <linux/slab.h>
#include <linux/swapops.h>
+#include <linux/ksm.h>
+#include <linux/mman.h>
#include <asm/pgtable.h>
#include <asm/pgalloc.h>
@@ -1328,22 +1330,34 @@ static int __s390_enable_skey(pte_t *pte, unsigned long addr,
return 0;
}
-void s390_enable_skey(void)
+int s390_enable_skey(void)
{
struct mm_walk walk = { .pte_entry = __s390_enable_skey };
struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+ int rc = 0;
down_write(&mm->mmap_sem);
if (mm_use_skey(mm))
goto out_up;
mm->context.use_skey = 1;
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ if (ksm_madvise(vma, vma->vm_start, vma->vm_end,
+ MADV_UNMERGEABLE, &vma->vm_flags)) {
+ mm->context.use_skey = 0;
+ rc = -ENOMEM;
+ goto out_up;
+ }
+ }
+ mm->def_flags &= ~VM_MERGEABLE;
walk.mm = mm;
walk_page_range(0, TASK_SIZE, &walk);
up_write(&mm->mmap_sem);
+ return rc;
}
EXPORT_SYMBOL_GPL(s390_enable_skey);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Paolo Bonzini
2014-10-22 13:59:40 UTC
Permalink
Post by Dominik Dingel
s390 has the special notion of storage keys which are some sort of page flags
associated with physical pages and live outside of direct addressable memory.
These storage keys can be queried and changed with a special set of instructions.
- an invalid pte, then the instructions will work on memory in the host page table
- a valid pte, then the instructions will work with the real storage key
Thanks to Martin with his software reference and dirty bit tracking,
the kernel does not issue any storage key instructions as now a
software based approach will be taken, on the other hand distributions
in the wild are currently using them.
However, for virtualized guests we still have a problem with guest pages
mapped to zero pages and the kernel same page merging.
With each one multiple guest pages will point to the same physical page
and share the same storage key.
Let's fix this by introducing a new function which s390 will define to
forbid new zero page mappings. If the guest issues a storage key related
instruction we flag the mm_struct, drop existing zero page mappings
and unmerge the guest memory.
- Clearing up patch description Patch 3/4
- removing unnecessary flag in mmu_context (Paolo)
... and zero the mm_use_skey flag correctly, too. :)
Post by Dominik Dingel
- Following Dave and Paolo suggestion removing the vma flag
Thanks, the patches look good. I expect that they will either go in
through the s390 tree, or come in via Christian.

If the latter, Martin, please reply with your Acked-by.

Paolo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Martin Schwidefsky
2014-10-23 10:19:48 UTC
Permalink
On Wed, 22 Oct 2014 13:09:26 +0200
Post by Dominik Dingel
s390 has the special notion of storage keys which are some sort of page flags
associated with physical pages and live outside of direct addressable memory.
These storage keys can be queried and changed with a special set of instructions.
- an invalid pte, then the instructions will work on memory in the host page table
- a valid pte, then the instructions will work with the real storage key
Thanks to Martin with his software reference and dirty bit tracking,
the kernel does not issue any storage key instructions as now a
software based approach will be taken, on the other hand distributions
in the wild are currently using them.
However, for virtualized guests we still have a problem with guest pages
mapped to zero pages and the kernel same page merging.
With each one multiple guest pages will point to the same physical page
and share the same storage key.
Let's fix this by introducing a new function which s390 will define to
forbid new zero page mappings. If the guest issues a storage key related
instruction we flag the mm_struct, drop existing zero page mappings
and unmerge the guest memory.
- Clearing up patch description Patch 3/4
- removing unnecessary flag in mmu_context (Paolo)
- Following Dave and Paolo suggestion removing the vma flag
s390/mm: recfactor global pgste updates
mm: introduce mm_forbids_zeropage function
s390/mm: prevent and break zero page mappings in case of storage keys
s390/mm: disable KSM for storage key enabled pages
arch/s390/include/asm/pgalloc.h | 2 -
arch/s390/include/asm/pgtable.h | 8 +-
arch/s390/kvm/kvm-s390.c | 2 +-
arch/s390/kvm/priv.c | 17 ++--
arch/s390/mm/pgtable.c | 180 ++++++++++++++++++----------------------
include/linux/mm.h | 4 +
mm/huge_memory.c | 2 +-
mm/memory.c | 2 +-
8 files changed, 106 insertions(+), 111 deletions(-)
Patches look good to me and as nobody seems to disagree with the proposed
solution I will add the code to the features branch of the s390 tree.
--
blue skies,
Martin.

"Reality continues to ruin my life." - Calvin.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to ***@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"***@kvack.org"> ***@kvack.org </a>
Continue reading on narkive:
Loading...