linux/arch/x86/ia32/ia32entry.S

856 lines
22 KiB
ArmAsm
Raw Normal View History

/*
* Compatibility mode system call entry point for x86-64.
*
* Copyright 2000-2002 Andi Kleen, SuSE Labs.
*/
#include <asm/dwarf2.h>
#include <asm/calling.h>
#include <asm/asm-offsets.h>
#include <asm/current.h>
#include <asm/errno.h>
#include <asm/ia32_unistd.h>
#include <asm/thread_info.h>
#include <asm/segment.h>
#include <asm/irqflags.h>
#include <linux/linkage.h>
/* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this. */
#include <linux/elf-em.h>
#define AUDIT_ARCH_I386 (EM_386|__AUDIT_ARCH_LE)
#define __AUDIT_ARCH_LE 0x40000000
#ifndef CONFIG_AUDITSYSCALL
#define sysexit_audit ia32_ret_from_sys_call
#define sysretl_audit ia32_ret_from_sys_call
#endif
x86: Separate out entry text section Put x86 entry code into a separate link section: .entry.text. Separating the entry text section seems to have performance benefits - caused by more efficient instruction cache usage. Running hackbench with perf stat --repeat showed that the change compresses the icache footprint. The icache load miss rate went down by about 15%: before patch: 19417627 L1-icache-load-misses ( +- 0.147% ) after patch: 16490788 L1-icache-load-misses ( +- 0.180% ) The motivation of the patch was to fix a particular kprobes bug that relates to the entry text section, the performance advantage was discovered accidentally. Whole perf output follows: - results for current tip tree: Performance counter stats for './hackbench/hackbench 10' (500 runs): 19417627 L1-icache-load-misses ( +- 0.147% ) 2676914223 instructions # 0.497 IPC ( +- 0.079% ) 5389516026 cycles ( +- 0.144% ) 0.206267711 seconds time elapsed ( +- 0.138% ) - results for current tip tree with the patch applied: Performance counter stats for './hackbench/hackbench 10' (500 runs): 16490788 L1-icache-load-misses ( +- 0.180% ) 2717734941 instructions # 0.502 IPC ( +- 0.079% ) 5414756975 cycles ( +- 0.148% ) 0.206747566 seconds time elapsed ( +- 0.137% ) Signed-off-by: Jiri Olsa <jolsa@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: masami.hiramatsu.pt@hitachi.com Cc: ananth@in.ibm.com Cc: davem@davemloft.net Cc: 2nddept-manager@sdl.hitachi.co.jp LKML-Reference: <20110307181039.GB15197@jolsa.redhat.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-03-07 18:10:39 +00:00
.section .entry.text, "ax"
#define IA32_NR_syscalls ((ia32_syscall_end - ia32_sys_call_table)/8)
.macro IA32_ARG_FIXUP noebp=0
movl %edi,%r8d
.if \noebp
.else
movl %ebp,%r9d
.endif
xchg %ecx,%esi
movl %ebx,%edi
movl %edx,%edx /* zero extension */
.endm
/* clobbers %eax */
.macro CLEAR_RREGS offset=0, _r9=rax
xorl %eax,%eax
movq %rax,\offset+R11(%rsp)
movq %rax,\offset+R10(%rsp)
movq %\_r9,\offset+R9(%rsp)
movq %rax,\offset+R8(%rsp)
.endm
/*
* Reload arg registers from stack in case ptrace changed them.
* We don't reload %eax because syscall_trace_enter() returned
* the %rax value we should see. Instead, we just truncate that
* value to 32 bits again as we did on entry from user mode.
* If it's a new value set by user_regset during entry tracing,
* this matches the normal truncation of the user-mode value.
* If it's -1 to make us punt the syscall, then (u32)-1 is still
* an appropriately invalid value.
*/
.macro LOAD_ARGS32 offset, _r9=0
.if \_r9
movl \offset+16(%rsp),%r9d
.endif
movl \offset+40(%rsp),%ecx
movl \offset+48(%rsp),%edx
movl \offset+56(%rsp),%esi
movl \offset+64(%rsp),%edi
movl %eax,%eax /* zero extension */
.endm
.macro CFI_STARTPROC32 simple
CFI_STARTPROC \simple
CFI_UNDEFINED r8
CFI_UNDEFINED r9
CFI_UNDEFINED r10
CFI_UNDEFINED r11
CFI_UNDEFINED r12
CFI_UNDEFINED r13
CFI_UNDEFINED r14
CFI_UNDEFINED r15
.endm
#ifdef CONFIG_PARAVIRT
ENTRY(native_usergs_sysret32)
swapgs
sysretl
ENDPROC(native_usergs_sysret32)
ENTRY(native_irq_enable_sysexit)
swapgs
sti
sysexit
ENDPROC(native_irq_enable_sysexit)
#endif
/*
* 32bit SYSENTER instruction entry.
*
* Arguments:
* %eax System call number.
* %ebx Arg1
* %ecx Arg2
* %edx Arg3
* %esi Arg4
* %edi Arg5
* %ebp user stack
* 0(%ebp) Arg6
*
* Interrupts off.
*
* This is purely a fast path. For anything complicated we use the int 0x80
* path below. Set up a complete hardware stack frame to share code
* with the int 0x80 path.
*/
ENTRY(ia32_sysenter_target)
CFI_STARTPROC32 simple
CFI_SIGNAL_FRAME
CFI_DEF_CFA rsp,0
CFI_REGISTER rsp,rbp
x86/paravirt: groundwork for 64-bit Xen support, fix Ingo Molnar wrote: > * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > > >>> It quickly broke the build in testing: >>> >>> include/asm/pgalloc.h: In function ‘paravirt_pgd_free': >>> include/asm/pgalloc.h:14: error: parameter name omitted >>> arch/x86/kernel/entry_64.S: In file included from >>> arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function >>> ‘paravirt_pgd_free': >>> include/asm/pgalloc.h:14: error: parameter name omitted >>> >>> >> No, looks like my fault. The non-PARAVIRT version of >> paravirt_pgd_free() is: >> >> static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *) {} >> >> but C doesn't like missing parameter names, even if unused. >> >> This should fix it: >> > > that fixed the build but now we've got a boot crash with this config: > > time.c: Detected 2010.304 MHz processor. > spurious 8259A interrupt: IRQ7. > BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 > IP: [<0000000000000000>] > PGD 0 > Thread overran stack, or stack corrupted > Oops: 0010 [1] SMP > CPU 0 > > with: > > http://redhat.com/~mingo/misc/config-Thu_Jun_26_12_46_46_CEST_2008.bad > Use SWAPGS_UNSAFE_STACK in ia32entry.S in the places where the active stack is the usermode stack. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-26 14:28:51 +00:00
SWAPGS_UNSAFE_STACK
movq PER_CPU_VAR(kernel_stack), %rsp
addq $(KERNEL_STACK_OFFSET),%rsp
/*
* No need to follow this irqs on/off section: the syscall
* disabled irqs, here we enable it straight after entry:
*/
ENABLE_INTERRUPTS(CLBR_NONE)
movl %ebp,%ebp /* zero extension */
pushq_cfi $__USER32_DS
/*CFI_REL_OFFSET ss,0*/
pushq_cfi %rbp
CFI_REL_OFFSET rsp,0
pushfq_cfi
/*CFI_REL_OFFSET rflags,0*/
movl 8*3-THREAD_SIZE+TI_sysenter_return(%rsp), %r10d
CFI_REGISTER rip,r10
pushq_cfi $__USER32_CS
/*CFI_REL_OFFSET cs,0*/
movl %eax, %eax
pushq_cfi %r10
CFI_REL_OFFSET rip,0
pushq_cfi %rax
cld
SAVE_ARGS 0,1,0
/* no need to do an access_ok check here because rbp has been
32bit zero extended */
1: movl (%rbp),%ebp
.section __ex_table,"a"
.quad 1b,ia32_badarg
.previous
GET_THREAD_INFO(%r10)
orl $TS_COMPAT,TI_status(%r10)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
CFI_REMEMBER_STATE
jnz sysenter_tracesys
cmpq $(IA32_NR_syscalls-1),%rax
ja ia32_badsys
sysenter_do_call:
IA32_ARG_FIXUP
sysenter_dispatch:
call *ia32_sys_call_table(,%rax,8)
movq %rax,RAX-ARGOFFSET(%rsp)
GET_THREAD_INFO(%r10)
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
testl $_TIF_ALLWORK_MASK,TI_flags(%r10)
jnz sysexit_audit
sysexit_from_sys_call:
andl $~TS_COMPAT,TI_status(%r10)
/* clear IF, that popfq doesn't enable interrupts early */
andl $~0x200,EFLAGS-R11(%rsp)
movl RIP-R11(%rsp),%edx /* User %eip */
CFI_REGISTER rip,rdx
RESTORE_ARGS 0,24,0,0,0,0
xorq %r8,%r8
xorq %r9,%r9
xorq %r10,%r10
xorq %r11,%r11
popfq_cfi
/*CFI_RESTORE rflags*/
popq_cfi %rcx /* User %esp */
CFI_REGISTER rsp,rcx
TRACE_IRQS_ON
ENABLE_INTERRUPTS_SYSEXIT32
#ifdef CONFIG_AUDITSYSCALL
.macro auditsys_entry_common
movl %esi,%r9d /* 6th arg: 4th syscall arg */
movl %edx,%r8d /* 5th arg: 3rd syscall arg */
/* (already in %ecx) 4th arg: 2nd syscall arg */
movl %ebx,%edx /* 3rd arg: 1st syscall arg */
movl %eax,%esi /* 2nd arg: syscall number */
movl $AUDIT_ARCH_I386,%edi /* 1st arg: audit arch */
call audit_syscall_entry
movl RAX-ARGOFFSET(%rsp),%eax /* reload syscall number */
cmpq $(IA32_NR_syscalls-1),%rax
ja ia32_badsys
movl %ebx,%edi /* reload 1st syscall arg */
movl RCX-ARGOFFSET(%rsp),%esi /* reload 2nd syscall arg */
movl RDX-ARGOFFSET(%rsp),%edx /* reload 3rd syscall arg */
movl RSI-ARGOFFSET(%rsp),%ecx /* reload 4th syscall arg */
movl RDI-ARGOFFSET(%rsp),%r8d /* reload 5th syscall arg */
.endm
.macro auditsys_exit exit
testl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),TI_flags(%r10)
jnz ia32_ret_from_sys_call
TRACE_IRQS_ON
sti
movl %eax,%esi /* second arg, syscall return value */
cmpl $0,%eax /* is it < 0? */
setl %al /* 1 if so, 0 if not */
movzbl %al,%edi /* zero-extend that into %edi */
inc %edi /* first arg, 0->1(AUDITSC_SUCCESS), 1->2(AUDITSC_FAILURE) */
call audit_syscall_exit
GET_THREAD_INFO(%r10)
movl RAX-ARGOFFSET(%rsp),%eax /* reload syscall return value */
movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
cli
TRACE_IRQS_OFF
testl %edi,TI_flags(%r10)
jz \exit
CLEAR_RREGS -ARGOFFSET
jmp int_with_check
.endm
sysenter_auditsys:
CFI_RESTORE_STATE
auditsys_entry_common
movl %ebp,%r9d /* reload 6th syscall arg */
jmp sysenter_dispatch
sysexit_audit:
auditsys_exit sysexit_from_sys_call
#endif
sysenter_tracesys:
#ifdef CONFIG_AUDITSYSCALL
testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags(%r10)
jz sysenter_auditsys
#endif
SAVE_REST
CLEAR_RREGS
movq $-ENOSYS,RAX(%rsp)/* ptrace can change this for a bad syscall */
movq %rsp,%rdi /* &pt_regs -> arg1 */
call syscall_trace_enter
LOAD_ARGS32 ARGOFFSET /* reload args from stack in case ptrace changed it */
RESTORE_REST
cmpq $(IA32_NR_syscalls-1),%rax
ja int_ret_from_sys_call /* sysenter_tracesys has set RAX(%rsp) */
jmp sysenter_do_call
CFI_ENDPROC
ENDPROC(ia32_sysenter_target)
/*
* 32bit SYSCALL instruction entry.
*
* Arguments:
* %eax System call number.
* %ebx Arg1
* %ecx return EIP
* %edx Arg3
* %esi Arg4
* %edi Arg5
* %ebp Arg2 [note: not saved in the stack frame, should not be touched]
* %esp user stack
* 0(%esp) Arg6
*
* Interrupts off.
*
* This is purely a fast path. For anything complicated we use the int 0x80
* path below. Set up a complete hardware stack frame to share code
* with the int 0x80 path.
*/
ENTRY(ia32_cstar_target)
CFI_STARTPROC32 simple
CFI_SIGNAL_FRAME
CFI_DEF_CFA rsp,KERNEL_STACK_OFFSET
CFI_REGISTER rip,rcx
/*CFI_REGISTER rflags,r11*/
x86/paravirt: groundwork for 64-bit Xen support, fix Ingo Molnar wrote: > * Jeremy Fitzhardinge <jeremy@goop.org> wrote: > > >>> It quickly broke the build in testing: >>> >>> include/asm/pgalloc.h: In function ‘paravirt_pgd_free': >>> include/asm/pgalloc.h:14: error: parameter name omitted >>> arch/x86/kernel/entry_64.S: In file included from >>> arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function >>> ‘paravirt_pgd_free': >>> include/asm/pgalloc.h:14: error: parameter name omitted >>> >>> >> No, looks like my fault. The non-PARAVIRT version of >> paravirt_pgd_free() is: >> >> static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *) {} >> >> but C doesn't like missing parameter names, even if unused. >> >> This should fix it: >> > > that fixed the build but now we've got a boot crash with this config: > > time.c: Detected 2010.304 MHz processor. > spurious 8259A interrupt: IRQ7. > BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 > IP: [<0000000000000000>] > PGD 0 > Thread overran stack, or stack corrupted > Oops: 0010 [1] SMP > CPU 0 > > with: > > http://redhat.com/~mingo/misc/config-Thu_Jun_26_12_46_46_CEST_2008.bad > Use SWAPGS_UNSAFE_STACK in ia32entry.S in the places where the active stack is the usermode stack. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: xen-devel <xen-devel@lists.xensource.com> Cc: Stephen Tweedie <sct@redhat.com> Cc: Eduardo Habkost <ehabkost@redhat.com> Cc: Mark McLoughlin <markmc@redhat.com> Cc: Vegard Nossum <vegard.nossum@gmail.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Yinghai Lu <yhlu.kernel@gmail.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2008-06-26 14:28:51 +00:00
SWAPGS_UNSAFE_STACK
movl %esp,%r8d
CFI_REGISTER rsp,r8
movq PER_CPU_VAR(kernel_stack),%rsp
/*
* No need to follow this irqs on/off section: the syscall
* disabled irqs and here we enable it straight after entry:
*/
ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_ARGS 8,0,0
movl %eax,%eax /* zero extension */
movq %rax,ORIG_RAX-ARGOFFSET(%rsp)
movq %rcx,RIP-ARGOFFSET(%rsp)
CFI_REL_OFFSET rip,RIP-ARGOFFSET
movq %rbp,RCX-ARGOFFSET(%rsp) /* this lies slightly to ptrace */
movl %ebp,%ecx
movq $__USER32_CS,CS-ARGOFFSET(%rsp)
movq $__USER32_DS,SS-ARGOFFSET(%rsp)
movq %r11,EFLAGS-ARGOFFSET(%rsp)
/*CFI_REL_OFFSET rflags,EFLAGS-ARGOFFSET*/
movq %r8,RSP-ARGOFFSET(%rsp)
CFI_REL_OFFSET rsp,RSP-ARGOFFSET
/* no need to do an access_ok check here because r8 has been
32bit zero extended */
/* hardware stack frame is complete now */
1: movl (%r8),%r9d
.section __ex_table,"a"
.quad 1b,ia32_badarg
.previous
GET_THREAD_INFO(%r10)
orl $TS_COMPAT,TI_status(%r10)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
CFI_REMEMBER_STATE
jnz cstar_tracesys
cmpq $IA32_NR_syscalls-1,%rax
ja ia32_badsys
cstar_do_call:
IA32_ARG_FIXUP 1
cstar_dispatch:
call *ia32_sys_call_table(,%rax,8)
movq %rax,RAX-ARGOFFSET(%rsp)
GET_THREAD_INFO(%r10)
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
testl $_TIF_ALLWORK_MASK,TI_flags(%r10)
jnz sysretl_audit
sysretl_from_sys_call:
andl $~TS_COMPAT,TI_status(%r10)
RESTORE_ARGS 0,-ARG_SKIP,0,0,0
movl RIP-ARGOFFSET(%rsp),%ecx
CFI_REGISTER rip,rcx
movl EFLAGS-ARGOFFSET(%rsp),%r11d
/*CFI_REGISTER rflags,r11*/
xorq %r10,%r10
xorq %r9,%r9
xorq %r8,%r8
TRACE_IRQS_ON
movl RSP-ARGOFFSET(%rsp),%esp
CFI_RESTORE rsp
USERGS_SYSRET32
#ifdef CONFIG_AUDITSYSCALL
cstar_auditsys:
CFI_RESTORE_STATE
movl %r9d,R9-ARGOFFSET(%rsp) /* register to be clobbered by call */
auditsys_entry_common
movl R9-ARGOFFSET(%rsp),%r9d /* reload 6th syscall arg */
jmp cstar_dispatch
sysretl_audit:
auditsys_exit sysretl_from_sys_call
#endif
cstar_tracesys:
#ifdef CONFIG_AUDITSYSCALL
testl $(_TIF_WORK_SYSCALL_ENTRY & ~_TIF_SYSCALL_AUDIT),TI_flags(%r10)
jz cstar_auditsys
#endif
xchgl %r9d,%ebp
SAVE_REST
CLEAR_RREGS 0, r9
movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */
movq %rsp,%rdi /* &pt_regs -> arg1 */
call syscall_trace_enter
LOAD_ARGS32 ARGOFFSET, 1 /* reload args from stack in case ptrace changed it */
RESTORE_REST
xchgl %ebp,%r9d
cmpq $(IA32_NR_syscalls-1),%rax
ja int_ret_from_sys_call /* cstar_tracesys has set RAX(%rsp) */
jmp cstar_do_call
END(ia32_cstar_target)
ia32_badarg:
movq $-EFAULT,%rax
jmp ia32_sysret
CFI_ENDPROC
/*
* Emulated IA32 system calls via int 0x80.
*
* Arguments:
* %eax System call number.
* %ebx Arg1
* %ecx Arg2
* %edx Arg3
* %esi Arg4
* %edi Arg5
* %ebp Arg6 [note: not saved in the stack frame, should not be touched]
*
* Notes:
* Uses the same stack frame as the x86-64 version.
* All registers except %eax must be saved (but ptrace may violate that)
* Arguments are zero extended. For system calls that want sign extension and
* take long arguments a wrapper is needed. Most calls can just be called
* directly.
* Assumes it is only called from user space and entered with interrupts off.
*/
ENTRY(ia32_syscall)
CFI_STARTPROC32 simple
CFI_SIGNAL_FRAME
CFI_DEF_CFA rsp,SS+8-RIP
/*CFI_REL_OFFSET ss,SS-RIP*/
CFI_REL_OFFSET rsp,RSP-RIP
/*CFI_REL_OFFSET rflags,EFLAGS-RIP*/
/*CFI_REL_OFFSET cs,CS-RIP*/
CFI_REL_OFFSET rip,RIP-RIP
PARAVIRT_ADJUST_EXCEPTION_FRAME
SWAPGS
/*
* No need to follow this irqs on/off section: the syscall
* disabled irqs and here we enable it straight after entry:
*/
ENABLE_INTERRUPTS(CLBR_NONE)
movl %eax,%eax
pushq_cfi %rax
cld
/* note the registers are not zero extended to the sf.
this could be a problem. */
SAVE_ARGS 0,1,0
GET_THREAD_INFO(%r10)
orl $TS_COMPAT,TI_status(%r10)
testl $_TIF_WORK_SYSCALL_ENTRY,TI_flags(%r10)
jnz ia32_tracesys
cmpq $(IA32_NR_syscalls-1),%rax
ja ia32_badsys
ia32_do_call:
IA32_ARG_FIXUP
call *ia32_sys_call_table(,%rax,8) # xxx: rip relative
ia32_sysret:
movq %rax,RAX-ARGOFFSET(%rsp)
ia32_ret_from_sys_call:
CLEAR_RREGS -ARGOFFSET
jmp int_ret_from_sys_call
ia32_tracesys:
SAVE_REST
CLEAR_RREGS
movq $-ENOSYS,RAX(%rsp) /* ptrace can change this for a bad syscall */
movq %rsp,%rdi /* &pt_regs -> arg1 */
call syscall_trace_enter
LOAD_ARGS32 ARGOFFSET /* reload args from stack in case ptrace changed it */
RESTORE_REST
cmpq $(IA32_NR_syscalls-1),%rax
ja int_ret_from_sys_call /* ia32_tracesys has set RAX(%rsp) */
jmp ia32_do_call
END(ia32_syscall)
ia32_badsys:
movq $0,ORIG_RAX-ARGOFFSET(%rsp)
movq $-ENOSYS,%rax
jmp ia32_sysret
quiet_ni_syscall:
movq $-ENOSYS,%rax
ret
CFI_ENDPROC
.macro PTREGSCALL label, func, arg
.globl \label
\label:
leaq \func(%rip),%rax
leaq -ARGOFFSET+8(%rsp),\arg /* 8 for return address */
jmp ia32_ptregs_common
.endm
CFI_STARTPROC32
PTREGSCALL stub32_rt_sigreturn, sys32_rt_sigreturn, %rdi
PTREGSCALL stub32_sigreturn, sys32_sigreturn, %rdi
PTREGSCALL stub32_sigaltstack, sys32_sigaltstack, %rdx
PTREGSCALL stub32_execve, sys32_execve, %rcx
PTREGSCALL stub32_fork, sys_fork, %rdi
PTREGSCALL stub32_clone, sys32_clone, %rdx
PTREGSCALL stub32_vfork, sys_vfork, %rdi
PTREGSCALL stub32_iopl, sys_iopl, %rsi
ENTRY(ia32_ptregs_common)
popq %r11
CFI_ENDPROC
CFI_STARTPROC32 simple
CFI_SIGNAL_FRAME
CFI_DEF_CFA rsp,SS+8-ARGOFFSET
CFI_REL_OFFSET rax,RAX-ARGOFFSET
CFI_REL_OFFSET rcx,RCX-ARGOFFSET
CFI_REL_OFFSET rdx,RDX-ARGOFFSET
CFI_REL_OFFSET rsi,RSI-ARGOFFSET
CFI_REL_OFFSET rdi,RDI-ARGOFFSET
CFI_REL_OFFSET rip,RIP-ARGOFFSET
/* CFI_REL_OFFSET cs,CS-ARGOFFSET*/
/* CFI_REL_OFFSET rflags,EFLAGS-ARGOFFSET*/
CFI_REL_OFFSET rsp,RSP-ARGOFFSET
/* CFI_REL_OFFSET ss,SS-ARGOFFSET*/
SAVE_REST
call *%rax
RESTORE_REST
jmp ia32_sysret /* misbalances the return cache */
CFI_ENDPROC
END(ia32_ptregs_common)
.section .rodata,"a"
.align 8
ia32_sys_call_table:
.quad sys_restart_syscall
.quad sys_exit
.quad stub32_fork
.quad sys_read
.quad sys_write
.quad compat_sys_open /* 5 */
.quad sys_close
.quad sys32_waitpid
.quad sys_creat
.quad sys_link
.quad sys_unlink /* 10 */
.quad stub32_execve
.quad sys_chdir
.quad compat_sys_time
.quad sys_mknod
.quad sys_chmod /* 15 */
.quad sys_lchown16
.quad quiet_ni_syscall /* old break syscall holder */
.quad sys_stat
.quad sys32_lseek
.quad sys_getpid /* 20 */
.quad compat_sys_mount /* mount */
.quad sys_oldumount /* old_umount */
.quad sys_setuid16
.quad sys_getuid16
.quad compat_sys_stime /* stime */ /* 25 */
.quad compat_sys_ptrace /* ptrace */
.quad sys_alarm
.quad sys_fstat /* (old)fstat */
.quad sys_pause
.quad compat_sys_utime /* 30 */
.quad quiet_ni_syscall /* old stty syscall holder */
.quad quiet_ni_syscall /* old gtty syscall holder */
.quad sys_access
.quad sys_nice
.quad quiet_ni_syscall /* 35 */ /* old ftime syscall holder */
.quad sys_sync
.quad sys32_kill
.quad sys_rename
.quad sys_mkdir
.quad sys_rmdir /* 40 */
.quad sys_dup
.quad sys_pipe
.quad compat_sys_times
.quad quiet_ni_syscall /* old prof syscall holder */
.quad sys_brk /* 45 */
.quad sys_setgid16
.quad sys_getgid16
.quad sys_signal
.quad sys_geteuid16
.quad sys_getegid16 /* 50 */
.quad sys_acct
.quad sys_umount /* new_umount */
.quad quiet_ni_syscall /* old lock syscall holder */
.quad compat_sys_ioctl
.quad compat_sys_fcntl64 /* 55 */
.quad quiet_ni_syscall /* old mpx syscall holder */
.quad sys_setpgid
.quad quiet_ni_syscall /* old ulimit syscall holder */
.quad sys_olduname
.quad sys_umask /* 60 */
.quad sys_chroot
.quad compat_sys_ustat
.quad sys_dup2
.quad sys_getppid
.quad sys_getpgrp /* 65 */
.quad sys_setsid
.quad sys32_sigaction
.quad sys_sgetmask
.quad sys_ssetmask
.quad sys_setreuid16 /* 70 */
.quad sys_setregid16
.quad sys32_sigsuspend
.quad compat_sys_sigpending
.quad sys_sethostname
.quad compat_sys_setrlimit /* 75 */
.quad compat_sys_old_getrlimit /* old_getrlimit */
.quad compat_sys_getrusage
.quad compat_sys_gettimeofday
.quad compat_sys_settimeofday
.quad sys_getgroups16 /* 80 */
.quad sys_setgroups16
.quad compat_sys_old_select
.quad sys_symlink
.quad sys_lstat
.quad sys_readlink /* 85 */
.quad sys_uselib
.quad sys_swapon
.quad sys_reboot
.quad compat_sys_old_readdir
.quad sys32_mmap /* 90 */
.quad sys_munmap
.quad sys_truncate
.quad sys_ftruncate
.quad sys_fchmod
.quad sys_fchown16 /* 95 */
.quad sys_getpriority
.quad sys_setpriority
.quad quiet_ni_syscall /* old profil syscall holder */
.quad compat_sys_statfs
.quad compat_sys_fstatfs /* 100 */
.quad sys_ioperm
.quad compat_sys_socketcall
.quad sys_syslog
.quad compat_sys_setitimer
.quad compat_sys_getitimer /* 105 */
.quad compat_sys_newstat
.quad compat_sys_newlstat
.quad compat_sys_newfstat
.quad sys_uname
.quad stub32_iopl /* 110 */
.quad sys_vhangup
.quad quiet_ni_syscall /* old "idle" system call */
.quad sys32_vm86_warning /* vm86old */
.quad compat_sys_wait4
.quad sys_swapoff /* 115 */
.quad compat_sys_sysinfo
.quad sys32_ipc
.quad sys_fsync
.quad stub32_sigreturn
.quad stub32_clone /* 120 */
.quad sys_setdomainname
.quad sys_newuname
.quad sys_modify_ldt
.quad compat_sys_adjtimex
.quad sys32_mprotect /* 125 */
.quad compat_sys_sigprocmask
.quad quiet_ni_syscall /* create_module */
.quad sys_init_module
.quad sys_delete_module
.quad quiet_ni_syscall /* 130 get_kernel_syms */
.quad sys32_quotactl
.quad sys_getpgid
.quad sys_fchdir
.quad quiet_ni_syscall /* bdflush */
.quad sys_sysfs /* 135 */
.quad sys_personality
.quad quiet_ni_syscall /* for afs_syscall */
.quad sys_setfsuid16
.quad sys_setfsgid16
.quad sys_llseek /* 140 */
.quad compat_sys_getdents
.quad compat_sys_select
.quad sys_flock
.quad sys_msync
.quad compat_sys_readv /* 145 */
.quad compat_sys_writev
.quad sys_getsid
.quad sys_fdatasync
.quad compat_sys_sysctl /* sysctl */
.quad sys_mlock /* 150 */
.quad sys_munlock
.quad sys_mlockall
.quad sys_munlockall
.quad sys_sched_setparam
.quad sys_sched_getparam /* 155 */
.quad sys_sched_setscheduler
.quad sys_sched_getscheduler
.quad sys_sched_yield
.quad sys_sched_get_priority_max
.quad sys_sched_get_priority_min /* 160 */
.quad sys32_sched_rr_get_interval
.quad compat_sys_nanosleep
.quad sys_mremap
.quad sys_setresuid16
.quad sys_getresuid16 /* 165 */
.quad sys32_vm86_warning /* vm86 */
.quad quiet_ni_syscall /* query_module */
.quad sys_poll
.quad quiet_ni_syscall /* old nfsservctl */
.quad sys_setresgid16 /* 170 */
.quad sys_getresgid16
.quad sys_prctl
.quad stub32_rt_sigreturn
.quad sys32_rt_sigaction
.quad sys32_rt_sigprocmask /* 175 */
.quad sys32_rt_sigpending
.quad compat_sys_rt_sigtimedwait
.quad sys32_rt_sigqueueinfo
.quad sys_rt_sigsuspend
.quad sys32_pread /* 180 */
.quad sys32_pwrite
.quad sys_chown16
.quad sys_getcwd
.quad sys_capget
.quad sys_capset
.quad stub32_sigaltstack
.quad sys32_sendfile
.quad quiet_ni_syscall /* streams1 */
.quad quiet_ni_syscall /* streams2 */
.quad stub32_vfork /* 190 */
.quad compat_sys_getrlimit
.quad sys_mmap_pgoff
.quad sys32_truncate64
.quad sys32_ftruncate64
.quad sys32_stat64 /* 195 */
.quad sys32_lstat64
.quad sys32_fstat64
.quad sys_lchown
.quad sys_getuid
.quad sys_getgid /* 200 */
.quad sys_geteuid
.quad sys_getegid
.quad sys_setreuid
.quad sys_setregid
.quad sys_getgroups /* 205 */
.quad sys_setgroups
.quad sys_fchown
.quad sys_setresuid
.quad sys_getresuid
.quad sys_setresgid /* 210 */
.quad sys_getresgid
.quad sys_chown
.quad sys_setuid
.quad sys_setgid
.quad sys_setfsuid /* 215 */
.quad sys_setfsgid
.quad sys_pivot_root
.quad sys_mincore
.quad sys_madvise
.quad compat_sys_getdents64 /* 220 getdents64 */
.quad compat_sys_fcntl64
.quad quiet_ni_syscall /* tux */
.quad quiet_ni_syscall /* security */
.quad sys_gettid
.quad sys32_readahead /* 225 */
.quad sys_setxattr
.quad sys_lsetxattr
.quad sys_fsetxattr
.quad sys_getxattr
.quad sys_lgetxattr /* 230 */
.quad sys_fgetxattr
.quad sys_listxattr
.quad sys_llistxattr
.quad sys_flistxattr
.quad sys_removexattr /* 235 */
.quad sys_lremovexattr
.quad sys_fremovexattr
.quad sys_tkill
.quad sys_sendfile64
.quad compat_sys_futex /* 240 */
.quad compat_sys_sched_setaffinity
.quad compat_sys_sched_getaffinity
.quad sys_set_thread_area
.quad sys_get_thread_area
.quad compat_sys_io_setup /* 245 */
.quad sys_io_destroy
.quad compat_sys_io_getevents
.quad compat_sys_io_submit
.quad sys_io_cancel
.quad sys32_fadvise64 /* 250 */
.quad quiet_ni_syscall /* free_huge_pages */
.quad sys_exit_group
.quad sys32_lookup_dcookie
.quad sys_epoll_create
.quad sys_epoll_ctl /* 255 */
.quad sys_epoll_wait
.quad sys_remap_file_pages
.quad sys_set_tid_address
.quad compat_sys_timer_create
.quad compat_sys_timer_settime /* 260 */
.quad compat_sys_timer_gettime
.quad sys_timer_getoverrun
.quad sys_timer_delete
.quad compat_sys_clock_settime
.quad compat_sys_clock_gettime /* 265 */
.quad compat_sys_clock_getres
.quad compat_sys_clock_nanosleep
.quad compat_sys_statfs64
.quad compat_sys_fstatfs64
.quad sys_tgkill /* 270 */
.quad compat_sys_utimes
.quad sys32_fadvise64_64
.quad quiet_ni_syscall /* sys_vserver */
.quad sys_mbind
.quad compat_sys_get_mempolicy /* 275 */
.quad sys_set_mempolicy
.quad compat_sys_mq_open
.quad sys_mq_unlink
.quad compat_sys_mq_timedsend
.quad compat_sys_mq_timedreceive /* 280 */
.quad compat_sys_mq_notify
.quad compat_sys_mq_getsetattr
.quad compat_sys_kexec_load /* reserved for kexec */
.quad compat_sys_waitid
.quad quiet_ni_syscall /* 285: sys_altroot */
.quad sys_add_key
.quad sys_request_key
.quad sys_keyctl
.quad sys_ioprio_set
.quad sys_ioprio_get /* 290 */
.quad sys_inotify_init
.quad sys_inotify_add_watch
.quad sys_inotify_rm_watch
[PATCH] Swap Migration V5: sys_migrate_pages interface sys_migrate_pages implementation using swap based page migration This is the original API proposed by Ray Bryant in his posts during the first half of 2005 on linux-mm@kvack.org and linux-kernel@vger.kernel.org. The intent of sys_migrate is to migrate memory of a process. A process may have migrated to another node. Memory was allocated optimally for the prior context. sys_migrate_pages allows to shift the memory to the new node. sys_migrate_pages is also useful if the processes available memory nodes have changed through cpuset operations to manually move the processes memory. Paul Jackson is working on an automated mechanism that will allow an automatic migration if the cpuset of a process is changed. However, a user may decide to manually control the migration. This implementation is put into the policy layer since it uses concepts and functions that are also needed for mbind and friends. The patch also provides a do_migrate_pages function that may be useful for cpusets to automatically move memory. sys_migrate_pages does not modify policies in contrast to Ray's implementation. The current code here is based on the swap based page migration capability and thus is not able to preserve the physical layout relative to it containing nodeset (which may be a cpuset). When direct page migration becomes available then the implementation needs to be changed to do a isomorphic move of pages between different nodesets. The current implementation simply evicts all pages in source nodeset that are not in the target nodeset. Patch supports ia64, i386 and x86_64. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08 09:00:51 +00:00
.quad sys_migrate_pages
.quad compat_sys_openat /* 295 */
.quad sys_mkdirat
.quad sys_mknodat
.quad sys_fchownat
.quad compat_sys_futimesat
.quad sys32_fstatat /* 300 */
.quad sys_unlinkat
.quad sys_renameat
.quad sys_linkat
.quad sys_symlinkat
.quad sys_readlinkat /* 305 */
.quad sys_fchmodat
.quad sys_faccessat
.quad compat_sys_pselect6
.quad compat_sys_ppoll
.quad sys_unshare /* 310 */
.quad compat_sys_set_robust_list
.quad compat_sys_get_robust_list
.quad sys_splice
.quad sys32_sync_file_range
utimensat implementation Implement utimensat(2) which is an extension to futimesat(2) in that it a) supports nano-second resolution for the timestamps b) allows to selectively ignore the atime/mtime value c) allows to selectively use the current time for either atime or mtime d) supports changing the atime/mtime of a symlink itself along the lines of the BSD lutimes(3) functions For this change the internally used do_utimes() functions was changed to accept a timespec time value and an additional flags parameter. Additionally the sys_utime function was changed to match compat_sys_utime which already use do_utimes instead of duplicating the work. Also, the completely missing futimensat() functionality is added. We have such a function in glibc but we have to resort to using /proc/self/fd/* which not everybody likes (chroot etc). Test application (the syscall number will need per-arch editing): #include <errno.h> #include <fcntl.h> #include <time.h> #include <sys/time.h> #include <stddef.h> #include <syscall.h> #define __NR_utimensat 280 #define UTIME_NOW ((1l << 30) - 1l) #define UTIME_OMIT ((1l << 30) - 2l) int main(void) { int status = 0; int fd = open("ttt", O_RDWR|O_CREAT|O_EXCL, 0666); if (fd == -1) error (1, errno, "failed to create test file \"ttt\""); struct stat64 st1; if (fstat64 (fd, &st1) != 0) error (1, errno, "fstat failed"); struct timespec t[2]; t[0].tv_sec = 0; t[0].tv_nsec = 0; t[1].tv_sec = 0; t[1].tv_nsec = 0; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); struct stat64 st2; if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != 0 || st2.st_atim.tv_nsec != 0) { puts ("atim not reset to zero"); status = 1; } if (st2.st_mtim.tv_sec != 0 || st2.st_mtim.tv_nsec != 0) { puts ("mtim not reset to zero"); status = 1; } if (status != 0) goto out; t[0] = st1.st_atim; t[1].tv_sec = 0; t[1].tv_nsec = UTIME_OMIT; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != st1.st_atim.tv_sec || st2.st_atim.tv_nsec != st1.st_atim.tv_nsec) { puts ("atim not set"); status = 1; } if (st2.st_mtim.tv_sec != 0 || st2.st_mtim.tv_nsec != 0) { puts ("mtim changed from zero"); status = 1; } if (status != 0) goto out; t[0].tv_sec = 0; t[0].tv_nsec = UTIME_OMIT; t[1] = st1.st_mtim; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != st1.st_atim.tv_sec || st2.st_atim.tv_nsec != st1.st_atim.tv_nsec) { puts ("mtim changed from original time"); status = 1; } if (st2.st_mtim.tv_sec != st1.st_mtim.tv_sec || st2.st_mtim.tv_nsec != st1.st_mtim.tv_nsec) { puts ("mtim not set"); status = 1; } if (status != 0) goto out; sleep (2); t[0].tv_sec = 0; t[0].tv_nsec = UTIME_NOW; t[1].tv_sec = 0; t[1].tv_nsec = UTIME_NOW; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); struct timeval tv; gettimeofday(&tv,NULL); if (st2.st_atim.tv_sec <= st1.st_atim.tv_sec || st2.st_atim.tv_sec > tv.tv_sec) { puts ("atim not set to NOW"); status = 1; } if (st2.st_mtim.tv_sec <= st1.st_mtim.tv_sec || st2.st_mtim.tv_sec > tv.tv_sec) { puts ("mtim not set to NOW"); status = 1; } if (symlink ("ttt", "tttsym") != 0) error (1, errno, "cannot create symlink"); t[0].tv_sec = 0; t[0].tv_nsec = 0; t[1].tv_sec = 0; t[1].tv_nsec = 0; if (syscall(__NR_utimensat, AT_FDCWD, "tttsym", t, AT_SYMLINK_NOFOLLOW) != 0) error (1, errno, "utimensat failed"); if (lstat64 ("tttsym", &st2) != 0) error (1, errno, "lstat failed"); if (st2.st_atim.tv_sec != 0 || st2.st_atim.tv_nsec != 0) { puts ("symlink atim not reset to zero"); status = 1; } if (st2.st_mtim.tv_sec != 0 || st2.st_mtim.tv_nsec != 0) { puts ("symlink mtim not reset to zero"); status = 1; } if (status != 0) goto out; t[0].tv_sec = 1; t[0].tv_nsec = 0; t[1].tv_sec = 1; t[1].tv_nsec = 0; if (syscall(__NR_utimensat, fd, NULL, t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != 1 || st2.st_atim.tv_nsec != 0) { puts ("atim not reset to one"); status = 1; } if (st2.st_mtim.tv_sec != 1 || st2.st_mtim.tv_nsec != 0) { puts ("mtim not reset to one"); status = 1; } if (status == 0) puts ("all OK"); out: close (fd); unlink ("ttt"); unlink ("tttsym"); return status; } [akpm@linux-foundation.org: add missing i386 syscall table entry] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Cc: Alexey Dobriyan <adobriyan@openvz.org> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 07:33:25 +00:00
.quad sys_tee /* 315 */
.quad compat_sys_vmsplice
.quad compat_sys_move_pages
.quad sys_getcpu
.quad sys_epoll_pwait
utimensat implementation Implement utimensat(2) which is an extension to futimesat(2) in that it a) supports nano-second resolution for the timestamps b) allows to selectively ignore the atime/mtime value c) allows to selectively use the current time for either atime or mtime d) supports changing the atime/mtime of a symlink itself along the lines of the BSD lutimes(3) functions For this change the internally used do_utimes() functions was changed to accept a timespec time value and an additional flags parameter. Additionally the sys_utime function was changed to match compat_sys_utime which already use do_utimes instead of duplicating the work. Also, the completely missing futimensat() functionality is added. We have such a function in glibc but we have to resort to using /proc/self/fd/* which not everybody likes (chroot etc). Test application (the syscall number will need per-arch editing): #include <errno.h> #include <fcntl.h> #include <time.h> #include <sys/time.h> #include <stddef.h> #include <syscall.h> #define __NR_utimensat 280 #define UTIME_NOW ((1l << 30) - 1l) #define UTIME_OMIT ((1l << 30) - 2l) int main(void) { int status = 0; int fd = open("ttt", O_RDWR|O_CREAT|O_EXCL, 0666); if (fd == -1) error (1, errno, "failed to create test file \"ttt\""); struct stat64 st1; if (fstat64 (fd, &st1) != 0) error (1, errno, "fstat failed"); struct timespec t[2]; t[0].tv_sec = 0; t[0].tv_nsec = 0; t[1].tv_sec = 0; t[1].tv_nsec = 0; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); struct stat64 st2; if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != 0 || st2.st_atim.tv_nsec != 0) { puts ("atim not reset to zero"); status = 1; } if (st2.st_mtim.tv_sec != 0 || st2.st_mtim.tv_nsec != 0) { puts ("mtim not reset to zero"); status = 1; } if (status != 0) goto out; t[0] = st1.st_atim; t[1].tv_sec = 0; t[1].tv_nsec = UTIME_OMIT; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != st1.st_atim.tv_sec || st2.st_atim.tv_nsec != st1.st_atim.tv_nsec) { puts ("atim not set"); status = 1; } if (st2.st_mtim.tv_sec != 0 || st2.st_mtim.tv_nsec != 0) { puts ("mtim changed from zero"); status = 1; } if (status != 0) goto out; t[0].tv_sec = 0; t[0].tv_nsec = UTIME_OMIT; t[1] = st1.st_mtim; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != st1.st_atim.tv_sec || st2.st_atim.tv_nsec != st1.st_atim.tv_nsec) { puts ("mtim changed from original time"); status = 1; } if (st2.st_mtim.tv_sec != st1.st_mtim.tv_sec || st2.st_mtim.tv_nsec != st1.st_mtim.tv_nsec) { puts ("mtim not set"); status = 1; } if (status != 0) goto out; sleep (2); t[0].tv_sec = 0; t[0].tv_nsec = UTIME_NOW; t[1].tv_sec = 0; t[1].tv_nsec = UTIME_NOW; if (syscall(__NR_utimensat, AT_FDCWD, "ttt", t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); struct timeval tv; gettimeofday(&tv,NULL); if (st2.st_atim.tv_sec <= st1.st_atim.tv_sec || st2.st_atim.tv_sec > tv.tv_sec) { puts ("atim not set to NOW"); status = 1; } if (st2.st_mtim.tv_sec <= st1.st_mtim.tv_sec || st2.st_mtim.tv_sec > tv.tv_sec) { puts ("mtim not set to NOW"); status = 1; } if (symlink ("ttt", "tttsym") != 0) error (1, errno, "cannot create symlink"); t[0].tv_sec = 0; t[0].tv_nsec = 0; t[1].tv_sec = 0; t[1].tv_nsec = 0; if (syscall(__NR_utimensat, AT_FDCWD, "tttsym", t, AT_SYMLINK_NOFOLLOW) != 0) error (1, errno, "utimensat failed"); if (lstat64 ("tttsym", &st2) != 0) error (1, errno, "lstat failed"); if (st2.st_atim.tv_sec != 0 || st2.st_atim.tv_nsec != 0) { puts ("symlink atim not reset to zero"); status = 1; } if (st2.st_mtim.tv_sec != 0 || st2.st_mtim.tv_nsec != 0) { puts ("symlink mtim not reset to zero"); status = 1; } if (status != 0) goto out; t[0].tv_sec = 1; t[0].tv_nsec = 0; t[1].tv_sec = 1; t[1].tv_nsec = 0; if (syscall(__NR_utimensat, fd, NULL, t, 0) != 0) error (1, errno, "utimensat failed"); if (fstat64 (fd, &st2) != 0) error (1, errno, "fstat failed"); if (st2.st_atim.tv_sec != 1 || st2.st_atim.tv_nsec != 0) { puts ("atim not reset to one"); status = 1; } if (st2.st_mtim.tv_sec != 1 || st2.st_mtim.tv_nsec != 0) { puts ("mtim not reset to one"); status = 1; } if (status == 0) puts ("all OK"); out: close (fd); unlink ("ttt"); unlink ("tttsym"); return status; } [akpm@linux-foundation.org: add missing i386 syscall table entry] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Cc: Alexey Dobriyan <adobriyan@openvz.org> Cc: Michael Kerrisk <mtk-manpages@gmx.net> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-05-08 07:33:25 +00:00
.quad compat_sys_utimensat /* 320 */
.quad compat_sys_signalfd
.quad sys_timerfd_create
.quad sys_eventfd
sys_fallocate() implementation on i386, x86_64 and powerpc fallocate() is a new system call being proposed here which will allow applications to preallocate space to any file(s) in a file system. Each file system implementation that wants to use this feature will need to support an inode operation called ->fallocate(). Applications can use this feature to avoid fragmentation to certain level and thus get faster access speed. With preallocation, applications also get a guarantee of space for particular file(s) - even if later the the system becomes full. Currently, glibc provides an interface called posix_fallocate() which can be used for similar cause. Though this has the advantage of working on all file systems, but it is quite slow (since it writes zeroes to each block that has to be preallocated). Without a doubt, file systems can do this more efficiently within the kernel, by implementing the proposed fallocate() system call. It is expected that posix_fallocate() will be modified to call this new system call first and incase the kernel/filesystem does not implement it, it should fall back to the current implementation of writing zeroes to the new blocks. ToDos: 1. Implementation on other architectures (other than i386, x86_64, and ppc). Patches for s390(x) and ia64 are already available from previous posts, but it was decided that they should be added later once fallocate is in the mainline. Hence not including those patches in this take. 2. Changes to glibc, a) to support fallocate() system call b) to make posix_fallocate() and posix_fallocate64() call fallocate() Signed-off-by: Amit Arora <aarora@in.ibm.com>
2007-07-18 01:42:44 +00:00
.quad sys32_fallocate
.quad compat_sys_timerfd_settime /* 325 */
.quad compat_sys_timerfd_gettime
flag parameters: signalfd This patch adds the new signalfd4 syscall. It extends the old signalfd syscall by one parameter which is meant to hold a flag value. In this patch the only flag support is SFD_CLOEXEC which causes the close-on-exec flag for the returned file descriptor to be set. A new name SFD_CLOEXEC is introduced which in this implementation must have the same value as O_CLOEXEC. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <signal.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_signalfd4 # ifdef __x86_64__ # define __NR_signalfd4 289 # elif defined __i386__ # define __NR_signalfd4 327 # else # error "need __NR_signalfd4" # endif #endif #define SFD_CLOEXEC O_CLOEXEC int main (void) { sigset_t ss; sigemptyset (&ss); sigaddset (&ss, SIGUSR1); int fd = syscall (__NR_signalfd4, -1, &ss, 8, 0); if (fd == -1) { puts ("signalfd4(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("signalfd4(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_signalfd4, -1, &ss, 8, SFD_CLOEXEC); if (fd == -1) { puts ("signalfd4(SFD_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("signalfd4(SFD_CLOEXEC) does not set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 04:29:24 +00:00
.quad compat_sys_signalfd4
flag parameters: eventfd This patch adds the new eventfd2 syscall. It extends the old eventfd syscall by one parameter which is meant to hold a flag value. In this patch the only flag support is EFD_CLOEXEC which causes the close-on-exec flag for the returned file descriptor to be set. A new name EFD_CLOEXEC is introduced which in this implementation must have the same value as O_CLOEXEC. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_eventfd2 # ifdef __x86_64__ # define __NR_eventfd2 290 # elif defined __i386__ # define __NR_eventfd2 328 # else # error "need __NR_eventfd2" # endif #endif #define EFD_CLOEXEC O_CLOEXEC int main (void) { int fd = syscall (__NR_eventfd2, 1, 0); if (fd == -1) { puts ("eventfd2(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("eventfd2(0) sets close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_eventfd2, 1, EFD_CLOEXEC); if (fd == -1) { puts ("eventfd2(EFD_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("eventfd2(EFD_CLOEXEC) does not set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 04:29:25 +00:00
.quad sys_eventfd2
flag parameters add-on: remove epoll_create size param Remove the size parameter from the new epoll_create syscall and renames the syscall itself. The updated test program follows. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <time.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_epoll_create2 # ifdef __x86_64__ # define __NR_epoll_create2 291 # elif defined __i386__ # define __NR_epoll_create2 329 # else # error "need __NR_epoll_create2" # endif #endif #define EPOLL_CLOEXEC O_CLOEXEC int main (void) { int fd = syscall (__NR_epoll_create2, 0); if (fd == -1) { puts ("epoll_create2(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("epoll_create2(0) set close-on-exec flag"); return 1; } close (fd); fd = syscall (__NR_epoll_create2, EPOLL_CLOEXEC); if (fd == -1) { puts ("epoll_create2(EPOLL_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("epoll_create2(EPOLL_CLOEXEC) set close-on-exec flag"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 04:29:43 +00:00
.quad sys_epoll_create1
.quad sys_dup3 /* 330 */
flag parameters: pipe This patch introduces the new syscall pipe2 which is like pipe but it also takes an additional parameter which takes a flag value. This patch implements the handling of O_CLOEXEC for the flag. I did not add support for the new syscall for the architectures which have a special sys_pipe implementation. I think the maintainers of those archs have the chance to go with the unified implementation but that's up to them. The implementation introduces do_pipe_flags. I did that instead of changing all callers of do_pipe because some of the callers are written in assembler. I would probably screw up changing the assembly code. To avoid breaking code do_pipe is now a small wrapper around do_pipe_flags. Once all callers are changed over to do_pipe_flags the old do_pipe function can be removed. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_pipe2 # ifdef __x86_64__ # define __NR_pipe2 293 # elif defined __i386__ # define __NR_pipe2 331 # else # error "need __NR_pipe2" # endif #endif int main (void) { int fd[2]; if (syscall (__NR_pipe2, fd, 0) != 0) { puts ("pipe2(0) failed"); return 1; } for (int i = 0; i < 2; ++i) { int coe = fcntl (fd[i], F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { printf ("pipe2(0) set close-on-exit for fd[%d]\n", i); return 1; } } close (fd[0]); close (fd[1]); if (syscall (__NR_pipe2, fd, O_CLOEXEC) != 0) { puts ("pipe2(O_CLOEXEC) failed"); return 1; } for (int i = 0; i < 2; ++i) { int coe = fcntl (fd[i], F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { printf ("pipe2(O_CLOEXEC) does not set close-on-exit for fd[%d]\n", i); return 1; } } close (fd[0]); close (fd[1]); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 04:29:30 +00:00
.quad sys_pipe2
flag parameters: inotify_init This patch introduces the new syscall inotify_init1 (note: the 1 stands for the one parameter the syscall takes, as opposed to no parameter before). The values accepted for this parameter are function-specific and defined in the inotify.h header. Here the values must match the O_* flags, though. In this patch CLOEXEC support is introduced. The following test must be adjusted for architectures other than x86 and x86-64 and in case the syscall numbers changed. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/syscall.h> #ifndef __NR_inotify_init1 # ifdef __x86_64__ # define __NR_inotify_init1 294 # elif defined __i386__ # define __NR_inotify_init1 332 # else # error "need __NR_inotify_init1" # endif #endif #define IN_CLOEXEC O_CLOEXEC int main (void) { int fd; fd = syscall (__NR_inotify_init1, 0); if (fd == -1) { puts ("inotify_init1(0) failed"); return 1; } int coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if (coe & FD_CLOEXEC) { puts ("inotify_init1(0) set close-on-exit"); return 1; } close (fd); fd = syscall (__NR_inotify_init1, IN_CLOEXEC); if (fd == -1) { puts ("inotify_init1(IN_CLOEXEC) failed"); return 1; } coe = fcntl (fd, F_GETFD); if (coe == -1) { puts ("fcntl failed"); return 1; } if ((coe & FD_CLOEXEC) == 0) { puts ("inotify_init1(O_CLOEXEC) does not set close-on-exit"); return 1; } close (fd); puts ("OK"); return 0; } ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ [akpm@linux-foundation.org: add sys_ni stub] Signed-off-by: Ulrich Drepper <drepper@redhat.com> Acked-by: Davide Libenzi <davidel@xmailserver.org> Cc: Michael Kerrisk <mtk.manpages@googlemail.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-24 04:29:32 +00:00
.quad sys_inotify_init1
preadv/pwritev: Add preadv and pwritev system calls. This patch adds preadv and pwritev system calls. These syscalls are a pretty straightforward combination of pread and readv (same for write). They are quite useful for doing vectored I/O in threaded applications. Using lseek+readv instead opens race windows you'll have to plug with locking. Other systems have such system calls too, for example NetBSD, check here: http://www.daemon-systems.org/man/preadv.2.html The application-visible interface provided by glibc should look like this to be compatible to the existing implementations in the *BSD family: ssize_t preadv(int d, const struct iovec *iov, int iovcnt, off_t offset); ssize_t pwritev(int d, const struct iovec *iov, int iovcnt, off_t offset); This prototype has one problem though: On 32bit archs is the (64bit) offset argument unaligned, which the syscall ABI of several archs doesn't allow to do. At least s390 needs a wrapper in glibc to handle this. As we'll need a wrappers in glibc anyway I've decided to push problem to glibc entriely and use a syscall prototype which works without arch-specific wrappers inside the kernel: The offset argument is explicitly splitted into two 32bit values. The patch sports the actual system call implementation and the windup in the x86 system call tables. Other archs follow as separate patches. Signed-off-by: Gerd Hoffmann <kraxel@redhat.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: <linux-api@vger.kernel.org> Cc: <linux-arch@vger.kernel.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-04-02 23:59:23 +00:00
.quad compat_sys_preadv
.quad compat_sys_pwritev
.quad compat_sys_rt_tgsigqueueinfo /* 335 */
perf: Do the big rename: Performance Counters -> Performance Events Bye-bye Performance Counters, welcome Performance Events! In the past few months the perfcounters subsystem has grown out its initial role of counting hardware events, and has become (and is becoming) a much broader generic event enumeration, reporting, logging, monitoring, analysis facility. Naming its core object 'perf_counter' and naming the subsystem 'perfcounters' has become more and more of a misnomer. With pending code like hw-breakpoints support the 'counter' name is less and less appropriate. All in one, we've decided to rename the subsystem to 'performance events' and to propagate this rename through all fields, variables and API names. (in an ABI compatible fashion) The word 'event' is also a bit shorter than 'counter' - which makes it slightly more convenient to write/handle as well. Thanks goes to Stephane Eranian who first observed this misnomer and suggested a rename. User-space tooling and ABI compatibility is not affected - this patch should be function-invariant. (Also, defconfigs were not touched to keep the size down.) This patch has been generated via the following script: FILES=$(find * -type f | grep -vE 'oprofile|[^K]config') sed -i \ -e 's/PERF_EVENT_/PERF_RECORD_/g' \ -e 's/PERF_COUNTER/PERF_EVENT/g' \ -e 's/perf_counter/perf_event/g' \ -e 's/nb_counters/nb_events/g' \ -e 's/swcounter/swevent/g' \ -e 's/tpcounter_event/tp_event/g' \ $FILES for N in $(find . -name perf_counter.[ch]); do M=$(echo $N | sed 's/perf_counter/perf_event/g') mv $N $M done FILES=$(find . -name perf_event.*) sed -i \ -e 's/COUNTER_MASK/REG_MASK/g' \ -e 's/COUNTER/EVENT/g' \ -e 's/\<event\>/event_id/g' \ -e 's/counter/event/g' \ -e 's/Counter/Event/g' \ $FILES ... to keep it as correct as possible. This script can also be used by anyone who has pending perfcounters patches - it converts a Linux kernel tree over to the new naming. We tried to time this change to the point in time where the amount of pending patches is the smallest: the end of the merge window. Namespace clashes were fixed up in a preparatory patch - and some stylistic fallout will be fixed up in a subsequent patch. ( NOTE: 'counters' are still the proper terminology when we deal with hardware registers - and these sed scripts are a bit over-eager in renaming them. I've undone some of that, but in case there's something left where 'counter' would be better than 'event' we can undo that on an individual basis instead of touching an otherwise nicely automated patch. ) Suggested-by: Stephane Eranian <eranian@google.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Mackerras <paulus@samba.org> Reviewed-by: Arjan van de Ven <arjan@linux.intel.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <linux-arch@vger.kernel.org> LKML-Reference: <new-submission> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-09-21 10:02:48 +00:00
.quad sys_perf_event_open
.quad compat_sys_recvmmsg
.quad sys_fanotify_init
.quad sys32_fanotify_mark
.quad sys_prlimit64 /* 340 */
.quad sys_name_to_handle_at
.quad compat_sys_open_by_handle_at
.quad compat_sys_clock_adjtime
introduce sys_syncfs to sync a single file system It is frequently useful to sync a single file system, instead of all mounted file systems via sync(2): - On machines with many mounts, it is not at all uncommon for some of them to hang (e.g. unresponsive NFS server). sync(2) will get stuck on those and may never get to the one you do care about (e.g., /). - Some applications write lots of data to the file system and then want to make sure it is flushed to disk. Calling fsync(2) on each file introduces unnecessary ordering constraints that result in a large amount of sub-optimal writeback/flush/commit behavior by the file system. There are currently two ways (that I know of) to sync a single super_block: - BLKFLSBUF ioctl on the block device: That also invalidates the bdev mapping, which isn't usually desirable, and doesn't work for non-block file systems. - 'mount -o remount,rw' will call sync_filesystem as an artifact of the current implemention. Relying on this little-known side effect for something like data safety sounds foolish. Both of these approaches require root privileges, which some applications do not have (nor should they need?) given that sync(2) is an unprivileged operation. This patch introduces a new system call syncfs(2) that takes an fd and syncs only the file system it references. Maybe someday we can $ sync /some/path and not get sync: ignoring all arguments The syscall is motivated by comments by Al and Christoph at the last LSF. syncfs(2) seems like an appropriate name given statfs(2). A similar ioctl was also proposed a while back, see http://marc.info/?l=linux-fsdevel&m=127970513829285&w=2 Signed-off-by: Sage Weil <sage@newdream.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-03-10 19:31:30 +00:00
.quad sys_syncfs
.quad compat_sys_sendmmsg /* 345 */
ns: Wire up the setns system call 32bit and 64bit on x86 are tested and working. The rest I have looked at closely and I can't find any problems. setns is an easy system call to wire up. It just takes two ints so I don't expect any weird architecture porting problems. While doing this I have noticed that we have some architectures that are very slow to get new system calls. cris seems to be the slowest where the last system calls wired up were preadv and pwritev. avr32 is weird in that recvmmsg was wired up but never declared in unistd.h. frv is behind with perf_event_open being the last syscall wired up. On h8300 the last system call wired up was epoll_wait. On m32r the last system call wired up was fallocate. mn10300 has recvmmsg as the last system call wired up. The rest seem to at least have syncfs wired up which was new in the 2.6.39. v2: Most of the architecture support added by Daniel Lezcano <dlezcano@fr.ibm.com> v3: ported to v2.6.36-rc4 by: Eric W. Biederman <ebiederm@xmission.com> v4: Moved wiring up of the system call to another patch v5: ported to v2.6.39-rc6 v6: rebased onto parisc-next and net-next to avoid syscall conflicts. v7: ported to Linus's latest post 2.6.39 tree. >  arch/blackfin/include/asm/unistd.h     |    3 ++- >  arch/blackfin/mach-common/entry.S      |    1 + Acked-by: Mike Frysinger <vapier@gentoo.org> Oh - ia64 wiring looks good. Acked-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-28 02:28:27 +00:00
.quad sys_setns
Cross Memory Attach The basic idea behind cross memory attach is to allow MPI programs doing intra-node communication to do a single copy of the message rather than a double copy of the message via shared memory. The following patch attempts to achieve this by allowing a destination process, given an address and size from a source process, to copy memory directly from the source process into its own address space via a system call. There is also a symmetrical ability to copy from the current process's address space into a destination process's address space. - Use of /proc/pid/mem has been considered, but there are issues with using it: - Does not allow for specifying iovecs for both src and dest, assuming preadv or pwritev was implemented either the area read from or written to would need to be contiguous. - Currently mem_read allows only processes who are currently ptrace'ing the target and are still able to ptrace the target to read from the target. This check could possibly be moved to the open call, but its not clear exactly what race this restriction is stopping (reason appears to have been lost) - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix domain socket is a bit ugly from a userspace point of view, especially when you may have hundreds if not (eventually) thousands of processes that all need to do this with each other - Doesn't allow for some future use of the interface we would like to consider adding in the future (see below) - Interestingly reading from /proc/pid/mem currently actually involves two copies! (But this could be fixed pretty easily) As mentioned previously use of vmsplice instead was considered, but has problems. Since you need the reader and writer working co-operatively if the pipe is not drained then you block. Which requires some wrapping to do non blocking on the send side or polling on the receive. In all to all communication it requires ordering otherwise you can deadlock. And in the example of many MPI tasks writing to one MPI task vmsplice serialises the copying. There are some cases of MPI collectives where even a single copy interface does not get us the performance gain we could. For example in an MPI_Reduce rather than copy the data from the source we would like to instead use it directly in a mathops (say the reduce is doing a sum) as this would save us doing a copy. We don't need to keep a copy of the data from the source. I haven't implemented this, but I think this interface could in the future do all this through the use of the flags - eg could specify the math operation and type and the kernel rather than just copying the data would apply the specified operation between the source and destination and store it in the destination. Although we don't have a "second user" of the interface (though I've had some nibbles from people who may be interested in using it for intra process messaging which is not MPI). This interface is something which hardware vendors are already doing for their custom drivers to implement fast local communication. And so in addition to this being useful for OpenMPI it would mean the driver maintainers don't have to fix things up when the mm changes. There was some discussion about how much faster a true zero copy would go. Here's a link back to the email with some testing I did on that: http://marc.info/?l=linux-mm&m=130105930902915&w=2 There is a basic man page for the proposed interface here: http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt This has been implemented for x86 and powerpc, other architecture should mainly (I think) just need to add syscall numbers for the process_vm_readv and process_vm_writev. There are 32 bit compatibility versions for 64-bit kernels. For arch maintainers there are some simple tests to be able to quickly verify that the syscalls are working correctly here: http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgz Signed-off-by: Chris Yeoh <yeohc@au1.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: David Howells <dhowells@redhat.com> Cc: James Morris <jmorris@namei.org> Cc: <linux-man@vger.kernel.org> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-01 00:06:39 +00:00
.quad compat_sys_process_vm_readv
.quad compat_sys_process_vm_writev
ia32_syscall_end: