Commit Graph

115 Commits

Author SHA1 Message Date
Davidlohr Bueso
410bd5ecb2 procfs: use faster rb_first_cached()
...  such that we can avoid the tree walks to get the node with the
smallest key.  Semantically the same, as the previously used rb_first(),
but O(1).  The main overhead is the extra footprint for the cached rb_node
pointer, which should not matter for procfs.

Link: http://lkml.kernel.org/r/20170719014603.19029-14-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:49 -07:00
Alexey Dobriyan
855d97657d proc: uninline proc_create()
Save some code from ~320 invocations all clearing last argument.

	add/remove: 3/0 grow/shrink: 0/158 up/down: 45/-702 (-657)
	function                                     old     new   delta
	proc_create                                    -      17     +17
	__ksymtab_proc_create                          -      16     +16
	__kstrtab_proc_create                          -      12     +12
	yam_init_driver                              301     298      -3

		...

	cifs_proc_init                               249     228     -21
	via_fb_pci_probe                            2304    2280     -24

Link: http://lkml.kernel.org/r/20170819094702.GA27864@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-09-08 18:26:47 -07:00
Heiner Kallweit
cde1b69389 fs/proc/generic.c: switch to ida_simple_get/remove
The code can be much simplified by switching to ida_simple_get/remove.

Link: http://lkml.kernel.org/r/8d1cc9f7-5115-c9dc-028e-c0770b6bfe1f@gmail.com
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-07-10 16:32:34 -07:00
Takashi Iwai
d66bb1607e proc: Fix unbalanced hard link numbers
proc_create_mount_point() forgot to increase the parent's nlink, and
it resulted in unbalanced hard link numbers, e.g. /proc/fs shows one
less than expected.

Fixes: eb6d38d542 ("proc: Allow creating permanently empty directories...")
Cc: stable@vger.kernel.org
Reported-by: Tristan Ye <tristan.ye@suse.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2017-04-28 21:05:26 -05:00
David Howells
a528d35e8b statx: Add a system call to make enhanced file info available
Add a system call to make extended file information available, including
file creation and some attribute flags where available through the
underlying filesystem.

The getattr inode operation is altered to take two additional arguments: a
u32 request_mask and an unsigned int flags that indicate the
synchronisation mode.  This change is propagated to the vfs_getattr*()
function.

Functions like vfs_stat() are now inline wrappers around new functions
vfs_statx() and vfs_statx_fd() to reduce stack usage.

========
OVERVIEW
========

The idea was initially proposed as a set of xattrs that could be retrieved
with getxattr(), but the general preference proved to be for a new syscall
with an extended stat structure.

A number of requests were gathered for features to be included.  The
following have been included:

 (1) Make the fields a consistent size on all arches and make them large.

 (2) Spare space, request flags and information flags are provided for
     future expansion.

 (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
     __s64).

 (4) Creation time: The SMB protocol carries the creation time, which could
     be exported by Samba, which will in turn help CIFS make use of
     FS-Cache as that can be used for coherency data (stx_btime).

     This is also specified in NFSv4 as a recommended attribute and could
     be exported by NFSD [Steve French].

 (5) Lightweight stat: Ask for just those details of interest, and allow a
     netfs (such as NFS) to approximate anything not of interest, possibly
     without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
     Dilger] (AT_STATX_DONT_SYNC).

 (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
     its cached attributes are up to date [Trond Myklebust]
     (AT_STATX_FORCE_SYNC).

And the following have been left out for future extension:

 (7) Data version number: Could be used by userspace NFS servers [Aneesh
     Kumar].

     Can also be used to modify fill_post_wcc() in NFSD which retrieves
     i_version directly, but has just called vfs_getattr().  It could get
     it from the kstat struct if it used vfs_xgetattr() instead.

     (There's disagreement on the exact semantics of a single field, since
     not all filesystems do this the same way).

 (8) BSD stat compatibility: Including more fields from the BSD stat such
     as creation time (st_btime) and inode generation number (st_gen)
     [Jeremy Allison, Bernd Schubert].

 (9) Inode generation number: Useful for FUSE and userspace NFS servers
     [Bernd Schubert].

     (This was asked for but later deemed unnecessary with the
     open-by-handle capability available and caused disagreement as to
     whether it's a security hole or not).

(10) Extra coherency data may be useful in making backups [Andreas Dilger].

     (No particular data were offered, but things like last backup
     timestamp, the data version number and the DOS archive bit would come
     into this category).

(11) Allow the filesystem to indicate what it can/cannot provide: A
     filesystem can now say it doesn't support a standard stat feature if
     that isn't available, so if, for instance, inode numbers or UIDs don't
     exist or are fabricated locally...

     (This requires a separate system call - I have an fsinfo() call idea
     for this).

(12) Store a 16-byte volume ID in the superblock that can be returned in
     struct xstat [Steve French].

     (Deferred to fsinfo).

(13) Include granularity fields in the time data to indicate the
     granularity of each of the times (NFSv4 time_delta) [Steve French].

     (Deferred to fsinfo).

(14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
     Note that the Linux IOC flags are a mess and filesystems such as Ext4
     define flags that aren't in linux/fs.h, so translation in the kernel
     may be a necessity (or, possibly, we provide the filesystem type too).

     (Some attributes are made available in stx_attributes, but the general
     feeling was that the IOC flags were to ext[234]-specific and shouldn't
     be exposed through statx this way).

(15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
     Michael Kerrisk].

     (Deferred, probably to fsinfo.  Finding out if there's an ACL or
     seclabal might require extra filesystem operations).

(16) Femtosecond-resolution timestamps [Dave Chinner].

     (A __reserved field has been left in the statx_timestamp struct for
     this - if there proves to be a need).

(17) A set multiple attributes syscall to go with this.

===============
NEW SYSTEM CALL
===============

The new system call is:

	int ret = statx(int dfd,
			const char *filename,
			unsigned int flags,
			unsigned int mask,
			struct statx *buffer);

The dfd, filename and flags parameters indicate the file to query, in a
similar way to fstatat().  There is no equivalent of lstat() as that can be
emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
also no equivalent of fstat() as that can be emulated by passing a NULL
filename to statx() with the fd of interest in dfd.

Whether or not statx() synchronises the attributes with the backing store
can be controlled by OR'ing a value into the flags argument (this typically
only affects network filesystems):

 (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
     respect.

 (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
     its attributes with the server - which might require data writeback to
     occur to get the timestamps correct.

 (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
     network filesystem.  The resulting values should be considered
     approximate.

mask is a bitmask indicating the fields in struct statx that are of
interest to the caller.  The user should set this to STATX_BASIC_STATS to
get the basic set returned by stat().  It should be noted that asking for
more information may entail extra I/O operations.

buffer points to the destination for the data.  This must be 256 bytes in
size.

======================
MAIN ATTRIBUTES RECORD
======================

The following structures are defined in which to return the main attribute
set:

	struct statx_timestamp {
		__s64	tv_sec;
		__s32	tv_nsec;
		__s32	__reserved;
	};

	struct statx {
		__u32	stx_mask;
		__u32	stx_blksize;
		__u64	stx_attributes;
		__u32	stx_nlink;
		__u32	stx_uid;
		__u32	stx_gid;
		__u16	stx_mode;
		__u16	__spare0[1];
		__u64	stx_ino;
		__u64	stx_size;
		__u64	stx_blocks;
		__u64	__spare1[1];
		struct statx_timestamp	stx_atime;
		struct statx_timestamp	stx_btime;
		struct statx_timestamp	stx_ctime;
		struct statx_timestamp	stx_mtime;
		__u32	stx_rdev_major;
		__u32	stx_rdev_minor;
		__u32	stx_dev_major;
		__u32	stx_dev_minor;
		__u64	__spare2[14];
	};

The defined bits in request_mask and stx_mask are:

	STATX_TYPE		Want/got stx_mode & S_IFMT
	STATX_MODE		Want/got stx_mode & ~S_IFMT
	STATX_NLINK		Want/got stx_nlink
	STATX_UID		Want/got stx_uid
	STATX_GID		Want/got stx_gid
	STATX_ATIME		Want/got stx_atime{,_ns}
	STATX_MTIME		Want/got stx_mtime{,_ns}
	STATX_CTIME		Want/got stx_ctime{,_ns}
	STATX_INO		Want/got stx_ino
	STATX_SIZE		Want/got stx_size
	STATX_BLOCKS		Want/got stx_blocks
	STATX_BASIC_STATS	[The stuff in the normal stat struct]
	STATX_BTIME		Want/got stx_btime{,_ns}
	STATX_ALL		[All currently available stuff]

stx_btime is the file creation time, stx_mask is a bitmask indicating the
data provided and __spares*[] are where as-yet undefined fields can be
placed.

Time fields are structures with separate seconds and nanoseconds fields
plus a reserved field in case we want to add even finer resolution.  Note
that times will be negative if before 1970; in such a case, the nanosecond
fields will also be negative if not zero.

The bits defined in the stx_attributes field convey information about a
file, how it is accessed, where it is and what it does.  The following
attributes map to FS_*_FL flags and are the same numerical value:

	STATX_ATTR_COMPRESSED		File is compressed by the fs
	STATX_ATTR_IMMUTABLE		File is marked immutable
	STATX_ATTR_APPEND		File is append-only
	STATX_ATTR_NODUMP		File is not to be dumped
	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs

Within the kernel, the supported flags are listed by:

	KSTAT_ATTR_FS_IOC_FLAGS

[Are any other IOC flags of sufficient general interest to be exposed
through this interface?]

New flags include:

	STATX_ATTR_AUTOMOUNT		Object is an automount trigger

These are for the use of GUI tools that might want to mark files specially,
depending on what they are.

Fields in struct statx come in a number of classes:

 (0) stx_dev_*, stx_blksize.

     These are local system information and are always available.

 (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
     stx_size, stx_blocks.

     These will be returned whether the caller asks for them or not.  The
     corresponding bits in stx_mask will be set to indicate whether they
     actually have valid values.

     If the caller didn't ask for them, then they may be approximated.  For
     example, NFS won't waste any time updating them from the server,
     unless as a byproduct of updating something requested.

     If the values don't actually exist for the underlying object (such as
     UID or GID on a DOS file), then the bit won't be set in the stx_mask,
     even if the caller asked for the value.  In such a case, the returned
     value will be a fabrication.

     Note that there are instances where the type might not be valid, for
     instance Windows reparse points.

 (2) stx_rdev_*.

     This will be set only if stx_mode indicates we're looking at a
     blockdev or a chardev, otherwise will be 0.

 (3) stx_btime.

     Similar to (1), except this will be set to 0 if it doesn't exist.

=======
TESTING
=======

The following test program can be used to test the statx system call:

	samples/statx/test-statx.c

Just compile and run, passing it paths to the files you want to examine.
The file is built automatically if CONFIG_SAMPLES is enabled.

Here's some example output.  Firstly, an NFS directory that crosses to
another FSID.  Note that the AUTOMOUNT attribute is set because transiting
this directory will cause d_automount to be invoked by the VFS.

	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:26           Inode: 1703937     Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000
	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)

Secondly, the result of automounting on that directory.

	[root@andromeda ~]# /tmp/test-statx /warthog/data
	statx(/warthog/data) = 0
	results=7ff
	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
	Device: 00:27           Inode: 2           Links: 125
	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
	Access: 2016-11-24 09:02:12.219699527+0000
	Modify: 2016-11-17 10:44:36.225653653+0000
	Change: 2016-11-17 10:44:36.225653653+0000

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2017-03-02 20:51:15 -05:00
Geliang Tang
4e4a7fb7b4 proc: use rb_entry()
To make the code clearer, use rb_entry() instead of container_of() to
deal with rbtree.

Link: http://lkml.kernel.org/r/4fd1f82818665705ce75c5156a060ae7caa8e0a9.1482160150.git.geliangtang@gmail.com
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Cc: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2017-02-24 17:46:56 -08:00
Linus Torvalds
7c0f6ba682 Replace <asm/uaccess.h> with <linux/uaccess.h> globally
This was entirely automated, using the script by Al:

  PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>'
  sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \
        $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h)

to do the replacement at the end of the merge window.

Requested-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2016-12-24 11:46:01 -08:00
Seth Forshee
f97df70b1c xenfs: Use proc_create_mount_point() to create /proc/xen
Mounting proc in user namespace containers fails if the xenbus
filesystem is mounted on /proc/xen because this directory fails
the "permanently empty" test. proc_create_mount_point() exists
specifically to create such mountpoints in proc but is currently
proc-internal. Export this interface to modules, then use it in
xenbus when creating /proc/xen.

Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Signed-off-by: David Vrabel <david.vrabel@citrix.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
2016-11-17 13:52:18 +01:00
Linus Torvalds
abb5a14fa2 Merge branch 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc vfs updates from Al Viro:
 "Assorted misc bits and pieces.

  There are several single-topic branches left after this (rename2
  series from Miklos, current_time series from Deepa Dinamani, xattr
  series from Andreas, uaccess stuff from from me) and I'd prefer to
  send those separately"

* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (39 commits)
  proc: switch auxv to use of __mem_open()
  hpfs: support FIEMAP
  cifs: get rid of unused arguments of CIFSSMBWrite()
  posix_acl: uapi header split
  posix_acl: xattr representation cleanups
  fs/aio.c: eliminate redundant loads in put_aio_ring_file
  fs/internal.h: add const to ns_dentry_operations declaration
  compat: remove compat_printk()
  fs/buffer.c: make __getblk_slow() static
  proc: unsigned file descriptors
  fs/file: more unsigned file descriptors
  fs: compat: remove redundant check of nr_segs
  cachefiles: Fix attempt to read i_blocks after deleting file [ver #2]
  cifs: don't use memcpy() to copy struct iov_iter
  get rid of separate multipage fault-in primitives
  fs: Avoid premature clearing of capabilities
  fs: Give dentry to inode_change_ok() instead of inode
  fuse: Propagate dentry down to inode_change_ok()
  ceph: Propagate dentry down to inode_change_ok()
  xfs: Propagate dentry down to inode_change_ok()
  ...
2016-10-10 13:04:49 -07:00
Jan Kara
31051c85b5 fs: Give dentry to inode_change_ok() instead of inode
inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
2016-09-22 10:56:19 +02:00
Dmitry Torokhov
c110486f6c proc: make proc entries inherit ownership from parent
There are certain parameters that belong to net namespace and that are
exported in /proc. They should be controllable by the container's owner,
but are currently owned by global root and thus not available.

Let's change proc code to inherit ownership of parent entry, and when
create per-ns "net" proc entry set it up as owned by container's owner.

Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-14 21:07:20 -07:00
Al Viro
f50752eaa0 switch all procfs directories ->iterate_shared()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2016-05-02 19:49:30 -04:00
Waiman Long
ecf1a3dfff proc: change proc_subdir_lock to a rwlock
The proc_subdir_lock spinlock is used to allow only one task to make
change to the proc directory structure as well as looking up information
in it.  However, the information lookup part can actually be entered by
more than one task as the pde_get() and pde_put() reference count update
calls in the critical sections are atomic increment and decrement
respectively and so are safe with concurrent updates.

The x86 architecture has already used qrwlock which is fair and other
architectures like ARM are in the process of switching to qrwlock.  So
unfairness shouldn't be a concern in that conversion.

This patch changed the proc_subdir_lock to a rwlock in order to enable
concurrent lookup. The following functions were modified to take a
write lock:
 - proc_register()
 - remove_proc_entry()
 - remove_proc_subtree()

The following functions were modified to take a read lock:
 - xlate_proc_name()
 - proc_lookup_de()
 - proc_readdir_de()

A parallel /proc filesystem search with the "find" command (1000 threads)
was run on a 4-socket Haswell-EX box (144 threads).  Before the patch, the
parallel search took about 39s.  After the patch, the parallel find took
only 25s, a saving of about 14s.

The micro-benchmark that I used was artificial, but it was used to
reproduce an exit hanging problem that I saw in real application.  In
fact, only allow one task to do a lookup seems too limiting to me.

Signed-off-by: Waiman Long <Waiman.Long@hp.com>
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Scott J Norton <scott.norton@hp.com>
Cc: Douglas Hatch <doug.hatch@hp.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-10 13:29:01 -07:00
Eric W. Biederman
eb6d38d542 proc: Allow creating permanently empty directories that serve as mount points
Add a new function proc_create_mount_point that when used to creates a
directory that can not be added to.

Add a new function is_empty_pde to test if a function is a mount
point.

Update the code to use make_empty_dir_inode when reporting
a permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-07-01 10:36:41 -05:00
David Howells
2b0143b5c9 VFS: normal filesystems (and lustre): d_inode() annotations
that's the bulk of filesystem drivers dealing with inodes of their own

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-15 15:06:57 -04:00
Al Viro
7e0e953bb0 procfs: fix race between symlink removals and traversals
use_pde()/unuse_pde() in ->follow_link()/->put_link() resp.

Cc: stable@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-02-22 11:43:12 -05:00
Linus Torvalds
50652963ea Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull misc VFS updates from Al Viro:
 "This cycle a lot of stuff sits on topical branches, so I'll be sending
  more or less one pull request per branch.

  This is the first pile; more to follow in a few.  In this one are
  several misc commits from early in the cycle (before I went for
  separate branches), plus the rework of mntput/dput ordering on umount,
  switching to use of fs_pin instead of convoluted games in
  namespace_unlock()"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  switch the IO-triggering parts of umount to fs_pin
  new fs_pin killing logics
  allow attaching fs_pin to a group not associated with some superblock
  get rid of the second argument of acct_kill()
  take count and rcu_head out of fs_pin
  dcache: let the dentry count go down to zero without taking d_lock
  pull bumping refcount into ->kill()
  kill pin_put()
  mode_t whack-a-mole: chelsio
  file->f_path.dentry is pinned down for as long as the file is open...
  get rid of lustre_dump_dentry()
  gut proc_register() a bit
  kill d_validate()
  ncpfs: get rid of d_validate() nonsense
  selinuxfs: don't open-code d_genocide()
2015-02-17 14:56:45 -08:00
Alexander Kuleshov
6bee55f94f fs: proc: use PDE() to get proc_dir_entry
Use the PDE() helper to get proc_dir_entry instead of coding it directly.

Signed-off-by: Alexander Kuleshov <kuleshovmail@gmail.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-02-12 18:54:12 -08:00
Al Viro
d443b9fd56 gut proc_register() a bit
There are only 3 callers and quite a bit of that thing is executed
exactly in one of those.  Just lift it there...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-01-25 23:16:26 -05:00
Nicolas Dichtel
2fc1e948e8 fs/proc.c: use rb_entry_safe() instead of rb_entry()
Better to use existing macro that rewriting them.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Debabrata Banerjee
b208d54b75 procfs: fix error handling of proc_register()
proc_register() error paths are leaking inodes and directory refcounts.

Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Nicolas Dichtel
710585d492 fs/proc: use a rb tree for the directory entries
When a lot of netdevices are created, one of the bottleneck is the
creation of proc entries.  This serie aims to accelerate this part.

The current implementation for the directories in /proc is using a single
linked list.  This is slow when handling directories with large numbers of
entries (eg netdevice-related entries when lots of tunnels are opened).

This patch replaces this linked list by a red-black tree.

Here are some numbers:

dummy30000.batch contains 30 000 times 'link add type dummy'.

Before the patch:
  $ time ip -b dummy30000.batch
  real    2m31.950s
  user    0m0.440s
  sys     2m21.440s
  $ time rmmod dummy
  real    1m35.764s
  user    0m0.000s
  sys     1m24.088s

After the patch:
  $ time ip -b dummy30000.batch
  real    2m0.874s
  user    0m0.448s
  sys     1m49.720s
  $ time rmmod dummy
  real    1m13.988s
  user    0m0.000s
  sys     1m1.008s

The idea of improving this part was suggested by Thierry Herbelot.

[akpm@linux-foundation.org: initialise proc_root.subdir at compile time]
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: Thierry Herbelot <thierry.herbelot@6wind.com>.
Acked-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-12-10 17:41:09 -08:00
Alexey Dobriyan
4dcc03fc45 proc: make proc_subdir_lock static
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Alexey Dobriyan
dbcdb50441 proc: add and remove /proc entry create checks
* remove proc_create(NULL, ...) check, let it oops

* warn about proc_create("", ...) and proc_create("very very long name", ...)
  proc code keeps length as u8, no 256+ name length possible

* warn about proc_create("123", ...)
  /proc/$PID and /proc/misc namespaces are separate things,
  but dumb module might create funky a-la $PID entry.

* remove post mortem strchr('/') check
  Triggering it implies either strchr() is buggy or memory corruption.
  It should be VFS check anyway.

In reality, none of these checks will ever trigger,
it is preparation for the next patch.

Based on patch from Al Viro.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-08-08 15:57:22 -07:00
Rui Xiang
cdf7e8dded proc: set attributes of pde using accessor functions
Use existing accessors proc_set_user() and proc_set_size() to set
attributes.  Just a cleanup.

Signed-off-by: Rui Xiang <rui.xiang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2014-01-23 16:37:01 -08:00
Al Viro
b26d4cd385 consolidate simple ->d_delete() instances
Rename simple_delete_dentry() to always_delete_dentry() and export it.
Export simple_dentry_operations, while we are at it, and get rid of
their duplicates

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-11-15 22:04:17 -05:00
Linus Torvalds
fd3930f70c proc: more readdir conversion bug-fixes
In the previous commit, Richard Genoud fixed proc_root_readdir(), which
had lost the check for whether all of the non-process /proc entries had
been returned or not.

But that in turn exposed _another_ bug, namely that the original readdir
conversion patch had yet another problem: it had lost the return value
of proc_readdir_de(), so now checking whether it had completed
successfully or not didn't actually work right anyway.

This reinstates the non-zero return for the "end of base entries" that
had also gotten lost in commit f0c3b5093a ("[readdir] convert
procfs").  So now you get all the base entries *and* you get all the
process entries, regardless of getdents buffer size.

(Side note: the Linux "getdents" manual page actually has a nice example
application for testing getdents, which can be easily modified to use
different buffers.  Who knew? Man-pages can be useful)

Reported-by: Emmanuel Benisty <benisty.e@gmail.com>
Reported-by: Marc Dionne <marc.c.dionne@gmail.com>
Cc: Richard Genoud <richard.genoud@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-08-19 16:26:12 -07:00
Al Viro
f0c3b5093a [readdir] convert procfs
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-06-29 12:56:32 +04:00
David Howells
c30480b92c proc: Make the PROC_I() and PDE() macros internal to procfs
Make the PROC_I() and PDE() macros internal to procfs.  This means making
PDE_DATA() out of line.  This could be made more optimal by storing
PDE()->data into inode->i_private.

Also provide a __PDE_DATA() that is inline and internal to procfs.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:47 -04:00
David Howells
a8ca16ea7b proc: Supply a function to remove a proc entry by PDE
Supply a function (proc_remove()) to remove a proc entry (and any subtree
rooted there) by proc_dir_entry pointer rather than by name and (optionally)
root dir entry pointer.  This allows us to eliminate all remaining pde->name
accesses outside of procfs.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Grant Likely <grant.likely@linaro.or>
cc: linux-acpi@vger.kernel.org
cc: openipmi-developer@lists.sourceforge.net
cc: devicetree-discuss@lists.ozlabs.org
cc: linux-pci@vger.kernel.org
cc: netdev@vger.kernel.org
cc: netfilter-devel@vger.kernel.org
cc: alsa-devel@alsa-project.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:46 -04:00
David Howells
4a520d2769 proc: Supply an accessor for getting the data from a PDE's parent
Supply an accessor function for getting the private data from the parent
proc_dir_entry struct of the proc_dir_entry struct associated with an inode.

ReiserFS, for instance, stores the super_block pointer in the proc directory
it makes for that super_block, and a pointer to the respective seq_file show
function in each of the proc files in that directory.

This allows a reduction in the number of file_operations structs, open
functions and seq_operations structs required.  The problem otherwise is that
each show function requires two pieces of data but only has storage for one
per PDE (and this has no release function).

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Jerry Chuang <jerry-chuang@realtek.com>
cc: Maxim Mikityanskiy <maxtram95@gmail.com>
cc: YAMANE Toshiaki <yamanetoshi@gmail.com>
cc: linux-wireless@vger.kernel.org
cc: linux-scsi@vger.kernel.org
cc: devel@driverdev.osuosl.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:42 -04:00
David Howells
270b5ac215 proc: Add proc_mkdir_data()
Add proc_mkdir_data() to allow procfs directories to be created that are
annotated at the time of creation with private data rather than doing this
post-creation.  This means no access is then required to the proc_dir_entry
struct to set this.

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
cc: Neela Syam Kolli <megaraidlinux@lsi.com>
cc: Jerry Chuang <jerry-chuang@realtek.com>
cc: linux-scsi@vger.kernel.org
cc: devel@driverdev.osuosl.org
cc: linux-wireless@vger.kernel.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:41 -04:00
David Howells
271a15eabe proc: Supply PDE attribute setting accessor functions
Supply accessor functions to set attributes in proc_dir_entry structs.

The following are supplied: proc_set_size() and proc_set_user().

Signed-off-by: David Howells <dhowells@redhat.com>
Acked-by: Mauro Carvalho Chehab <mchehab@redhat.com>
cc: linuxppc-dev@lists.ozlabs.org
cc: linux-media@vger.kernel.org
cc: netdev@vger.kernel.org
cc: linux-wireless@vger.kernel.org
cc: linux-pci@vger.kernel.org
cc: netfilter-devel@vger.kernel.org
cc: alsa-devel@alsa-project.org
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-05-01 17:29:18 -04:00
David Howells
3cb5bf1bf9 proc: Delete create_proc_read_entry()
Delete create_proc_read_entry() as it no longer has any users.

Also delete read_proc_t, write_proc_t, the read_proc member of the
proc_dir_entry struct and the support functions that use them.  This saves a
pointer for every PDE allocated.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-29 15:42:00 -04:00
Al Viro
866ad9a747 procfs: preparations for remove_proc_entry() race fixes
* leave ->proc_fops alone; make ->pde_users negative instead
* trim pde_opener
* move relevant code in fs/proc/inode.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 15:16:51 -04:00
David Howells
ad147d011f procfs: Clean up huge if-statement in __proc_file_read()
Switch huge if-statement in __proc_file_read() around.  This then puts the
single line loop break immediately after the if-statement and allows us to
de-indent the huge comment and make it take fewer lines.  The code following
the if-statement then follows naturally from the call to dp->read_proc().

Signed-off-by: David Howells <dhowells@redhat.com>
2013-04-09 15:16:50 -04:00
David Howells
80e928f7eb proc: Kill create_proc_entry()
Kill create_proc_entry() in favour of create_proc_read_entry(), proc_create()
and proc_create_data().

Signed-off-by: David Howells <dhowells@redhat.com>
2013-04-09 14:16:39 -04:00
Al Viro
d9dda78bad procfs: new helper - PDE_DATA(inode)
The only part of proc_dir_entry the code outside of fs/proc
really cares about is PDE(inode)->data.  Provide a helper
for that; static inline for now, eventually will be moved
to fs/proc, along with the knowledge of struct proc_dir_entry
layout.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:32 -04:00
Al Viro
ee21ed0afc procfs: kill ->write_proc()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:32 -04:00
Al Viro
b6cdc73103 procfs: don't allow to use proc_create, create_proc_entry, etc. for directories
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:13:14 -04:00
Al Viro
8ce584c741 procfs: add proc_remove_subtree()
just what it sounds like; do that only to procfs subtrees you've
created - doing that to something shared with another driver is
not only antisocial, but might cause interesting races with
proc_create() and its ilk.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-04-09 14:09:17 -04:00
Andrew Morton
87ebdc00ee fs/proc: clean up printks
- use pr_foo() throughout

- remove a couple of duplicated KERN_WARNINGs, via WARN(KERN_WARNING "...")

- nuke a few warnings which I've never seen happen, ever.

Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2013-02-27 19:10:11 -08:00
Al Viro
d3d009cb96 saner proc_get_inode() calling conventions
Make it drop the pde in *all* cases when no new reference to it is
put into an inode - both when an inode had already been set up
(as we were already doing) and when inode allocation has failed.
Makes for simpler logics in callers...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-26 02:46:14 -05:00
Al Viro
496ad9aa8e new helper: file_inode(file)
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2013-02-22 23:31:31 -05:00
Eric W. Biederman
dfb2ea45be proc: Allow proc_free_inum to be called from any context
While testing the pid namespace code I hit this nasty warning.

[  176.262617] ------------[ cut here ]------------
[  176.263388] WARNING: at /home/eric/projects/linux/linux-userns-devel/kernel/softirq.c:160 local_bh_enable_ip+0x7a/0xa0()
[  176.265145] Hardware name: Bochs
[  176.265677] Modules linked in:
[  176.266341] Pid: 742, comm: bash Not tainted 3.7.0userns+ #18
[  176.266564] Call Trace:
[  176.266564]  [<ffffffff810a539f>] warn_slowpath_common+0x7f/0xc0
[  176.266564]  [<ffffffff810a53fa>] warn_slowpath_null+0x1a/0x20
[  176.266564]  [<ffffffff810ad9ea>] local_bh_enable_ip+0x7a/0xa0
[  176.266564]  [<ffffffff819308c9>] _raw_spin_unlock_bh+0x19/0x20
[  176.266564]  [<ffffffff8123dbda>] proc_free_inum+0x3a/0x50
[  176.266564]  [<ffffffff8111d0dc>] free_pid_ns+0x1c/0x80
[  176.266564]  [<ffffffff8111d195>] put_pid_ns+0x35/0x50
[  176.266564]  [<ffffffff810c608a>] put_pid+0x4a/0x60
[  176.266564]  [<ffffffff8146b177>] tty_ioctl+0x717/0xc10
[  176.266564]  [<ffffffff810aa4d5>] ? wait_consider_task+0x855/0xb90
[  176.266564]  [<ffffffff81086bf9>] ? default_spin_lock_flags+0x9/0x10
[  176.266564]  [<ffffffff810cab0a>] ? remove_wait_queue+0x5a/0x70
[  176.266564]  [<ffffffff811e37e8>] do_vfs_ioctl+0x98/0x550
[  176.266564]  [<ffffffff810b8a0f>] ? recalc_sigpending+0x1f/0x60
[  176.266564]  [<ffffffff810b9127>] ? __set_task_blocked+0x37/0x80
[  176.266564]  [<ffffffff810ab95b>] ? sys_wait4+0xab/0xf0
[  176.266564]  [<ffffffff811e3d31>] sys_ioctl+0x91/0xb0
[  176.266564]  [<ffffffff810a95f0>] ? task_stopped_code+0x50/0x50
[  176.266564]  [<ffffffff81939199>] system_call_fastpath+0x16/0x1b
[  176.266564] ---[ end trace 387af88219ad6143 ]---

It turns out that spin_unlock_bh(proc_inum_lock) is not safe when
put_pid is called with another spinlock held and irqs disabled.

For now take the easy path and use spin_lock_irqsave(proc_inum_lock)
in proc_free_inum and spin_loc_irq in proc_alloc_inum(proc_inum_lock).

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2012-12-25 16:23:12 -08:00
Linus Torvalds
4c9a44aebe Merge branch 'akpm' (Andrew's patch-bomb)
Merge the rest of Andrew's patches for -rc1:
 "A bunch of fixes and misc missed-out-on things.

  That'll do for -rc1.  I still have a batch of IPC patches which still
  have a possible bug report which I'm chasing down."

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (25 commits)
  keys: use keyring_alloc() to create module signing keyring
  keys: fix unreachable code
  sendfile: allows bypassing of notifier events
  SGI-XP: handle non-fatal traps
  fat: fix incorrect function comment
  Documentation: ABI: remove testing/sysfs-devices-node
  proc: fix inconsistent lock state
  linux/kernel.h: fix DIV_ROUND_CLOSEST with unsigned divisors
  memcg: don't register hotcpu notifier from ->css_alloc()
  checkpatch: warn on uapi #includes that #include <uapi/...
  revert "rtc: recycle id when unloading a rtc driver"
  mm: clean up transparent hugepage sysfs error messages
  hfsplus: add error message for the case of failure of sync fs in delayed_sync_fs() method
  hfsplus: rework processing of hfs_btree_write() returned error
  hfsplus: rework processing errors in hfsplus_free_extents()
  hfsplus: avoid crash on failed block map free
  kcmp: include linux/ptrace.h
  drivers/rtc/rtc-imxdi.c: must include <linux/spinlock.h>
  mm: cma: WARN if freed memory is still in use
  exec: do not leave bprm->interp on stack
  ...
2012-12-20 20:00:43 -08:00
Xiaotian Feng
ee297209bf proc: fix inconsistent lock state
Lockdep found an inconsistent lock state when rcu is processing delayed
work in softirq.  Currently, kernel is using spin_lock/spin_unlock to
protect proc_inum_ida, but proc_free_inum is called by rcu in softirq
context.

Use spin_lock_bh/spin_unlock_bh fix following lockdep warning.

  =================================
  [ INFO: inconsistent lock state ]
  3.7.0 #36 Not tainted
  ---------------------------------
  inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
  swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
   (proc_inum_lock){+.?...}, at: proc_free_inum+0x1c/0x50
  {SOFTIRQ-ON-W} state was registered at:
     __lock_acquire+0x8ae/0xca0
     lock_acquire+0x199/0x200
     _raw_spin_lock+0x41/0x50
     proc_alloc_inum+0x4c/0xd0
     alloc_mnt_ns+0x49/0xc0
     create_mnt_ns+0x25/0x70
     mnt_init+0x161/0x1c7
     vfs_caches_init+0x107/0x11a
     start_kernel+0x348/0x38c
     x86_64_start_reservations+0x131/0x136
     x86_64_start_kernel+0x103/0x112
  irq event stamp: 2993422
  hardirqs last  enabled at (2993422):  _raw_spin_unlock_irqrestore+0x55/0x80
  hardirqs last disabled at (2993421):  _raw_spin_lock_irqsave+0x29/0x70
  softirqs last  enabled at (2993394):  _local_bh_enable+0x13/0x20
  softirqs last disabled at (2993395):  call_softirq+0x1c/0x30

  other info that might help us debug this:
   Possible unsafe locking scenario:

         CPU0
         ----
    lock(proc_inum_lock);
    <Interrupt>
      lock(proc_inum_lock);

   *** DEADLOCK ***

  no locks held by swapper/1/0.

  stack backtrace:
  Pid: 0, comm: swapper/1 Not tainted 3.7.0 #36
  Call Trace:
   <IRQ>  [<ffffffff810a40f1>] ? vprintk_emit+0x471/0x510
    print_usage_bug+0x2a5/0x2c0
    mark_lock+0x33b/0x5e0
    __lock_acquire+0x813/0xca0
    lock_acquire+0x199/0x200
    _raw_spin_lock+0x41/0x50
    proc_free_inum+0x1c/0x50
    free_pid_ns+0x1c/0x50
    put_pid_ns+0x2e/0x50
    put_pid+0x4a/0x60
    delayed_put_pid+0x12/0x20
    rcu_process_callbacks+0x462/0x790
    __do_softirq+0x1b4/0x3b0
    call_softirq+0x1c/0x30
    do_softirq+0x59/0xd0
    irq_exit+0x54/0xd0
    smp_apic_timer_interrupt+0x95/0xa3
    apic_timer_interrupt+0x72/0x80
    cpuidle_enter_tk+0x10/0x20
    cpuidle_enter_state+0x17/0x50
    cpuidle_idle_call+0x287/0x520
    cpu_idle+0xba/0x130
    start_secondary+0x2b3/0x2bc

Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-12-20 17:40:20 -08:00
Marco Stornelli
46f6955710 procfs: drop vmtruncate
Removed vmtruncate

Signed-off-by: Marco Stornelli <marco.stornelli@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-12-20 14:00:01 -05:00
Eric W. Biederman
33d6dce607 proc: Generalize proc inode allocation
Generalize the proc inode allocation so that it can be
used without having to having to create a proc_dir_entry.

This will allow namespace file descriptors to remain light
weight entitities but still have the same inode number
when the backing namespace is the same.

Acked-by: Serge E. Hallyn <serge.hallyn@ubuntu.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-11-20 04:19:19 -08:00
yan
17baa2a2c4 proc: use kzalloc instead of kmalloc and memset
Part of the memory will be written twice after this change, but that
should be negligible.

[akpm@linux-foundation.org: fix __proc_create() coding-style issues, remove unneeded zero-initialisations]
Signed-off-by: yan <clouds.yan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-10-06 03:05:18 +09:00