Commit graph

1497 commits

Author SHA1 Message Date
David Howells
4bacc9c923 overlayfs: Make f_path always point to the overlay and f_inode to the underlay
Make file->f_path always point to the overlay dentry so that the path in
/proc/pid/fd is correct and to ensure that label-based LSMs have access to the
overlay as well as the underlay (path-based LSMs probably don't need it).

Using my union testsuite to set things up, before the patch I see:

	[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
	[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
	...
	lr-x------. 1 root root 64 Jun  5 14:38 5 -> /a/foo107
	[root@andromeda union-testsuite]# stat /mnt/a/foo107
	...
	Device: 23h/35d Inode: 13381       Links: 1
	...
	[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
	...
	Device: 23h/35d Inode: 13381       Links: 1
	...

After the patch:

	[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
	[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
	...
	lr-x------. 1 root root 64 Jun  5 14:22 5 -> /mnt/a/foo107
	[root@andromeda union-testsuite]# stat /mnt/a/foo107
	...
	Device: 23h/35d Inode: 40346       Links: 1
	...
	[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
	...
	Device: 23h/35d Inode: 40346       Links: 1
	...

Note the change in where /proc/$$/fd/5 points to in the ls command.  It was
pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107
(which is correct).

The inode accessed, however, is the lower layer.  The union layer is on device
25h/37d and the upper layer on 24h/36d.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-06-19 03:19:32 -04:00
Tejun Heo
46b15caa7c vfs, writeback: replace FS_CGROUP_WRITEBACK with SB_I_CGROUPWB
FS_CGROUP_WRITEBACK indicates whether a file_system_type supports
cgroup writeback; however, different super_blocks of the same
file_system_type may or may not support cgroup writeback depending on
filesystem options.  This patch replaces FS_CGROUP_WRITEBACK with a
per-super_block flag.

super_block->s_flags carries some internal flags in the high bits but
it's exposd to userland through uapi header and running out of space
anyway.  This patch adds a new field super_block->s_iflags to carry
kernel-internal flags.  It is currently only used by the new
SB_I_CGROUPWB flag whose concatenated and abbreviated name is for
consistency with other super_block flags.

ext2_fill_super() is updated to set SB_I_CGROUPWB.

v2: Added super_block->s_iflags instead of stealing another high bit
    from sb->s_flags as suggested by Christoph and Jan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-17 12:47:39 -06:00
Dave Chinner
ce5c5d554d dax: expose __dax_fault for filesystems with locking constraints
Some filesystems cannot call dax_fault() directly because they have
different locking and/or allocation constraints in the page fault IO
path. To handle this, we need to follow the same model as the
generic block_page_mkwrite code, where the internals are exposed via
__block_page_mkwrite() so that filesystems can wrap the correct
locking and operations around the outside. 

This is loosely based on a patch originally from Matthew Willcox.
Unlike the original patch, it does not change ext4 code, error
returns or unwritten extent conversion handling.  It also adds a
__dax_mkwrite() wrapper for .page_mkwrite implementations to do the
right thing, too.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04 09:18:18 +10:00
Dave Chinner
e842f29039 dax: don't abuse get_block mapping for endio callbacks
dax_fault() currently relies on the get_block callback to attach an
io completion callback to the mapping buffer head so that it can
run unwritten extent conversion after zeroing allocated blocks.

Instead of this hack, pass the conversion callback directly into
dax_fault() similar to the get_block callback. When the filesystem
allocates unwritten extents, it will set the buffer_unwritten()
flag, and hence the dax_fault code can call the completion function
in the contexts where it is necessary without overloading the
mapping buffer head.

Note: The changes to ext4 to use this interface are suspect at best.
In fact, the way ext4 did this end_io assignment in the first place
looks suspect because it only set a completion callback when there
wasn't already some other write() call taking place on the same
inode. The ext4 end_io code looks rather intricate and fragile with
all it's reference counting and passing to different contexts for
modification via inode private pointers that aren't protected by
locks...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
2015-06-04 09:18:18 +10:00
Tejun Heo
682aa8e1a6 writeback: implement unlocked_inode_to_wb transaction and use it for stat updates
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place.  This patch build the
framework for the actual switching.

This patch adds a new inode flag I_WB_SWITCHING, which has two
functions.  First, the easy one, it ensures that there's only one
switching in progress for a give inode.  Second, it's used as a
mechanism to synchronize wb stat updates.

The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively.  As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.

This patch solves the problem in a similar way as memcg.  Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end().  During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.

In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.

This patch still doesn't implement the actual switching.

v3: Updated on top of the recent cancel_dirty_page() updates.
    unlocked_inode_to_wb_begin() now nests inside
    mem_cgroup_begin_page_stat() to match the locking order.

v2: The i_wb access transaction will be used for !stat accesses too.
    Function names and comments updated accordingly.

    s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
    s/switch_wb/switch_wbs/

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:40:20 -06:00
Tejun Heo
2a81490811 writeback: implement foreign cgroup inode detection
As concurrent write sharing of an inode is expected to be very rare
and memcg only tracks page ownership on first-use basis severely
confining the usefulness of such sharing, cgroup writeback tracks
ownership per-inode.  While the support for concurrent write sharing
of an inode is deemed unnecessary, an inode being written to by
different cgroups at different points in time is a lot more common,
and, more importantly, charging only by first-use can too readily lead
to grossly incorrect behaviors (single foreign page can lead to
gigabytes of writeback to be incorrectly attributed).

To resolve this issue, cgroup writeback detects the majority dirtier
of an inode and will transfer the ownership to it.  To avoid
unnnecessary oscillation, the detection mechanism keeps track of
history and gives out the switch verdict only if the foreign usage
pattern is stable over a certain amount of time and/or writeback
attempts.

The detection mechanism has fairly low space and computation overhead.
It adds 8 bytes to struct inode (one int and two u16's) and minimal
amount of calculation per IO.  The detection mechanism converges to
the correct answer usually in several seconds of IO time when there's
a clear majority dirtier.  Even when there isn't, it can reach an
acceptable answer fairly quickly under most circumstances.

Please see wb_detach_inode() for more details.

This patch only implements detection.  Following patches will
implement actual switching.

v2: wbc_account_io() now checks whether the wbc is associated with a
    wb before dereferencing it.  This can happen when pageout() is
    writing pages directly without going through the usual writeback
    path.  As pageout() path is single-threaded, we don't want it to
    be blocked behind a slow cgroup and ultimately want it to delegate
    actual writing to the usual writeback path.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:40:20 -06:00
Tejun Heo
52ebea749a writeback: make backing_dev_info host cgroup-specific bdi_writebacks
For the planned cgroup writeback support, on each bdi
(backing_dev_info), each memcg will be served by a separate wb
(bdi_writeback).  This patch updates bdi so that a bdi can host
multiple wbs (bdi_writebacks).

On the default hierarchy, blkcg implicitly enables memcg.  This allows
using memcg's page ownership for attributing writeback IOs, and every
memcg - blkcg combination can be served by its own wb by assigning a
dedicated wb to each memcg.  This means that there may be multiple
wb's of a bdi mapped to the same blkcg.  As congested state is per
blkcg - bdi combination, those wb's should share the same congested
state.  This is achieved by tracking congested state via
bdi_writeback_congested structs which are keyed by blkcg.

bdi->wb remains unchanged and will keep serving the root cgroup.
cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
looked up while dirtying an inode according to the memcg of the page
being dirtied or current task.  Each cgwb is indexed on bdi->cgwb_tree
by its memcg id.  Once an inode is associated with its wb, it can be
retrieved using inode_to_wb().

Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being associated with bdi->wb.

v3: inode_attach_wb() in account_page_dirtied() moved inside
    mapping_cap_account_dirty() block where it's known to be !NULL.
    Also, an unnecessary NULL check before kfree() removed.  Both
    detected by the kbuild bot.

v2: Updated so that wb association is per inode and wb is per memcg
    rather than blkcg.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kbuild test robot <fengguang.wu@intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:35 -06:00
Tejun Heo
89e9b9e07a writeback: add {CONFIG|BDI_CAP|FS}_CGROUP_WRITEBACK
cgroup writeback requires support from both bdi and filesystem sides.
Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
default.  Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
both MEMCG and BLK_CGROUP are enabled.

inode_cgwb_enabled() which determines whether a given inode's both bdi
and fs support cgroup writeback is added.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:35 -06:00
Tejun Heo
a212b105b0 bdi: make inode_to_bdi() inline
Now that bdi definitions are moved to backing-dev-defs.h,
backing-dev.h can include blkdev.h and inline inode_to_bdi() without
worrying about introducing circular include dependency.  The function
gets called from hot paths and fairly trivial.

This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
function calls inline.  blockdev_superblock and noop_backing_dev_info
are EXPORT_GPL'd to allow the inline functions to be used from
modules.

While at it, make sb_is_blkdev_sb() return bool instead of int.

v2: Fixed typo in description as suggested by Jan.

Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-06-02 08:33:34 -06:00
Jarod Wilson
be32417796 block: export blkdev_reread_part() and __blkdev_reread_part()
This patch exports blkdev_reread_part() for block drivers, also
introduce __blkdev_reread_part().

For some drivers, such as loop, reread of partitions can be run
from the release path, and bd_mutex may already be held prior to
calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
__blkdev_reread_part for use in such cases.

CC: Christoph Hellwig <hch@lst.de>
CC: Jens Axboe <axboe@kernel.dk>
CC: Tejun Heo <tj@kernel.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Markus Pargmann <mpa@pengutronix.de>
CC: Stefan Weinhuber <wein@de.ibm.com>
CC: Stefan Haberland <stefan.haberland@de.ibm.com>
CC: Sebastian Ott <sebott@linux.vnet.ibm.com>
CC: Fabian Frederick <fabf@skynet.be>
CC: Ming Lei <ming.lei@canonical.com>
CC: David Herrmann <dh.herrmann@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: nbd-general@lists.sourceforge.net
CC: linux-s390@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
2015-05-20 09:05:42 -06:00
Al Viro
89076bc319 get rid of assorted nameidata-related debris
pointless forward declarations, stale comments

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-15 01:10:37 -04:00
NeilBrown
8fa9dd2466 VFS/namei: make the use of touch_atime() in get_link() RCU-safe.
touch_atime is not RCU-safe, and so cannot be called on an RCU walk.
However, in situations where RCU-walk makes a difference, the symlink
will likely to accessed much more often than it is useful to update
the atime.

So split out the test of "Does the atime actually need to be updated"
into  atime_needs_update(), and have get_link() unlazy if it finds that
it will need to do that update.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-15 01:06:27 -04:00
Eric W. Biederman
1b852bceb0 mnt: Refactor the logic for mounting sysfs and proc in a user namespace
Fresh mounts of proc and sysfs are a very special case that works very
much like a bind mount.  Unfortunately the current structure can not
preserve the MNT_LOCK... mount flags.  Therefore refactor the logic
into a form that can be modified to preserve those lock bits.

Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount
of the filesystem be fully visible in the current mount namespace,
before the filesystem may be mounted.

Move the logic for calling fs_fully_visible from proc and sysfs into
fs/namespace.c where it has greater access to mount namespace state.

Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
2015-05-13 21:44:11 -05:00
Al Viro
ecc087ff14 new helper: free_page_put_link()
similar to kfree_put_link()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-11 08:13:13 -04:00
Al Viro
5f2c4179e1 switch ->put_link() from dentry to inode
only one instance looks at that argument at all; that sole
exception wants inode rather than dentry.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-11 08:13:12 -04:00
Al Viro
6e77137b36 don't pass nameidata to ->follow_link()
its only use is getting passed to nd_jump_link(), which can obtain
it from current->nameidata

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10 22:20:15 -04:00
Al Viro
680baacbca new ->follow_link() and ->put_link() calling conventions
a) instead of storing the symlink body (via nd_set_link()) and returning
an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
that opaque pointer (into void * passed by address by caller) and returns
the symlink body.  Returning ERR_PTR() on error, NULL on jump (procfs magic
symlinks) and pointer to symlink body for normal symlinks.  Stored pointer
is ignored in all cases except the last one.

Storing NULL for opaque pointer (or not storing it at all) means no call
of ->put_link().

b) the body used to be passed to ->put_link() implicitly (via nameidata).
Now only the opaque pointer is.  In the cases when we used the symlink body
to free stuff, ->follow_link() now should store it as opaque pointer in addition
to returning it.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10 22:19:45 -04:00
Al Viro
61ba64fc07 libfs: simple_follow_link()
let "fast" symlinks store the pointer to the body into ->i_link and
use simple_follow_link for ->follow_link()

Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-05-10 22:18:20 -04:00
Linus Torvalds
9ec3a646fe Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull fourth vfs update from Al Viro:
 "d_inode() annotations from David Howells (sat in for-next since before
  the beginning of merge window) + four assorted fixes"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  RCU pathwalk breakage when running into a symlink overmounting something
  fix I_DIO_WAKEUP definition
  direct-io: only inc/dec inode->i_dio_count for file systems
  fs/9p: fix readdir()
  VFS: assorted d_backing_inode() annotations
  VFS: fs/inode.c helpers: d_inode() annotations
  VFS: fs/cachefiles: d_backing_inode() annotations
  VFS: fs library helpers: d_inode() annotations
  VFS: assorted weird filesystems: d_inode() annotations
  VFS: normal filesystems (and lustre): d_inode() annotations
  VFS: security/: d_inode() annotations
  VFS: security/: d_backing_inode() annotations
  VFS: net/: d_inode() annotations
  VFS: net/unix: d_backing_inode() annotations
  VFS: kernel/: d_inode() annotations
  VFS: audit: d_backing_inode() annotations
  VFS: Fix up some ->d_inode accesses in the chelsio driver
  VFS: Cachefiles should perform fs modifications on the top layer only
  VFS: AF_UNIX sockets should call mknod on the top layer only
2015-04-26 17:22:07 -07:00
Eric Sandeen
ac74d8d65c fix I_DIO_WAKEUP definition
I_DIO_WAKEUP is never directly used, but fix it up anyway.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-24 15:45:34 -04:00
Jens Axboe
fe0f07d08e direct-io: only inc/dec inode->i_dio_count for file systems
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.

For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:

clat percentiles (usec):
 |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
 | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
 | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
 | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
 | 99.99th=[  165]

After:

clat percentiles (usec):
 |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
 | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
 | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
 | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
 | 99.99th=[  438]

In other setups, Robert Elliott reported seeing good performance
improvements:

https://lkml.org/lkml/2015/4/3/557

The more applications accessing the device, the worse it gets.

Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-24 15:45:28 -04:00
Linus Torvalds
54e514b91b Merge branch 'akpm' (patches from Andrew)
Merge third patchbomb from Andrew Morton:

 - various misc things

 - a couple of lib/ optimisations

 - provide DIV_ROUND_CLOSEST_ULL()

 - checkpatch updates

 - rtc tree

 - befs, nilfs2, hfs, hfsplus, fatfs, adfs, affs, bfs

 - ptrace fixes

 - fork() fixes

 - seccomp cleanups

 - more mmap_sem hold time reductions from Davidlohr

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (138 commits)
  proc: show locks in /proc/pid/fdinfo/X
  docs: add missing and new /proc/PID/status file entries, fix typos
  drivers/rtc/rtc-at91rm9200.c: make IO endian agnostic
  Documentation/spi/spidev_test.c: fix warning
  drivers/rtc/rtc-s5m.c: allow usage on device type different than main MFD type
  .gitignore: ignore *.tar
  MAINTAINERS: add Mediatek SoC mailing list
  tomoyo: reduce mmap_sem hold for mm->exe_file
  powerpc/oprofile: reduce mmap_sem hold for exe_file
  oprofile: reduce mmap_sem hold for mm->exe_file
  mips: ip32: add platform data hooks to use DS1685 driver
  lib/Kconfig: fix up HAVE_ARCH_BITREVERSE help text
  x86: switch to using asm-generic for seccomp.h
  sparc: switch to using asm-generic for seccomp.h
  powerpc: switch to using asm-generic for seccomp.h
  parisc: switch to using asm-generic for seccomp.h
  mips: switch to using asm-generic for seccomp.h
  microblaze: use asm-generic for seccomp.h
  arm: use asm-generic for seccomp.h
  seccomp: allow COMPAT sigreturn overrides
  ...
2015-04-17 09:04:38 -04:00
Andrey Vagin
6c8c90319c proc: show locks in /proc/pid/fdinfo/X
Let's show locks which are associated with a file descriptor in
its fdinfo file.

Currently we don't have a reliable way to determine who holds a lock.  We
can find some information in /proc/locks, but PID which is reported there
can be wrong.  For example, a process takes a lock, then forks a child and
dies.  In this case /proc/locks contains the parent pid, which can be
reused by another process.

$ cat /proc/locks
...
6: FLOCK  ADVISORY  WRITE 324 00:13:13431 0 EOF
...

$ ps -C rpcbind
  PID TTY          TIME CMD
  332 ?        00:00:00 rpcbind

$ cat /proc/332/fdinfo/4
pos:	0
flags:	0100000
mnt_id:	22
lock:	1: FLOCK  ADVISORY  WRITE 324 00:13:13431 0 EOF

$ ls -l /proc/332/fd/4
lr-x------ 1 root root 64 Mar  5 14:43 /proc/332/fd/4 -> /run/rpcbind.lock

$ ls -l /proc/324/fd/
total 0
lrwx------ 1 root root 64 Feb 27 14:50 0 -> /dev/pts/0
lrwx------ 1 root root 64 Feb 27 14:50 1 -> /dev/pts/0
lrwx------ 1 root root 64 Feb 27 14:49 2 -> /dev/pts/0

You can see that the process with the 324 pid doesn't hold the lock.

This information is required for proper dumping and restoring file
locks.

Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:12 -04:00
Konstantin Khlebnikov
90f31d0ea8 mm: rcu-protected get_mm_exe_file()
This patch removes mm->mmap_sem from mm->exe_file read side.
Also it kills dup_mm_exe_file() and moves exe_file duplication into
dup_mmap() where both mmap_sems are locked.

[akpm@linux-foundation.org: fix comment typo]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-17 09:04:07 -04:00
Linus Torvalds
4fc8adcfec Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull third hunk of vfs changes from Al Viro:
 "This contains the ->direct_IO() changes from Omar + saner
  generic_write_checks() + dealing with fcntl()/{read,write}() races
  (mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of
  repeatedly looking at ->f_flags, which can be changed by fcntl(2),
  check ->ki_flags - which cannot) + infrastructure bits for dhowells'
  d_inode annotations + Christophs switch of /dev/loop to
  vfs_iter_write()"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits)
  block: loop: switch to VFS ITER_BVEC
  configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
  VFS: Make pathwalk use d_is_reg() rather than S_ISREG()
  VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
  VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
  NFS: Don't use d_inode as a variable name
  VFS: Impose ordering on accesses of d_inode and d_flags
  VFS: Add owner-filesystem positive/negative dentry checks
  nfs: generic_write_checks() shouldn't be done on swapout...
  ocfs2: use __generic_file_write_iter()
  mirror O_APPEND and O_DIRECT into iocb->ki_flags
  switch generic_write_checks() to iocb and iter
  ocfs2: move generic_write_checks() before the alignment checks
  ocfs2_file_write_iter: stop messing with ppos
  udf_file_write_iter: reorder and simplify
  fuse: ->direct_IO() doesn't need generic_write_checks()
  ext4_file_write_iter: move generic_write_checks() up
  xfs_file_aio_write_checks: switch to iocb/iov_iter
  generic_write_checks(): drop isblk argument
  blkdev_write_iter: expand generic_file_checks() call in there
  ...
2015-04-16 23:27:56 -04:00
Linus Torvalds
eea3a00264 Merge branch 'akpm' (patches from Andrew)
Merge second patchbomb from Andrew Morton:

 - the rest of MM

 - various misc bits

 - add ability to run /sbin/reboot at reboot time

 - printk/vsprintf changes

 - fiddle with seq_printf() return value

* akpm: (114 commits)
  parisc: remove use of seq_printf return value
  lru_cache: remove use of seq_printf return value
  tracing: remove use of seq_printf return value
  cgroup: remove use of seq_printf return value
  proc: remove use of seq_printf return value
  s390: remove use of seq_printf return value
  cris fasttimer: remove use of seq_printf return value
  cris: remove use of seq_printf return value
  openrisc: remove use of seq_printf return value
  ARM: plat-pxa: remove use of seq_printf return value
  nios2: cpuinfo: remove use of seq_printf return value
  microblaze: mb: remove use of seq_printf return value
  ipc: remove use of seq_printf return value
  rtc: remove use of seq_printf return value
  power: wakeup: remove use of seq_printf return value
  x86: mtrr: if: remove use of seq_printf return value
  linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
  MAINTAINERS: CREDITS: remove Stefano Brivio from B43
  .mailmap: add Ricardo Ribalda
  CREDITS: add Ricardo Ribalda Delgado
  ...
2015-04-15 16:39:15 -07:00
Boaz Harrosh
0e3b210ce1 dax: use pfn_mkwrite to update c/mtime + freeze protection
From: Yigal Korman <yigal@plexistor.com>

[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.

A new xfstest is submitted for testing this (generic/080)

[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.

Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:20 -07:00
Zhang Zhen
f2b91d8d38 vfs: delete vfs_readdir function declaration
vfs_readdir() was replaced by iterate_dir() in commit 5c0ba4e076
("[readdir] introduce iterate_dir() and dir_context").

Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-04-15 16:35:16 -07:00
Linus Torvalds
6d50ff91d9 File locking related changes for v4.1 (pile #1)
-----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJVLoesAAoJEAAOaEEZVoIVJVoP/jVPt1TbEdA1wP/YO6dcrhtL
 wjaWjDLJDUZ6zhHYAYIZ0Psko1LOkZzIa31bqlGGdjBbwHAKLqipgwoIPsOmD0YZ
 Ja4apWH7zRSWdzmYNkvigMpHmYpFCLuU2y7+BzWLH/MlohdN7zWkI4RZNpVddE35
 m/MmIVlp8IPMbrPMR07/6mkKpqLY6FMxMEhEQvYfGzBiUyF6sJN9IdN7/4S4LK/7
 W6rN6ZSKiioAp4vOSaPkhOMoM+H2uiBemKl+ZnznlOkUwL9aw+R8O8+481VgwqQ5
 beDdZzOIW4UD/vA98AhXIP74z/zHLezVWDL5GNkMro7SZuUUJo0PzJDsuWbzNjYi
 nzMkbJp2SGmIKAUWqyKQZXOWft3KJ0x8eqPtd+kgYniQvjCPH7o1De8xRhQ5HNwf
 URJLkfBpJujjEewZ/gKldgcd6IJ6e5qhtg1L4ShNaV0jL6wtbDF4JLGOZkKb1wDB
 +QJ7lvsrqDo7EeEM06BzYcDF+Opp0IeXSR7RQzv5uYP2HZYR7angccuGVPJJqbbg
 IGSEhTGhokaiprvuE60SMPNBKXDW+gdhbN9z4XYfIL1brcV6dQzbm5RUR1duzSF1
 VYZxXECnDrDHXBqMUaF4su5E55Ivni4zK4B3j2WqYu72LMcjngr7cPqIL8U5KYTn
 nVC5Cz6nVZ0l7JoQ63p4
 =MPZM
 -----END PGP SIGNATURE-----

Merge tag 'locks-v4.1-1' of git://git.samba.org/jlayton/linux

Pull file locking related changes from Jeff Layton:
 "This set is mostly minor cleanups to the overhaul that went in last
  cycle.  The other noticeable items are the changes to the lm_get_owner
  and lm_put_owner prototypes, and the fact that we no longer need to
  use the i_lock to protect the i_flctx pointer"

* tag 'locks-v4.1-1' of git://git.samba.org/jlayton/linux:
  locks: use cmpxchg to assign i_flctx pointer
  locks: get rid of WE_CAN_BREAK_LSLK_NOW dead code
  locks: change lm_get_owner and lm_put_owner prototypes
  locks: don't allocate a lock context for an F_UNLCK request
  locks: Add lockdep assertion for blocked_lock_lock
  locks: remove extraneous IS_POSIX and IS_FLOCK tests
  locks: Remove unnecessary IS_POSIX test
2015-04-15 14:22:45 -07:00
Linus Torvalds
fa927894bb Merge branch 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull second vfs update from Al Viro:
 "Now that net-next went in...  Here's the next big chunk - killing
  ->aio_read() and ->aio_write().

  There'll be one more pile today (direct_IO changes and
  generic_write_checks() cleanups/fixes), but I'd prefer to keep that
  one separate"

* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
  ->aio_read and ->aio_write removed
  pcm: another weird API abuse
  infinibad: weird APIs switched to ->write_iter()
  kill do_sync_read/do_sync_write
  fuse: use iov_iter_get_pages() for non-splice path
  fuse: switch to ->read_iter/->write_iter
  switch drivers/char/mem.c to ->read_iter/->write_iter
  make new_sync_{read,write}() static
  coredump: accept any write method
  switch /dev/loop to vfs_iter_write()
  serial2002: switch to __vfs_read/__vfs_write
  ashmem: use __vfs_read()
  export __vfs_read()
  autofs: switch to __vfs_write()
  new helper: __vfs_write()
  switch hugetlbfs to ->read_iter()
  coda: switch to ->read_iter/->write_iter
  ncpfs: switch to ->read_iter/->write_iter
  net/9p: remove (now-)unused helpers
  p9_client_attach(): set fid->uid correctly
  ...
2015-04-15 13:22:56 -07:00
Linus Torvalds
ca2ec32658 Merge branch 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs update from Al Viro:
 "Part one:

   - struct filename-related cleanups

   - saner iov_iter_init() replacements (and switching the syscalls to
     use of those)

   - ntfs switch to ->write_iter() (Anton)

   - aio cleanups and splitting iocb into common and async parts
     (Christoph)

   - assorted fixes (me, bfields, Andrew Elble)

  There's a lot more, including the completion of switchover to
  ->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
  race fixes, etc, but that goes after #for-davem merge.  David has
  pulled it, and once it's in I'll send the next vfs pull request"

* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
  sg_start_req(): use import_iovec()
  sg_start_req(): make sure that there's not too many elements in iovec
  blk_rq_map_user(): use import_single_range()
  sg_io(): use import_iovec()
  process_vm_access: switch to {compat_,}import_iovec()
  switch keyctl_instantiate_key_common() to iov_iter
  switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
  aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
  vmsplice_to_user(): switch to import_iovec()
  kill aio_setup_single_vector()
  aio: simplify arguments of aio_setup_..._rw()
  aio: lift iov_iter_init() into aio_setup_..._rw()
  lift iov_iter into {compat_,}do_readv_writev()
  NFS: fix BUG() crash in notify_change() with patch to chown_common()
  dcache: return -ESTALE not -EBUSY on distributed fs race
  NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
  VFS: Add iov_iter_fault_in_multipages_readable()
  drop bogus check in file_open_root()
  switch security_inode_getattr() to struct path *
  constify tomoyo_realpath_from_path()
  ...
2015-04-14 15:31:03 -07:00
Al Viro
2ba48ce513 mirror O_APPEND and O_DIRECT into iocb->ki_flags
... avoiding write_iter/fcntl races.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:30:22 -04:00
Al Viro
3309dd04cb switch generic_write_checks() to iocb and iter
... returning -E... upon error and amount of data left in iter after
(possible) truncation upon success.  Note, that normal case gives
a non-zero (positive) return value, so any tests for != 0 _must_ be
updated.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>

Conflicts:
	fs/ext4/file.c
2015-04-11 22:30:21 -04:00
Al Viro
dfea934575 Merge branch 'for-linus' into for-next 2015-04-11 22:29:51 -04:00
Al Viro
0fa6b005af generic_write_checks(): drop isblk argument
all remaining callers are passing 0; some just obscure that fact.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:48 -04:00
Omar Sandoval
22c6186ece direct_IO: remove rw from a_ops->direct_IO()
Now that no one is using rw, remove it completely.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:45 -04:00
Omar Sandoval
a95cd63115 Remove rw from dax_{do_,}io()
And use iov_iter_rw() instead.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:44 -04:00
Omar Sandoval
17f8c842d2 Remove rw from {,__,do_}blockdev_direct_IO()
Most filesystems call through to these at some point, so we'll start
here.

Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:44 -04:00
Al Viro
8436318205 ->aio_read and ->aio_write removed
no remaining users

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:43 -04:00
Al Viro
9a219bc70b kill do_sync_read/do_sync_write
all remaining instances of aio_{read,write} (all 4 of them) have explicit
->read and ->write resp.; do_sync_read/do_sync_write is never called by
__vfs_read/__vfs_write anymore and no other users had been left.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:42 -04:00
Al Viro
5d5d568975 make new_sync_{read,write}() static
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:40 -04:00
Al Viro
493c84c072 new helper: __vfs_write()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:29:37 -04:00
Al Viro
39c853ebfe Merge branch 'for-davem' into for-next 2015-04-11 22:27:19 -04:00
Al Viro
c0fec3a98b Merge branch 'iocb' into for-next 2015-04-11 22:24:41 -04:00
Al Viro
fd2f7cb5bc kill struct filename.separate
just make const char iname[] the last member and compare name->name with
name->iname instead of checking name->separate

We need to make sure that out-of-line name doesn't end up allocated adjacent
to struct filename refering to it; fortunately, it's easy to achieve - just
allocate that struct filename with one byte in ->iname[], so that ->iname[0]
will be inside the same object and thus have an address different from that
of out-of-line name [spotted by Boqun Feng <boqun.feng@gmail.com>]

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-11 22:21:24 -04:00
Al Viro
237dae8890 Merge branch 'iocb' into for-davem
trivial conflict in net/socket.c and non-trivial one in crypto -
that one had evaded aio_complete() removal.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-09 00:01:38 -04:00
Al Viro
b2edffdd91 fix mremap() vs. ioctx_kill() race
teach ->mremap() method to return an error and have it fail for
aio mappings in process of being killed

Note that in case of ->mremap() failure we need to undo move_page_tables()
we'd already done; we could call ->mremap() first, but then the failure of
move_page_tables() would require undoing whatever _successful_ ->mremap()
has done, which would be a lot more headache in general.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-04-06 17:50:59 -04:00
Jeff Layton
cae80b305e locks: change lm_get_owner and lm_put_owner prototypes
The current prototypes for these operations are somewhat awkward as they
deal with fl_owners but take struct file_lock arguments. In the future,
we'll want to be able to take references without necessarily dealing
with a struct file_lock.

Change them to take fl_owner_t arguments instead and have the callers
deal with assigning the values to the file_lock structs.

Signed-off-by: Jeff Layton <jlayton@primarydata.com>
2015-04-03 09:04:04 -04:00
Christoph Hellwig
e2e40f2c1e fs: move struct kiocb to fs.h
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2015-03-25 20:28:11 -04:00
Theodore Ts'o
a2f4870697 fs: make sure the timestamps for lazytime inodes eventually get written
Jan Kara pointed out that if there is an inode which is constantly
getting dirtied with I_DIRTY_PAGES, an inode with an updated timestamp
will never be written since inode->dirtied_when is constantly getting
updated.  We fix this by adding an extra field to the inode,
dirtied_time_when, so inodes with a stale dirtytime can get detected
and handled.

In addition, if we have a dirtytime inode caused by an atime update,
and there is no write activity on the file system, we need to have a
secondary system to make sure these inodes get written out.  We do
this by setting up a second delayed work structure which wakes up the
CPU much more rarely compared to writeback_expire_centisecs.

Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
2015-03-17 12:23:19 -04:00