Make file->f_path always point to the overlay dentry so that the path in
/proc/pid/fd is correct and to ensure that label-based LSMs have access to the
overlay as well as the underlay (path-based LSMs probably don't need it).
Using my union testsuite to set things up, before the patch I see:
[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
...
lr-x------. 1 root root 64 Jun 5 14:38 5 -> /a/foo107
[root@andromeda union-testsuite]# stat /mnt/a/foo107
...
Device: 23h/35d Inode: 13381 Links: 1
...
[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
...
Device: 23h/35d Inode: 13381 Links: 1
...
After the patch:
[root@andromeda union-testsuite]# bash 5</mnt/a/foo107
[root@andromeda union-testsuite]# ls -l /proc/$$/fd/
...
lr-x------. 1 root root 64 Jun 5 14:22 5 -> /mnt/a/foo107
[root@andromeda union-testsuite]# stat /mnt/a/foo107
...
Device: 23h/35d Inode: 40346 Links: 1
...
[root@andromeda union-testsuite]# stat -L /proc/$$/fd/5
...
Device: 23h/35d Inode: 40346 Links: 1
...
Note the change in where /proc/$$/fd/5 points to in the ls command. It was
pointing to /a/foo107 (which doesn't exist) and now points to /mnt/a/foo107
(which is correct).
The inode accessed, however, is the lower layer. The union layer is on device
25h/37d and the upper layer on 24h/36d.
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
FS_CGROUP_WRITEBACK indicates whether a file_system_type supports
cgroup writeback; however, different super_blocks of the same
file_system_type may or may not support cgroup writeback depending on
filesystem options. This patch replaces FS_CGROUP_WRITEBACK with a
per-super_block flag.
super_block->s_flags carries some internal flags in the high bits but
it's exposd to userland through uapi header and running out of space
anyway. This patch adds a new field super_block->s_iflags to carry
kernel-internal flags. It is currently only used by the new
SB_I_CGROUPWB flag whose concatenated and abbreviated name is for
consistency with other super_block flags.
ext2_fill_super() is updated to set SB_I_CGROUPWB.
v2: Added super_block->s_iflags instead of stealing another high bit
from sb->s_flags as suggested by Christoph and Jan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@fb.com>
Some filesystems cannot call dax_fault() directly because they have
different locking and/or allocation constraints in the page fault IO
path. To handle this, we need to follow the same model as the
generic block_page_mkwrite code, where the internals are exposed via
__block_page_mkwrite() so that filesystems can wrap the correct
locking and operations around the outside.
This is loosely based on a patch originally from Matthew Willcox.
Unlike the original patch, it does not change ext4 code, error
returns or unwritten extent conversion handling. It also adds a
__dax_mkwrite() wrapper for .page_mkwrite implementations to do the
right thing, too.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
dax_fault() currently relies on the get_block callback to attach an
io completion callback to the mapping buffer head so that it can
run unwritten extent conversion after zeroing allocated blocks.
Instead of this hack, pass the conversion callback directly into
dax_fault() similar to the get_block callback. When the filesystem
allocates unwritten extents, it will set the buffer_unwritten()
flag, and hence the dax_fault code can call the completion function
in the contexts where it is necessary without overloading the
mapping buffer head.
Note: The changes to ext4 to use this interface are suspect at best.
In fact, the way ext4 did this end_io assignment in the first place
looks suspect because it only set a completion callback when there
wasn't already some other write() call taking place on the same
inode. The ext4 end_io code looks rather intricate and fragile with
all it's reference counting and passing to different contexts for
modification via inode private pointers that aren't protected by
locks...
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <david@fromorbit.com>
The mechanism for detecting whether an inode should switch its wb
(bdi_writeback) association is now in place. This patch build the
framework for the actual switching.
This patch adds a new inode flag I_WB_SWITCHING, which has two
functions. First, the easy one, it ensures that there's only one
switching in progress for a give inode. Second, it's used as a
mechanism to synchronize wb stat updates.
The two stats, WB_RECLAIMABLE and WB_WRITEBACK, aren't event counters
but track the current number of dirty pages and pages under writeback
respectively. As such, when an inode is moved from one wb to another,
the inode's portion of those stats have to be transferred together;
unfortunately, this is a bit tricky as those stat updates are percpu
operations which are performed without holding any lock in some
places.
This patch solves the problem in a similar way as memcg. Each such
lockless stat updates are wrapped in transaction surrounded by
unlocked_inode_to_wb_begin/end(). During normal operation, they map
to rcu_read_lock/unlock(); however, if I_WB_SWITCHING is asserted,
mapping->tree_lock is grabbed across the transaction.
In turn, the switching path sets I_WB_SWITCHING and waits for a RCU
grace period to pass before actually starting to switch, which
guarantees that all stat update paths are synchronizing against
mapping->tree_lock.
This patch still doesn't implement the actual switching.
v3: Updated on top of the recent cancel_dirty_page() updates.
unlocked_inode_to_wb_begin() now nests inside
mem_cgroup_begin_page_stat() to match the locking order.
v2: The i_wb access transaction will be used for !stat accesses too.
Function names and comments updated accordingly.
s/inode_wb_stat_unlocked_{begin|end}/unlocked_inode_to_wb_{begin|end}/
s/switch_wb/switch_wbs/
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
As concurrent write sharing of an inode is expected to be very rare
and memcg only tracks page ownership on first-use basis severely
confining the usefulness of such sharing, cgroup writeback tracks
ownership per-inode. While the support for concurrent write sharing
of an inode is deemed unnecessary, an inode being written to by
different cgroups at different points in time is a lot more common,
and, more importantly, charging only by first-use can too readily lead
to grossly incorrect behaviors (single foreign page can lead to
gigabytes of writeback to be incorrectly attributed).
To resolve this issue, cgroup writeback detects the majority dirtier
of an inode and will transfer the ownership to it. To avoid
unnnecessary oscillation, the detection mechanism keeps track of
history and gives out the switch verdict only if the foreign usage
pattern is stable over a certain amount of time and/or writeback
attempts.
The detection mechanism has fairly low space and computation overhead.
It adds 8 bytes to struct inode (one int and two u16's) and minimal
amount of calculation per IO. The detection mechanism converges to
the correct answer usually in several seconds of IO time when there's
a clear majority dirtier. Even when there isn't, it can reach an
acceptable answer fairly quickly under most circumstances.
Please see wb_detach_inode() for more details.
This patch only implements detection. Following patches will
implement actual switching.
v2: wbc_account_io() now checks whether the wbc is associated with a
wb before dereferencing it. This can happen when pageout() is
writing pages directly without going through the usual writeback
path. As pageout() path is single-threaded, we don't want it to
be blocked behind a slow cgroup and ultimately want it to delegate
actual writing to the usual writeback path.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Cc: Wu Fengguang <fengguang.wu@intel.com>
Cc: Greg Thelen <gthelen@google.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
For the planned cgroup writeback support, on each bdi
(backing_dev_info), each memcg will be served by a separate wb
(bdi_writeback). This patch updates bdi so that a bdi can host
multiple wbs (bdi_writebacks).
On the default hierarchy, blkcg implicitly enables memcg. This allows
using memcg's page ownership for attributing writeback IOs, and every
memcg - blkcg combination can be served by its own wb by assigning a
dedicated wb to each memcg. This means that there may be multiple
wb's of a bdi mapped to the same blkcg. As congested state is per
blkcg - bdi combination, those wb's should share the same congested
state. This is achieved by tracking congested state via
bdi_writeback_congested structs which are keyed by blkcg.
bdi->wb remains unchanged and will keep serving the root cgroup.
cgwb's (cgroup wb's) for non-root cgroups are created on-demand or
looked up while dirtying an inode according to the memcg of the page
being dirtied or current task. Each cgwb is indexed on bdi->cgwb_tree
by its memcg id. Once an inode is associated with its wb, it can be
retrieved using inode_to_wb().
Currently, none of the filesystems has FS_CGROUP_WRITEBACK and all
pages will keep being associated with bdi->wb.
v3: inode_attach_wb() in account_page_dirtied() moved inside
mapping_cap_account_dirty() block where it's known to be !NULL.
Also, an unnecessary NULL check before kfree() removed. Both
detected by the kbuild bot.
v2: Updated so that wb association is per inode and wb is per memcg
rather than blkcg.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: kbuild test robot <fengguang.wu@intel.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
cgroup writeback requires support from both bdi and filesystem sides.
Add BDI_CAP_CGROUP_WRITEBACK and FS_CGROUP_WRITEBACK to indicate
support and enable BDI_CAP_CGROUP_WRITEBACK on block based bdi's by
default. Also, define CONFIG_CGROUP_WRITEBACK which is enabled if
both MEMCG and BLK_CGROUP are enabled.
inode_cgwb_enabled() which determines whether a given inode's both bdi
and fs support cgroup writeback is added.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@fb.com>
Now that bdi definitions are moved to backing-dev-defs.h,
backing-dev.h can include blkdev.h and inline inode_to_bdi() without
worrying about introducing circular include dependency. The function
gets called from hot paths and fairly trivial.
This patch makes inode_to_bdi() and sb_is_blkdev_sb() that the
function calls inline. blockdev_superblock and noop_backing_dev_info
are EXPORT_GPL'd to allow the inline functions to be used from
modules.
While at it, make sb_is_blkdev_sb() return bool instead of int.
v2: Fixed typo in description as suggested by Jan.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <axboe@fb.com>
This patch exports blkdev_reread_part() for block drivers, also
introduce __blkdev_reread_part().
For some drivers, such as loop, reread of partitions can be run
from the release path, and bd_mutex may already be held prior to
calling ioctl_by_bdev(bdev, BLKRRPART, 0), so introduce
__blkdev_reread_part for use in such cases.
CC: Christoph Hellwig <hch@lst.de>
CC: Jens Axboe <axboe@kernel.dk>
CC: Tejun Heo <tj@kernel.org>
CC: Alexander Viro <viro@zeniv.linux.org.uk>
CC: Markus Pargmann <mpa@pengutronix.de>
CC: Stefan Weinhuber <wein@de.ibm.com>
CC: Stefan Haberland <stefan.haberland@de.ibm.com>
CC: Sebastian Ott <sebott@linux.vnet.ibm.com>
CC: Fabian Frederick <fabf@skynet.be>
CC: Ming Lei <ming.lei@canonical.com>
CC: David Herrmann <dh.herrmann@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Peter Zijlstra <peterz@infradead.org>
CC: nbd-general@lists.sourceforge.net
CC: linux-s390@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: Ming Lei <ming.lei@canonical.com>
Signed-off-by: Jens Axboe <axboe@fb.com>
touch_atime is not RCU-safe, and so cannot be called on an RCU walk.
However, in situations where RCU-walk makes a difference, the symlink
will likely to accessed much more often than it is useful to update
the atime.
So split out the test of "Does the atime actually need to be updated"
into atime_needs_update(), and have get_link() unlazy if it finds that
it will need to do that update.
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Fresh mounts of proc and sysfs are a very special case that works very
much like a bind mount. Unfortunately the current structure can not
preserve the MNT_LOCK... mount flags. Therefore refactor the logic
into a form that can be modified to preserve those lock bits.
Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount
of the filesystem be fully visible in the current mount namespace,
before the filesystem may be mounted.
Move the logic for calling fs_fully_visible from proc and sysfs into
fs/namespace.c where it has greater access to mount namespace state.
Cc: stable@vger.kernel.org
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
a) instead of storing the symlink body (via nd_set_link()) and returning
an opaque pointer later passed to ->put_link(), ->follow_link() _stores_
that opaque pointer (into void * passed by address by caller) and returns
the symlink body. Returning ERR_PTR() on error, NULL on jump (procfs magic
symlinks) and pointer to symlink body for normal symlinks. Stored pointer
is ignored in all cases except the last one.
Storing NULL for opaque pointer (or not storing it at all) means no call
of ->put_link().
b) the body used to be passed to ->put_link() implicitly (via nameidata).
Now only the opaque pointer is. In the cases when we used the symlink body
to free stuff, ->follow_link() now should store it as opaque pointer in addition
to returning it.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
let "fast" symlinks store the pointer to the body into ->i_link and
use simple_follow_link for ->follow_link()
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Pull fourth vfs update from Al Viro:
"d_inode() annotations from David Howells (sat in for-next since before
the beginning of merge window) + four assorted fixes"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
RCU pathwalk breakage when running into a symlink overmounting something
fix I_DIO_WAKEUP definition
direct-io: only inc/dec inode->i_dio_count for file systems
fs/9p: fix readdir()
VFS: assorted d_backing_inode() annotations
VFS: fs/inode.c helpers: d_inode() annotations
VFS: fs/cachefiles: d_backing_inode() annotations
VFS: fs library helpers: d_inode() annotations
VFS: assorted weird filesystems: d_inode() annotations
VFS: normal filesystems (and lustre): d_inode() annotations
VFS: security/: d_inode() annotations
VFS: security/: d_backing_inode() annotations
VFS: net/: d_inode() annotations
VFS: net/unix: d_backing_inode() annotations
VFS: kernel/: d_inode() annotations
VFS: audit: d_backing_inode() annotations
VFS: Fix up some ->d_inode accesses in the chelsio driver
VFS: Cachefiles should perform fs modifications on the top layer only
VFS: AF_UNIX sockets should call mknod on the top layer only
I_DIO_WAKEUP is never directly used, but fix it up anyway.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
do_blockdev_direct_IO() increments and decrements the inode
->i_dio_count for each IO operation. It does this to protect against
truncate of a file. Block devices don't need this sort of protection.
For a capable multiqueue setup, this atomic int is the only shared
state between applications accessing the device for O_DIRECT, and it
presents a scaling wall for that. In my testing, as much as 30% of
system time is spent incrementing and decrementing this value. A mixed
read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
better latencies too. Before:
clat percentiles (usec):
| 1.00th=[ 33], 5.00th=[ 34], 10.00th=[ 34], 20.00th=[ 34],
| 30.00th=[ 34], 40.00th=[ 34], 50.00th=[ 35], 60.00th=[ 35],
| 70.00th=[ 35], 80.00th=[ 35], 90.00th=[ 37], 95.00th=[ 80],
| 99.00th=[ 98], 99.50th=[ 151], 99.90th=[ 155], 99.95th=[ 155],
| 99.99th=[ 165]
After:
clat percentiles (usec):
| 1.00th=[ 95], 5.00th=[ 108], 10.00th=[ 129], 20.00th=[ 149],
| 30.00th=[ 155], 40.00th=[ 161], 50.00th=[ 167], 60.00th=[ 171],
| 70.00th=[ 177], 80.00th=[ 185], 90.00th=[ 201], 95.00th=[ 270],
| 99.00th=[ 390], 99.50th=[ 398], 99.90th=[ 418], 99.95th=[ 422],
| 99.99th=[ 438]
In other setups, Robert Elliott reported seeing good performance
improvements:
https://lkml.org/lkml/2015/4/3/557
The more applications accessing the device, the worse it gets.
Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
do_blockdev_direct_IO() that it need not worry about incrementing
or decrementing the inode i_dio_count for this caller.
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Theodore Ts'o <tytso@mit.edu>
Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Jens Axboe <axboe@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Merge third patchbomb from Andrew Morton:
- various misc things
- a couple of lib/ optimisations
- provide DIV_ROUND_CLOSEST_ULL()
- checkpatch updates
- rtc tree
- befs, nilfs2, hfs, hfsplus, fatfs, adfs, affs, bfs
- ptrace fixes
- fork() fixes
- seccomp cleanups
- more mmap_sem hold time reductions from Davidlohr
* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (138 commits)
proc: show locks in /proc/pid/fdinfo/X
docs: add missing and new /proc/PID/status file entries, fix typos
drivers/rtc/rtc-at91rm9200.c: make IO endian agnostic
Documentation/spi/spidev_test.c: fix warning
drivers/rtc/rtc-s5m.c: allow usage on device type different than main MFD type
.gitignore: ignore *.tar
MAINTAINERS: add Mediatek SoC mailing list
tomoyo: reduce mmap_sem hold for mm->exe_file
powerpc/oprofile: reduce mmap_sem hold for exe_file
oprofile: reduce mmap_sem hold for mm->exe_file
mips: ip32: add platform data hooks to use DS1685 driver
lib/Kconfig: fix up HAVE_ARCH_BITREVERSE help text
x86: switch to using asm-generic for seccomp.h
sparc: switch to using asm-generic for seccomp.h
powerpc: switch to using asm-generic for seccomp.h
parisc: switch to using asm-generic for seccomp.h
mips: switch to using asm-generic for seccomp.h
microblaze: use asm-generic for seccomp.h
arm: use asm-generic for seccomp.h
seccomp: allow COMPAT sigreturn overrides
...
Let's show locks which are associated with a file descriptor in
its fdinfo file.
Currently we don't have a reliable way to determine who holds a lock. We
can find some information in /proc/locks, but PID which is reported there
can be wrong. For example, a process takes a lock, then forks a child and
dies. In this case /proc/locks contains the parent pid, which can be
reused by another process.
$ cat /proc/locks
...
6: FLOCK ADVISORY WRITE 324 00:13:13431 0 EOF
...
$ ps -C rpcbind
PID TTY TIME CMD
332 ? 00:00:00 rpcbind
$ cat /proc/332/fdinfo/4
pos: 0
flags: 0100000
mnt_id: 22
lock: 1: FLOCK ADVISORY WRITE 324 00:13:13431 0 EOF
$ ls -l /proc/332/fd/4
lr-x------ 1 root root 64 Mar 5 14:43 /proc/332/fd/4 -> /run/rpcbind.lock
$ ls -l /proc/324/fd/
total 0
lrwx------ 1 root root 64 Feb 27 14:50 0 -> /dev/pts/0
lrwx------ 1 root root 64 Feb 27 14:50 1 -> /dev/pts/0
lrwx------ 1 root root 64 Feb 27 14:49 2 -> /dev/pts/0
You can see that the process with the 324 pid doesn't hold the lock.
This information is required for proper dumping and restoring file
locks.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Acked-by: Jeff Layton <jlayton@poochiereds.net>
Acked-by: "J. Bruce Fields" <bfields@fieldses.org>
Acked-by: Cyrill Gorcunov <gorcunov@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
This patch removes mm->mmap_sem from mm->exe_file read side.
Also it kills dup_mm_exe_file() and moves exe_file duplication into
dup_mmap() where both mmap_sems are locked.
[akpm@linux-foundation.org: fix comment typo]
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Davidlohr Bueso <dbueso@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Pull third hunk of vfs changes from Al Viro:
"This contains the ->direct_IO() changes from Omar + saner
generic_write_checks() + dealing with fcntl()/{read,write}() races
(mirroring O_APPEND/O_DIRECT into iocb->ki_flags and instead of
repeatedly looking at ->f_flags, which can be changed by fcntl(2),
check ->ki_flags - which cannot) + infrastructure bits for dhowells'
d_inode annotations + Christophs switch of /dev/loop to
vfs_iter_write()"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (30 commits)
block: loop: switch to VFS ITER_BVEC
configfs: Fix inconsistent use of file_inode() vs file->f_path.dentry->d_inode
VFS: Make pathwalk use d_is_reg() rather than S_ISREG()
VFS: Fix up debugfs to use d_is_dir() in place of S_ISDIR()
VFS: Combine inode checks with d_is_negative() and d_is_positive() in pathwalk
NFS: Don't use d_inode as a variable name
VFS: Impose ordering on accesses of d_inode and d_flags
VFS: Add owner-filesystem positive/negative dentry checks
nfs: generic_write_checks() shouldn't be done on swapout...
ocfs2: use __generic_file_write_iter()
mirror O_APPEND and O_DIRECT into iocb->ki_flags
switch generic_write_checks() to iocb and iter
ocfs2: move generic_write_checks() before the alignment checks
ocfs2_file_write_iter: stop messing with ppos
udf_file_write_iter: reorder and simplify
fuse: ->direct_IO() doesn't need generic_write_checks()
ext4_file_write_iter: move generic_write_checks() up
xfs_file_aio_write_checks: switch to iocb/iov_iter
generic_write_checks(): drop isblk argument
blkdev_write_iter: expand generic_file_checks() call in there
...
Merge second patchbomb from Andrew Morton:
- the rest of MM
- various misc bits
- add ability to run /sbin/reboot at reboot time
- printk/vsprintf changes
- fiddle with seq_printf() return value
* akpm: (114 commits)
parisc: remove use of seq_printf return value
lru_cache: remove use of seq_printf return value
tracing: remove use of seq_printf return value
cgroup: remove use of seq_printf return value
proc: remove use of seq_printf return value
s390: remove use of seq_printf return value
cris fasttimer: remove use of seq_printf return value
cris: remove use of seq_printf return value
openrisc: remove use of seq_printf return value
ARM: plat-pxa: remove use of seq_printf return value
nios2: cpuinfo: remove use of seq_printf return value
microblaze: mb: remove use of seq_printf return value
ipc: remove use of seq_printf return value
rtc: remove use of seq_printf return value
power: wakeup: remove use of seq_printf return value
x86: mtrr: if: remove use of seq_printf return value
linux/bitmap.h: improve BITMAP_{LAST,FIRST}_WORD_MASK
MAINTAINERS: CREDITS: remove Stefano Brivio from B43
.mailmap: add Ricardo Ribalda
CREDITS: add Ricardo Ribalda Delgado
...
From: Yigal Korman <yigal@plexistor.com>
[v1]
Without this patch, c/mtime is not updated correctly when mmap'ed page is
first read from and then written to.
A new xfstest is submitted for testing this (generic/080)
[v2]
Jan Kara has pointed out that if we add the
sb_start/end_pagefault pair in the new pfn_mkwrite we
are then fixing another bug where: A user could start
writing to the page while filesystem is frozen.
Signed-off-by: Yigal Korman <yigal@plexistor.com>
Signed-off-by: Boaz Harrosh <boaz@plexistor.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Matthew Wilcox <matthew.r.wilcox@intel.com>
Cc: Dave Chinner <david@fromorbit.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
vfs_readdir() was replaced by iterate_dir() in commit 5c0ba4e076
("[readdir] introduce iterate_dir() and dir_context").
Signed-off-by: Zhang Zhen <zhenzhang.zhang@huawei.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1
iQIcBAABAgAGBQJVLoesAAoJEAAOaEEZVoIVJVoP/jVPt1TbEdA1wP/YO6dcrhtL
wjaWjDLJDUZ6zhHYAYIZ0Psko1LOkZzIa31bqlGGdjBbwHAKLqipgwoIPsOmD0YZ
Ja4apWH7zRSWdzmYNkvigMpHmYpFCLuU2y7+BzWLH/MlohdN7zWkI4RZNpVddE35
m/MmIVlp8IPMbrPMR07/6mkKpqLY6FMxMEhEQvYfGzBiUyF6sJN9IdN7/4S4LK/7
W6rN6ZSKiioAp4vOSaPkhOMoM+H2uiBemKl+ZnznlOkUwL9aw+R8O8+481VgwqQ5
beDdZzOIW4UD/vA98AhXIP74z/zHLezVWDL5GNkMro7SZuUUJo0PzJDsuWbzNjYi
nzMkbJp2SGmIKAUWqyKQZXOWft3KJ0x8eqPtd+kgYniQvjCPH7o1De8xRhQ5HNwf
URJLkfBpJujjEewZ/gKldgcd6IJ6e5qhtg1L4ShNaV0jL6wtbDF4JLGOZkKb1wDB
+QJ7lvsrqDo7EeEM06BzYcDF+Opp0IeXSR7RQzv5uYP2HZYR7angccuGVPJJqbbg
IGSEhTGhokaiprvuE60SMPNBKXDW+gdhbN9z4XYfIL1brcV6dQzbm5RUR1duzSF1
VYZxXECnDrDHXBqMUaF4su5E55Ivni4zK4B3j2WqYu72LMcjngr7cPqIL8U5KYTn
nVC5Cz6nVZ0l7JoQ63p4
=MPZM
-----END PGP SIGNATURE-----
Merge tag 'locks-v4.1-1' of git://git.samba.org/jlayton/linux
Pull file locking related changes from Jeff Layton:
"This set is mostly minor cleanups to the overhaul that went in last
cycle. The other noticeable items are the changes to the lm_get_owner
and lm_put_owner prototypes, and the fact that we no longer need to
use the i_lock to protect the i_flctx pointer"
* tag 'locks-v4.1-1' of git://git.samba.org/jlayton/linux:
locks: use cmpxchg to assign i_flctx pointer
locks: get rid of WE_CAN_BREAK_LSLK_NOW dead code
locks: change lm_get_owner and lm_put_owner prototypes
locks: don't allocate a lock context for an F_UNLCK request
locks: Add lockdep assertion for blocked_lock_lock
locks: remove extraneous IS_POSIX and IS_FLOCK tests
locks: Remove unnecessary IS_POSIX test
Pull second vfs update from Al Viro:
"Now that net-next went in... Here's the next big chunk - killing
->aio_read() and ->aio_write().
There'll be one more pile today (direct_IO changes and
generic_write_checks() cleanups/fixes), but I'd prefer to keep that
one separate"
* 'for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
->aio_read and ->aio_write removed
pcm: another weird API abuse
infinibad: weird APIs switched to ->write_iter()
kill do_sync_read/do_sync_write
fuse: use iov_iter_get_pages() for non-splice path
fuse: switch to ->read_iter/->write_iter
switch drivers/char/mem.c to ->read_iter/->write_iter
make new_sync_{read,write}() static
coredump: accept any write method
switch /dev/loop to vfs_iter_write()
serial2002: switch to __vfs_read/__vfs_write
ashmem: use __vfs_read()
export __vfs_read()
autofs: switch to __vfs_write()
new helper: __vfs_write()
switch hugetlbfs to ->read_iter()
coda: switch to ->read_iter/->write_iter
ncpfs: switch to ->read_iter/->write_iter
net/9p: remove (now-)unused helpers
p9_client_attach(): set fid->uid correctly
...
Pull vfs update from Al Viro:
"Part one:
- struct filename-related cleanups
- saner iov_iter_init() replacements (and switching the syscalls to
use of those)
- ntfs switch to ->write_iter() (Anton)
- aio cleanups and splitting iocb into common and async parts
(Christoph)
- assorted fixes (me, bfields, Andrew Elble)
There's a lot more, including the completion of switchover to
->{read,write}_iter(), d_inode/d_backing_inode annotations, f_flags
race fixes, etc, but that goes after #for-davem merge. David has
pulled it, and once it's in I'll send the next vfs pull request"
* 'for-linus-1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (35 commits)
sg_start_req(): use import_iovec()
sg_start_req(): make sure that there's not too many elements in iovec
blk_rq_map_user(): use import_single_range()
sg_io(): use import_iovec()
process_vm_access: switch to {compat_,}import_iovec()
switch keyctl_instantiate_key_common() to iov_iter
switch {compat_,}do_readv_writev() to {compat_,}import_iovec()
aio_setup_vectored_rw(): switch to {compat_,}import_iovec()
vmsplice_to_user(): switch to import_iovec()
kill aio_setup_single_vector()
aio: simplify arguments of aio_setup_..._rw()
aio: lift iov_iter_init() into aio_setup_..._rw()
lift iov_iter into {compat_,}do_readv_writev()
NFS: fix BUG() crash in notify_change() with patch to chown_common()
dcache: return -ESTALE not -EBUSY on distributed fs race
NTFS: Version 2.1.32 - Update file write from aio_write to write_iter.
VFS: Add iov_iter_fault_in_multipages_readable()
drop bogus check in file_open_root()
switch security_inode_getattr() to struct path *
constify tomoyo_realpath_from_path()
...
... returning -E... upon error and amount of data left in iter after
(possible) truncation upon success. Note, that normal case gives
a non-zero (positive) return value, so any tests for != 0 _must_ be
updated.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Conflicts:
fs/ext4/file.c
Most filesystems call through to these at some point, so we'll start
here.
Signed-off-by: Omar Sandoval <osandov@osandov.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
all remaining instances of aio_{read,write} (all 4 of them) have explicit
->read and ->write resp.; do_sync_read/do_sync_write is never called by
__vfs_read/__vfs_write anymore and no other users had been left.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
All places outside of core VFS that checked ->read and ->write for being NULL or
called the methods directly are gone now, so NULL {read,write} with non-NULL
{read,write}_iter will do the right thing in all cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
just make const char iname[] the last member and compare name->name with
name->iname instead of checking name->separate
We need to make sure that out-of-line name doesn't end up allocated adjacent
to struct filename refering to it; fortunately, it's easy to achieve - just
allocate that struct filename with one byte in ->iname[], so that ->iname[0]
will be inside the same object and thus have an address different from that
of out-of-line name [spotted by Boqun Feng <boqun.feng@gmail.com>]
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
trivial conflict in net/socket.c and non-trivial one in crypto -
that one had evaded aio_complete() removal.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
teach ->mremap() method to return an error and have it fail for
aio mappings in process of being killed
Note that in case of ->mremap() failure we need to undo move_page_tables()
we'd already done; we could call ->mremap() first, but then the failure of
move_page_tables() would require undoing whatever _successful_ ->mremap()
has done, which would be a lot more headache in general.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
The current prototypes for these operations are somewhat awkward as they
deal with fl_owners but take struct file_lock arguments. In the future,
we'll want to be able to take references without necessarily dealing
with a struct file_lock.
Change them to take fl_owner_t arguments instead and have the callers
deal with assigning the values to the file_lock structs.
Signed-off-by: Jeff Layton <jlayton@primarydata.com>
struct kiocb now is a generic I/O container, so move it to fs.h.
Also do a #include diet for aio.h while we're at it.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Jan Kara pointed out that if there is an inode which is constantly
getting dirtied with I_DIRTY_PAGES, an inode with an updated timestamp
will never be written since inode->dirtied_when is constantly getting
updated. We fix this by adding an extra field to the inode,
dirtied_time_when, so inodes with a stale dirtytime can get detected
and handled.
In addition, if we have a dirtytime inode caused by an atime update,
and there is no write activity on the file system, we need to have a
secondary system to make sure these inodes get written out. We do
this by setting up a second delayed work structure which wakes up the
CPU much more rarely compared to writeback_expire_centisecs.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>