Commit graph

34080 commits

Author SHA1 Message Date
Linus Torvalds
6116dea80d kgdb patches for 5.8-rc3
The main change here is a fix for a number of unsafe interactions
 between kdb and the console system. The fixes are specific to kdb (pure
 kgdb debugging does not use the console system at all). On systems with
 an NMI then kdb, if it is enabled, must get messages to the user despite
 potentially running from some "difficult" calling contexts. These fixes
 avoid using the console system where we have been provided an
 alternative (safer) way to interact with the user and, if using the
 console system in unavoidable, use oops_in_progress for deadlock
 avoidance. These fixes also ensure kdb honours the console enable flag.
 
 Also included is a fix that wraps kgdb trap handling in an RCU read lock
 to avoids triggering diagnostic warnings. This is a wide lock scope but
 this is OK because kgdb is a stop-the-world debugger. When we stop the
 world we put all the CPUs into holding pens and this inhibits RCU update
 anyway.
 
 Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEELzVBU1D3lWq6cKzwfOMlXTn3iKEFAl72C5AACgkQfOMlXTn3
 iKFfmA//SCJU7zJsrKTsr6+HJY+gIuwHm70aGCNIr3EjBgTZQHQYflG6msmMHTAX
 d4qnGSkfKzC8jYJrHPpX4eU3bnqYci6GnaT/N5p9YkTGHun+kYYTz3wLzZiWxKRg
 iE4QLEwjU/dGAYyRz0CKCTRNTLTG+R79HWLL2Wi5OQiNhYiPuFAgS/NSUjpnJIuf
 fmj8jSPP/7T/m0cEUWXbLwTfolEZLIa1heqtaJq4fAftPsAk5a5TZ0NugaxUPoo4
 YS06eASIZoVcDQiehVy+gH05FyEjJGXnkFtTkAoRL/yOERKLy0WMzFZAAh6NT4St
 16Hx3Nnw+7ds7Iq8jEIpM/XJo1d3haYvAQdzy6HakAOwp7vrD/CjF45wwju78woY
 Jq54Vjvaxjaw1vlJCVrAAjdj3bAHdufBeWrBGmYO8F1HSn9eNeLS7wWbq6lEhxNd
 ObXRUFwebzYpOT6DI2TdnDg/2+xAn2oXpzk4UK9I/Vbxew8R4lOPQm4vC0V3CTME
 cHXFGV3ncjXlVRKdMAmnYcN7pMY4NCdX5vGqC/djQRwKRV1Ve8jwUCFVKRAd4zio
 wHpCFziwSaz9giZJ5I831EKsvSj9DVoPPJFgoEXIzIWF3OS0qzP6UqO2HwJNbA+e
 W4laVRzdBcMuVVa+7XWYzdAhof0hNX0Ov78dyDMcX1MkOS02O7o=
 =ovnT
 -----END PGP SIGNATURE-----

Merge tag 'kgdb-5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux

Pull kgdb fixes from Daniel Thompson:
 "The main change here is a fix for a number of unsafe interactions
  between kdb and the console system. The fixes are specific to kdb
  (pure kgdb debugging does not use the console system at all). On
  systems with an NMI then kdb, if it is enabled, must get messages to
  the user despite potentially running from some "difficult" calling
  contexts. These fixes avoid using the console system where we have
  been provided an alternative (safer) way to interact with the user
  and, if using the console system in unavoidable, use oops_in_progress
  for deadlock avoidance. These fixes also ensure kdb honours the
  console enable flag.

  Also included is a fix that wraps kgdb trap handling in an RCU read
  lock to avoids triggering diagnostic warnings. This is a wide lock
  scope but this is OK because kgdb is a stop-the-world debugger. When
  we stop the world we put all the CPUs into holding pens and this
  inhibits RCU update anyway"

* tag 'kgdb-5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/danielt/linux:
  kgdb: Avoid suspicious RCU usage warning
  kdb: Switch to use safer dbg_io_ops over console APIs
  kdb: Make kdb_printf() console handling more robust
  kdb: Check status of console prior to invoking handlers
  kdb: Re-factor kdb_printf() message write code
2020-06-27 08:53:49 -07:00
Linus Torvalds
ed3e00e7d6 Power management fixes for 5.8-rc3
- Make sure that the _TIF_POLLING_NRFLAG is clear before entering
    the last phase of suspend-to-idle to avoid wakeup issues on some
    x86 systems (Chen Yu, Rafael Wysocki).
 
  - Cover one more case in which the intel_pstate driver should let
    the platform firmware control the CPU frequency and refuse to
    load (Srinivas Pandruvada).
 
  - Add __init annotations to 2 functions in the power management
    core (Christophe JAILLET).
 -----BEGIN PGP SIGNATURE-----
 
 iQJGBAABCAAwFiEE4fcc61cGeeHD/fCwgsRv/nhiVHEFAl72FGESHHJqd0Byand5
 c29ja2kubmV0AAoJEILEb/54YlRxt44QAKk+ojYFCLVHz+mwniaNBRROegZrDPFS
 u9QiH4Qth5QdG6TFXJF2nhcZmkvyh/W3AXFj8px1XWanBfG8nhPb0c+wRHGIbACV
 /R7ykcQuj/h7ZlCjqevFAaUbCSw4Lt5oldIk+YyVUGrZH+uulyNOABm+cxFT6XaD
 KKnJ4hbbgnriQaMxmee+jYNphDsaTBhWWCkGj/V5x4DEyvWihkjf9skDEJ4/aP4O
 09Ug8FB1P0AxMTdaoKNfrIae27oPAb74HQOe92UbOMA/Q/k1snRr7vxGVak2/VyH
 /KHxCElM+F5F3kM6LMp4C9jtVMrcJq+1ABUmh0z2jBAw5MxGcdvjKihw6+W8Z9gC
 j7r4gDvIkXDGHlLSaWQC8otARs8gMmGdZJu+Tl8yzo7ZK7InFMAsNMBH/iOPzycQ
 9UMOpIkbDrOP2zeukULAFIuh1ow/dQopnY0Hyi0Gfezb1lTXe3rFzjPAnTRantyo
 z+c+o01pkbiiP+fBG3aUYjzkUv3wpm3cyBAlzr17wIUs2cfIo7WGMQuGQ8ZqoL0T
 QMpY5OxtzJJlvTvn9UF3oxlYbNFGq68fNnrdQ4u1zHm+K1P2mh2Dko6viHrceOyA
 DOMHuUP1Yrc7sneqGYkNbfhAfjdGeS4FTDJOpx9fgasTG1IYmk+hLOuKTi8Vttjj
 Fx6nxbo1sECm
 =Reud
 -----END PGP SIGNATURE-----

Merge tag 'pm-5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm

Pull power management fixes from Rafael Wysocki:
 "These fix a recent regression that broke suspend-to-idle on some x86
  systems, fix the intel_pstate driver to correctly let the platform
  firmware control CPU performance in some cases and add __init
  annotations to a couple of functions.

  Specifics:

   - Make sure that the _TIF_POLLING_NRFLAG is clear before entering the
     last phase of suspend-to-idle to avoid wakeup issues on some x86
     systems (Chen Yu, Rafael Wysocki).

   - Cover one more case in which the intel_pstate driver should let the
     platform firmware control the CPU frequency and refuse to load
     (Srinivas Pandruvada).

   - Add __init annotations to 2 functions in the power management core
     (Christophe JAILLET)"

* tag 'pm-5.8-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
  cpuidle: Rearrange s2idle-specific idle state entry code
  PM: sleep: core: mark 2 functions as __init to save some memory
  cpufreq: intel_pstate: Add one more OOB control bit
  PM: s2idle: Clear _TIF_POLLING_NRFLAG before suspend to idle
2020-06-26 12:32:11 -07:00
Linus Torvalds
7c902e2730 Merge branch 'akpm' (patches from Andrew)
Merge misx fixes from Andrew Morton:
 "31 patches.

  Subsystems affected by this patch series: hotfixes, mm/pagealloc,
  kexec, ocfs2, lib, mm/slab, mm/slab, mm/slub, mm/swap, mm/pagemap,
  mm/vmalloc, mm/memcg, mm/gup, mm/thp, mm/vmscan, x86,
  mm/memory-hotplug, MAINTAINERS"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (31 commits)
  MAINTAINERS: update info for sparse
  mm/memory_hotplug.c: fix false softlockup during pfn range removal
  mm: remove vmalloc_exec
  arm64: use PAGE_KERNEL_ROX directly in alloc_insn_page
  x86/hyperv: allocate the hypercall page with only read and execute bits
  mm/memory: fix IO cost for anonymous page
  mm/swap: fix for "mm: workingset: age nonresident information alongside anonymous pages"
  mm: workingset: age nonresident information alongside anonymous pages
  doc: THP CoW fault no longer allocate THP
  docs: mm/gup: minor documentation update
  mm/memcontrol.c: prevent missed memory.low load tears
  mm/memcontrol.c: add missed css_put()
  mm: memcontrol: handle div0 crash race condition in memory.low
  mm/vmalloc.c: fix a warning while make xmldocs
  media: omap3isp: remove cacheflush.h
  make asm-generic/cacheflush.h more standalone
  mm/debug_vm_pgtable: fix build failure with powerpc 8xx
  mm/memory.c: properly pte_offset_map_lock/unlock in vm_insert_pages()
  mm: fix swap cache node allocation mask
  slub: cure list_slab_objects() from double fix
  ...
2020-06-26 12:19:36 -07:00
Mauro Carvalho Chehab
985098a05e docs: fix references for DMA*.txt files
As we moved those files to core-api, fix references to point
to their newer locations.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Link: https://lore.kernel.org/r/37b2fd159fbc7655dbf33b3eb1215396a25f6344.1592895969.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-06-26 10:01:32 -06:00
Douglas Anderson
440ab9e10e kgdb: Avoid suspicious RCU usage warning
At times when I'm using kgdb I see a splat on my console about
suspicious RCU usage.  I managed to come up with a case that could
reproduce this that looked like this:

  WARNING: suspicious RCU usage
  5.7.0-rc4+ #609 Not tainted
  -----------------------------
  kernel/pid.c:395 find_task_by_pid_ns() needs rcu_read_lock() protection!

  other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
  3 locks held by swapper/0/1:
   #0: ffffff81b6b8e988 (&dev->mutex){....}-{3:3}, at: __device_attach+0x40/0x13c
   #1: ffffffd01109e9e8 (dbg_master_lock){....}-{2:2}, at: kgdb_cpu_enter+0x20c/0x7ac
   #2: ffffffd01109ea90 (dbg_slave_lock){....}-{2:2}, at: kgdb_cpu_enter+0x3ec/0x7ac

  stack backtrace:
  CPU: 7 PID: 1 Comm: swapper/0 Not tainted 5.7.0-rc4+ #609
  Hardware name: Google Cheza (rev3+) (DT)
  Call trace:
   dump_backtrace+0x0/0x1b8
   show_stack+0x1c/0x24
   dump_stack+0xd4/0x134
   lockdep_rcu_suspicious+0xf0/0x100
   find_task_by_pid_ns+0x5c/0x80
   getthread+0x8c/0xb0
   gdb_serial_stub+0x9d4/0xd04
   kgdb_cpu_enter+0x284/0x7ac
   kgdb_handle_exception+0x174/0x20c
   kgdb_brk_fn+0x24/0x30
   call_break_hook+0x6c/0x7c
   brk_handler+0x20/0x5c
   do_debug_exception+0x1c8/0x22c
   el1_sync_handler+0x3c/0xe4
   el1_sync+0x7c/0x100
   rpmh_rsc_probe+0x38/0x420
   platform_drv_probe+0x94/0xb4
   really_probe+0x134/0x300
   driver_probe_device+0x68/0x100
   __device_attach_driver+0x90/0xa8
   bus_for_each_drv+0x84/0xcc
   __device_attach+0xb4/0x13c
   device_initial_probe+0x18/0x20
   bus_probe_device+0x38/0x98
   device_add+0x38c/0x420

If I understand properly we should just be able to blanket kgdb under
one big RCU read lock and the problem should go away.  We'll add it to
the beast-of-a-function known as kgdb_cpu_enter().

With this I no longer get any splats and things seem to work fine.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
Link: https://lore.kernel.org/r/20200602154729.v2.1.I70e0d4fd46d5ed2aaf0c98a355e8e1b7a5bb7e4e@changeid
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-26 15:41:40 +01:00
Sumit Garg
5946d1f5b3 kdb: Switch to use safer dbg_io_ops over console APIs
In kgdb context, calling console handlers aren't safe due to locks used
in those handlers which could in turn lead to a deadlock. Although, using
oops_in_progress increases the chance to bypass locks in most console
handlers but it might not be sufficient enough in case a console uses
more locks (VT/TTY is good example).

Currently when a driver provides both polling I/O and a console then kdb
will output using the console. We can increase robustness by using the
currently active polling I/O driver (which should be lockless) instead
of the corresponding console. For several common cases (e.g. an
embedded system with a single serial port that is used both for console
output and debugger I/O) this will result in no console handler being
used.

In order to achieve this we need to reverse the order of preference to
use dbg_io_ops (uses polling I/O mode) over console APIs. So we just
store "struct console" that represents debugger I/O in dbg_io_ops and
while emitting kdb messages, skip console that matches dbg_io_ops
console in order to avoid duplicate messages. After this change,
"is_console" param becomes redundant and hence removed.

Suggested-by: Daniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Link: https://lore.kernel.org/r/1591264879-25920-5-git-send-email-sumit.garg@linaro.org
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-26 15:40:16 +01:00
Christoph Hellwig
7a0e27b2a0 mm: remove vmalloc_exec
Merge vmalloc_exec into its only caller.  Note that for !CONFIG_MMU
__vmalloc_node_range maps to __vmalloc, which directly clears the
__GFP_HIGHMEM added by the vmalloc_exec stub anyway.

Link: http://lkml.kernel.org/r/20200618064307.32739-4-hch@lst.de
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Hildenbrand <david@redhat.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Dexuan Cui <decui@microsoft.com>
Cc: Jessica Yu <jeyu@kernel.org>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Will Deacon <will@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26 00:27:38 -07:00
Lianbo Jiang
fd7af71be5 kexec: do not verify the signature without the lockdown or mandatory signature
Signature verification is an important security feature, to protect
system from being attacked with a kernel of unknown origin.  Kexec
rebooting is a way to replace the running kernel, hence need be secured
carefully.

In the current code of handling signature verification of kexec kernel,
the logic is very twisted.  It mixes signature verification, IMA
signature appraising and kexec lockdown.

If there is no KEXEC_SIG_FORCE, kexec kernel image doesn't have one of
signature, the supported crypto, and key, we don't think this is wrong,
Unless kexec lockdown is executed.  IMA is considered as another kind of
signature appraising method.

If kexec kernel image has signature/crypto/key, it has to go through the
signature verification and pass.  Otherwise it's seen as verification
failure, and won't be loaded.

Seems kexec kernel image with an unqualified signature is even worse
than those w/o signature at all, this sounds very unreasonable.  E.g.
If people get a unsigned kernel to load, or a kernel signed with expired
key, which one is more dangerous?

So, here, let's simplify the logic to improve code readability.  If the
KEXEC_SIG_FORCE enabled or kexec lockdown enabled, signature
verification is mandated.  Otherwise, we lift the bar for any kernel
image.

Link: http://lkml.kernel.org/r/20200602045952.27487-1-lijiang@redhat.com
Signed-off-by: Lianbo Jiang <lijiang@redhat.com>
Reviewed-by: Jiri Bohac <jbohac@suse.cz>
Acked-by: Dave Young <dyoung@redhat.com>
Acked-by: Baoquan He <bhe@redhat.com>
Cc: James Morris <jmorris@namei.org>
Cc: Matthew Garrett <mjg59@google.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-26 00:27:36 -07:00
Jan Kara
f3bdc62fd8 blktrace: Provide event for request merging
Currently blk-mq does not report any event when two requests get merged
in the elevator. This then results in difficult to understand sequence
of events like:

...
  8,0   34     1579     0.608765271  2718  I  WS 215023504 + 40 [dbench]
  8,0   34     1584     0.609184613  2719  A  WS 215023544 + 56 <- (8,4) 2160568
  8,0   34     1585     0.609184850  2719  Q  WS 215023544 + 56 [dbench]
  8,0   34     1586     0.609188524  2719  G  WS 215023544 + 56 [dbench]
  8,0    3      602     0.609684162   773  D  WS 215023504 + 96 [kworker/3:1H]
  8,0   34     1591     0.609843593     0  C  WS 215023504 + 96 [0]

and you can only guess (after quite some headscratching since the above
excerpt is intermixed with a lot of other IO) that request 215023544+56
got merged to request 215023504+40. Provide proper event for request
merging like we used to do in the legacy block layer.

Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-25 21:06:11 -06:00
Linus Torvalds
4a21185cda Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Don't insert ESP trailer twice in IPSEC code, from Huy Nguyen.

 2) The default crypto algorithm selection in Kconfig for IPSEC is out
    of touch with modern reality, fix this up. From Eric Biggers.

 3) bpftool is missing an entry for BPF_MAP_TYPE_RINGBUF, from Andrii
    Nakryiko.

 4) Missing init of ->frame_sz in xdp_convert_zc_to_xdp_frame(), from
    Hangbin Liu.

 5) Adjust packet alignment handling in ax88179_178a driver to match
    what the hardware actually does. From Jeremy Kerr.

 6) register_netdevice can leak in the case one of the notifiers fail,
    from Yang Yingliang.

 7) Use after free in ip_tunnel_lookup(), from Taehee Yoo.

 8) VLAN checks in sja1105 DSA driver need adjustments, from Vladimir
    Oltean.

 9) tg3 driver can sleep forever when we get enough EEH errors, fix from
    David Christensen.

10) Missing {READ,WRITE}_ONCE() annotations in various Intel ethernet
    drivers, from Ciara Loftus.

11) Fix scanning loop break condition in of_mdiobus_register(), from
    Florian Fainelli.

12) MTU limit is incorrect in ibmveth driver, from Thomas Falcon.

13) Endianness fix in mlxsw, from Ido Schimmel.

14) Use after free in smsc95xx usbnet driver, from Tuomas Tynkkynen.

15) Missing bridge mrp configuration validation, from Horatiu Vultur.

16) Fix circular netns references in wireguard, from Jason A. Donenfeld.

17) PTP initialization on recovery is not done properly in qed driver,
    from Alexander Lobakin.

18) Endian conversion of L4 ports in filters of cxgb4 driver is wrong,
    from Rahul Lakkireddy.

19) Don't clear bound device TX queue of socket prematurely otherwise we
    get problems with ktls hw offloading, from Tariq Toukan.

20) ipset can do atomics on unaligned memory, fix from Russell King.

21) Align ethernet addresses properly in bridging code, from Thomas
    Martitz.

22) Don't advertise ipv4 addresses on SCTP sockets having ipv6only set,
    from Marcelo Ricardo Leitner.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (149 commits)
  rds: transport module should be auto loaded when transport is set
  sch_cake: fix a few style nits
  sch_cake: don't call diffserv parsing code when it is not needed
  sch_cake: don't try to reallocate or unshare skb unconditionally
  ethtool: fix error handling in linkstate_prepare_data()
  wil6210: account for napi_gro_receive never returning GRO_DROP
  hns: do not cast return value of napi_gro_receive to null
  socionext: account for napi_gro_receive never returning GRO_DROP
  wireguard: receive: account for napi_gro_receive never returning GRO_DROP
  vxlan: fix last fdb index during dump of fdb with nhid
  sctp: Don't advertise IPv4 addresses if ipv6only is set on the socket
  tc-testing: avoid action cookies with odd length.
  bpf: tcp: bpf_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT
  tcp_cubic: fix spurious HYSTART_DELAY exit upon drop in min RTT
  net: dsa: sja1105: fix tc-gate schedule with single element
  net: dsa: sja1105: recalculate gating subschedule after deleting tc-gate rules
  net: dsa: sja1105: unconditionally free old gating config
  net: dsa: sja1105: move sja1105_compose_gating_subschedule at the top
  net: macb: free resources on failure path of at91ether_open()
  net: macb: call pm_runtime_put_sync on failure path
  ...
2020-06-25 18:27:40 -07:00
Linus Torvalds
42e9c85f5c tracing: Four small fixes
- Fixed a ringbuffer bug for nested events having time go backwards
  - Fix a config dependency for boot time tracing to depend on synthetic
    events instead of histograms.
  - Fix trigger format parsing to handle multiple spaces
  - Fix bootconfig to handle failures in multiple events
 -----BEGIN PGP SIGNATURE-----
 
 iIoEABYIADIWIQRRSw7ePDh/lE+zeZMp5XQQmuv6qgUCXvUjBBQccm9zdGVkdEBn
 b29kbWlzLm9yZwAKCRAp5XQQmuv6qmLiAQD47/1T01ilYeXqJ+EG235aeQssvRa7
 RSmIAoMP+V6kHQD9G2RjnWkb3BcrdNk9zoi0LpnuMl95m5OuaMzE4PPO+ws=
 =Zbx8
 -----END PGP SIGNATURE-----

Merge tag 'trace-v5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
 "Four small fixes:

   - Fix a ringbuffer bug for nested events having time go backwards

   - Fix a config dependency for boot time tracing to depend on
     synthetic events instead of histograms.

   - Fix trigger format parsing to handle multiple spaces

   - Fix bootconfig to handle failures in multiple events"

* tag 'trace-v5.8-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing/boottime: Fix kprobe multiple events
  tracing: Fix event trigger to accept redundant spaces
  tracing/boot: Fix config dependency for synthedic event
  ring-buffer: Zero out time extend if it is nested and not absolute
2020-06-25 16:16:49 -07:00
Weilong Chen
c17d1a3a8e fork: annotate data race in copy_process()
KCSAN reported data race reading and writing nr_threads and max_threads.
The data race is intentional and benign. This is obvious from the comment
above it and based on general consensus when discussing this issue. So
there's no need for any heavy atomic or *_ONCE() machinery here.

In accordance with the newly introduced data_race() annotation consensus,
mark the offending line with data_race(). Here it's actually useful not
just to silence KCSAN but to also clearly communicate that the race is
intentional. This is especially helpful since nr_threads is otherwise
protected by tasklist_lock.

BUG: KCSAN: data-race in copy_process / copy_process

write to 0xffffffff86205cf8 of 4 bytes by task 14779 on cpu 1:
  copy_process+0x2eba/0x3c40 kernel/fork.c:2273
  _do_fork+0xfe/0x7a0 kernel/fork.c:2421
  __do_sys_clone kernel/fork.c:2576 [inline]
  __se_sys_clone kernel/fork.c:2557 [inline]
  __x64_sys_clone+0x130/0x170 kernel/fork.c:2557
  do_syscall_64+0xcc/0x3a0 arch/x86/entry/common.c:294
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

read to 0xffffffff86205cf8 of 4 bytes by task 6944 on cpu 0:
  copy_process+0x94d/0x3c40 kernel/fork.c:1954
  _do_fork+0xfe/0x7a0 kernel/fork.c:2421
  __do_sys_clone kernel/fork.c:2576 [inline]
  __se_sys_clone kernel/fork.c:2557 [inline]
  __x64_sys_clone+0x130/0x170 kernel/fork.c:2557
  do_syscall_64+0xcc/0x3a0 arch/x86/entry/common.c:294
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Link: https://groups.google.com/forum/#!msg/syzkaller-upstream-mo
deration/thvp7AHs5Ew/aPdYLXfYBQAJ

Reported-by: syzbot+52fced2d288f8ecd2b20@syzkaller.appspotmail.com
Signed-off-by: Zefan Li <lizefan@huawei.com>
Signed-off-by: Weilong Chen <chenweilong@huawei.com>
Acked-by: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Cc: Marco Elver <elver@google.com>
[christian.brauner@ubuntu.com: rewrite commit message]
Link: https://lore.kernel.org/r/20200623041240.154294-1-chenweilong@huawei.com
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
2020-06-26 01:05:29 +02:00
Peter Zijlstra
b58e733fd7 rcu: Fixup noinstr warnings
A KCSAN build revealed we have explicit annoations through atomic_*()
usage, switch to arch_atomic_*() for the respective functions.

vmlinux.o: warning: objtool: rcu_nmi_exit()+0x4d: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_dynticks_eqs_enter()+0x25: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_nmi_enter()+0x4f: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: rcu_dynticks_eqs_exit()+0x2a: call to __kcsan_check_access() leaves .noinstr.text section
vmlinux.o: warning: objtool: __rcu_is_watching()+0x25: call to __kcsan_check_access() leaves .noinstr.text section

Additionally, without the NOP in instrumentation_begin(), objtool would
not detect the lack of the 'else instrumentation_begin();' branch in
rcu_nmi_enter().

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
2020-06-25 08:24:32 -07:00
John Fastabend
a9b59159d3 bpf: Do not allow btf_ctx_access with __int128 types
To ensure btf_ctx_access() is safe the verifier checks that the BTF
arg type is an int, enum, or pointer. When the function does the
BTF arg lookup it uses the calculation 'arg = off / 8'  using the
fact that registers are 8B. This requires that the first arg is
in the first reg, the second in the second, and so on. However,
for __int128 the arg will consume two registers by default LLVM
implementation. So this will cause the arg layout assumed by the
'arg = off / 8' calculation to be incorrect.

Because __int128 is uncommon this patch applies the easiest fix and
will force int types to be sizeof(u64) or smaller so that they will
fit in a single register.

v2: remove unneeded parens per Andrii's feedback

Fixes: 9e15db6613 ("bpf: Implement accurate raw_tp context access via BTF")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/159303723962.11287.13309537171132420717.stgit@john-Precision-5820-Tower
2020-06-25 16:17:05 +02:00
Andy Shevchenko
504603767a console: Fix trivia typo 'change' -> 'chance'
I bet the word 'chance' has to be used in 'had a chance to be called',
but, alas, I'm not native speaker...

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20200618164751.56828-7-andriy.shevchenko@linux.intel.com
2020-06-25 14:24:54 +02:00
Andy Shevchenko
bba18a1af3 console: Propagate error code from console ->setup()
Since console ->setup() hook returns meaningful error codes,
propagate it to the caller of try_enable_new_console().

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20200618164751.56828-6-andriy.shevchenko@linux.intel.com
2020-06-25 14:24:07 +02:00
Rafael J. Wysocki
10e8b11eb3 cpuidle: Rearrange s2idle-specific idle state entry code
Implement call_cpuidle_s2idle() in analogy with call_cpuidle()
for the s2idle-specific idle state entry and invoke it from
cpuidle_idle_call() to make the s2idle-specific idle entry code
path look more similar to the "regular" idle entry one.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: Chen Yu <yu.c.chen@intel.com>
2020-06-25 13:52:53 +02:00
Peng Wang
423d02e146 sched/fair: Optimize dequeue_task_fair()
While looking at enqueue_task_fair and dequeue_task_fair, it occurred
to me that dequeue_task_fair can also be optimized as Vincent described
in commit 7d148be69e ("sched/fair: Optimize enqueue_task_fair()").

When encountering throttled cfs_rq, dequeue_throttle label can ensure
se not to be NULL, and rq->nr_running remains unchanged, so we can also
skip the early balance check.

Signed-off-by: Peng Wang <rocking@linux.alibaba.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/701eef9a40de93dcf5fe7063fd607bca5db38e05.1592287263.git.rocking@linux.alibaba.com
2020-06-25 13:45:44 +02:00
Kirill Tkhai
aa93cd53bc sched: Micro optimization in pick_next_task() and in check_preempt_curr()
This introduces an optimization based on xxx_sched_class addresses
in two hot scheduler functions: pick_next_task() and check_preempt_curr().

It is possible to compare pointers to sched classes to check, which
of them has a higher priority, instead of current iterations using
for_each_class().

One more result of the patch is that size of object file becomes a little
less (excluding added BUG_ON(), which goes in __init section):

$size kernel/sched/core.o
         text     data      bss	    dec	    hex	filename
before:  66446    18957	    676	  86079	  1503f	kernel/sched/core.o
after:   66398    18957	    676	  86031	  1500f	kernel/sched/core.o

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/711a9c4b-ff32-1136-b848-17c622d548f3@yandex.ru
2020-06-25 13:45:44 +02:00
Steven Rostedt (VMware)
a87e749e8f sched: Remove struct sched_class::next field
Now that the sched_class descriptors are defined in order via the linker
script vmlinux.lds.h, there's no reason to have a "next" pointer to the
previous priroity structure. The order of the sturctures can be aligned as
an array, and used to index and find the next sched_class descriptor.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191219214558.845353593@goodmis.org
2020-06-25 13:45:44 +02:00
Steven Rostedt (VMware)
c3a340f7e7 sched: Have sched_class_highest define by vmlinux.lds.h
Now that the sched_class descriptors are defined by the linker script, and
this needs to be aware of the existance of stop_sched_class when SMP is
enabled or not, as it is used as the "highest" priority when defined. Move
the declaration of sched_class_highest to the same location in the linker
script that inserts stop_sched_class, and this will also make it easier to
see what should be defined as the highest class, as this linker script
location defines the priorities as well.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191219214558.682913590@goodmis.org
2020-06-25 13:45:44 +02:00
Steven Rostedt (VMware)
590d697963 sched: Force the address order of each sched class descriptor
In order to make a micro optimization in pick_next_task(), the order of the
sched class descriptor address must be in the same order as their priority
to each other. That is:

 &idle_sched_class < &fair_sched_class < &rt_sched_class <
 &dl_sched_class < &stop_sched_class

In order to guarantee this order of the sched class descriptors, add each
one into their own data section and force the order in the linker script.

Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lore.kernel.org/r/157675913272.349305.8936736338884044103.stgit@localhost.localdomain
2020-06-25 13:45:43 +02:00
Sumit Garg
2a78b85b70 kdb: Make kdb_printf() console handling more robust
While rounding up CPUs via NMIs, its possible that a rounded up CPU
maybe holding a console port lock leading to kgdb master CPU stuck in
a deadlock during invocation of console write operations. A similar
deadlock could also be possible while using synchronous breakpoints.

So in order to avoid such a deadlock, set oops_in_progress to encourage
the console drivers to disregard their internal spin locks: in the
current calling context the risk of deadlock is a bigger problem than
risks due to re-entering the console driver. We operate directly on
oops_in_progress rather than using bust_spinlocks() because the calls
bust_spinlocks() makes on exit are not appropriate for this calling
context.

Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/1591264879-25920-4-git-send-email-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-25 12:04:30 +01:00
Sumit Garg
e8857288bb kdb: Check status of console prior to invoking handlers
Check if a console is enabled prior to invoking corresponding write
handler.

Suggested-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Daniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/1591264879-25920-3-git-send-email-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-25 12:04:29 +01:00
Sumit Garg
9d71b344f8 kdb: Re-factor kdb_printf() message write code
Re-factor kdb_printf() message write code in order to avoid duplication
of code and thereby increase readability.

Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Reviewed-by: Petr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/1591264879-25920-2-git-send-email-sumit.garg@linaro.org
Signed-off-by: Daniel Thompson <daniel.thompson@linaro.org>
2020-06-25 12:04:29 +01:00
Yonghong Song
0d4fad3e57 bpf: Add bpf_skc_to_udp6_sock() helper
The helper is used in tracing programs to cast a socket
pointer to a udp6_sock pointer.
The return value could be NULL if the casting is illegal.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200623230815.3988481-1-yhs@fb.com
2020-06-24 18:37:59 -07:00
Yonghong Song
478cfbdf5f bpf: Add bpf_skc_to_{tcp, tcp_timewait, tcp_request}_sock() helpers
Three more helpers are added to cast a sock_common pointer to
an tcp_sock, tcp_timewait_sock or a tcp_request_sock for
tracing programs.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230811.3988277-1-yhs@fb.com
2020-06-24 18:37:59 -07:00
Yonghong Song
af7ec13833 bpf: Add bpf_skc_to_tcp6_sock() helper
The helper is used in tracing programs to cast a socket
pointer to a tcp6_sock pointer.
The return value could be NULL if the casting is illegal.

A new helper return type RET_PTR_TO_BTF_ID_OR_NULL is added
so the verifier is able to deduce proper return types for the helper.

Different from the previous BTF_ID based helpers,
the bpf_skc_to_tcp6_sock() argument can be several possible
btf_ids. More specifically, all possible socket data structures
with sock_common appearing in the first in the memory layout.
This patch only added socket types related to tcp and udp.

All possible argument btf_id and return value btf_id
for helper bpf_skc_to_tcp6_sock() are pre-calculcated and
cached. In the future, it is even possible to precompute
these btf_id's at kernel build time.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230809.3988195-1-yhs@fb.com
2020-06-24 18:37:59 -07:00
Yonghong Song
72e2b2b66f bpf: Allow tracing programs to use bpf_jiffies64() helper
/proc/net/tcp{4,6} uses jiffies for various computations.
Let us add bpf_jiffies64() helper to tracing program
so bpf_iter and other programs can use it.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230808.3988073-1-yhs@fb.com
2020-06-24 18:37:58 -07:00
Yonghong Song
c06b022957 bpf: Support 'X' in bpf_seq_printf() helper
'X' tells kernel to print hex with upper case letters.
/proc/net/tcp{4,6} seq_file show() used this, and
supports it in bpf_seq_printf() helper too.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200623230807.3988014-1-yhs@fb.com
2020-06-24 18:37:58 -07:00
Linus Torvalds
fbb58011fd for-linus-2020-06-24
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXvNBJgAKCRCRxhvAZXjc
 oulGAPoCPfCguA8TPcy4tq4byGPoThyO4XnWR6XcUDOEzhbzzAEA+s5S7iRV8W92
 p2gzbI4Kncq4dQNEtUvfPHQZDAEwTA0=
 =eZDz
 -----END PGP SIGNATURE-----

Merge tag 'for-linus-2020-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread fix from Christian Brauner:
 "This fixes a regression introduced with 303cc571d1 ("nsproxy: attach
  to namespaces via pidfds").

  The LTP testsuite reported a regression where users would now see
  EBADF returned instead of EINVAL when an fd was passed that referred
  to an open file but the file was not a namespace file.

  Fix this by continuing to report EINVAL and add a regression test"

* tag 'for-linus-2020-06-24' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  tests: test for setns() EINVAL regression
  nsproxy: restore EINVAL for non-namespace file descriptor
2020-06-24 14:19:45 -07:00
Lukasz Luba
f0b5694791 PM / EM: change name of em_pd_energy to em_cpu_energy
Energy Model framework now supports other devices than CPUs. Refactor some
of the functions in order to prevent wrong usage. The old function
em_pd_energy has to generic name. It must not be used without proper
cpumask pointer, which is possible only for CPU devices. Thus, rename it
and add proper description to warn of potential wrong usage for other
devices.

Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Quentin Perret <qperret@google.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24 17:16:42 +02:00
Lukasz Luba
07891f15d9 PM / EM: remove em_register_perf_domain
Remove old function em_register_perf_domain which is no longer needed.
There is em_dev_register_perf_domain that covers old use cases and new as
well.

Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Quentin Perret <qperret@google.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24 17:16:42 +02:00
Lukasz Luba
1bc138c622 PM / EM: add support for other devices than CPUs in Energy Model
Add support for other devices than CPUs. The registration function
does not require a valid cpumask pointer and is ready to handle new
devices. Some of the internal structures has been reorganized in order to
keep consistent view (like removing per_cpu pd pointers).

Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24 17:16:27 +02:00
Luis Chamberlain
85e0cbbb8a block: create the request_queue debugfs_dir on registration
We were only creating the request_queue debugfs_dir only
for make_request block drivers (multiqueue), but never for
request-based block drivers. We did this as we were only
creating non-blktrace additional debugfs files on that directory
for make_request drivers. However, since blktrace *always* creates
that directory anyway, we special-case the use of that directory
on blktrace. Other than this being an eye-sore, this exposes
request-based block drivers to the same debugfs fragile
race that used to exist with make_request block drivers
where if we start adding files onto that directory we can later
run a race with a double removal of dentries on the directory
if we don't deal with this carefully on blktrace.

Instead, just simplify things by always creating the request_queue
debugfs_dir on request_queue registration. Rename the mutex also to
reflect the fact that this is used outside of the blktrace context.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:58 -06:00
Luis Chamberlain
b431ef837e blktrace: ensure our debugfs dir exists
We make an assumption that a debugfs directory exists, but since
this can fail ensure it exists before allowing blktrace setup to
complete. Otherwise we end up stuffing blktrace files on the debugfs
root directory. In the worst case scenario this *in theory* can create
an eventual panic *iff* in the future a similarly named file is created
prior on the debugfs root directory. This theoretical crash can happen
due to a recursive removal followed by a specific dentry removal.

This doesn't fix any known crash, however I have seen the files
go into the main debugfs root directory in cases where the debugfs
directory was not created due to other internal bugs with blktrace
now fixed.

blktrace is also completely useless without this directory, so
this ensures to userspace we only setup blktrace if the kernel
can stuff files where they are supposed to go into.

debugfs directory creations typically aren't checked for, and we have
maintainers doing sweep removals of these checks, but since we need this
check to ensure proper userspace blktrace functionality we make sure
to annotate the justification for the check.

Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:58 -06:00
Luis Chamberlain
bad8e64fb1 blktrace: fix debugfs use after free
On commit 6ac93117ab ("blktrace: use existing disk debugfs directory")
merged on v4.12 Omar fixed the original blktrace code for request-based
drivers (multiqueue). This however left in place a possible crash, if you
happen to abuse blktrace while racing to remove / add a device.

We used to use asynchronous removal of the request_queue, and with that
the issue was easier to reproduce. Now that we have reverted to
synchronous removal of the request_queue, the issue is still possible to
reproduce, its however just a bit more difficult.

We essentially run two instances of break-blktrace which add/remove
a loop device, and setup a blktrace and just never tear the blktrace
down. We do this twice in parallel. This is easily reproduced with the
script run_0004.sh from break-blktrace [0].

We can end up with two types of panics each reflecting where we
race, one a failed blktrace setup:

[  252.426751] debugfs: Directory 'loop0' with parent 'block' already present!
[  252.432265] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[  252.436592] #PF: supervisor write access in kernel mode
[  252.439822] #PF: error_code(0x0002) - not-present page
[  252.442967] PGD 0 P4D 0
[  252.444656] Oops: 0002 [#1] SMP NOPTI
[  252.446972] CPU: 10 PID: 1153 Comm: break-blktrace Tainted: G            E     5.7.0-rc2-next-20200420+ #164
[  252.452673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[  252.456343] RIP: 0010:down_write+0x15/0x40
[  252.458146] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
               cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
               00 00 <f0> 48 0f b1 55 00 75 0f 48 8b 04 25 c0 8b 01 00 48 89
               45 08 5d
[  252.463638] RSP: 0018:ffffa626415abcc8 EFLAGS: 00010246
[  252.464950] RAX: 0000000000000000 RBX: ffff958c25f0f5c0 RCX: ffffff8100000000
[  252.466727] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[  252.468482] RBP: 00000000000000a0 R08: 0000000000000000 R09: 0000000000000001
[  252.470014] R10: 0000000000000000 R11: ffff958d1f9227ff R12: 0000000000000000
[  252.471473] R13: ffff958c25ea5380 R14: ffffffff8cce15f1 R15: 00000000000000a0
[  252.473346] FS:  00007f2e69dee540(0000) GS:ffff958c2fc80000(0000) knlGS:0000000000000000
[  252.475225] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  252.476267] CR2: 00000000000000a0 CR3: 0000000427d10004 CR4: 0000000000360ee0
[  252.477526] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  252.478776] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  252.479866] Call Trace:
[  252.480322]  simple_recursive_removal+0x4e/0x2e0
[  252.481078]  ? debugfs_remove+0x60/0x60
[  252.481725]  ? relay_destroy_buf+0x77/0xb0
[  252.482662]  debugfs_remove+0x40/0x60
[  252.483518]  blk_remove_buf_file_callback+0x5/0x10
[  252.484328]  relay_close_buf+0x2e/0x60
[  252.484930]  relay_open+0x1ce/0x2c0
[  252.485520]  do_blk_trace_setup+0x14f/0x2b0
[  252.486187]  __blk_trace_setup+0x54/0xb0
[  252.486803]  blk_trace_ioctl+0x90/0x140
[  252.487423]  ? do_sys_openat2+0x1ab/0x2d0
[  252.488053]  blkdev_ioctl+0x4d/0x260
[  252.488636]  block_ioctl+0x39/0x40
[  252.489139]  ksys_ioctl+0x87/0xc0
[  252.489675]  __x64_sys_ioctl+0x16/0x20
[  252.490380]  do_syscall_64+0x52/0x180
[  252.491032]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

And the other on the device removal:

[  128.528940] debugfs: Directory 'loop0' with parent 'block' already present!
[  128.615325] BUG: kernel NULL pointer dereference, address: 00000000000000a0
[  128.619537] #PF: supervisor write access in kernel mode
[  128.622700] #PF: error_code(0x0002) - not-present page
[  128.625842] PGD 0 P4D 0
[  128.627585] Oops: 0002 [#1] SMP NOPTI
[  128.629871] CPU: 12 PID: 544 Comm: break-blktrace Tainted: G            E     5.7.0-rc2-next-20200420+ #164
[  128.635595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[  128.640471] RIP: 0010:down_write+0x15/0x40
[  128.643041] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
               cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
               00 00 <f0> 48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89
               45 08 5d
[  128.650180] RSP: 0018:ffffa9c3c05ebd78 EFLAGS: 00010246
[  128.651820] RAX: 0000000000000000 RBX: ffff8ae9a6370240 RCX: ffffff8100000000
[  128.653942] RDX: 0000000000000001 RSI: ffffff8100000000 RDI: 00000000000000a0
[  128.655720] RBP: 00000000000000a0 R08: 0000000000000002 R09: ffff8ae9afd2d3d0
[  128.657400] R10: 0000000000000056 R11: 0000000000000000 R12: 0000000000000000
[  128.659099] R13: 0000000000000000 R14: 0000000000000003 R15: 00000000000000a0
[  128.660500] FS:  00007febfd995540(0000) GS:ffff8ae9afd00000(0000) knlGS:0000000000000000
[  128.662204] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  128.663426] CR2: 00000000000000a0 CR3: 0000000420042003 CR4: 0000000000360ee0
[  128.664776] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  128.666022] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  128.667282] Call Trace:
[  128.667801]  simple_recursive_removal+0x4e/0x2e0
[  128.668663]  ? debugfs_remove+0x60/0x60
[  128.669368]  debugfs_remove+0x40/0x60
[  128.669985]  blk_trace_free+0xd/0x50
[  128.670593]  __blk_trace_remove+0x27/0x40
[  128.671274]  blk_trace_shutdown+0x30/0x40
[  128.671935]  blk_release_queue+0x95/0xf0
[  128.672589]  kobject_put+0xa5/0x1b0
[  128.673188]  disk_release+0xa2/0xc0
[  128.673786]  device_release+0x28/0x80
[  128.674376]  kobject_put+0xa5/0x1b0
[  128.674915]  loop_remove+0x39/0x50 [loop]
[  128.675511]  loop_control_ioctl+0x113/0x130 [loop]
[  128.676199]  ksys_ioctl+0x87/0xc0
[  128.676708]  __x64_sys_ioctl+0x16/0x20
[  128.677274]  do_syscall_64+0x52/0x180
[  128.677823]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The common theme here is:

debugfs: Directory 'loop0' with parent 'block' already present

This crash happens because of how blktrace uses the debugfs directory
where it places its files. Upon init we always create the same directory
which would be needed by blktrace but we only do this for make_request
drivers (multiqueue) block drivers. When you race a removal of these
devices with a blktrace setup you end up in a situation where the
make_request recursive debugfs removal will sweep away the blktrace
files and then later blktrace will also try to remove individual
dentries which are already NULL. The inverse is also possible and hence
the two types of use after frees.

We don't create the block debugfs directory on init for these types of
block devices:

  * request-based block driver block devices
  * every possible partition
  * scsi-generic

And so, this race should in theory only be possible with make_request
drivers.

We can fix the UAF by simply re-using the debugfs directory for
make_request drivers (multiqueue) and only creating the ephemeral
directory for the other type of block devices. The new clarifications
on relying on the q->blk_trace_mutex *and* also checking for q->blk_trace
*prior* to processing a blktrace ensures the debugfs directories are
only created if no possible directory name clashes are possible.

This goes tested with:

  o nvme partitions
  o ISCSI with tgt, and blktracing against scsi-generic with:
    o block
    o tape
    o cdrom
    o media changer
  o blktests

This patch is part of the work which disputes the severity of
CVE-2019-19770 which shows this issue is not a core debugfs issue, but
a misuse of debugfs within blktace.

Fixes: 6ac93117ab ("blktrace: use existing disk debugfs directory")
Reported-by: syzbot+603294af2d01acfdd6da@syzkaller.appspotmail.com
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Cc: Bart Van Assche <bvanassche@acm.org>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Hannes Reinecke <hare@suse.com>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>
Cc: "James E.J. Bottomley" <jejb@linux.ibm.com>
Cc: yu kuai <yukuai3@huawei.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:58 -06:00
Luis Chamberlain
a67549c8e5 blktrace: annotate required lock on do_blk_trace_setup()
Ensure it is clear which lock is required on do_blk_trace_setup().

Suggested-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Luis Chamberlain <mcgrof@kernel.org>
Reviewed-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-06-24 09:15:58 -06:00
Lukasz Luba
d0351cc3b0 PM / EM: update callback structure and add device pointer
The Energy Model framework is going to support devices other that CPUs. In
order to make this happen change the callback function and add pointer to
a device as an argument.

Update the related users to use new function and new callback from the
Energy Model.

Acked-by: Quentin Perret <qperret@google.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24 17:14:07 +02:00
Lukasz Luba
7d9895c7fb PM / EM: introduce em_dev_register_perf_domain function
Add now function in the Energy Model framework which is going to support
new devices. This function will help in transition and make it smoother.
For now it still checks if the cpumask is a valid pointer, which will be
removed later when the new structures and infrastructure will be ready.

Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Quentin Perret <qperret@google.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24 17:14:07 +02:00
Lukasz Luba
521b512b15 PM / EM: change naming convention from 'capacity' to 'performance'
The Energy Model uses concept of performance domain and capacity states in
order to calculate power used by CPUs. Change naming convention from
capacity to performance state would enable wider usage in future, e.g.
upcoming support for other devices other than CPUs.

Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: Quentin Perret <qperret@google.com>
Signed-off-by: Lukasz Luba <lukasz.luba@arm.com>
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
2020-06-24 17:14:07 +02:00
Alexander Popov
feee1b8c49 gcc-plugins/stackleak: Use asm instrumentation to avoid useless register saving
The kernel code instrumentation in stackleak gcc plugin works in two stages.
At first, stack tracking is added to GIMPLE representation of every function
(except some special cases). And later, when stack frame size info is
available, stack tracking is removed from the RTL representation of the
functions with small stack frame. There is an unwanted side-effect for these
functions: some of them do useless work with caller-saved registers.

As an example of such case, proc_sys_write without() instrumentation:
    55                      push   %rbp
    41 b8 01 00 00 00       mov    $0x1,%r8d
    48 89 e5                mov    %rsp,%rbp
    e8 11 ff ff ff          callq  ffffffff81284610 <proc_sys_call_handler>
    5d                      pop    %rbp
    c3                      retq
    0f 1f 44 00 00          nopl   0x0(%rax,%rax,1)
    66 2e 0f 1f 84 00 00    nopw   %cs:0x0(%rax,%rax,1)
    00 00 00

proc_sys_write() with instrumentation:
    55                      push   %rbp
    48 89 e5                mov    %rsp,%rbp
    41 56                   push   %r14
    41 55                   push   %r13
    41 54                   push   %r12
    53                      push   %rbx
    49 89 f4                mov    %rsi,%r12
    48 89 fb                mov    %rdi,%rbx
    49 89 d5                mov    %rdx,%r13
    49 89 ce                mov    %rcx,%r14
    4c 89 f1                mov    %r14,%rcx
    4c 89 ea                mov    %r13,%rdx
    4c 89 e6                mov    %r12,%rsi
    48 89 df                mov    %rbx,%rdi
    41 b8 01 00 00 00       mov    $0x1,%r8d
    e8 f2 fe ff ff          callq  ffffffff81298e80 <proc_sys_call_handler>
    5b                      pop    %rbx
    41 5c                   pop    %r12
    41 5d                   pop    %r13
    41 5e                   pop    %r14
    5d                      pop    %rbp
    c3                      retq
    66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
    00 00

Let's improve the instrumentation to avoid this:

1. Make stackleak_track_stack() save all register that it works with.
Use no_caller_saved_registers attribute for that function. This attribute
is available for x86_64 and i386 starting from gcc-7.

2. Insert calling stackleak_track_stack() in asm:
  asm volatile("call stackleak_track_stack" :: "r" (current_stack_pointer))
Here we use ASM_CALL_CONSTRAINT trick from arch/x86/include/asm/asm.h.
The input constraint is taken into account during gcc shrink-wrapping
optimization. It is needed to be sure that stackleak_track_stack() call is
inserted after the prologue of the containing function, when the stack
frame is prepared.

This work is a deep reengineering of the idea described on grsecurity blog
  https://grsecurity.net/resolving_an_unfortunate_stackleak_interaction

Signed-off-by: Alexander Popov <alex.popov@linux.com>
Acked-by: Miguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Link: https://lore.kernel.org/r/20200624123330.83226-5-alex.popov@linux.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-24 07:48:28 -07:00
Alexander Popov
005e696df6 gcc-plugins/stackleak: Don't instrument itself
There is no need to try instrumenting functions in kernel/stackleak.c.
Otherwise that can cause issues if the cleanup pass of stackleak gcc plugin
is disabled.

Signed-off-by: Alexander Popov <alex.popov@linux.com>
Link: https://lore.kernel.org/r/20200624123330.83226-2-alex.popov@linux.com
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-06-24 07:48:13 -07:00
Sascha Ortmann
20dc3847cc tracing/boottime: Fix kprobe multiple events
Fix boottime kprobe events to report and abort after each failure when
adding probes.

As an example, when we try to set multiprobe kprobe events in
bootconfig like this:

ftrace.event.kprobes.vfsevents {
        probes = "vfs_read $arg1 $arg2,,
                 !error! not reported;?", // leads to error
                 "vfs_write $arg1 $arg2"
}

This will not work as expected. After
commit da0f1f4167 ("tracing/boottime: Fix kprobe event API usage"),
the function trace_boot_add_kprobe_event will not produce any error
message when adding a probe fails at kprobe_event_gen_cmd_start.
Furthermore, we continue to add probes when kprobe_event_gen_cmd_end fails
(and kprobe_event_gen_cmd_start did not fail). In this case the function
even returns successfully when the last call to kprobe_event_gen_cmd_end
is successful.

The behaviour of reporting and aborting after failures is not
consistent.

The function trace_boot_add_kprobe_event now reports each failure and
stops adding probes immediately.

Link: https://lkml.kernel.org/r/20200618163301.25854-1-sascha.ortmann@stud.uni-hannover.de

Cc: stable@vger.kernel.org
Cc: linux-kernel@i4.cs.fau.de
Co-developed-by: Maximilian Werner <maximilian.werner96@gmail.com>
Fixes: da0f1f4167 ("tracing/boottime: Fix kprobe event API usage")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Maximilian Werner <maximilian.werner96@gmail.com>
Signed-off-by: Sascha Ortmann <sascha.ortmann@stud.uni-hannover.de>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-23 21:51:50 -04:00
Masami Hiramatsu
6784beada6 tracing: Fix event trigger to accept redundant spaces
Fix the event trigger to accept redundant spaces in
the trigger input.

For example, these return -EINVAL

echo " traceon" > events/ftrace/print/trigger
echo "traceon  if common_pid == 0" > events/ftrace/print/trigger
echo "disable_event:kmem:kmalloc " > events/ftrace/print/trigger

But these are hard to find what is wrong.

To fix this issue, use skip_spaces() to remove spaces
in front of actual tokens, and set NULL if there is no
token.

Link: http://lkml.kernel.org/r/159262476352.185015.5261566783045364186.stgit@devnote2

Cc: Tom Zanussi <zanussi@kernel.org>
Cc: stable@vger.kernel.org
Fixes: 85f2b08268 ("tracing: Add basic event trigger framework")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-23 21:51:40 -04:00
Masami Hiramatsu
6c95503c29 tracing/boot: Fix config dependency for synthedic event
Since commit 726721a518 ("tracing: Move synthetic events to
a separate file") decoupled synthetic event from histogram,
boot-time tracing also has to check CONFIG_SYNTH_EVENT instead
of CONFIG_HIST_TRIGGERS.

Link: http://lkml.kernel.org/r/159262475441.185015.5300725180746017555.stgit@devnote2

Fixes: 726721a518 ("tracing: Move synthetic events to a separate file")
Reviewed-by: Tom Zanussi <zanussi@kernel.org>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
2020-06-23 21:51:22 -04:00
Maciej Żenczykowski
b338cb921e bpf: Restore behaviour of CAP_SYS_ADMIN allowing the loading of networking bpf programs
This is a fix for a regression in commit 2c78ee898d ("bpf: Implement CAP_BPF").
Before the above commit it was possible to load network bpf programs
with just the CAP_SYS_ADMIN privilege.

The Android bpfloader happens to run in such a configuration (it has
SYS_ADMIN but not NET_ADMIN) and creates maps and loads bpf programs
for later use by Android's netd (which has NET_ADMIN but not SYS_ADMIN).

Fixes: 2c78ee898d ("bpf: Implement CAP_BPF")
Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: John Stultz <john.stultz@linaro.org>
Link: https://lore.kernel.org/bpf/20200620212616.93894-1-zenczykowski@gmail.com
2020-06-23 17:45:42 -07:00
Yonghong Song
c4c0bdc0d2 bpf: Set the number of exception entries properly for subprograms
Currently, if a bpf program has more than one subprograms, each program will be
jitted separately. For programs with bpf-to-bpf calls the
prog->aux->num_exentries is not setup properly. For example, with
bpf_iter_netlink.c modified to force one function to be not inlined and with
CONFIG_BPF_JIT_ALWAYS_ON the following error is seen:
   $ ./test_progs -n 3/3
   ...
   libbpf: failed to load program 'iter/netlink'
   libbpf: failed to load object 'bpf_iter_netlink'
   libbpf: failed to load BPF skeleton 'bpf_iter_netlink': -4007
   test_netlink:FAIL:bpf_iter_netlink__open_and_load skeleton open_and_load failed
   #3/3 netlink:FAIL
The dmesg shows the following errors:
   ex gen bug
which is triggered by the following code in arch/x86/net/bpf_jit_comp.c:
   if (excnt >= bpf_prog->aux->num_exentries) {
     pr_err("ex gen bug\n");
     return -EFAULT;
   }

This patch fixes the issue by computing proper num_exentries for each
subprogram before calling JIT.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-23 17:27:37 -07:00
Richard Guy Briggs
8e6cf365e1 audit: log nftables configuration change events
iptables, ip6tables, arptables and ebtables table registration,
replacement and unregistration configuration events are logged for the
native (legacy) iptables setsockopt api, but not for the
nftables netlink api which is used by the nft-variant of iptables in
addition to nftables itself.

Add calls to log the configuration actions in the nftables netlink api.

This uses the same NETFILTER_CFG record format but overloads the table
field.

  type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.878:162) : table=?:0;?:0 family=unspecified entries=2 op=nft_register_gen pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld
  ...
  type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.878:162) : table=firewalld:1;?:0 family=inet entries=0 op=nft_register_table pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld
  ...
  type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.911:163) : table=firewalld:1;filter_FORWARD:85 family=inet entries=8 op=nft_register_chain pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld
  ...
  type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.911:163) : table=firewalld:1;filter_FORWARD:85 family=inet entries=101 op=nft_register_rule pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld
  ...
  type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.911:163) : table=firewalld:1;__set0:87 family=inet entries=87 op=nft_register_setelem pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld
  ...
  type=NETFILTER_CFG msg=audit(2020-05-28 17:46:41.911:163) : table=firewalld:1;__set0:87 family=inet entries=0 op=nft_register_set pid=396 subj=system_u:system_r:firewalld_t:s0 comm=firewalld

For further information please see issue
https://github.com/linux-audit/audit-kernel/issues/124

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-06-23 20:25:16 -04:00
Alexei Starovoitov
4e608675e7 Merge up to bpf_probe_read_kernel_str() fix into bpf-next 2020-06-23 15:33:41 -07:00