The extents code will sometimes zero out blocks and mark them as
initialized instead of splitting an extent into several smaller ones.
This optimization however, causes problems if the extent is beyond
i_size because fsck will complain if there are uninitialized blocks
after i_size as this can not be distinguished from an inode that has
an incorrect i_size field.
https://bugzilla.kernel.org/show_bug.cgi?id=15742
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
One of the most contended locks in the jbd2 layer is j_state_lock when
running dbench. This is especially true if using the real-time kernel
with its "sleeping spinlocks" patch that replaces spinlocks with
priority inheriting mutexes --- but it also shows up on large SMP
benchmarks.
Thanks to John Stultz for pointing this out.
Reviewed by Mingming Cao and Jan Kara.
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Now we have a set of nested attributes:
IFLA_VFINFO_LIST (NESTED)
IFLA_VF_INFO (NESTED)
IFLA_VF_MAC
IFLA_VF_VLAN
IFLA_VF_TX_RATE
This allows a single set to operate on multiple attributes if desired.
Among other things, it means a dump can be replayed to set state.
The current interface has yet to be released, so this seems like
something to consider for 2.6.34.
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
There was a bug reported on RHEL5 that a 10G dd on a 12G box
had a very, very slow sync after that.
At issue was the loop in write_cache_pages scanning all the way
to the end of the 10G file, even though the subsequent call
to mpage_da_submit_io would only actually write a smallish amt; then
we went back to the write_cache_pages loop ... wasting tons of time
in calling __mpage_da_writepage for thousands of pages we would
just revisit (many times) later.
Upstream it's not such a big issue for sys_sync because we get
to the loop with a much smaller nr_to_write, which limits the loop.
However, talking with Aneesh he realized that fsync upstream still
gets here with a very large nr_to_write and we face the same problem.
This patch makes mpage_add_bh_to_extent stop the loop after we've
accumulated 2048 pages, by setting mpd->io_done = 1; which ultimately
causes the write_cache_pages loop to break.
Repeating the test with a dirty_ratio of 80 (to leave something for
fsync to do), I don't see huge IO performance gains, but the reduction
in cpu usage is striking: 80% usage with stock, and 2% with the
below patch. Instrumenting the loop in write_cache_pages clearly
shows that we are wasting time here.
Eventually we need to change mpage_da_map_pages() also submit its I/O
to the block layer, subsuming mpage_da_submit_io(), and then change it
call ext4_get_blocks() multiple times.
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Synchronize access to the drivers configuration interface.
Also do not allow configuration changes during online/offline
transition.
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
z/OS may activate Optimized Latency Mode (OLM) for a connection
through an OSA Express3 adapter, which reduces the number of
allowed concurrent connections, if adapter is used in shared mode.
Create a meaningful message, if activation of an OSA-connection fails
due to an active OLM-connection on the shared OSA-adapter.
Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
OSA supports HW TX checksumming in layer 3 mode. Enable this
feature and remove software fallback used for TSO. Cleanup
checksum bits to indicate OSA can do checksumming only for
IPv4 TCP and UDP.
Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
transport may be free before ICMP proto unreachable timer expire, so
we should delete active ICMP proto unreachable timer when transport
is going away.
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
vlan/macvlan start_xmit() can inform caller of congestion with
NET_XMIT_CN return value. This doesnt mean packet was dropped.
Increment normal stat counters instead of tx_dropped.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP-MD5 sessions have intermittent failures, when route cache is
invalidated. ip_queue_xmit() has to find a new route, calls
sk_setup_caps(sk, &rt->u.dst), destroying the
sk->sk_route_caps &= ~NETIF_F_GSO_MASK
that MD5 desperately try to make all over its way (from
tcp_transmit_skb() for example)
So we send few bad packets, and everything is fine when
tcp_transmit_skb() is called again for this socket.
Since ip_queue_xmit() is at a lower level than TCP-MD5, I chose to use a
socket field, sk_route_nocaps, containing bits to mask on sk_route_caps.
Reported-by: Bhaskar Dutta <bhaskie@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
TCP MD5 support uses percpu data for temporary storage. It currently
disables preemption so that same storage cannot be reclaimed by another
thread on same cpu.
We also have to make sure a softirq handler wont try to use also same
context. Various bug reports demonstrated corruptions.
Fix is to disable preemption and BH.
Reported-by: Bhaskar Dutta <bhaskie@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Turn off issuance of discard requests if the device does
not support it - similar to the action we take for barriers.
This will save a little computation time if a non-discardable
device is mounted with -o discard, and also makes it obvious
that it's not doing what was asked at mount time ...
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
With RPS inclusion, skb timestamping is not consistent in RX path.
If netif_receive_skb() is used, its deferred after RPS dispatch.
If netif_rx() is used, its done before RPS dispatch.
This can give strange tcpdump timestamps results.
I think timestamping should be done as soon as possible in the receive
path, to get meaningful values (ie timestamps taken at the time packet
was delivered by NIC driver to our stack), even if NAPI already can
defer timestamping a bit (RPS can help to reduce the gap)
Tom Herbert prefer to sample timestamps after RPS dispatch. In case
sampling is expensive (HPET/acpi_pm on x86), this makes sense.
Let admins switch from one mode to another, using a new
sysctl, /proc/sys/net/core/netdev_tstamp_prequeue
Its default value (1), means timestamps are taken as soon as possible,
before backlog queueing, giving accurate timestamps.
Setting a 0 value permits to sample timestamps when processing backlog,
after RPS dispatch, to lower the load of the pre-RPS cpu.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
I mistakenly had the error path to use num_pols to decide how
many policies we need to drop (cruft from earlier patch set
version which did not handle socket policies right).
This is wrong since normally we do not keep explicit references
(instead we hold reference to the cache entry which holds references
to policies). drop_pols is set to num_pols if we are holding the
references, so use that. Otherwise we eventually BUG_ON inside
xfrm_policy_destroy due to premature policy deletion.
Signed-off-by: Timo Teras <timo.teras@iki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Now there's null check here and also again in the hook. Looking at bridge bits
which are simmilar, port structure is rcu_dereferenced right away in
handle_bridge and passed to hook. Looks nicer.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Acked-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
(Dropped the infiniband part, because Tetsuo modified the related code,
I will send a separate patch for it once this is accepted.)
This patch introduces /proc/sys/net/ipv4/ip_local_reserved_ports which
allows users to reserve ports for third-party applications.
The reserved ports will not be used by automatic port assignments
(e.g. when calling connect() or bind() with port number 0). Explicit
port allocation behavior is unchanged.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
The new function can be used to read/write large bitmaps via /proc. A
comma separated range format is used for compact output and input
(e.g. 1,3-4,10-10).
Writing into the file will first reset the bitmap then update it
based on the given input.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
(Based on Octavian's work, and I modified a lot.)
As we are about to add another integer handling proc function a little
bit of cleanup is in order: add a few helper functions to improve code
readability and decrease code duplication.
In the process a bug is also fixed: if the user specifies a number
with more then 20 digits it will be interpreted as two integers
(e.g. 10000...13 will be interpreted as 100.... and 13).
Behavior for EFAULT handling was changed as well. Previous to this
patch, when an EFAULT error occurred in the middle of a write
operation, although some of the elements were set, that was not
acknowledged to the user (by shorting the write and returning the
number of bytes accepted). EFAULT is now treated just like any other
errors by acknowledging the amount of bytes accepted.
Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
Signed-off-by: WANG Cong <amwang@redhat.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Links for each port are created in sysfs using the device
name, but this could be changed after being added to the
bridge.
As well as being unable to remove interfaces after this
occurs (because userspace tools don't recognise the new
name, and the kernel won't recognise the old name), adding
another interface with the old name to the bridge will
cause an error trying to create the sysfs link.
This fixes the problem by listening for NETDEV_CHANGENAME
notifications and renaming the link.
https://bugzilla.kernel.org/show_bug.cgi?id=12743
Signed-off-by: Simon Arlott <simon@fire.lp0.eu>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Use one set of macro's for all bridge messages.
Note: can't use netdev_XXX macro's because bridge is purely
virtual and has no device parent.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Move code around so that the ifdef for NETPOLL_CONTROLLER don't have to
show up in main code path. The control functions should be in helpers
that are only compiled if needed.
Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some RNDIS devices don't respond on the control channel until polled
on the status channel. In particular, this was reported to be the
case for the 2Wire HomePortal 1000SW.
This is roughly based on a patch by John Carr <john.carr@unrouted.co.uk>
which is reported to be needed for use with some Windows Mobile devices
and which is currently applied by Mandriva.
Reported-by: Mark Glassberg <vzeeaxwl@myfairpoint.net>
Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Tested-by: Mark Glassberg <vzeeaxwl@myfairpoint.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
ext4_freeze() used jbd2_journal_lock_updates() which takes
the j_barrier mutex, and then returns to userspace. The
kernel does not like this:
================================================
[ BUG: lock held when returning to user space! ]
------------------------------------------------
lvcreate/1075 is leaving the kernel with locks still held!
1 lock held by lvcreate/1075:
#0: (&journal->j_barrier){+.+...}, at: [<ffffffff811c6214>]
jbd2_journal_lock_updates+0xe1/0xf0
Use vfs_check_frozen() added to ext4_journal_start_sb() and
ext4_force_commit() instead.
Addresses-Red-Hat-Bugzilla: #568503
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
generic setattr implementation is no longer responsible for
quota transfer so synlinks must be handled via ext4_setattr.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
If groups_per_flex < 2, sbi->s_flex_groups[] doesn't get filled out,
and every other access to this first tests s_log_groups_per_flex;
same thing needs to happen in resize or we'll wander off into
a null pointer when doing an online resize of the file system.
Thanks to Christoph Biedl, who came up with the trivial testcase:
# truncate --size 128M fsfile
# mkfs.ext3 -F fsfile
# tune2fs -O extents,uninit_bg,dir_index,flex_bg,huge_file,dir_nlink,extra_isize fsfile
# e2fsck -yDf -C0 fsfile
# truncate --size 132M fsfile
# losetup /dev/loop0 fsfile
# mount /dev/loop0 mnt
# resize2fs -p /dev/loop0
https://bugzilla.kernel.org/show_bug.cgi?id=13549
Reported-by: Alessandro Polverini <alex@nibbles.it>
Test-case-by: Christoph Biedl <bugzilla.kernel.bpeb@manchmal.in-ulm.de>
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
allocated_meta_data is already included in 'used' variable.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Use kmemdup when some other buffer is immediately copied into the
allocated region.
A simplified version of the semantic patch that makes this change is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to,size,flag;
statement S;
@@
- to = \(kmalloc\|kzalloc\)(size,flag);
+ to = kmemdup(from,size,flag);
if (to==NULL || ...) S
- memcpy(to, from, size);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: Roland Dreier <rolandd@cisco.com>
Combining the softlockup and hardlockup code causes watchdog.c
to build even without the hardlockup detection support.
So if an arch, that has the previous and the new nmi watchdog
implementations cohabiting, wants to know if the generic one
is in use, CONFIG_LOCKUP_DETECTOR is not a reliable check.
We need to use CONFIG_HARDLOCKUP_DETECTOR instead.
Fixes:
kernel/built-in.o: In function `touch_nmi_watchdog':
(.text+0x449bc): multiple definition of `touch_nmi_watchdog'
arch/sparc/kernel/built-in.o:(.text+0x11b28): first defined here
Signed-off-by: Don Zickus <dzickus@redhat.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
LKML-Reference: <20100514151121.GR15159@redhat.com>
[ use CONFIG_HARDLOCKUP_DETECTOR instead of CONFIG_PERF_EVENTS_NMI]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
This new config is deemed to simplify even more the lockup detector
dependencies and can make it easier to bring a smooth sorting
between archs that support the new generic lockup detector and those
that still have their own, especially for those that are in the
middle of this migration.
Instead of checking whether we have CONFIG_LOCKUP_DETECTOR +
CONFIG_PERF_EVENTS_NMI each time an arch wants to know if it needs
to build its own lockup detector, take a shortcut with this new
config. It is enabled only if the hardlockup detection part of
the whole lockup detector is on.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
CONFIG_PERF_EVENT_NMI is something that need to be enabled from the
arch. This is fine on x86 as PERF_EVENTS is builtin but if other
archs select it, they will need to handle the PERF_EVENTS dependency.
Instead, handle the dependency in the generic layer:
- archs need to tell what they support through HAVE_PERF_EVENTS_NMI
- Enable magically PERF_EVENTS_NMI if we have PERF_EVENTS and
HAVE_PERF_EVENTS_NMI.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
We kept CONFIG_DETECT_SOFTLOCKUP around for compatibility with
older configs. But it was enabled by default if CONFIG_DEBUG_KERNEL.
So if we want to enable CONFIG_LOCKUP_DETECTOR on configs that had
CONFIG_DETECT_SOFTLOCKUP, all we need is to have the same enabling
by default if CONFIG_DEBUG_KERNEL. We can then remove
CONFIG_DETECT_SOFTLOCKUP directly.
So tag CONFIG_LOCKUP_DETECTOR as default y. This is what we want for
most serious kernel debugging anyway.
And also forbid the lockup detector in S390 as it was for the
previous softlockup detector, event though the true reason for that
is not outlined.
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Don Zickus <dzickus@redhat.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Right now that means that pressing the left arrow willl make the symbol
annotation window to exit back to the main symbol histogram browser.
This is another improvement on the UI fastpath, i.e. just the arrows and
enter are enough for most browsing.
Suggested-by: Ingo Molnar <mingo@elte.hu>
Cc: Frédéric Weisbecker <fweisbec@gmail.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Tom Zanussi <tzanussi@gmail.com>
LKML-Reference: <new-submission>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
Use kmemdup when some other buffer is immediately copied into the
allocated region.
A simplified version of the semantic patch that makes this change is as
follows: (http://coccinelle.lip6.fr/)
// <smpl>
@@
expression from,to,size,flag;
statement S;
@@
- to = \(kmalloc\|kzalloc\)(size,flag);
+ to = kmemdup(from,size,flag);
if (to==NULL || ...) S
- memcpy(to, from, size);
// </smpl>
Signed-off-by: Julia Lawall <julia@diku.dk>
Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
In the FPU emulator code of the MIPS, the Cause bits of the FCSR register
are not currently writeable by the ctc1 instruction. In odd corner cases,
this can cause problems. For example, a case existed where a divide-by-zero
exception was generated by the FPU, and the signal handler attempted to
restore the FPU registers to their state before the exception occurred. In
this particular setup, writing the old value to the FCSR register would
cause another divide-by-zero exception to occur immediately. The solution
is to change the ctc1 instruction emulator code to allow the Cause bits of
the FCSR register to be writeable. This is the behaviour of the hardware
that the code is emulating.
This problem was found by Shane McDonald, but the credit for the fix goes
to Kevin Kissell. In Kevin's words:
I submit that the bug is indeed in that ctc_op: case of the emulator. The
Cause bits (17:12) are supposed to be writable by that instruction, but the
CTC1 emulation won't let them be updated by the instruction. I think that
actually if you just completely removed lines 387-388 [...] things would
work a good deal better. At least, it would be a more accurate emulation of
the architecturally defined FPU. If I wanted to be really, really pedantic
(which I sometimes do), I'd also protect the reserved bits that aren't
necessarily writable.
Signed-off-by: Shane McDonald <mcdonald.shane@gmail.com>
To: anemo@mba.ocn.ne.jp
To: kevink@paralogos.com
To: sshtylyov@mvista.com
Patchwork: http://patchwork.linux-mips.org/patch/1205/
Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
---
As we were using an internal dma flushing routine, this patch changes to
the DMA API flush_kernel_dcache_page(). Driver is able to compile now.
[akpm@linux-foundation.org: flush_kernel_dcache_page() comes before kunmap_atomic()]
Signed-off-by: Nicolas Ferre <nicolas.ferre@atmel.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
JFS: Free sbi memory in error path
fs/sysv: dereferencing ERR_PTR()
Fix double-free in logfs
Fix the regression created by "set S_DEAD on unlink()..." commit
* 'perf-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
perf record: Add a fallback to the reference relocation symbol
Between "clean D line..." and "invalidate I line" operations in
v7_coherent_user_range(), the memory page may get swapped out.
And the fault on "invalidate I line" could not be properly handled
causing the oops.
In ARMv6 "external abort on linefetch" replaced by "instruction cache
maintenance fault". Let's handle it as translation fault. It fixes the
issue.
I'm not sure if it's reasonable to check arch version in run-time.
Let's do it in compile time for now.
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Signed-off-by: Siarhei Siamashka <siarhei.siamashka@nokia.com>
Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name>
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Enabling CONFIG_USER_DEBUG allows NWFPE to complain about every FP
exception, which with some programs can cause the kernel message log
to fill with NWFPE debug, swamping out other messages.
This change allows NWFPE debugging to be configured at run time.
Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>