* 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6:
nfs: fix memory leak in nfs_get_sb with CONFIG_NFS_V4
nfs: fix some issues in nfs41_proc_reclaim_complete()
NFS: Ensure that nfs_wb_page() waits for Pg_writeback to clear
NFS: Fix an unstable write data integrity race
nfs: testing for null instead of ERR_PTR()
NFS: rsize and wsize settings ignored on v4 mounts
NFSv4: Don't attempt an atomic open if the file is a mountpoint
SUNRPC: Fix a bug in rpcauth_prune_expired
The struct device_node *of_node pointer is moving out of dev->archdata
and into the struct device proper. of_i2c.c needs to set the of_node
pointer before the device is registered. Since the i2c subsystem
doesn't allow 2 stage allocation and registration of i2c devices, the
of_node pointer needs to be passed via the i2c_board_info structure
so that it is set prior to registration.
This patch adds of_node to struct i2c_board_info (conditional on
CONFIG_OF), sets of_node in i2c_new_device(), and modifies of_i2c.c
to use the new parameter. The calling of dev_archdata_set_node()
from of_i2c will be removed in a subsequent patch when of_node is
removed from archdata and all users are converted over.
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
This patch makes unflatten_device_tree() safe to call from any arch
setup code with the following changes:
- Make sure initial_boot_params actually points to a device tree blob
before unflattening
- Make sure the initial_boot_params->magic field is correct
- If CONFIG_OF_FLATTREE is not set, then make unflatten_device_tree()
an empty static inline function.
This patch also adds some additional debug output to the top of
unflatten_device_tree().
Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
When queueing a skb to socket, we can immediately release its dst if
target socket do not use IP_CMSG_PKTINFO.
tcp_data_queue() can drop dst too.
This to benefit from a hot cache line and avoid the receiver, possibly
on another cpu, to dirty this cache line himself.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Since commit 95766fff ([UDP]: Add memory accounting.),
each received packet needs one extra sock_lock()/sock_release() pair.
This added latency because of possible backlog handling. Then later,
ticket spinlocks added yet another latency source in case of DDOS.
This patch introduces lock_sock_bh() and unlock_sock_bh()
synchronization primitives, avoiding one atomic operation and backlog
processing.
skb_free_datagram_locked() uses them instead of full blown
lock_sock()/release_sock(). skb is orphaned inside locked section for
proper socket memory reclaim, and finally freed outside of it.
UDP receive path now take the socket spinlock only once.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Ok, version 4
Change Notes:
1) Minor cleanups, from Vlads notes
Summary:
Hey-
Recently, it was reported to me that the kernel could oops in the
following way:
<5> kernel BUG at net/core/skbuff.c:91!
<5> invalid operand: 0000 [#1]
<5> Modules linked in: sctp netconsole nls_utf8 autofs4 sunrpc iptable_filter
ip_tables cpufreq_powersave parport_pc lp parport vmblock(U) vsock(U) vmci(U)
vmxnet(U) vmmemctl(U) vmhgfs(U) acpiphp dm_mirror dm_mod button battery ac md5
ipv6 uhci_hcd ehci_hcd snd_ens1371 snd_rawmidi snd_seq_device snd_pcm_oss
snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_ac97_codec snd soundcore
pcnet32 mii floppy ext3 jbd ata_piix libata mptscsih mptsas mptspi mptscsi
mptbase sd_mod scsi_mod
<5> CPU: 0
<5> EIP: 0060:[<c02bff27>] Not tainted VLI
<5> EFLAGS: 00010216 (2.6.9-89.0.25.EL)
<5> EIP is at skb_over_panic+0x1f/0x2d
<5> eax: 0000002c ebx: c033f461 ecx: c0357d96 edx: c040fd44
<5> esi: c033f461 edi: df653280 ebp: 00000000 esp: c040fd40
<5> ds: 007b es: 007b ss: 0068
<5> Process swapper (pid: 0, threadinfo=c040f000 task=c0370be0)
<5> Stack: c0357d96 e0c29478 00000084 00000004 c033f461 df653280 d7883180
e0c2947d
<5> 00000000 00000080 df653490 00000004 de4f1ac0 de4f1ac0 00000004
df653490
<5> 00000001 e0c2877a 08000800 de4f1ac0 df653490 00000000 e0c29d2e
00000004
<5> Call Trace:
<5> [<e0c29478>] sctp_addto_chunk+0xb0/0x128 [sctp]
<5> [<e0c2947d>] sctp_addto_chunk+0xb5/0x128 [sctp]
<5> [<e0c2877a>] sctp_init_cause+0x3f/0x47 [sctp]
<5> [<e0c29d2e>] sctp_process_unk_param+0xac/0xb8 [sctp]
<5> [<e0c29e90>] sctp_verify_init+0xcc/0x134 [sctp]
<5> [<e0c20322>] sctp_sf_do_5_1B_init+0x83/0x28e [sctp]
<5> [<e0c25333>] sctp_do_sm+0x41/0x77 [sctp]
<5> [<c01555a4>] cache_grow+0x140/0x233
<5> [<e0c26ba1>] sctp_endpoint_bh_rcv+0xc5/0x108 [sctp]
<5> [<e0c2b863>] sctp_inq_push+0xe/0x10 [sctp]
<5> [<e0c34600>] sctp_rcv+0x454/0x509 [sctp]
<5> [<e084e017>] ipt_hook+0x17/0x1c [iptable_filter]
<5> [<c02d005e>] nf_iterate+0x40/0x81
<5> [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
<5> [<c02e0c7f>] ip_local_deliver_finish+0xc6/0x151
<5> [<c02d0362>] nf_hook_slow+0x83/0xb5
<5> [<c02e0bb2>] ip_local_deliver+0x1a2/0x1a9
<5> [<c02e0bb9>] ip_local_deliver_finish+0x0/0x151
<5> [<c02e103e>] ip_rcv+0x334/0x3b4
<5> [<c02c66fd>] netif_receive_skb+0x320/0x35b
<5> [<e0a0928b>] init_stall_timer+0x67/0x6a [uhci_hcd]
<5> [<c02c67a4>] process_backlog+0x6c/0xd9
<5> [<c02c690f>] net_rx_action+0xfe/0x1f8
<5> [<c012a7b1>] __do_softirq+0x35/0x79
<5> [<c0107efb>] handle_IRQ_event+0x0/0x4f
<5> [<c01094de>] do_softirq+0x46/0x4d
Its an skb_over_panic BUG halt that results from processing an init chunk in
which too many of its variable length parameters are in some way malformed.
The problem is in sctp_process_unk_param:
if (NULL == *errp)
*errp = sctp_make_op_error_space(asoc, chunk,
ntohs(chunk->chunk_hdr->length));
if (*errp) {
sctp_init_cause(*errp, SCTP_ERROR_UNKNOWN_PARAM,
WORD_ROUND(ntohs(param.p->length)));
sctp_addto_chunk(*errp,
WORD_ROUND(ntohs(param.p->length)),
param.v);
When we allocate an error chunk, we assume that the worst case scenario requires
that we have chunk_hdr->length data allocated, which would be correct nominally,
given that we call sctp_addto_chunk for the violating parameter. Unfortunately,
we also, in sctp_init_cause insert a sctp_errhdr_t structure into the error
chunk, so the worst case situation in which all parameters are in violation
requires chunk_hdr->length+(sizeof(sctp_errhdr_t)*param_count) bytes of data.
The result of this error is that a deliberately malformed packet sent to a
listening host can cause a remote DOS, described in CVE-2010-1173:
http://cve.mitre.org/cgi-bin/cvename.cgi?name=2010-1173
I've tested the below fix and confirmed that it fixes the issue. We move to a
strategy whereby we allocate a fixed size error chunk and ignore errors we don't
have space to report. Tested by me successfully
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Some drivers (e.g. iwlwifi) need to know and try
to figure it out based on other things, but making
it explicit is definitely better.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (27 commits)
sfc: Change falcon_probe_board() to fail for unsupported boards
sfc: Always close net device at the end of a disabling reset
sfc: Wait at most 10ms for the MC to finish reading out MAC statistics
sctp: Fix oops when sending queued ASCONF chunks
sctp: fix to calc the INIT/INIT-ACK chunk length correctly is set
sctp: per_cpu variables should be in bh_disabled section
sctp: fix potential reference of a freed pointer
sctp: avoid irq lock inversion while call sk->sk_data_ready()
Revert "tcp: bind() fix when many ports are bound"
net/usb: add sierra_net.c driver
cdc_ether: fix autosuspend for mbm devices
bluetooth: handle l2cap_create_connless_pdu() errors
gianfar: Wait for both RX and TX to stop
ipheth: potential null dereferences on error path
smc91c92_cs: spin_unlock_irqrestore before calling smc_interrupt()
drivers/usb/net/kaweth.c: add device "Allied Telesyn AT-USB10 USB Ethernet Adapter"
bnx2: Update version to 2.0.9.
bnx2: Prevent "scheduling while atomic" warning with cnic, bonding and vlan.
bnx2: Fix lost MSI-X problem on 5709 NICs.
cxgb3: Wait longer for control packets on initialization
...
Changes:
This is a complete re-write of the socket layer. Making the socket
implementation more aligned with the other socket layers and using more
of the support functions available in sock.c. Lots of code is copied
from af_unix (and some from af_irda).
Non-blocking mode should be working as well.
Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes:
o Function cfcnfg_disconn_adapt_layer is changed to do asynchronous
disconnect, not waiting for any response from the modem. Due to this
the function cfcnfg_linkdestroy_rsp does nothing anymore.
o Because disconnect may take down a connection before a connect response
is received the function cfcnfg_linkup_rsp is checking if the client is
still waiting for the response, if not a disconnect request is sent to
the modem.
o cfctrl is no longer keeping track of pending disconnect requests.
o Added function cfctrl_cancel_req, which is used for deleting a pending
connect request if disconnect is done before connect response is received.
o Removed unused function cfctrl_insert_req2
o Added better handling of connect reject from modem.
Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes:
o Added functions cfsrvl_get and cfsrvl_put.
o Added support release_client to use by socket and net device.
o Increase reference counting for in-flight packets from cfmuxl
Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Changes:
o Renamed cfcnfg_del_adapt_layer to cfcnfg_disconn_adapt_layer
o Fixed typo cfcfg to cfcnfg
o Renamed linkid to channel_id
o Updated documentation in caif_dev.h
o Minor formatting changes
Signed-off-by: Sjur Braendeland <sjur.brandeland@stericsson.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
sk->sk_data_ready() of sctp socket can be called from both BH and non-BH
contexts, but the default sk->sk_data_ready(), sock_def_readable(), can
not be used in this case. Therefore, we have to make a new function
sctp_data_ready() to grab sk->sk_data_ready() with BH disabling.
=========================================================
[ INFO: possible irq lock inversion dependency detected ]
2.6.33-rc6 #129
---------------------------------------------------------
sctp_darn/1517 just changed the state of lock:
(clock-AF_INET){++.?..}, at: [<c06aab60>] sock_def_readable+0x20/0x80
but this lock took another, SOFTIRQ-unsafe lock in the past:
(slock-AF_INET){+.-...}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
1 lock held by sctp_darn/1517:
#0: (sk_lock-AF_INET){+.+.+.}, at: [<cdfe363d>] sctp_sendmsg+0x23d/0xc00 [sctp]
Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>
Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
- Add bio_batch helper primitive. This is rather generic primitive
for submitting/waiting a complex request which consists of several
bios.
- blkdev_issue_zeroout() generate number of zero filed write bios.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
The patch just convert all blkdev_issue_xxx function to common
set of flags. Wait/allocation semantics preserved.
Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
coda: move backing-dev.h kernel include inside __KERNEL__
mtd: ensure that bdi entries are properly initialized and registered
Move mtd_bdi_*mappable to mtdcore.c
btrfs: convert to using bdi_setup_and_register()
Catch filesystems lacking s_bdi
drbd: Terminate a connection early if sending the protocol fails
drbd: fix memory leak
Fix JFFS2 sync silent failure
smbfs: add bdi backing to mount session
ncpfs: add bdi backing to mount session
exofs: add bdi backing to mount session
ecryptfs: add bdi backing to mount session
coda: add bdi backing to mount session
cifs: add bdi backing to mount session
afs: add bdi backing to mount session.
9p: add bdi backing to mount session
bdi: add helper function for doing init and register of a bdi for a file system
block: ensure jiffies wrap is handled correctly in blk_rq_timed_out_timer
Most of the LSM common audit work uses LSM_AUDIT_DATA_* for the naming.
This was not so for LSM_AUDIT_NO_AUDIT which means the generic initializer
cannot be used. This patch just renames the flag so the generic
initializer can be used.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
Now there's no need to use this fuction directly because it's handled by
register_pernet_device. So to make this simple and easy to understand,
make this static to do not tempt potentional users.
Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Current socket backlog limit is not enough to really stop DDOS attacks,
because user thread spend many time to process a full backlog each
round, and user might crazy spin on socket lock.
We should add backlog size and receive_queue size (aka rmem_alloc) to
pace writers, and let user run without being slow down too much.
Introduce a sk_rcvqueues_full() helper, to avoid taking socket lock in
stress situations.
Under huge stress from a multiqueue/RPS enabled NIC, a single flow udp
receiver can now process ~200.000 pps (instead of ~100 pps before the
patch) on a 8 core machine.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
batch skb dequeueing from softnet input_pkt_queue to reduce potential lock
contention when RPS is enabled.
Note: in the worst case, the number of packets in a softnet_data may
be double of netdev_max_backlog.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Idea from Eric Dumazet.
As for placement inside of struct sock, I tried to choose a place
that otherwise has a 32-bit hole on 64-bit systems.
Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
reimplement softnet_data.output_queue as a FIFO queue to keep the
fairness among the qdiscs rescheduled.
Signed-off-by: Changli Gao <xiaosuo@gmail.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
----
include/linux/netdevice.h | 1 +
net/core/dev.c | 22 ++++++++++++----------
2 files changed, 13 insertions(+), 10 deletions(-)
Signed-off-by: David S. Miller <davem@davemloft.net>
When scanning, it is somewhat important to scan
on the correct virtual interface. All drivers
that currently implement hw_scan only support a
single virtual interface, but that may change
and then we'd want to be ready.
Signed-off-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Now that the mac80211 is choosing dynamic ps timeouts based on the ps-qos
network latency configuration, configure a default value of -1 as the dynamic
ps timeout in cfg80211. This value allows the mac80211 to determine the value
to be used.
Signed-off-by: Juuso Oikarinen <juuso.oikarinen@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
Determine the dynamic PS timeout based on the configured ps-qos network
latency. For backwards wext compatibility, allow the dynamic PS timeout
configured by the cfg80211 to overrule the automatically determined value.
Signed-off-by: Juuso Oikarinen <juuso.oikarinen@nokia.com>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
This is used to configure APs to not bridge traffic between connected stations.
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: John W. Linville <linville@tuxdriver.com>
__sk_dst_set() might be called while no state can be integrated in a
rcu_dereference_check() condition.
So use rcu_dereference_raw() to shutup lockdep warnings (if
CONFIG_PROVE_RCU is set)
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
const qualifier on sock argument is misleading, since we can modify rxhash.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
RFC 1122 says the following:
...
Keep-alive packets MUST only be sent when no data or
acknowledgement packets have been received for the
connection within an interval.
...
The acknowledgement packet is reseting the keepalive
timer but the data packet isn't. This patch fixes it by
checking the timestamp of the last received data packet
too when the keepalive timer expires.
Signed-off-by: Flavio Leitner <fleitner@redhat.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Calls to twsk_net() are in some cases protected by reference counting
as an alternative to RCU protection. Cases covered by reference counts
include __inet_twsk_kill(), inet_twsk_free(), inet_twdr_do_twkill_work(),
inet_twdr_twcal_tick(), and tcp_timewait_state_process(). RCU is used
by inet_twsk_purge(). Locking is used by established_get_first()
and established_get_next(). Finally, __inet_twsk_hashdance() is an
initialization case.
It appears to be non-trivial to locate the appropriate locks and
reference counts from within twsk_net(), so used rcu_dereference_raw().
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
When performing a non-consuming read, a synchronize_sched() is
performed once for every cpu which is actively tracing.
This is very expensive, and can make it take several seconds to open
up the 'trace' file with lots of cpus.
Only one synchronize_sched() call is actually necessary. What is
desired is for all cpus to see the disabling state change. So we
transform the existing sequence:
for_each_cpu() {
ring_buffer_read_start();
}
where each ring_buffer_start() call performs a synchronize_sched(),
into the following:
for_each_cpu() {
ring_buffer_read_prepare();
}
ring_buffer_read_prepare_sync();
for_each_cpu() {
ring_buffer_read_start();
}
wherein only the single ring_buffer_read_prepare_sync() call needs to
do the synchronize_sched().
The first phase, via ring_buffer_read_prepare(), allocates the 'iter'
memory and increments ->record_disabled.
In the second phase, ring_buffer_read_prepare_sync() makes sure this
->record_disabled state is visible fully to all cpus.
And in the final third phase, the ring_buffer_read_start() calls reset
the 'iter' objects allocated in the first phase since we now know that
none of the cpus are adding trace entries any more.
This makes openning the 'trace' file nearly instantaneous on a
sparc64 Niagara2 box with 128 cpus tracing.
Signed-off-by: David S. Miller <davem@davemloft.net>
LKML-Reference: <20100420.154711.11246950.davem@davemloft.net>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
There has been quite a confusion in userspace about
XT_FUNCTION_MAXNAMELEN; because struct xt_entry_match used MAX-1,
userspace would have to do an awkward MAX-2 for maximum length
checking (due to '\0'). This patch adds a new define that matches the
definition of XT_TABLE_MAXNAMELEN - being the size of the actual
struct member, not one off.
Signed-off-by: Jan Engelhardt <jengelh@medozas.de>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Add suspend/resume hooks for HID drivers so these can do some
additional state adjustment when device gets suspended/resumed.
Signed-off-by: Bruno Prémont <bonbons@linux-vserver.org>
Signed-off-by: Jiri Kosina <jkosina@suse.cz>
Currently, device claiming for exclusive open is done after low level
open - disk->fops->open() - has completed successfully. This means
that exclusive open attempts while a device is already exclusively
open will fail only after disk->fops->open() is called.
cdrom driver issues commands during open() which means that O_EXCL
open attempt can unintentionally inject commands to in-progress
command stream for burning thus disturbing burning process. In most
cases, this doesn't cause problems because the first command to be
issued is TUR which most devices can process in the middle of burning.
However, depending on how a device replies to TUR during burning,
cdrom driver may end up issuing further commands.
This can't be resolved trivially by moving bd_claim() before doing
actual open() because that means an open attempt which will end up
failing could interfere other legit O_EXCL open attempts.
ie. unconfirmed open attempts can fail others.
This patch resolves the problem by introducing claiming block which is
started by bd_start_claiming() and terminated either by bd_claim() or
bd_abort_claiming(). bd_claim() from inside a claiming block is
guaranteed to succeed and once a claiming block is started, other
bd_start_claiming() or bd_claim() attempts block till the current
claiming block is terminated.
bd_claim() can still be used standalone although now it always
synchronizes against claiming blocks, so the existing users will keep
working without any change.
blkdev_open() and open_bdev_exclusive() are converted to use claiming
blocks so that exclusive open attempts from these functions don't
interfere with the existing exclusive open.
This problem was discovered while investigating bko#15403.
https://bugzilla.kernel.org/show_bug.cgi?id=15403
The burning problem itself can be resolved by updating userspace
probing tools to always open w/ O_EXCL.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Matthias-Christian Ott <ott@mirix.org>
Cc: Kay Sievers <kay.sievers@vrfy.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>