578531 – [RHEL5.5] soft lockup on vlan with bonding in balance-alb mode

Bug 578531 - [RHEL5.5] soft lockup on vlan with bonding in balance-alb mode

Summary: [RHEL5.5] soft lockup on vlan with bonding in balance-alb mode

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Andy Gospodarek
QA Contact:	Liang Zheng
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	602197 615996 (view as bug list)
Depends On:
Blocks:	630540 640803
TreeView+	depends on / blocked

Reported:	2010-03-31 14:53 UTC by Yury Konovalov
Modified:	2018-12-05 15:13 UTC (History)
CC List:	35 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	An attempt to create a VLAN interface on a bond of two bnx2 adapters in two switch configurations resulted in a soft lockup after a few seconds. This was caused by an incorrect use of a bonding pointer. With this update, soft lockups no longer occurs and creating a VLAN interface works as expected.
Clone Of:
Environment:
Last Closed:	2011-01-13 21:23:29 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
stack trace with bonding on a paier of intel cards (142.09 KB, image/jpeg) 2010-06-11 13:08 UTC, Gianluca Cecchi	no flags	Details
suggested patch (2.67 KB, patch) 2010-07-19 14:28 UTC, Flavio Leitner	no flags	Details \| Diff
bonding-fix-alb-mode-to-balance-traffic-on-vlans.patch (1.37 KB, patch) 2010-07-19 15:52 UTC, Andy Gospodarek	no flags	Details \| Diff
bonding-fix-alb-mode-to-balance-traffic-on-vlans-updated.patch (1.52 KB, patch) 2010-07-27 20:38 UTC, Andy Gospodarek	no flags	Details \| Diff
Show Obsolete (2) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2011:0017	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.6 kernel security and bug fix update	2011-01-13 10:37:42 UTC

Description Yury Konovalov 2010-03-31 14:53:50 UTC

Attempt to create VLAN iface on bond of two bnx2 adapters in two switch configuration results in soft lockup after a few seconds.

kernel-2.6.18-194.el5

How reproducible:
1) Configure bond0 in mode 6 of two bnx2 ifaces
2) Create VLAN iface on bond0
3) ping some host via vlan iface
4) wait a few seconds, until soft lockup messages appears and all connectivity with host will be lost.

Steps to Reproduce:

1. Configure network as follows

/etc/modprobe.conf 
alias eth0 bnx2
alias eth1 bnx2
alias bond0 bonding

/etc/sysconfig/network-scripts/ifcfg-bond0
DEVICE=bond0                                                                                                 
BONDING_OPTS="mode=6 miimon=300"                                                                             
ONBOOT=yes                                                                                                   
BOOTPROTO=none

/etc/sysconfig/network-scripts/ifcfg-bond0.3094
DEVICE=bond0.3094                                                                                            
ONBOOT=yes                                                                                                   
REORDER_HDR=no                                                                                               
VLAN=yes                                                                                                     
BOOTPROTO=static                                                                                             
IPADDR=192.168.55.63                                                                                         
NETMASK=255.255.255.0

/etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0                                                                                                  
ONBOOT=yes                                                                                                   
MASTER=bond0                                                                                                 
SLAVE=yes
HOTPLUG=no

/etc/sysconfig/network-scripts/ifcfg-eth0
DEVICE=eth0                                                                                                  
ONBOOT=yes                                                                                                   
MASTER=bond0                                                                                                 
SLAVE=yes
HOTPLUG=no

2. Start bond0
 service network start

3. Wait a few minutes if you like to make sure it's not bonding itself to blame.

4. Create vlan iface
 ifup bond0.3094

5. Create some traffic
 ping 192.168.55.62


Actual results:

Soon you'll get backtrace message like following:

BUG: soft lockup - CPU#0 stuck for 10s! [swapper:0]
CPU 0:
Modules linked in: 8021q ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge lockd sunrpc bonding ipv6 xfrm_nalgo crypto_api video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport sr_mod cdrom ksm(U) kvm_intel(U) kvm(U) i5000_edac usb_storage edac_mc bnx2 pcspkr sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_round_robin dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_dh dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata mptsas mptscsih scsi_transport_sas mptbase shpchp qla2xxx scsi_transport_fc sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Tainted: G      2.6.18-194.el5 #1
RIP: 0010:[<ffffffff80065c23>]  [<ffffffff80065c23>] .text.lock.spinlock+0x29/0x30
RSP: 0018:ffffffff80448c80  EFLAGS: 00000286
RAX: ffffffff803fdfd8 RBX: ffff81082636a6c0 RCX: ffff81082636a000
RDX: ffff81082d0e7710 RSI: ffff81082636a000 RDI: ffff81082636a758
RBP: ffffffff80448c00 R08: 0000000000000000 R09: ffff81082e9e9070
R10: ffff81082f7a8d80 R11: 00000000000000c8 R12: ffffffff8005ec8e
R13: ffff81082636a758 R14: ffffffff8007922b R15: ffffffff80448c00
FS:  0000000000000000(0000) GS:ffffffff803cb000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000042023fe8 CR3: 0000000000201000 CR4: 00000000000026e0

Call Trace:
 <IRQ>  [<ffffffff885e6002>] :bonding:rlb_arp_recv+0xb2/0x146
 [<ffffffff800209ba>] netif_receive_skb+0x43e/0x49f
 [<ffffffff8838792c>] :bnx2:bnx2_poll_work+0x1116/0x124f
 [<ffffffff80151d39>] kobject_add+0x10c/0x19b
 [<ffffffff80081c39>] cacheinfo_cpu_callback+0xda/0x516
 [<ffffffff88387e1b>] :bnx2:bnx2_poll+0xdf/0x209
 [<ffffffff8000c88a>] net_rx_action+0xac/0x1e0
 [<ffffffff80012409>] __do_softirq+0x89/0x133
 [<ffffffff8005f2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006dba8>] do_softirq+0x2c/0x85
 [<ffffffff8006da30>] do_IRQ+0xec/0xf5
 [<ffffffff800575d0>] mwait_idle+0x0/0x4a
 [<ffffffff8005e615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80226577>] pci_mmcfg_read+0x0/0x92
 [<ffffffff80057606>] mwait_idle+0x36/0x4a
 [<ffffffff800497be>] cpu_idle+0x95/0xb8
 [<ffffffff80407807>] start_kernel+0x220/0x225
 [<ffffffff8040722f>] _sinittext+0x22f/0x236

Expected results:

Network should work normal.

Additional info:

It was fully functional on 2.6.18-164.11.1.el5 kernel (except when you add bridge on vlan iface, but that's different story.)

Hardware is IBM eServer BladeCenter HS21 with BIOS version 1.14

Comment 1 Derek Moore 2010-05-06 19:08:28 UTC

We are experiencing the same problem on multiple hardware platforms:

Kernel 2.6.18-194.el5
   - bnx2 (2.0.8b)
   - e1000 (7.3.21-k4-NAPI)


BUG: soft lockup - CPU#7 stuck for 10s! [swapper:0]
CPU 7:
Modules linked in: 8021q bridge autofs4 lockd sunrpc bonding ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport st e1000e serio_raw hpilo shpchp bnx2(U) pcspkr sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Tainted: G      2.6.18-194.el5 #1
RIP: 0010:[<ffffffff80065c20>]  [<ffffffff80065c20>] .text.lock.spinlock+0x26/0x30
RSP: 0018:ffff81031feb3cb0  EFLAGS: 00000286
RAX: ffff81031feabfd8 RBX: ffff8103125b16c0 RCX: ffff8103125b1000
RDX: ffff81031b016710 RSI: ffff8103125b1000 RDI: ffff8103125b1758
RBP: ffff81031feb3c30 R08: ffff8103176f2e80 R09: ffff810316545180
R10: ffff81031feb3db8 R11: 00000000000000c8 R12: ffffffff8005ec8e
R13: ffff8103125b1758 R14: ffffffff8007922b R15: ffff81031feb3c30
FS:  0000000000000000(0000) GS:ffff81031fe283c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 000000001efbb670 CR3: 0000000310e8a000 CR4: 00000000000006e0

Call Trace:
 <IRQ>  [<ffffffff884ce002>] :bonding:rlb_arp_recv+0xb2/0x146
 [<ffffffff800209ba>] netif_receive_skb+0x43e/0x49f
 [<ffffffff882abe50>] :bnx2:bnx2_poll+0x1245/0x14e3
 [<ffffffff80151248>] __next_cpu+0x19/0x28
 [<ffffffff8008ccb0>] find_busiest_group+0x20d/0x621
 [<ffffffff800c907c>] free_pages_bulk+0x1f0/0x268
 [<ffffffff8000c88a>] net_rx_action+0xac/0x1e0
 [<ffffffff80012409>] __do_softirq+0x89/0x133
 [<ffffffff8005f2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006dba8>] do_softirq+0x2c/0x85
 [<ffffffff8006da30>] do_IRQ+0xec/0xf5
 [<ffffffff8005e615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff8019e040>] acpi_processor_idle_simple+0x17d/0x30e
 [<ffffffff8019df2f>] acpi_processor_idle_simple+0x6c/0x30e
 [<ffffffff8019dec3>] acpi_processor_idle_simple+0x0/0x30e
 [<ffffffff8019dec3>] acpi_processor_idle_simple+0x0/0x30e
 [<ffffffff800497be>] cpu_idle+0x95/0xb8
 [<ffffffff80078997>] start_secondary+0x498/0x4a7


BUG: soft lockup - CPU#3 stuck for 10s! [swapper:0]
CPU 3:
Modules linked in: 8021q bridge autofs4 lockd sunrpc bonding ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec i2c_core dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport st sg hpilo shpchp e1000e pcspkr bnx2(U) serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 0, comm: swapper Tainted: G      2.6.18-194.el5 #1
RIP: 0010:[<ffffffff80065c20>]  [<ffffffff80065c20>] .text.lock.spinlock+0x26/0x30
RSP: 0018:ffff81010afbfd60  EFLAGS: 00000286
RAX: ffff81010afb9fd8 RBX: ffff8103170826c0 RCX: ffff810317082000
RDX: ffff8103152f1710 RSI: ffff810317082000 RDI: ffff810317082758
RBP: ffff81010afbfce0 R08: 0000000000000000 R09: 0000000000000000
R10: ffff8103153d95c0 R11: 00000000000000c8 R12: ffffffff8005ec8e
R13: ffff810317082758 R14: ffffffff8007922b R15: ffff81010afbfce0
FS:  0000000000000000(0000) GS:ffff81031ff236c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000003516003080 CR3: 0000000000201000 CR4: 00000000000006e0

Call Trace:
 <IRQ>  [<ffffffff884c7002>] :bonding:rlb_arp_recv+0xb2/0x146
 [<ffffffff800209ba>] netif_receive_skb+0x43e/0x49f
 [<ffffffff882e33cd>] :e1000e:e1000_receive_skb+0x1b5/0x1d6
 [<ffffffff882e7b27>] :e1000e:e1000_clean_rx_irq+0x27a/0x321
 [<ffffffff882e5bc5>] :e1000e:e1000_clean+0x7c/0x29a
 [<ffffffff8000c88a>] net_rx_action+0xac/0x1e0
 [<ffffffff882e5a55>] :e1000e:e1000_intr_msi+0xd6/0xe0
 [<ffffffff80012409>] __do_softirq+0x89/0x133
 [<ffffffff8005f2fc>] call_softirq+0x1c/0x28
 [<ffffffff8006dba8>] do_softirq+0x2c/0x85
 [<ffffffff8006da30>] do_IRQ+0xec/0xf5
 [<ffffffff8005e615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff8019e040>] acpi_processor_idle_simple+0x17d/0x30e
 [<ffffffff8019df2f>] acpi_processor_idle_simple+0x6c/0x30e
 [<ffffffff8019dec3>] acpi_processor_idle_simple+0x0/0x30e
 [<ffffffff8019dec3>] acpi_processor_idle_simple+0x0/0x30e
 [<ffffffff800497be>] cpu_idle+0x95/0xb8
 [<ffffffff80078997>] start_secondary+0x498/0x4a7

Comment 2 matthew zeier 2010-05-28 04:30:52 UTC

I'm trying to reproduce this on a machine running 2.6.18-194.3.1.el5 and can't.  Yury, do you still have that problem if you upgrade to 2.6.18-194.3.1.el5?

Comment 3 Gunther Schlegel 2010-06-09 13:47:09 UTC

I confirm to have this problem with 2.6.18-194.3.1.el5 on a Dell PowerEdge 1650 with two e1000 NICs.

lspci output:
00:00.0 Host bridge: Broadcom CNB20HE Host Bridge (rev 23)
00:00.1 Host bridge: Broadcom CNB20HE Host Bridge (rev 01)
00:00.2 Host bridge: Broadcom CNB20HE Host Bridge (rev 01)
00:00.3 Host bridge: Broadcom CNB20HE Host Bridge (rev 01)
00:0c.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
00:0f.0 Host bridge: Broadcom CSB5 South Bridge (rev 93)
00:0f.1 IDE interface: Broadcom CSB5 IDE Controller (rev 93)
00:0f.2 USB Controller: Broadcom OSB4/CSB5 OHCI USB Controller (rev 05)
00:0f.3 ISA bridge: Broadcom CSB5 LPC bridge
01:02.0 Ethernet controller: Intel Corporation 82544EI Gigabit Ethernet Controller (Copper) (rev 02)
01:04.0 Ethernet controller: Intel Corporation 82544EI Gigabit Ethernet Controller (Copper) (rev 02)
01:08.0 PCI bridge: Intel Corporation 80303 I/O Processor PCI-to-PCI Bridge (rev 01)
01:08.1 RAID bus controller: Dell PowerEdge Expandable RAID Controller 3/Di (rev 01)

Hotfix: change to bonding mode 5.

Comment 4 Gianluca Cecchi 2010-06-11 13:04:04 UTC

I'm experiencing the same problem on a Dell 2950 and 2.6.18-194.3.1.el5
with two Intel and two Broadcom (the embedded ones) adapters.
I have bonding and VLANs too.
Updated (via scratch install) from rh el 4.5 x86_64 to rh el 5.5 (+updates till today) and I'm experiencing these problems that before I had not.

Changing mode from 6 to 5 workarounds the problem for me too.
# lspci|grep thern
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
09:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
0a:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
0a:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)

I get the same stack trace (I'm going to attach screenshot) both with bonding done on top of a pairs of bnx2 and on top of a pair of e1000e ones.
So it is driver independent

Comment 5 Gianluca Cecchi 2010-06-11 13:08:06 UTC

Created attachment 423266 [details]
stack trace with bonding on a paier of intel cards

Comment 6 Roy Keene 2010-06-17 05:17:57 UTC

I had this issue when I switched to mode 6 (alb-balance).


BUG: soft lockup - CPU#1 stuck for 10s! [swapper:0]

Pid: 0, comm:              swapper
EIP: 0060:[<c061dc4c>] CPU: 1
EIP is at _spin_lock_bh+0xd/0x18
 EFLAGS: 00000286    Not tainted  (2.6.18-194.3.1.el5 #1)
EAX: c0749000 EBX: f69a3d50 ECX: f68a8d2c EDX: f69a3800
ESI: f69a3cfc EDI: f69a3d50 EBP: f68fd600 DS: 007b ES: 007b
CR0: 8005003b CR2: b7fc8000 CR3: 00742000 CR4: 000006d0
 [<f8fed0ea>] rlb_arp_recv+0x98/0x11d [bonding]
 [<c05c0aa8>] netif_receive_skb+0x3ac/0x401
 [<f8ba5df3>] tg3_poll+0x64f/0xc28 [tg3]
 [<c041eeda>] __activate_task+0x4a/0x59
 [<c04ee889>] rb_erase+0x176/0x22f
 [<c05c2995>] net_rx_action+0x9c/0x1a7
 [<c042a377>] __do_softirq+0x87/0x114
 [<c04073cf>] do_softirq+0x52/0x9c
 [<c044f158>] __do_IRQ+0x0/0xd6
 [<c04074ce>] do_IRQ+0xb5/0xc3
 [<c0405946>] common_interrupt+0x1a/0x20
 [<c0403bb0>] default_idle+0x0/0x59
 [<c0403be1>] default_idle+0x31/0x59
 [<c0403ca8>] cpu_idle+0x9f/0xb9
 =======================

Comment 7 Stuart R. Kirk 2010-07-06 22:43:43 UTC

I can also confirm similar behavior on Cisco UCS B250-M2 blade system using the ixgbe network driver.

In our situation we had physical eth0, eth1, eth2, eth3 which we were using mode 6 to create a bond0.  As soon as the bond0.x interface was brought up during boot the boot process is halted.  If done via command line, a stack trace similar to the above is produced.

Comment 8 Andy Gospodarek 2010-07-13 21:30:43 UTC

Unfortunately the soft-lockups don't really tell me much as it only shows the receive patch not being able to take the lock.  I'll have to try and reproduce this myself.

Comment 9 Flavio Leitner 2010-07-16 20:32:58 UTC

The CPU is stuck at rlb_update_entry_from_arp() trying to get the lock
_lock_rx_hashtbl(bond) and for some reason, it's unable to get it, so the
watchdog fires showing that back trace. However, this is a consequence
and not a root cause because, in this case, another CPU is holding that
lock leaving others waiting for it long enough to trigger the watchdog.
Therefore, can you get few outputs of sysrq+t and sysrq+w while the
problem is happening?

It could point to us which CPU is holding the lock and why.

thanks,
fbl

Comment 10 Flavio Leitner 2010-07-16 20:57:52 UTC

I could reproduce this here, so no need to provide sysrq+t or sysrq+w.

Comment 11 Andy Gospodarek 2010-07-19 13:35:30 UTC

Flavio, can you post any new information you have?

Comment 12 Flavio Leitner 2010-07-19 14:28:36 UTC

Created attachment 432903 [details]
suggested patch

Hi Andy,

The problem has been introduced by the following patch:
[net] bonding: allow arp_ip_targets on separate vlan from bond device

and not fixed by the later patch:
[net] fixup problems with vlans and bonding

The problem happens because in rlb_arp_recv(), the struct bonding *bond
pointer is a vlan's net_device struct instead, so it can either oops or
just hangs on a invalid spinlock. 

I can reproduce both situations following the instructions in the
ticket's summary.

The upstream fixes rlb_arp_recv() to look for the flag IFF_802_1Q_VLAN 
and if it is present, then find the underlying bonding device.

I have the patch backported and it works out on my tests.
Please review.
fbl

Comment 14 Andy Gospodarek 2010-07-19 15:36:49 UTC

I don't think this can count as a regression.  The upstream patch below that adds the code described above was added in 2008 and were it not for the code that added support for arp_ip_targets on a separate VLAN this problem would have never been seen.  Here is the upstream patch we should consider:

commit 6146b1a4da98377e4abddc91ba5856bef8f23f1e
Author: Jay Vosburgh <fubar.com>
Date:   Tue Nov 4 17:51:15 2008 -0800

    bonding: Fix ALB mode to balance traffic on VLANs

This does not diminish the importance, but finding out that something is broken after a feature was added is not a regression.

Comment 15 Andy Gospodarek 2010-07-19 15:38:43 UTC

*** Bug 615996 has been marked as a duplicate of this bug. ***

Comment 16 Andy Gospodarek 2010-07-19 15:52:01 UTC

Created attachment 432929 [details]
bonding-fix-alb-mode-to-balance-traffic-on-vlans.patch

This patch should resolve the issue based on Flavio's analysis.  It is a backport of upstream commits:

6146b1a4da98377e4abddc91ba5856bef8f23f1e bonding: Fix ALB mode to balance traffic on VLANs
2690f8d62e98779c71625dba9a0fd525d8b2263d bonding: Remove debug printk

I have added this to my rhel5 gtest repo and will post to this bug when new test kernels are available.

Comment 17 RHEL Program Management 2010-07-19 15:59:20 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 18 Flavio Leitner 2010-07-19 16:38:07 UTC

I thought it was a regression because before vlan over bonding was working
and after adding a new feature, the same setup doesn't work anymore.

Regarding to the new patch, the next merges will be a bit more complicated
because the loop is not using vlan_dev_real_dev() as upstream does, so I have
added a macro to deal with it. Also, I intentionally removed pk_type->dev
initialization because it didn't make sense to my eyes for this kernel version,
but I could be overlooking something.

Comment 19 Andy Gospodarek 2010-07-23 21:29:00 UTC

(In reply to comment #18)
> Also, I intentionally removed pk_type->dev
> initialization because it didn't make sense to my eyes for this kernel version,
> but I could be overlooking something.    

Removing it was a good idea.  It was removed today as it caused a panic in a multiple bond configuration.

Comment 20 Flavio Leitner 2010-07-26 20:15:05 UTC

(In reply to comment #19)
> Removing it was a good idea.  It was removed today as it caused a panic in a
> multiple bond configuration.    

Yeah, let me know when you have the final patch ready.

Comment 25 Andy Gospodarek 2010-07-27 20:38:24 UTC

Created attachment 434836 [details]
bonding-fix-alb-mode-to-balance-traffic-on-vlans-updated.patch

Here is an updated patch.  Feedback is welcome.

Comment 27 Andy Gospodarek 2010-08-09 12:51:10 UTC

My test kernels have been updated to include a patch for this bugzilla.

http://people.redhat.com/agospoda/#rhel5

Please test them and report back your results.

Comment 31 Shyam Mani 2010-09-02 03:41:55 UTC

(In reply to comment #27)
> My test kernels have been updated to include a patch for this bugzilla.
> 
> http://people.redhat.com/agospoda/#rhel5
> 
> Please test them and report back your results.

We've been running this for a couple of weeks on one of our production boxes and haven't seen any issues whatsoever.

# uptime 
 20:39:44 up 13 days, 20:30,  1 user,  load average: 1.30, 1.52, 1.56
# uname -a
Linux foo.mozilla.net 2.6.18-212.el5.gtest.89 #1 SMP Mon Aug 16 14:01:15 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

Comment 32 Andy Gospodarek 2010-09-02 04:16:01 UTC

Glad to hear that, Shyam!

I appreciate the feedback.

Comment 36 Yury Konovalov 2010-09-07 12:02:37 UTC

Andy,
 Finnaly, I got a change to put 2.6.18-212.el5.gtest.89 under a test on a IBM HS21 system with two nics in two switch configuration. Works smoothly for about half a day and perform pretty good (~ 1883 Mbits/sec on multi-node iperf test).
Thanks for the patch and test kernel package. It seems like this bug is fixed.

Comment 37 Yury Konovalov 2010-09-07 13:24:24 UTC

Andy,

 Could you please share kernel-headers package of your test kernel build? I would like to test GPFS filesystem with bonding on your test kernel and I need to build IBM gpfs modules and some other kernel-dependent staff. kernel-devel is not enough for me.

Comment 38 Andy Gospodarek 2010-09-07 14:23:15 UTC

Glad to know they are working, Yury.  Thanks for that feedback.

I also added all of the headers rpms to my people page here:

http://people.redhat.com/agospoda/#rhel5

Please let me know if any other rpms from that build would be helpful and are not on my people page.

Comment 39 Yury Konovalov 2010-09-08 19:22:11 UTC

Thank you, Andy for making headers package available. I will report if any of my ongoing stress tests fail. Until now it works and perform perfectly.

Comment 40 Jarod Wilson 2010-09-10 21:39:24 UTC

in kernel-2.6.18-219.el5
You can download this test kernel from http://people.redhat.com/jwilson/el5

Detailed testing feedback is always welcomed.

Comment 42 Gianluca Cecchi 2010-09-20 13:24:56 UTC

It's ok also in my Dell 2950 with kernel 2.6.18-219.el5 and alb bonding and some vlans over bonding of two Broadcom adapters:
05:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)
09:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708 Gigabit Ethernet (rev 12)

At least in half an hour no error at all.
In dmesg:
Ethernet Channel Bonding Driver: v3.4.0 (October 7, 2008)
bonding: In ALB mode you might experience client disconnections upon reconnection of a link if the bonding module updelay parameter (200 msec) is incompatible with the forwarding delay time of the switch
bonding: MII link monitoring set to 100 ms
ADDRCONF(NETDEV_UP): bond0: link is not ready
bonding: bond0: Adding slave eth0.
bnx2: eth0: using MSI
bonding: bond0: enslaving eth0 as an active interface with a down link.
bonding: bond0: Adding slave eth1.
bnx2: eth1: using MSI
bonding: bond0: enslaving eth1 as an active interface with a down link.
bnx2: eth0 NIC Copper Link is Up, 1000 Mbps full duplex
bonding: bond0: link status up for interface eth0, enabling it in 0 ms.
bonding: bond0: link status definitely up for interface eth0.
bonding: bond0: making interface eth0 the new active one.
bonding: bond0: first active interface up!
ADDRCONF(NETDEV_CHANGE): bond0: link becomes ready
bnx2: eth1 NIC Copper Link is Up, 1000 Mbps full duplex
bonding: bond0: link status up for interface eth1, enabling it in 200 ms.
bonding: bond0: link status definitely up for interface eth1.
802.1Q VLAN Support v1.8 Ben Greear ...
All bugs added by David S. Miller ...
bond0: no IPv6 routers present
bond0.13: no IPv6 routers present
bond0.139: no IPv6 routers present
bond0.221: no IPv6 routers present
bond0.66: no IPv6 routers present
bond0.68: no IPv6 routers present
bond0.800: no IPv6 routers present

Gianluca

Comment 43 Andy Gospodarek 2010-09-20 14:03:45 UTC

Excellent!  Thanks for the feedback, Gianluca.

Comment 44 Shyam Mani 2010-09-20 14:43:55 UTC

(In reply to comment #40)
> in kernel-2.6.18-219.el5
> You can download this test kernel from http://people.redhat.com/jwilson/el5
> 
> Detailed testing feedback is always welcomed.

# uname -a
Linux foo.mozilla.net 2.6.18-219.el5 #1 SMP Thu Sep 9 17:10:23 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux
# uptime
 07:41:25 up 9 days, 10:42,  1 user,  load average: 4.77, 4.59, 4.59

All good, same production server as in comment #31 and seems to be fine.

Comment 45 Tom Coughlan 2010-09-27 14:10:48 UTC

*** Bug 602197 has been marked as a duplicate of this bug. ***

Comment 46 G Songara 2010-09-28 14:46:28 UTC

This patch works for us. Can you please let us know when this patch will be available in redhat-release, as it fails on latest kernel also 2.6.18-194.11.4.el5

Thanks

Comment 47 Andy Gospodarek 2010-09-28 14:58:49 UTC

(In reply to comment #46)
> This patch works for us. Can you please let us know when this patch will be
> available in redhat-release, as it fails on latest kernel also
> 2.6.18-194.11.4.el5
> 
> Thanks

Right now it will not be available until the official RHEL5.6 update ships.  As of today, I cannot say for sure when RHEL5.6 will ship.

If you need support in a RHEL5.5 update kernel, please go through our support portal:

https://access.redhat.com/home

and request that this patch is added.  You can reference this bugzilla if needed.

Comment 48 Gianluca Cecchi 2010-09-28 15:25:51 UTC

I DO have a case opened for this bug and latest answer has been (right today after asking for a date):
> Any information about an official and supported update containing the fix? 
At this moment the errata which will contain the fix is not yet confirmed.

My case is 
Case Number      : 00332049 
Case Open Date   : 2010-06-11 14:42:07

If you see the date, we are about 3 months and no official solution yet....
So the question is: what is the added value of active subscription?

Excuse me for being a little sarcastic...

Gianluca

Comment 49 Andy Gospodarek 2010-09-28 19:31:50 UTC

(In reply to comment #48)
> I DO have a case opened for this bug and latest answer has been (right today
> after asking for a date):
> > Any information about an official and supported update containing the fix? 
> At this moment the errata which will contain the fix is not yet confirmed.
> 
> My case is 
> Case Number      : 00332049 
> Case Open Date   : 2010-06-11 14:42:07
> 
> If you see the date, we are about 3 months and no official solution yet....
> So the question is: what is the added value of active subscription?
> 
> Excuse me for being a little sarcastic...
> 
> Gianluca

I'm glad you opened a support ticket for this.  Opening tickets is really the only way for us to know that this fix is critical enough to paying customers that we need to add it to the currently shipping (in this case 2.6.18-194) kernel stream.  Thanks for doing that.

It can be a bit confusing, but we open a bug for each release that requires a patch.  This bug will address the problem on upcoming RHEL5.6 and bug 630540 will address the problem on already released RHEL5.5 since there was enough customer demand (not just noise in bugzilla) to fix it before RHEL5.6 shipped.  I made some noise over in bug 630540, so hopefully things will move forward with a fix in RHEL5.5.

Comment 50 Gianluca Cecchi 2010-09-28 21:22:46 UTC

(In reply to comment #49)
> I made some noise over in bug 630540, so hopefully things will move forward
> with a fix in RHEL5.5.

I saw it, thanks.
I (we) keep on waiting for a fix, then

Comment 51 Martin Prpič 2010-11-11 14:05:09 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
An attempt to create a VLAN interface on a bond of two bnx2 adapters in two switch configurations resulted in a soft lockup after a few seconds. This was caused by an incorrect use of a bonding pointer. With this update, soft lockups no longer occurs and creating a VLAN interface works as expected.

Comment 54 errata-xmlrpc 2011-01-13 21:23:29 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0017.html

Note You need to log in before you can comment on or make changes to this bug.

agospoda
benj
benlu
bshepher
dhoward
dmoore
fleitner
gianluca.cecchi
govind.rhul
haliu
herbert.xu
hjia
jaeshin
jolsa
jpirko
justdave
jwest
khorenko
kzhang
liko
lzheng
mzeier
nenad
nhorman
pep
peterm
plsmith
roy.keene
schlegel
sgruszka
shyam
Stuart.Kirk
tao
tgraf
villapla