[Bugme-new] [Bug 16494] New: NFS client over TCP hangs due to packet loss

Mon Aug 2 09:14:46 PDT 2010

https://bugzilla.kernel.org/show_bug.cgi?id=16494

           Summary: NFS client over TCP hangs due to packet loss
           Product: Networking
           Version: 2.5
    Kernel Version: 2.6.34.1
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: IPV4
        AssignedTo: shemminger at linux-foundation.org
        ReportedBy: andyc.bluearc at gmail.com
        Regression: No


If there's sufficient packet loss over a TCP connection from the NFS client
code to an NFS server (using NFS v3) that the RPC client code institutes
recovery by shutting down the connection and then reestablishing the
connection, then we see repeated connection setup and teardowns without any
intervening data packets:

4    42.909478    172.18.0.39    10.1.6.102    TCP    1013 > nfs [SYN] Seq=0
Win=5840 Len=0 MSS=1460 TSV=108490 TSER=0 WS=0
5    42.909577    10.1.6.102    172.18.0.39    TCP    nfs > 1013 [SYN, ACK]
Seq=0 Ack=1 Win=64240 Len=0 MSS=1460
6    42.909610    172.18.0.39    10.1.6.102    TCP    1013 > nfs [ACK] Seq=1
Ack=1 Win=5840 Len=0
7    42.909672    172.18.0.39    10.1.6.102    TCP    1013 > nfs [FIN, ACK]
Seq=1 Ack=1 Win=5840 Len=0
8    42.909767    10.1.6.102    172.18.0.39    TCP    nfs > 1013 [ACK] Seq=1
Ack=2 Win=64240 Len=0
9    43.660083    10.1.6.102    172.18.0.39    TCP    nfs > 1013 [FIN, ACK]
Seq=1 Ack=2 Win=64240 Len=0
10    43.660100    172.18.0.39    10.1.6.102    TCP    1013 > nfs [ACK] Seq=2
Ack=2 Win=5840 Len=0

and then repeats after a while.

Here's a link to what I think the problem is: http://lkml.org/lkml/2010/7/27/42

Essentially, tcp_sendmsg is breaking out here as sk_shutdown contains
SEND_SHUTDOWN:

         err = -EPIPE;
         if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
                 goto out_err;

Here's a patch that fixes the hang. It clears the sk_shutdown flag at
connection init time:

--- /home/company/software/src/linux-2.6.34.1/net/ipv4/tcp_output.c    
2010-07-27 08:46:46.917000000 +0100
+++ net/ipv4/tcp_output.c       2010-07-27 09:19:16.000000000 +0100
@@ -2522,6 +2522,13 @@
        struct tcp_sock *tp = tcp_sk(sk);
        __u8 rcv_wscale;

+       /* clear down any previous shutdown attempts so that
+        * reconnects on a socket that's been shutdown leave the
+        * socket in a usable state (otherwise tcp_sendmsg() returns
+        * -EPIPE).
+        */
+       sk->sk_shutdown = 0;
+
        /* We'll fix this up when we get a response from the other end.
         * See tcp_input.c:tcp_rcv_state_process case TCP_SYN_SENT.
         */

Whether that's the correct fix, I don't know.

At the time of writing, the current state of the thread in the LKML is here:
http://lkml.org/lkml/2010/7/29/120.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.