[Bugme-new] [Bug 8839] New: Runtime Memory Inconsistency on Linux kernel 2.4.21-32

bugme-daemon at bugzilla.kernel.org bugme-daemon at bugzilla.kernel.org
Thu Aug 2 05:03:28 PDT 2007


http://bugzilla.kernel.org/show_bug.cgi?id=8839

           Summary: Runtime Memory Inconsistency on Linux kernel 2.4.21-32
           Product: Memory Management
           Version: 2.5
     KernelVersion: 2.4.21-32.ELsmp
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: blocking
          Priority: P1
         Component: MTTR
        AssignedTo: akpm at osdl.org
        ReportedBy: lvenkata at in.ibm.com
                CC: lvenkata at in.ibm.com


Most recent kernel where this bug did not occur:Not Tested on other kernels.
Distribution:RedHat RHEL 3.0 U3
Hardware Environment:i686 athlon i386, AMD Athlon
Software Environment:WebSphere Application driven by IBM JDK 1.4.2
Problem Description:

Symptom : WebSphere application crash

Description of Issue :

Analysis of crash footprints indicate that 

1) Well guarded (Locked) data structures ending up holding invalid memory.
2) A linked list with all the nodes correctly "formed" correctly ends up
pointing to an "invalid" node at the point of execution.
3) This list is well guarded (locked by native locks) and therefore have rules
out the possibility of being abruptly updated by any other thread.
4) We have also verified that this area of memory has not been overlaid by
another memory allocation.

In short a native memory inconsistency issue that occurs, albeit the piece of
memory being guarded, not overlaid and correctly built which suggests that this
is a low level memory issue possibly to do with the memory management in the
kernel. 

Analysis and Exact details retrieved from the System Core
==========================================================

Crash Symptom : Abort owing to a panic in the Java Virtual Machine(JVM).

Reason for Panic : One of the JVM data structures that internally represent a
Java Thread is pointed to by an "invalid address".

Stack Trace of Crash
====================

#0  0xb749acdf in raise () from /lib/tls/libc.so.6
#1  0xb749c4e5 in abort () from /lib/tls/libc.so.6
#2  0xb71ebd41 in _hpiPanic (
    fmt=0xb71faf80 "JVMLH019: invalid thread sr_state %d\n")
    at /userlvl/cxia32142ifx/src/hpi/pfm/hpi_util_md.c:183
#3  0xb71f548e in tellThreadToSuspend (self=0xa0fb1bc0, tid=0x31284347,
    type=GLOB_SUSPEND) at /userlvl/cxia32142ifx/src/hpi/pfm/threads_md.c:1502
#4  0xb71f6dbb in sysThreadSingle ()
    at /userlvl/cxia32142ifx/src/hpi/pfm/threads_md.c:2542
#5  0xb74296af in __clz_tab ()
   from /opt/IBM/WebSphere/AppServer/java/jre/bin/classic/libjvm.so
#6  0x00000001 in ?? ()
#7  0x00000000 in ?? ()

frames 2 to 5 represent JVM function calls.

Frame 4 executes the following psuedo code :

tid = head;  //start from the tid pointed by head. 
 i = 0;
 while ((i < no_of_elements in list ) && ( tid != NULL)) {
        if ((CHECK 1) && CHECK 2) {                             /*ibm at 52001*/
            if (tid = self) {
                //do Something
            } else {
                if (tellThreadToSuspend(self, tid,
                                        suspendattribute) == ERROR)
{/*ibm at 57783*/
                    ret = ERROR;
                }
            }
        }
        prev = tid;
        tid = tid->next; // iterate the list using the "next" field
        i++;
    }


The current problem is with the current value of tid=0x31284347 which is the
"invalid memory location" leading to the panic subsequently.

As we can see the code iterates from the head through the elements of the list
using tid->next.

So ideally, the invalid tid should be a part of the linked list, I have pasted
the entire list in order below which is effectively the list the above code is
iterating through and we can evidently see that the above "invalid tid value"
is not a part of the list.

Note : tid=systhread

========================================
List Memory

  HEAD
  systhread=0BD464B8 
  systhread=0C395F70 
  systhread=9FE5D378 
  systhread=0C3666E8 
  systhread=A0F7B598 
  systhread=0C265000 
  systhread=0C1F62E0 
  systhread=0C3415A0 
  systhread=0C4BCA78 
  systhread=A0F79348 
  systhread=A0F75888 
  systhread=A0F7AFE0 
  systhread=0BEDC178 
  systhread=0BF61570 
  systhread=0B05A6E8 
  systhread=0A949598 
  systhread=A0F7C6C0 
  systhread=A0F78D90 
  systhread=A0F77C68 
  systhread=A0F7BB50 
  systhread=A0F7EEC8 
  systhread=A0F7E910 
  systhread=A0F70088 
  systhread=A0FB2178 
  systhread=A0FB1608 
  systhread=A0FB1BC0 
  systhread=A0F7E358 
  systhread=A0F78220 
  systhread=0B963C40 
  systhread=A48C2338 
  systhread=0A950D10 
  systhread=0A8EF5D0 
  systhread=0C05EF40 
  systhread=0C130060 
  systhread=0AA86850 
  systhread=0C22D6E8 
  systhread=0C24F2D8 
  systhread=0B962AB8 
  systhread=0B9EC808 
  systhread=0BAF6A38 
  systhread=A2823A90 
  systhread=A2810E38 
  systhread=0BF2B218 
  systhread=0B9A2D88 
  systhread=0BA644D8 
  systhread=0C2823D0 
  systhread=9F942A00 
  systhread=9FBC1400 
  systhread=0B9A3518 
  systhread=0B24E6F8 
  systhread=0C22DCA0 
  systhread=0B8F9310 
  systhread=0BD00C18 
  systhread=0BC349F8 
  systhread=0C185D50 
  systhread=0B965FE0 
  systhread=0AC3D360 
  systhread=0AB0A6C0 
  systhread=0AB7C418 
  systhread=0B962310 
  systhread=0BAD7A30 
  systhread=09CBDDC8 
  systhread=A16FA328 
  systhread=A16F9AA8 
  systhread=A89866B8 
  systhread=9FE92F08 
  systhread=9FB86B70 
  systhread=9FEFE8C8 
  systhread=9FD21E68 
  systhread=0BE97830 
  systhread=0AA02318 
  systhread=0BAAE598 
  systhread=0BFA4750 
  systhread=A16F9248 
  systhread=A16F7908 
  systhread=9FB7A390 
  systhread=9F904230 
  systhread=0C383378 
  systhread=0AA29308 
  systhread=0AB6D470 
  systhread=0A36E1C0 
  systhread=0BEFB240 
  systhread=0AA20840 
  systhread=A231B858 
  systhread=A2310420 
  systhread=9FDDD180 
  systhread=9FE57368 
  systhread=9FB53708 
  systhread=9FE5DBF8 
  systhread=9FE08B28 
  systhread=0AEF4708 
  systhread=0BA66258 
  systhread=0AC003C0 
  systhread=0A6CDD00 
  systhread=0AE01D18 
  systhread=0ADBBF70 
  systhread=0AD28F78 
  systhread=0A6829E8 
  systhread=0A50D590 
  systhread=AA6031D8 
  systhread=0A222050 
  systhread=0A223240 
  systhread=09CD37D8 
  systhread=A9A148B8 
  systhread=AB8093C8 
  systhread=AB808E10 
  systhread=A9DE1B98 
  systhread=AABE65D8 
  systhread=A9D29060 
  systhread=A9D28AA8 
  systhread=AABBBD28 
  systhread=AABBB770 
  systhread=AABB95E0 
  systhread=09C7A8A8 
  systhread=AADF6310 
  systhread=AAD7EF50 
  systhread=AAD7E998 
  systhread=AAD7E3E0 
  systhread=AAD7C080 
  systhread=AAD7BAC8 
  systhread=09C6A260 
  systhread=0868DE28 
  systhread=0868C818 
  systhread=AB841F60 
  systhread=09A886B8 
  systhread=09A87148 
  systhread=090BDA48 
  systhread=B129D488 
  systhread=AE737AD8 
  systhread=0A0C03F0 
  systhread=0A0BD3B0 
  systhread=0A077878 
  systhread=AECDF240 
  systhread=B12D4518 
  systhread=0A0A35E0 
  systhread=0A097140 
  systhread=0A075548 
  systhread=0A01E950 
  systhread=09FCAB10 
  systhread=09FCA558 
  systhread=09E7F568 
  systhread=09DCD460 
  systhread=09DF6398 
  systhread=09DEBDE8 
  systhread=09DAF558 
  systhread=09D9FC90 
  systhread=09D9ABA0 
  systhread=B12DA7B0 
  systhread=08FA09C0 
  systhread=0925F140 
  systhread=08F9E080 
  systhread=08332550 
  systhread=081EF880 
  systhread=08188AB0 
  systhread=081873F0 
  systhread=08185D30 
  systhread=08184670 
  systhread=08182FB0 
  systhread=08181970 
  systhread=08180400 
  systhread=0817DA00 
  systhread=0817B270 
  systhread=08178C60 
NULL
========================================

1)The invalid tid value of 0x31284347 is not in the above list.
2)This value was retrieved when iterating through the above list at runtime.
3)The list itself is guarded by locks in the JVM code.
4)The list nodes are not overlaid.

5)Also from frame 4 :

#4  0xb71f6dbb in sysThreadSingle ()
    at /userlvl/cxia32142ifx/src/hpi/pfm/threads_md.c:2542

i locals

i = 49
ret = 0
tid = (sys_thread_t *) 0x31284347
self = (sys_thread_t *) 0xa0fb1bc0

"i" represents the loop induction variable in the while loop in the pseudocode
above. It suggests that the last tid that was processed correctly is :

systhread=9FBC1400

so ideally 0B24E6F8->next should point to the incorrect value of "0x31284347",
but when we check the next 

(gdb) p *(sys_thread_t *) 0xa0fb1bc0
$2 = {ref_count = 0, pid = 7764, sys_tid = 2727795632, next = 0xa0f7e358, state
= {value = RUNNABLE, data = 0},interrupted = 0, single_threaded = FALSE,
is_system_thread = TRUE, seen_to_die = FALSE, ps_count = 0

It correctly points to the next item in the list namely "0xa0f7e358"

Considering the above 5 points, there seems to be no reason for the invalid
memory to "turn up" at runtime and very clearly points to a "low level memory
management issue".


Steps to reproduce:

There is no standalone testcase that reproduces this problem, This problem
occurs only on a single server that constitues a production server. 
It cannot be reproduced in the test environment.
Footprints that are available for analysis are core file, WebSphere and
Application logs.

Other Information
=================

For access of the footprints like the core file, logs and the thread dump,
please do get in touch with me at lvenkata at in.ibm.com.


-- 
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


More information about the Bugme-new mailing list