[Bugme-new] [Bug 8839] New: Runtime Memory Inconsistency on Linux
kernel 2.4.21-32
bugme-daemon at bugzilla.kernel.org
bugme-daemon at bugzilla.kernel.org
Thu Aug 2 05:03:28 PDT 2007
http://bugzilla.kernel.org/show_bug.cgi?id=8839
Summary: Runtime Memory Inconsistency on Linux kernel 2.4.21-32
Product: Memory Management
Version: 2.5
KernelVersion: 2.4.21-32.ELsmp
Platform: All
OS/Version: Linux
Tree: Mainline
Status: NEW
Severity: blocking
Priority: P1
Component: MTTR
AssignedTo: akpm at osdl.org
ReportedBy: lvenkata at in.ibm.com
CC: lvenkata at in.ibm.com
Most recent kernel where this bug did not occur:Not Tested on other kernels.
Distribution:RedHat RHEL 3.0 U3
Hardware Environment:i686 athlon i386, AMD Athlon
Software Environment:WebSphere Application driven by IBM JDK 1.4.2
Problem Description:
Symptom : WebSphere application crash
Description of Issue :
Analysis of crash footprints indicate that
1) Well guarded (Locked) data structures ending up holding invalid memory.
2) A linked list with all the nodes correctly "formed" correctly ends up
pointing to an "invalid" node at the point of execution.
3) This list is well guarded (locked by native locks) and therefore have rules
out the possibility of being abruptly updated by any other thread.
4) We have also verified that this area of memory has not been overlaid by
another memory allocation.
In short a native memory inconsistency issue that occurs, albeit the piece of
memory being guarded, not overlaid and correctly built which suggests that this
is a low level memory issue possibly to do with the memory management in the
kernel.
Analysis and Exact details retrieved from the System Core
==========================================================
Crash Symptom : Abort owing to a panic in the Java Virtual Machine(JVM).
Reason for Panic : One of the JVM data structures that internally represent a
Java Thread is pointed to by an "invalid address".
Stack Trace of Crash
====================
#0 0xb749acdf in raise () from /lib/tls/libc.so.6
#1 0xb749c4e5 in abort () from /lib/tls/libc.so.6
#2 0xb71ebd41 in _hpiPanic (
fmt=0xb71faf80 "JVMLH019: invalid thread sr_state %d\n")
at /userlvl/cxia32142ifx/src/hpi/pfm/hpi_util_md.c:183
#3 0xb71f548e in tellThreadToSuspend (self=0xa0fb1bc0, tid=0x31284347,
type=GLOB_SUSPEND) at /userlvl/cxia32142ifx/src/hpi/pfm/threads_md.c:1502
#4 0xb71f6dbb in sysThreadSingle ()
at /userlvl/cxia32142ifx/src/hpi/pfm/threads_md.c:2542
#5 0xb74296af in __clz_tab ()
from /opt/IBM/WebSphere/AppServer/java/jre/bin/classic/libjvm.so
#6 0x00000001 in ?? ()
#7 0x00000000 in ?? ()
frames 2 to 5 represent JVM function calls.
Frame 4 executes the following psuedo code :
tid = head; //start from the tid pointed by head.
i = 0;
while ((i < no_of_elements in list ) && ( tid != NULL)) {
if ((CHECK 1) && CHECK 2) { /*ibm at 52001*/
if (tid = self) {
//do Something
} else {
if (tellThreadToSuspend(self, tid,
suspendattribute) == ERROR)
{/*ibm at 57783*/
ret = ERROR;
}
}
}
prev = tid;
tid = tid->next; // iterate the list using the "next" field
i++;
}
The current problem is with the current value of tid=0x31284347 which is the
"invalid memory location" leading to the panic subsequently.
As we can see the code iterates from the head through the elements of the list
using tid->next.
So ideally, the invalid tid should be a part of the linked list, I have pasted
the entire list in order below which is effectively the list the above code is
iterating through and we can evidently see that the above "invalid tid value"
is not a part of the list.
Note : tid=systhread
========================================
List Memory
HEAD
systhread=0BD464B8
systhread=0C395F70
systhread=9FE5D378
systhread=0C3666E8
systhread=A0F7B598
systhread=0C265000
systhread=0C1F62E0
systhread=0C3415A0
systhread=0C4BCA78
systhread=A0F79348
systhread=A0F75888
systhread=A0F7AFE0
systhread=0BEDC178
systhread=0BF61570
systhread=0B05A6E8
systhread=0A949598
systhread=A0F7C6C0
systhread=A0F78D90
systhread=A0F77C68
systhread=A0F7BB50
systhread=A0F7EEC8
systhread=A0F7E910
systhread=A0F70088
systhread=A0FB2178
systhread=A0FB1608
systhread=A0FB1BC0
systhread=A0F7E358
systhread=A0F78220
systhread=0B963C40
systhread=A48C2338
systhread=0A950D10
systhread=0A8EF5D0
systhread=0C05EF40
systhread=0C130060
systhread=0AA86850
systhread=0C22D6E8
systhread=0C24F2D8
systhread=0B962AB8
systhread=0B9EC808
systhread=0BAF6A38
systhread=A2823A90
systhread=A2810E38
systhread=0BF2B218
systhread=0B9A2D88
systhread=0BA644D8
systhread=0C2823D0
systhread=9F942A00
systhread=9FBC1400
systhread=0B9A3518
systhread=0B24E6F8
systhread=0C22DCA0
systhread=0B8F9310
systhread=0BD00C18
systhread=0BC349F8
systhread=0C185D50
systhread=0B965FE0
systhread=0AC3D360
systhread=0AB0A6C0
systhread=0AB7C418
systhread=0B962310
systhread=0BAD7A30
systhread=09CBDDC8
systhread=A16FA328
systhread=A16F9AA8
systhread=A89866B8
systhread=9FE92F08
systhread=9FB86B70
systhread=9FEFE8C8
systhread=9FD21E68
systhread=0BE97830
systhread=0AA02318
systhread=0BAAE598
systhread=0BFA4750
systhread=A16F9248
systhread=A16F7908
systhread=9FB7A390
systhread=9F904230
systhread=0C383378
systhread=0AA29308
systhread=0AB6D470
systhread=0A36E1C0
systhread=0BEFB240
systhread=0AA20840
systhread=A231B858
systhread=A2310420
systhread=9FDDD180
systhread=9FE57368
systhread=9FB53708
systhread=9FE5DBF8
systhread=9FE08B28
systhread=0AEF4708
systhread=0BA66258
systhread=0AC003C0
systhread=0A6CDD00
systhread=0AE01D18
systhread=0ADBBF70
systhread=0AD28F78
systhread=0A6829E8
systhread=0A50D590
systhread=AA6031D8
systhread=0A222050
systhread=0A223240
systhread=09CD37D8
systhread=A9A148B8
systhread=AB8093C8
systhread=AB808E10
systhread=A9DE1B98
systhread=AABE65D8
systhread=A9D29060
systhread=A9D28AA8
systhread=AABBBD28
systhread=AABBB770
systhread=AABB95E0
systhread=09C7A8A8
systhread=AADF6310
systhread=AAD7EF50
systhread=AAD7E998
systhread=AAD7E3E0
systhread=AAD7C080
systhread=AAD7BAC8
systhread=09C6A260
systhread=0868DE28
systhread=0868C818
systhread=AB841F60
systhread=09A886B8
systhread=09A87148
systhread=090BDA48
systhread=B129D488
systhread=AE737AD8
systhread=0A0C03F0
systhread=0A0BD3B0
systhread=0A077878
systhread=AECDF240
systhread=B12D4518
systhread=0A0A35E0
systhread=0A097140
systhread=0A075548
systhread=0A01E950
systhread=09FCAB10
systhread=09FCA558
systhread=09E7F568
systhread=09DCD460
systhread=09DF6398
systhread=09DEBDE8
systhread=09DAF558
systhread=09D9FC90
systhread=09D9ABA0
systhread=B12DA7B0
systhread=08FA09C0
systhread=0925F140
systhread=08F9E080
systhread=08332550
systhread=081EF880
systhread=08188AB0
systhread=081873F0
systhread=08185D30
systhread=08184670
systhread=08182FB0
systhread=08181970
systhread=08180400
systhread=0817DA00
systhread=0817B270
systhread=08178C60
NULL
========================================
1)The invalid tid value of 0x31284347 is not in the above list.
2)This value was retrieved when iterating through the above list at runtime.
3)The list itself is guarded by locks in the JVM code.
4)The list nodes are not overlaid.
5)Also from frame 4 :
#4 0xb71f6dbb in sysThreadSingle ()
at /userlvl/cxia32142ifx/src/hpi/pfm/threads_md.c:2542
i locals
i = 49
ret = 0
tid = (sys_thread_t *) 0x31284347
self = (sys_thread_t *) 0xa0fb1bc0
"i" represents the loop induction variable in the while loop in the pseudocode
above. It suggests that the last tid that was processed correctly is :
systhread=9FBC1400
so ideally 0B24E6F8->next should point to the incorrect value of "0x31284347",
but when we check the next
(gdb) p *(sys_thread_t *) 0xa0fb1bc0
$2 = {ref_count = 0, pid = 7764, sys_tid = 2727795632, next = 0xa0f7e358, state
= {value = RUNNABLE, data = 0},interrupted = 0, single_threaded = FALSE,
is_system_thread = TRUE, seen_to_die = FALSE, ps_count = 0
It correctly points to the next item in the list namely "0xa0f7e358"
Considering the above 5 points, there seems to be no reason for the invalid
memory to "turn up" at runtime and very clearly points to a "low level memory
management issue".
Steps to reproduce:
There is no standalone testcase that reproduces this problem, This problem
occurs only on a single server that constitues a production server.
It cannot be reproduced in the test environment.
Footprints that are available for analysis are core file, WebSphere and
Application logs.
Other Information
=================
For access of the footprints like the core file, logs and the thread dump,
please do get in touch with me at lvenkata at in.ibm.com.
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
More information about the Bugme-new
mailing list