[Openais] [Pacemaker] Linux HA on debian sparc

Thu Jun 2 20:16:55 PDT 2011

Well,

Now with this patch, the pacemakerd process starts and up his other
process ( crmd, lrmd, pengine....) but after the process pacemakerd do
a fork, the forked  process pacemakerd dies due to "signal 10, Bus
error".. And  on the log, the process of pacemark ( crmd, lrmd,
pengine....) cant connect to open ais plugin (possible because the
"death" of the pacemakerd process).
But this time when the forked pacemakerd dies, he generates a coredump.

gdb  -c "/usr/var/lib/heartbeat/cores/root/ pacemakerd 7986"  -se
/usr/sbin/pacemakerd :
GNU gdb (GDB) 7.0.1-debian
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "sparc-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/pacemakerd...done.
Reading symbols from /usr/lib64/libuuid.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib64/libuuid.so.1
Reading symbols from /usr/lib/libcoroipcc.so.4...done.
Loaded symbols for /usr/lib/libcoroipcc.so.4
Reading symbols from /usr/lib/libcpg.so.4...done.
Loaded symbols for /usr/lib/libcpg.so.4
Reading symbols from /usr/lib/libquorum.so.4...done.
Loaded symbols for /usr/lib/libquorum.so.4
Reading symbols from /usr/lib64/libcrmcommon.so.2...done.
Loaded symbols for /usr/lib64/libcrmcommon.so.2
Reading symbols from /usr/lib/libcfg.so.4...done.
Loaded symbols for /usr/lib/libcfg.so.4
Reading symbols from /usr/lib/libconfdb.so.4...done.
Loaded symbols for /usr/lib/libconfdb.so.4
Reading symbols from /usr/lib64/libplumb.so.2...done.
Loaded symbols for /usr/lib64/libplumb.so.2
Reading symbols from /usr/lib64/libpils.so.2...done.
Loaded symbols for /usr/lib64/libpils.so.2
Reading symbols from /lib/libbz2.so.1.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libbz2.so.1.0
Reading symbols from /usr/lib/libxslt.so.1...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libxslt.so.1
Reading symbols from /usr/lib/libxml2.so.2...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libxml2.so.2
Reading symbols from /lib/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/librt.so.1
Reading symbols from /lib/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libglib-2.0.so.0...(no debugging symbols
found)...done.
Loaded symbols for /lib/libglib-2.0.so.0
Reading symbols from /usr/lib/libltdl.so.7...(no debugging symbols
found)...done.
Loaded symbols for /usr/lib/libltdl.so.7
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/libpthread.so.0...(no debugging symbols found)...done.
Loaded symbols for /lib/libpthread.so.0
Reading symbols from /lib/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /usr/lib/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/libpcre.so.3...(no debugging symbols found)...done.
Loaded symbols for /lib/libpcre.so.3
Reading symbols from /lib/libnss_compat.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/libnss_compat.so.2
Reading symbols from /lib/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libnsl.so.1
Reading symbols from /lib/libnss_nis.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/libnss_nis.so.2
Reading symbols from /lib/libnss_files.so.2...(no debugging symbols
found)...done.
Loaded symbols for /lib/libnss_files.so.2
Core was generated by `pacemakerd'.
Program terminated with signal 10, Bus error.
#0  cpg_dispatch (handle=17861288972693536769, dispatch_types=7986) at cpg.c:339
339			switch (dispatch_data->id) {
(gdb) bt
#0  cpg_dispatch (handle=17861288972693536769, dispatch_types=7986) at cpg.c:339
#1  0xf6f100f0 in ?? ()
#2  0xf6f100f4 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I take a look at the cpg.c and see that the dispatch_data was aquired
by coroipcc_dispatch_get (that was defined on lib/coroipcc.c)
function:

       do {
                error = coroipcc_dispatch_get (
                        cpg_inst->handle,
                        (void **)&dispatch_data,
                        timeout);

Resumed log:
...
un 02 23:12:20 corosync [CPG   ] got mcast request on 0x62500
Jun 02 23:12:20 corosync [TOTEM ] mcasted message added to pending queue
Jun 02 23:12:20 corosync [TOTEM ] Delivering f to 10
Jun 02 23:12:20 corosync [TOTEM ] Delivering MCAST message with seq 10
to pending delivery queue
Jun 02 23:12:20 corosync [TOTEM ] releasing messages up to and including f
Jun 02 23:12:20 corosync [TOTEM ] releasing messages up to and including 10
Jun 02 23:12:20 xxxxxxxxxx pacemakerd: [7986]: info: start_child:
Forked child 7991 for process lrmd
Jun 02 23:12:20 xxxxxxxxxx pacemakerd: [7986]: info:
update_node_processes: Node xxxxxxxxxx now has process list:
00000000000000000000000000100112 (was
00000000000000000000000000100102)
Jun 02 23:12:20 corosync [CPG   ] got mcast request on 0x62500
Jun 02 23:12:20 corosync [TOTEM ] mcasted message added to pending queue
Jun 02 23:12:20 corosync [TOTEM ] Delivering 10 to 11
Jun 02 23:12:20 corosync [TOTEM ] Delivering MCAST message with seq 11
to pending delivery queue
Jun 02 23:12:20 corosync [TOTEM ] releasing messages up to and including 11
Jun 02 23:12:20 xxxxxxxxxx pacemakerd: [7986]: info: start_child:
Forked child 7992 for process attrd
Jun 02 23:12:20 xxxxxxxxxx pacemakerd: [7986]: info:
update_node_processes: Node xxxxxxxxxx now has process list:
00000000000000000000000000101112 (was
00000000000000000000000000100112)
Jun 02 23:12:20 corosync [CPG   ] got mcast request on 0x62500
Jun 02 23:12:20 corosync [TOTEM ] mcasted message added to pending queue
Jun 02 23:12:20 corosync [TOTEM ] Delivering 11 to 12
Jun 02 23:12:20 corosync [TOTEM ] Delivering MCAST message with seq 12
to pending delivery queue
Jun 02 23:12:20 corosync [TOTEM ] releasing messages up to and including 12
Jun 02 23:12:20 xxxxxxxxxx pacemakerd: [7986]: info: start_child:
Forked child 7993 for process pengine
Jun 02 23:12:20 xxxxxxxxxx pacemakerd: [7986]: info:
update_node_processes: Node xxxxxxxxxx now has process list:
00000000000000000000000000111112 (was
00000000000000000000000000101112)
Jun 02 23:12:20 corosync [CPG   ] got mcast request on 0x62500
Jun 02 23:12:20 corosync [TOTEM ] mcasted message added to pending queue
Jun 02 23:12:20 corosync [TOTEM ] Delivering 12 to 13
Jun 02 23:12:20 corosync [TOTEM ] Delivering MCAST message with seq 13
to pending delivery queue
Jun 02 23:12:20 corosync [TOTEM ] releasing messages up to and including 13
Jun 02 23:12:20 xxxxxxxxxx pacemakerd: [7986]: info: start_child:
Forked child 7994 for process crmd
Jun 02 23:12:20 xxxxxxxxxx pacemakerd: [7986]: info:
update_node_processes: Node xxxxxxxxxx now has process list:
00000000000000000000000000111312 (was
00000000000000000000000000111112)
Jun 02 23:12:20 corosync [CPG   ] got mcast request on 0x62500
Jun 02 23:12:20 xxxxxxxxxx pacemakerd: [7986]: info: main: Starting mainloop
Jun 02 23:12:20 corosync [TOTEM ] mcasted message added to pending queue
Jun 02 23:12:20 corosync [TOTEM ] Delivering 13 to 14
Jun 02 23:12:20 corosync [TOTEM ] Delivering MCAST message with seq 14
to pending delivery queue
Jun 02 23:12:20 corosync [TOTEM ] releasing messages up to and including 14
Jun 02 23:12:20 corosync [CPG   ] got mcast request on 0x62500
Jun 02 23:12:20 corosync [TOTEM ] mcasted message added to pending queue
Jun 02 23:12:20 corosync [TOTEM ] Delivering 14 to 15
Jun 02 23:12:20 corosync [TOTEM ] Delivering MCAST message with seq 15
to pending delivery queue
Jun 02 23:12:20 corosync [TOTEM ] releasing messages up to and including 15
Jun 02 23:12:20 xxxxxxxxxx stonith-ng: [7989]: info: Invoked:
/usr/lib64/heartbeat/stonithd
Jun 02 23:12:20 xxxxxxxxxx stonith-ng: [7989]: info:
crm_log_init_worker: Changed active directory to
/usr/var/lib/heartbeat/cores/root
Jun 02 23:12:20 xxxxxxxxxx stonith-ng: [7989]: info: get_cluster_type:
Cluster type is: 'openais'.
Jun 02 23:12:20 xxxxxxxxxx stonith-ng: [7989]: info:
crm_cluster_connect: Connecting to cluster infrastructure: classic
openais (with plugin)
Jun 02 23:12:20 xxxxxxxxxx stonith-ng: [7989]: info:
init_ais_connection_classic: Creating connection to our Corosync
plugin
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: info: crm_log_init_worker:
Changed active directory to /usr/var/lib/heartbeat/cores/hacluster
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: info: retrieveCib: Reading
cluster configuration from: /usr/var/lib/heartbeat/crm/cib.xml
(digest: /usr/var/lib/heartbeat/crm/cib.xml.sig)
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: WARN: retrieveCib: Cluster
configuration not found: /usr/var/lib/heartbeat/crm/cib.xml
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: WARN: readCibXmlFile: Primary
configuration corrupt or unusable, trying backup...
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: debug: get_last_sequence:
Series file /usr/var/lib/heartbeat/crm/cib.last does not exist
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: debug: readCibXmlFile: Backup
file /usr/var/lib/heartbeat/crm/cib-99.raw not found
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: WARN: readCibXmlFile:
Continuing with an empty configuration.
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: debug: readCibXmlFile[on-disk]
<cib epoch="0" num_updates="0" admin_epoch="0"
validate-with="pacemaker-1.2" >
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: debug: readCibXmlFile[on-disk]
  <configuration >
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: debug: readCibXmlFile[on-disk]
    <crm_config />
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: debug: readCibXmlFile[on-disk]
    <nodes />
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: debug: readCibXmlFile[on-disk]
    <resources />
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: debug: readCibXmlFile[on-disk]
    <constraints />
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: debug: readCibXmlFile[on-disk]
  </configuration>
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: debug: readCibXmlFile[on-disk]
  <status />
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: debug: readCibXmlFile[on-disk] </cib>
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: info: validate_with_relaxng:
Creating RNG parser context
Jun 02 23:12:20 xxxxxxxxxx stonith-ng: [7989]: info:
init_ais_connection_classic: Connection to our AIS plugin (9) failed:
Doesn't exist (12)
Jun 02 23:12:20 xxxxxxxxxx stonith-ng: [7989]: CRIT: main: Cannot sign
in to the cluster... terminating
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: info: Invoked:
/usr/lib64/heartbeat/crmd
Jun 02 23:12:20 xxxxxxxxxx pengine: [7993]: info: Invoked:
/usr/lib64/heartbeat/pengine
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: info: crm_log_init_worker:
Changed active directory to /usr/var/lib/heartbeat/cores/hacluster
Jun 02 23:12:20 xxxxxxxxxx pengine: [7993]: info: crm_log_init_worker:
Changed active directory to /usr/var/lib/heartbeat/cores/hacluster
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: info: main: CRM Hg Version:
e872eeb39a5f6e1fdb57c3108551a5353648c4f4

Jun 02 23:12:20 xxxxxxxxxx pengine: [7993]: debug: main: Checking for
old instances of pengine
Jun 02 23:12:20 xxxxxxxxxx pengine: [7993]: debug:
init_client_ipc_comms_nodispatch: Attempting to talk on:
/usr/var/run/crm/pengine
Jun 02 23:12:20 xxxxxxxxxx lrmd: [7991]: info: enabling coredumps
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: info: crmd_init: Starting crmd
Jun 02 23:12:20 xxxxxxxxxx pengine: [7993]: debug:
init_client_ipc_comms_nodispatch: Could not init comms on:
/usr/var/run/crm/pengine
Jun 02 23:12:20 xxxxxxxxxx lrmd: [7991]: debug: main: run the loop...
Jun 02 23:12:20 xxxxxxxxxx lrmd: [7991]: info: Started.
Jun 02 23:12:20 xxxxxxxxxx pengine: [7993]: debug: main: Init server comms
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug: s_crmd_fsa: Processing
I_STARTUP: [ state=S_STARTING cause=C_STARTUP origin=crmd_init ]
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug: do_fsa_action:
actions:trace: 	// A_LOG
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug: do_fsa_action:
actions:trace: 	// A_STARTUP
Jun 02 23:12:20 xxxxxxxxxx pengine: [7993]: info: main: Starting pengine
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug: do_startup:
Registering Signal Handlers
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug: do_startup: Creating
CIB and LRM objects
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug: do_fsa_action:
actions:trace: 	// A_CIB_START
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug:
init_client_ipc_comms_nodispatch: Attempting to talk on:
/usr/var/run/crm/cib_rw
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug:
init_client_ipc_comms_nodispatch: Could not init comms on:
/usr/var/run/crm/cib_rw
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug: cib_native_signon_raw:
Connection to command channel failed
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug:
init_client_ipc_comms_nodispatch: Attempting to talk on:
/usr/var/run/crm/cib_callback
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug:
init_client_ipc_comms_nodispatch: Could not init comms on:
/usr/var/run/crm/cib_callback
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug: cib_native_signon_raw:
Connection to callback channel failed
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug: cib_native_signon_raw:
Connection to CIB failed: connection failed
Jun 02 23:12:20 xxxxxxxxxx crmd: [7994]: debug: cib_native_signoff:
Signing out of the CIB Service
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: debug: activateCibXml:
Triggering CIB write for start op
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: info: startCib: CIB
Initialization completed successfully
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: info: get_cluster_type:
Cluster type is: 'openais'.
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: info: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: info:
init_ais_connection_classic: Creating connection to our Corosync
plugin
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: info:
init_ais_connection_classic: Connection to our AIS plugin (9) failed:
Doesn't exist (12)
Jun 02 23:12:20 xxxxxxxxxx cib: [7990]: CRIT: cib_init: Cannot sign in
to the cluster... terminating
Jun 02 23:12:21 corosync [CPG   ] exit_fn for conn=0x62500
Jun 02 23:12:21 corosync [TOTEM ] mcasted message added to pending queue
Jun 02 23:12:21 corosync [TOTEM ] Delivering 15 to 16
Jun 02 23:12:21 corosync [TOTEM ] Delivering MCAST message with seq 16
to pending delivery queue
Jun 02 23:12:21 corosync [CPG   ] got procleave message from cluster
node 1377289226
Jun 02 23:12:21 corosync [TOTEM ] releasing messages up to and including 16
Jun 02 23:12:21 xxxxxxxxxx attrd: [7992]: info: Invoked:
/usr/lib64/heartbeat/attrd
Jun 02 23:12:21 xxxxxxxxxx attrd: [7992]: info: crm_log_init_worker:
Changed active directory to /usr/var/lib/heartbeat/cores/hacluster
Jun 02 23:12:21 xxxxxxxxxx attrd: [7992]: info: main: Starting up
Jun 02 23:12:21 xxxxxxxxxx attrd: [7992]: info: get_cluster_type:
Cluster type is: 'openais'.
Jun 02 23:12:21 xxxxxxxxxx attrd: [7992]: info: crm_cluster_connect:
Connecting to cluster infrastructure: classic openais (with plugin)
Jun 02 23:12:21 xxxxxxxxxx attrd: [7992]: info:
init_ais_connection_classic: Creating connection to our Corosync
plugin
Jun 02 23:12:21 xxxxxxxxxx attrd: [7992]: info:
init_ais_connection_classic: Connection to our AIS plugin (9) failed:
Doesn't exist (12)
Jun 02 23:12:21 xxxxxxxxxx attrd: [7992]: ERROR: main: HA Signon failed
Jun 02 23:12:21 xxxxxxxxxx attrd: [7992]: info: main: Cluster connection active
Jun 02 23:12:21 xxxxxxxxxx attrd: [7992]: info: main: Accepting
attribute updates
Jun 02 23:12:21 xxxxxxxxxx attrd: [7992]: ERROR: main: Aborting startup
Jun 02 23:12:21 xxxxxxxxxx crmd: [7994]: debug:
init_client_ipc_comms_nodispatch: Attempting to talk on:
/usr/var/run/crm/cib_rw
Jun 02 23:12:21 xxxxxxxxxx crmd: [7994]: debug:
init_client_ipc_comms_nodispatch: Could not init comms on:
/usr/var/run/crm/cib_rw
Jun 02 23:12:21 xxxxxxxxxx crmd: [7994]: debug: cib_native_signon_raw:
Connection to command channel failed
Jun 02 23:12:21 xxxxxxxxxx crmd: [7994]: debug:
init_client_ipc_comms_nodispatch: Attempting to talk on:
/usr/var/run/crm/cib_callback
...

2011/6/2 Steven Dake <sdake at redhat.com>:
> On 06/01/2011 11:05 PM, william felipe_welter wrote:
>> I recompile my kernel without hugetlb .. and the result are the same..
>>
>> My test program still resulting:
>> PATH=/dev/shm/teste123XXXXXX
>> page size=20000
>> fd=3
>> ADDR_ORIG:0xe000a000  ADDR:0xffffffff
>> Erro
>>
>> And Pacemaker still resulting because the mmap error:
>> Could not initialize Cluster Configuration Database API instance error 2
>>
>
> Give the patch I posted recently a spin - corosync WFM with this patch
> on sparc64 with hugetlb set.  Please report back results.
>
> Regards
> -steve
>
>> For make sure that i have disable the hugetlb there is my /proc/meminfo:
>> MemTotal:       33093488 kB
>> MemFree:        32855616 kB
>> Buffers:            5600 kB
>> Cached:            53480 kB
>> SwapCached:            0 kB
>> Active:            45768 kB
>> Inactive:          28104 kB
>> Active(anon):      18024 kB
>> Inactive(anon):     1560 kB
>> Active(file):      27744 kB
>> Inactive(file):    26544 kB
>> Unevictable:           0 kB
>> Mlocked:               0 kB
>> SwapTotal:       6104680 kB
>> SwapFree:        6104680 kB
>> Dirty:                 0 kB
>> Writeback:             0 kB
>> AnonPages:         14936 kB
>> Mapped:             7736 kB
>> Shmem:              4624 kB
>> Slab:              39184 kB
>> SReclaimable:      10088 kB
>> SUnreclaim:        29096 kB
>> KernelStack:        7088 kB
>> PageTables:         1160 kB
>> Quicklists:        17664 kB
>> NFS_Unstable:          0 kB
>> Bounce:                0 kB
>> WritebackTmp:          0 kB
>> CommitLimit:    22651424 kB
>> Committed_AS:     519368 kB
>> VmallocTotal:   1069547520 kB
>> VmallocUsed:       11064 kB
>> VmallocChunk:   1069529616 kB
>>
>>
>> 2011/6/1 Steven Dake <sdake at redhat.com>:
>>> On 06/01/2011 07:42 AM, william felipe_welter wrote:
>>>> Steven,
>>>>
>>>> cat /proc/meminfo
>>>> ...
>>>> HugePages_Total:       0
>>>> HugePages_Free:        0
>>>> HugePages_Rsvd:        0
>>>> HugePages_Surp:        0
>>>> Hugepagesize:       4096 kB
>>>> ...
>>>>
>>>
>>> It definitely requires a kernel compile and setting the config option to
>>> off.  I don't know the debian way of doing this.
>>>
>>> The only reason you may need this option is if you have very large
>>> memory sizes, such as 48GB or more.
>>>
>>> Regards
>>> -steve
>>>
>>>> Its 4MB..
>>>>
>>>> How can i disable hugetlb ? ( passing CONFIG_HUGETLBFS=n at boot to
>>>> kernel ?)
>>>>
>>>> 2011/6/1 Steven Dake <sdake at redhat.com <mailto:sdake at redhat.com>>
>>>>
>>>>     On 06/01/2011 01:05 AM, Steven Dake wrote:
>>>>     > On 05/31/2011 09:44 PM, Angus Salkeld wrote:
>>>>     >> On Tue, May 31, 2011 at 11:52:48PM -0300, william felipe_welter
>>>>     wrote:
>>>>     >>> Angus,
>>>>     >>>
>>>>     >>> I make some test program (based on the code coreipcc.c) and i
>>>>     now i sure
>>>>     >>> that are problems with the mmap systems call on sparc..
>>>>     >>>
>>>>     >>> Source code of my test program:
>>>>     >>>
>>>>     >>> #include <stdlib.h>
>>>>     >>> #include <sys/mman.h>
>>>>     >>> #include <stdio.h>
>>>>     >>>
>>>>     >>> #define PATH_MAX  36
>>>>     >>>
>>>>     >>> int main()
>>>>     >>> {
>>>>     >>>
>>>>     >>> int32_t fd;
>>>>     >>> void *addr_orig;
>>>>     >>> void *addr;
>>>>     >>> char path[PATH_MAX];
>>>>     >>> const char *file = "teste123XXXXXX";
>>>>     >>> size_t bytes=10024;
>>>>     >>>
>>>>     >>> snprintf (path, PATH_MAX, "/dev/shm/%s", file);
>>>>     >>> printf("PATH=%s\n",path);
>>>>     >>>
>>>>     >>> fd = mkstemp (path);
>>>>     >>> printf("fd=%d \n",fd);
>>>>     >>>
>>>>     >>>
>>>>     >>> addr_orig = mmap (NULL, bytes, PROT_NONE,
>>>>     >>>               MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>>     >>>
>>>>     >>>
>>>>     >>> addr = mmap (addr_orig, bytes, PROT_READ | PROT_WRITE,
>>>>     >>>               MAP_FIXED | MAP_SHARED, fd, 0);
>>>>     >>>
>>>>     >>> printf("ADDR_ORIG:%p  ADDR:%p\n",addr_orig,addr);
>>>>     >>>
>>>>     >>>
>>>>     >>>   if (addr != addr_orig) {
>>>>     >>>                printf("Erro");
>>>>     >>>         }
>>>>     >>> }
>>>>     >>>
>>>>     >>> Results on x86:
>>>>     >>> PATH=/dev/shm/teste123XXXXXX
>>>>     >>> fd=3
>>>>     >>> ADDR_ORIG:0x7f867d8e6000  ADDR:0x7f867d8e6000
>>>>     >>>
>>>>     >>> Results on sparc:
>>>>     >>> PATH=/dev/shm/teste123XXXXXX
>>>>     >>> fd=3
>>>>     >>> ADDR_ORIG:0xf7f72000  ADDR:0xffffffff
>>>>     >>
>>>>     >> Note: 0xffffffff == MAP_FAILED
>>>>     >>
>>>>     >> (from man mmap)
>>>>     >> RETURN VALUE
>>>>     >>        On success, mmap() returns a pointer to the mapped area.  On
>>>>     >>        error, the value MAP_FAILED (that is, (void *) -1) is
>>>>     returned,
>>>>     >>        and errno is  set appropriately.
>>>>     >>
>>>>     >>>
>>>>     >>>
>>>>     >>> But im wondering if is really needed to call mmap 2 times ?
>>>>      What are the
>>>>     >>> reason to call the mmap 2 times, on the second time using the
>>>>     address of the
>>>>     >>> first?
>>>>     >>>
>>>>     >>>
>>>>     >> Well there are 3 calls to mmap()
>>>>     >> 1) one to allocate 2 * what you need (in pages)
>>>>     >> 2) maps the first half of the mem to a real file
>>>>     >> 3) maps the second half of the mem to the same file
>>>>     >>
>>>>     >> The point is when you write to an address over the end of the
>>>>     >> first half of memory it is taken care of the the third mmap which
>>>>     maps
>>>>     >> the address back to the top of the file for you. This means you
>>>>     >> don't have to worry about ringbuffer wrapping which can be a
>>>>     headache.
>>>>     >>
>>>>     >> -Angus
>>>>     >>
>>>>     >
>>>>     > interesting this mmap operation doesn't work on sparc linux.
>>>>     >
>>>>     > Not sure how I can help here - Next step would be a follow up with the
>>>>     > sparc linux mailing list.  I'll do that and cc you on the message
>>>>     - see
>>>>     > if we get any response.
>>>>     >
>>>>     > http://vger.kernel.org/vger-lists.html
>>>>     >
>>>>     >>>
>>>>     >>>
>>>>     >>>
>>>>     >>>
>>>>     >>> 2011/5/31 Angus Salkeld <asalkeld at redhat.com
>>>>     <mailto:asalkeld at redhat.com>>
>>>>     >>>
>>>>     >>>> On Tue, May 31, 2011 at 06:25:56PM -0300, william felipe_welter
>>>>     wrote:
>>>>     >>>>> Thanks Steven,
>>>>     >>>>>
>>>>     >>>>> Now im try to run on the MCP:
>>>>     >>>>> - Uninstall the pacemaker 1.0
>>>>     >>>>> - Compile and install 1.1
>>>>     >>>>>
>>>>     >>>>> But now i have problems to initialize the pacemakerd: Could not
>>>>     >>>> initialize
>>>>     >>>>> Cluster Configuration Database API instance error 2
>>>>     >>>>> Debbuging with gdb i see that the error are on the confdb.. most
>>>>     >>>> specificaly
>>>>     >>>>> the errors start on coreipcc.c  at line:
>>>>     >>>>>
>>>>     >>>>>
>>>>     >>>>> 448        if (addr != addr_orig) {
>>>>     >>>>> 449                goto error_close_unlink;  <- enter here
>>>>     >>>>> 450       }
>>>>     >>>>>
>>>>     >>>>> Some ideia about  what can cause this  ?
>>>>     >>>>>
>>>>     >>>>
>>>>     >>>> I tried porting a ringbuffer (www.libqb.org
>>>>     <http://www.libqb.org>) to sparc and had the same
>>>>     >>>> failure.
>>>>     >>>> There are 3 mmap() calls and on sparc the third one keeps failing.
>>>>     >>>>
>>>>     >>>> This is a common way of creating a ring buffer, see:
>>>>     >>>>
>>>>     http://en.wikipedia.org/wiki/Circular_buffer#Exemplary_POSIX_Implementation
>>>>     >>>>
>>>>     >>>> I couldn't get it working in the short time I tried. It's probably
>>>>     >>>> worth looking at the clib implementation to see why it's failing
>>>>     >>>> (I didn't get to that).
>>>>     >>>>
>>>>     >>>> -Angus
>>>>     >>>>
>>>>
>>>>     Note, we sorted this out we believe.  Your kernel has hugetlb enabled,
>>>>     probably with 4MB pages.  This requires corosync to allocate 4MB pages.
>>>>
>>>>     Can you verify your hugetlb settings?
>>>>
>>>>     If you can turn this option off, you should have atleast a working
>>>>     corosync.
>>>>
>>>>     Regards
>>>>     -steve
>>>>     >>>>
>>>>     >>>> _______________________________________________
>>>>     >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>     <mailto:Pacemaker at oss.clusterlabs.org>
>>>>     >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>     >>>>
>>>>     >>>> Project Home: http://www.clusterlabs.org
>>>>     >>>> Getting started:
>>>>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>     >>>> Bugs:
>>>>     >>>>
>>>>     http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>>     >>>>
>>>>     >>>
>>>>     >>>
>>>>     >>>
>>>>     >>> --
>>>>     >>> William Felipe Welter
>>>>     >>> ------------------------------
>>>>     >>> Consultor em Tecnologias Livres
>>>>     >>> william.welter at 4linux.com.br <mailto:william.welter at 4linux.com.br>
>>>>     >>> www.4linux.com.br <http://www.4linux.com.br>
>>>>     >>
>>>>     >>> _______________________________________________
>>>>     >>> Openais mailing list
>>>>     >>> Openais at lists.linux-foundation.org
>>>>     <mailto:Openais at lists.linux-foundation.org>
>>>>     >>> https://lists.linux-foundation.org/mailman/listinfo/openais
>>>>     >>
>>>>     >>
>>>>     >> _______________________________________________
>>>>     >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>     <mailto:Pacemaker at oss.clusterlabs.org>
>>>>     >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>     >>
>>>>     >> Project Home: http://www.clusterlabs.org
>>>>     >> Getting started:
>>>>     http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>     >> Bugs:
>>>>     http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>>     >
>>>>     > _______________________________________________
>>>>     > Openais mailing list
>>>>     > Openais at lists.linux-foundation.org
>>>>     <mailto:Openais at lists.linux-foundation.org>
>>>>     > https://lists.linux-foundation.org/mailman/listinfo/openais
>>>>
>>>>
>>>>     _______________________________________________
>>>>     Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>     <mailto:Pacemaker at oss.clusterlabs.org>
>>>>     http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>>     Project Home: http://www.clusterlabs.org
>>>>     Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>     Bugs:
>>>>     http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> William Felipe Welter
>>>> ------------------------------
>>>> Consultor em Tecnologias Livres
>>>> william.welter at 4linux.com.br <mailto:william.welter at 4linux.com.br>
>>>> www.4linux.com.br <http://www.4linux.com.br>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>
>>>
>>
>>
>>
>
>

-- 
William Felipe Welter
------------------------------
Consultor em Tecnologias Livres
william.welter at 4linux.com.br
www.4linux.com.br