[Openais] [call for porters to test corosync]

Steven Dake sdake at redhat.com
Fri Jun 19 07:58:31 PDT 2009


I merged your patch.  Without it, if that variable wasn't luckily
initialized to NULL (which it appears to be on my system) all kinds of
bad things would happen after ucred_free frees random memory and
getpeerucred randomly writes data to whatever that variable points to.

If we can sort out what goes wrong after having this patch applied, then
we should be ready to go.

After applying the code works fine for me.  I don't see any other issue.

Thanks!
-steve

On Fri, 2009-06-19 at 14:06 +0200, Wojtek Meler wrote:
> Steven Dake pisze:
> > Wojtek,
> >
> > I've done a bit more work on portability.  I ran your install test case
> > and found the directories required by corosync are not present in
> > default csw.  Corosync doesn't warn the user and proceeds without
> > actually being functional.
> >
> > Note the requirement for the ais user is now gone.  If you want to run
> > clients as user AIS, (or yourself) You can create ais user support by
> > adding a uidgid.d directory to your etc/corosync directory.  In that
> > dir, place a file called 
> >
> > "ais" with contents
> > uidgid {
> > 	uid: ais
> > 	gid: ais
> > }
> >
> > Finally if suing to root from a normal user, this apparently doesn't
> > promote your privileges enough for the getpeerucred() syscall.  Instead
> > you must do su - root
> >
> > I ran 1 million cpgverifys single node and that appears to work.
> > cpgbench also works although alarm() has an odd behavior of resetting
> > the alarm signal handler to the default.
> >
> > I'd be interested in your cpgbench numbers and logsysbench numbers.
> >
> > All code is currently in trunk.
> >   
> I'm also interested in benchmarks ;), but I've fetched and recompiled 
> trunk and still have problems with running corosync in solaris zone:
> 
> -bash-3.00# corosync -fp
> Jun 19 13:30:03 corosync [MAIN  ] Corosync Executive Service RELEASE 'trunk'
> Jun 19 13:30:03 corosync [MAIN  ] Copyright (C) 2002-2006 MontaVista 
> Software, Inc and contributors.
> turn off loopback: Invalid argument
> Jun 19 13:30:03 corosync [MAIN  ] Copyright (C) 2006-2008 Red Hat, Inc.
> Jun 19 13:30:03 corosync [MAIN  ] Corosync Executive Service: started 
> and ready to provide service.
> Jun 19 13:30:03 corosync [MAIN  ] Successfully read main configuration 
> file '/opt/csw/etc/corosync.conf'.
> Jun 19 13:30:03 corosync [TOTEM ] Token Timeout (1000 ms) retransmit 
> timeout (238 ms)
> Jun 19 13:30:03 corosync [TOTEM ] token hold (180 ms) retransmits before 
> loss (4 retrans)
> Jun 19 13:30:03 corosync [TOTEM ] join (50 ms) send_join (0 ms) 
> consensus (800 ms) merge (200 ms)
> Jun 19 13:30:03 corosync [TOTEM ] downcheck (1000 ms) fail to recv const 
> (50 msgs)
> Jun 19 13:30:03 corosync [TOTEM ] seqno unchanged const (30 rotations) 
> Maximum network MTU 1500
> Jun 19 13:30:03 corosync [TOTEM ] window size per rotation (50 messages) 
> maximum messages per rotation (17 messages)
> Jun 19 13:30:03 corosync [TOTEM ] send threads (0 threads)
> Jun 19 13:30:03 corosync [TOTEM ] RRP token expired timeout (238 ms)
> Jun 19 13:30:03 corosync [TOTEM ] RRP token problem counter (2000 ms)
> Jun 19 13:30:03 corosync [TOTEM ] RRP threshold (10 problem count)
> Jun 19 13:30:03 corosync [TOTEM ] RRP mode set to none.
> Jun 19 13:30:03 corosync [TOTEM ] heartbeat_failures_allowed (0)
> Jun 19 13:30:03 corosync [TOTEM ] max_network_delay (50 ms)
> Jun 19 13:30:03 corosync [TOTEM ] HeartBeat is Disabled. To enable set 
> heartbeat_failures_allowed > 0
> Jun 19 13:30:03 corosync [TOTEM ] Initializing transmit/receive 
> security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Jun 19 13:30:03 corosync [TOTEM ] Receive multicast socket recv buffer 
> size (144000 bytes).
> Jun 19 13:30:03 corosync [TOTEM ] Transmit multicast socket send buffer 
> size (144000 bytes).
> Jun 19 13:30:03 corosync [TOTEM ] The network interface [10.0.8.75] is 
> now up.
> Jun 19 13:30:03 corosync [TOTEM ] Created or loaded sequence id 
> 20.10.0.8.75 for this ring.
> Jun 19 13:30:03 corosync [TOTEM ] entering GATHER state from 15.
> Jun 19 13:30:03 corosync [SERV  ] Service initialized 'corosync extended 
> virtual synchrony service'
> Jun 19 13:30:03 corosync [SERV  ] Service initialized 'corosync 
> configuration service'
> Jun 19 13:30:03 corosync [SERV  ] Service initialized 'corosync cluster 
> closed process group service v1.01'
> Jun 19 13:30:03 corosync [SERV  ] Service initialized 'corosync cluster 
> config database access v1.01'
> Jun 19 13:30:03 corosync [SERV  ] Service initialized 'corosync profile 
> loading service'
> Jun 19 13:30:03 corosync [TOTEM ] Creating commit token because I am the 
> rep.
> Jun 19 13:30:03 corosync [TOTEM ] Saving state aru 0 high seq received 0
> Jun 19 13:30:03 corosync [TOTEM ] Storing new sequence id for ring 18
> Jun 19 13:30:03 corosync [TOTEM ] entering COMMIT state.
> Jun 19 13:30:03 corosync [TOTEM ] entering RECOVERY state.
> Jun 19 13:30:03 corosync [TOTEM ] position [0] member 10.0.8.75:
> Jun 19 13:30:03 corosync [TOTEM ] previous ring seq 20 rep 10.0.8.75
> Jun 19 13:30:03 corosync [TOTEM ] aru 0 high delivered 0 received flag 1
> Jun 19 13:30:03 corosync [TOTEM ] Did not need to originate any messages 
> in recovery.
> Jun 19 13:30:03 corosync [TOTEM ] Sending initial ORF token
> Jun 19 13:30:03 corosync [TOTEM ] entering OPERATIONAL state.
> 
> looks OK, but after running on other console :
> -bash-3.00# ./cpgverify
> Couldn't initialize CPG service 6
> 
> corosync wrote on console:
> Jun 19 13:30:06 corosync [IPC   ] Invalid IPC credentials.
> 
> second run of cpgverify hanged and corosync process dumped a core:
> 
> #0  0xfef74aa7 in _lwp_kill () from /lib/libc.so.1
> #1  0xfef72250 in thr_kill () from /lib/libc.so.1
> #2  0xfef21217 in raise () from /lib/libc.so.1
> #3  0x0805366b in sigsegv_handler (num=11) at main.c:169
> #4  0xfef73e6f in __sighndlr () from /lib/libc.so.1
> #5  0xfef6a30e in call_user_handler () from /lib/libc.so.1
> #6  <signal handler called>
> #7  0xfeeb3848 in _logsys_log_vprintf (rec_ident=4294967295, 
> function_name=0xfeae20e6 "cpg_lib_init_fn", file_name=0xfeae1fe9 "cpg.c",
>     file_line=901, format=0x7 <Address 0x7 out of bounds>, ap=0x8047474 
> "") at logsys.c:1290
> #8  0xfeeb395a in _logsys_log_printf (rec_ident=4294967295, 
> function_name=0xfeae20e6 "cpg_lib_init_fn", file_name=0xfeae1fe9 "cpg.c",
>     file_line=901, format=0x7 <Address 0x7 out of bounds>) at logsys.c:1340
> #9  0xfeae0aa9 in cpg_lib_init_fn (conn=0x81da300) at cpg.c:901
> #10 0xfec62574 in coroipcs_handler_dispatch (fd=9, revent=<value 
> optimized out>, context=0x81da300) at coroipcs.c:1333
> #11 0x08053b79 in corosync_poll_handler_dispatch 
> (handle=8928880753731698688, fd=9, revent=1, context=0x81da300) at 
> main.c:589
> #12 0xfedf31a5 in poll_run (handle=8928880753731698688) at coropoll.c:393
> #13 0x080542be in main (argc=2, argv=0x8047d28) at main.c:934
> 
> 
> I've digged into why first run failed. Getpeerucred returned with 
> errno=EFAULT. Man says that ucred_t ** should point to NULL pointer so
> it should be initialized:
> 
> -bash-3.00# svn di
> Index: exec/coroipcs.c
> ===================================================================
> --- exec/coroipcs.c     (revision 2264)
> +++ exec/coroipcs.c     (working copy)
> @@ -673,7 +673,7 @@
>   * Solaris and some BSD systems
>   */
>         {
> -               ucred_t *uc;
> +               ucred_t *uc = NULL;
>                 uid_t euid = -1;
>                 gid_t egid = -1;
> 
> After this patch I'm getting the core at first time :) ... I passed the 
> credential check but something went wrong later...
> 
> Regards,
> Wojtek



More information about the Openais mailing list