[Openais] [corosync] [patch] - Fix problems with long token timeout and cpg

David Teigland teigland at redhat.com
Thu Jul 2 11:37:06 PDT 2009


On Thu, Jul 02, 2009 at 11:09:26AM -0700, Steven Dake wrote:
> On Thu, 2009-07-02 at 09:27 -0500, David Teigland wrote:
> > On Thu, Jul 02, 2009 at 01:15:18PM +0200, Jan Friesse wrote:
> > > David Teigland wrote:
> > > > On Wed, Jul 01, 2009 at 01:46:03PM -0500, David Teigland wrote:
> > > >> other nodes should immediately recognize it has
> > > >> previously failed and process a complete failure for it.
> > > > 
> > > > i.e. the full equivalent to what apps (using any api's) would see if the
> > > > node had failed via normal token timeout.
> > >
> > > More or less agree, but does this patch fixed problem for you or not?
> > 
> > I haven't tried the patch, but based on the description and a quick look at
> > the patch, I don't think it helps.  Think more broadly about what's happening
> > here, don't focus on one particular effect.
> > 
> > 1. nodes 1,2,3,4: are cluster members
> > 2. nodes 1,2,3,4: are using services A,B,C,D
> > 3. node4: ifdown eth0, kill corosync
> > 4. node4: ifup eth0, start corosync
> > 5. node4: do not start/use any services
> > 6. nodes 1,2,3: never see node4 removed from membership
> > 7. nodes 1,2,3: services A,B,C,D never see node4 removed/fail
> > 
> 
> Individual services have to protect against those sorts of restarts.
> The only other mechanism would be to break wire compatibility within
> Totem.  

I'm trying to define my specific problem for you; how/when/where you actually
fix it isn't my main concern at this point.

(I'd suggest starting with a real, proper fix, without regard to compatibility
restrictions.  We'll get that working well.  Then, investigate the options for
backporting the same behavior into stable versions.  Doing that without
breaking compat will often involve some imperfect hacks.)

> This patch resolves the cpg case which is what the original bug was filed
> against.

It may resolve a problem that you're defining, but it doesn't resolve the
problem I'm defining.  Would you like bz 506255 to represent your bug or mine?
If yours, then I'll open a new bz.

Dave



More information about the Openais mailing list