Bug(s) in StalePassiveConnectionService


Josh Hershberg <jhershbe@...>
 

Slightly newer version of the text:

Analysis of event handling in StalePassiveConnectionService

Global structures:

clientFutureMap (not technically global, but passed around): a map of old client ->SettableFuture

pendingConnectionClients: Maps a new client -> clientFutureMap (where all the old clients are to that same OVS)


handleNewPassiveConnection

pendingConnectionClients.put(newClient, new clientFutureMap)

For each old client

Add a future to the clientFutureMap

Start a ping

Start a 10 second timer (echoTimeoutFuture) Note: this seems to be a timeout to determine that the echo pings have failed. Why set it to 10 seconds when the rpc client’s timeout (whose future we listen to) is so much smaller? Also, why have this at all, why not just rely on the rpc client’s timeout?

Events:

Ping timeout

Cancellation exception is logged createStaleConnectionFutureCallback’s onFailure

Ping response

createStaleConnectionFutureCallback’s onSuccess:

Remove client from clientFutureMap

If clientFutureMap is empty,

kill the 10 second timer

Notify of new client

Remove new client from pendingConnectionClients

Note: this seems to be a bug. In this case, wouldn’t we want to continue with the old connections and ignore the new? Perhaps not. Perhaps we want to use the newest valid connection and the purpose of this is to get rid of the old, lingering connections by causing them to IOException (see the commit message and comments in the code). Perhaps this was the original thinking but it still does not make so much sense to me.

Ping failure

Presumably, the same as a timeout. However, the comment in the code (and original commit message) might indicate that the author might have thought something else, like that onSuccess would be called. Hard to understand how that could be the case.

New connection closed

Get the clientFutureMap in pendingConnectionClients for this client (why is this done in a loop?)

Call future.set(null) on all the futures for this canceled client..this should result in a call to “Ping Response” above.???? Why? Why should we not just remove it from the data structure?

10 second timer fires

Cancel the clientFutures for this new client in its clientFutureMap. Huh!?!? Wouldn’t we want to also notify about the new connection? Presumably if the 10 second timer fires (it’s called the clientEchoTimeout) wouldn’t we assume those clients are dead?


On Wed, Apr 25, 2018 at 11:42 AM, Josh Hershberg <jhershbe@...> wrote:
All, 

This bug [0] has me looking at StalePassiveConnectionService and aside from this specific bug, it seems there may be other issues. I took some notes about how this class seems to work and the questions are in bold. I have some fixes in mind (and a patch with a rough show of one of them that really needs to be reworked). 

I would REALLY appreciate at least another set of eyes on this. Please see my analysis below.


-Josh

Analysis of event handling in StalePassiveConnectionService

Global structures:

clientFutureMap (not technically global, but passed around): a map of old client ->SettableFuture

pendingConnectionClients: Maps a new client -> clientFutureMap (where all the old clients are to that same OVS)


handleNewPassiveConnection

pendingConnectionClients.put(newClient, new clientFutureMap)

For each old client

Add a future to the clientFutureMap

Start a ping

Start a 10 second timer (echoTimeoutFuture) Note: this seems to be a timeout to determine that the echo pings have failed. Why set it to 10 seconds when the rpc client’s timeout (whose future we listen to) is so much smaller? Also, why have this at all, why not just rely on the rpc client’s timeout?

Events:

Ping timeout

Cancellation exception is logged createStaleConnectionFutureCallback’s onFailure

Ping response

createStaleConnectionFutureCallback’s onSuccess:

Remove client from clientFutureMap

If clientFutureMap is empty,

kill the 10 second timer

Notify of new client

Remove new client from pendingConnectionClients

Note: this seems to be a bug. In this case, wouldn’t we want to continue with the old connections and ignore the new? Perhaps not. Perhaps we want to use the newest valid connection and the purpose of this is to get rid of the old, lingering connections by causing them to IOException (see the commit message and comments in the code). Perhaps this was the original thinking but it still does not make so much sense to me.

Ping failure

Presumably, the same as a timeout. However, the comment in the code (and original commit message) might indicate that the author might have thought something else, like that onSuccess would be called. Hard to understand how that could be the case.

New connection closed

Get the clientFutureMap in pendingConnectionClients for this client (why is this done in a loop?)

Call future.set(null) on all the futures..this should result in a call to “Ping Response” above.???? Huh??? Shouldn’t this be a cancel instead of a set(null)?? Shouldn’t we continue with the old connections in this case? This looks to me like it will literally trigger a notification of the new client...which disconnected!

10 second timer fires

Cancel the clientFutures for this new client in its clientFutureMap. Huh!?!? Wouldn’t we want to also notify about the new connection? Presumably if the 10 second timer fires (it’s called the clientEchoTimeout) wouldn’t we assume those clients are dead?




Josh Hershberg <jhershbe@...>
 

All, 

This bug [0] has me looking at StalePassiveConnectionService and aside from this specific bug, it seems there may be other issues. I took some notes about how this class seems to work and the questions are in bold. I have some fixes in mind (and a patch with a rough show of one of them that really needs to be reworked). 

I would REALLY appreciate at least another set of eyes on this. Please see my analysis below.


-Josh

Analysis of event handling in StalePassiveConnectionService

Global structures:

clientFutureMap (not technically global, but passed around): a map of old client ->SettableFuture

pendingConnectionClients: Maps a new client -> clientFutureMap (where all the old clients are to that same OVS)


handleNewPassiveConnection

pendingConnectionClients.put(newClient, new clientFutureMap)

For each old client

Add a future to the clientFutureMap

Start a ping

Start a 10 second timer (echoTimeoutFuture) Note: this seems to be a timeout to determine that the echo pings have failed. Why set it to 10 seconds when the rpc client’s timeout (whose future we listen to) is so much smaller? Also, why have this at all, why not just rely on the rpc client’s timeout?

Events:

Ping timeout

Cancellation exception is logged createStaleConnectionFutureCallback’s onFailure

Ping response

createStaleConnectionFutureCallback’s onSuccess:

Remove client from clientFutureMap

If clientFutureMap is empty,

kill the 10 second timer

Notify of new client

Remove new client from pendingConnectionClients

Note: this seems to be a bug. In this case, wouldn’t we want to continue with the old connections and ignore the new? Perhaps not. Perhaps we want to use the newest valid connection and the purpose of this is to get rid of the old, lingering connections by causing them to IOException (see the commit message and comments in the code). Perhaps this was the original thinking but it still does not make so much sense to me.

Ping failure

Presumably, the same as a timeout. However, the comment in the code (and original commit message) might indicate that the author might have thought something else, like that onSuccess would be called. Hard to understand how that could be the case.

New connection closed

Get the clientFutureMap in pendingConnectionClients for this client (why is this done in a loop?)

Call future.set(null) on all the futures..this should result in a call to “Ping Response” above.???? Huh??? Shouldn’t this be a cancel instead of a set(null)?? Shouldn’t we continue with the old connections in this case? This looks to me like it will literally trigger a notification of the new client...which disconnected!

10 second timer fires

Cancel the clientFutures for this new client in its clientFutureMap. Huh!?!? Wouldn’t we want to also notify about the new connection? Presumably if the 10 second timer fires (it’s called the clientEchoTimeout) wouldn’t we assume those clients are dead?