This gerrit: OVSDB-439 Stale connection check is failing and


Josh Hershberg <jhershbe@...>
 

Anil et al,

We seem to have hit this issue. Basically, if we power off a compute node the OVS dies abruptly, no clean connection tear down. Then when the OVS comes back up notifyListenerForPassiveConnection is never called in  this code (OvsdbConnectionService):

    List<OvsdbClient> clientsFromSameNode = getPassiveClientsFromSameNode(client);
if (clientsFromSameNode.size() == 0) {
notifyListenerForPassiveConnection(client);
} else {
STALE_PASSIVE_CONNECTION_SERVICE.handleNewPassiveConnection(client, clientsFromSameNode);
}
}
Openflowplugin has already deleted the important parts of the nodes in the operational data store some time after the connection dropped. Due to the fact that notifyListenerForPassiveConnection is not called those nodes are not recreated which in turn causes all kinds of unpleasantness. 

Here are some related links:
[0] Red Hat Bugzilla with some details: https://bugzilla.redhat.com/show_bug.cgi?id=1560872

Thanks so much,

Josh


Anil Vishnoi
 

Hi Josh,

Please see inline..

On Wed, Apr 11, 2018 at 8:34 AM, Josh Hershberg <jhershbe@...> wrote:
Anil et al,

We seem to have hit this issue. Basically, if we power off a compute node the OVS dies abruptly, no clean connection tear down. Then when the OVS comes back up notifyListenerForPassiveConnection is never called in  this code (OvsdbConnectionService):
​Do you see the following log message in karaf.log ?

"Probe failed to OVSDB switch. Disconnecting the channel {}"
 

​If not, that you are most probably hitting the issue that is fixed through the gerrit [1]. There are two possible scenarios that ovsdb southbound plugin can hit.

(1) OVS is already connected to ovsdb southbound plugin, and another connection comes from the same device. Given that currently OVSDB SB plugin don't allow multiple connection, it disconnects the previous active connection. This is already implemented and should be working fine.
(2) OVS was connected, but now it's down, but OVSDB SB thinks that it's still connected. In that case it will go and check the connection aliveness. To check the connection aliveness, it will send echo request, but given that connection is already down from OVS side, it won't receive echo response, so it will throw TimeoutException. This scenario was not handled properly in OVSDB and gerrit patch [1] resolves that. 

Did you use this patch to test this scenario? If you are hitting the issue with it, just enable the debug log from library code and provide the log, i can have a look.

Thanks
Anil​


    List<OvsdbClient> clientsFromSameNode = getPassiveClientsFromSameNode(client);
if (clientsFromSameNode.size() == 0) {
notifyListenerForPassiveConnection(client);
} else {
STALE_PASSIVE_CONNECTION_SERVICE.handleNewPassiveConnection(client, clientsFromSameNode);
}
}
Openflowplugin has already deleted the important parts of the nodes in the operational data store some time after the connection dropped. Due to the fact that notifyListenerForPassiveConnection is not called those nodes are not recreated which in turn causes all kinds of unpleasantness. 

Here are some related links:
[0] Red Hat Bugzilla with some details: https://bugzilla.redhat.com/show_bug.cgi?id=1560872

Thanks so much,

Josh



--
Thanks
Anil