OVSDB still listening during graceful shutdown


Sam Hague
 

Vishal, Michael,

when changing the 3node csit to use a graceful shutdown rather than a kill -9, I noticed the ovsdb listening on 6640 i still open while shutting down. It looks weird because as the bundles shutdown the current connections are closed, but the 6640 is still open so the ovsdb nodes reconnect. Also during this time any currently processed txs from ovsdb have exceptions. kill -9 just kills everything instantly so we don't see this.

What looks to be happening is the southbound-impl gets the blueprint shutdown first and starts to close connections. The library has not received the shutdown yet. As part of closing connections exceptions come out. Few seconds later the library gets the blueprint shutdown, but somehow the listening ports are still open. the ovsdb nodes reconnect and the southbound-impl shows logs for incoming connections. Not sure if the earlier exceptions left something hanging or the library was stuck.

Should we close those ports really quickly when shutting down? Is there way to make the blueprint shutdown ordered?

Thanks, Sam



Vishal Thapar <vthapar@...>
 

Hi Sam,

Yes, we bring up the ports once everything is ready, so should bring them down before everything goes away. Michael can tell more about how order is determined by bundles, but we can figure out a way to trigger closing the port. Only issue would be HWVTEP and OVSDB both use same port but different plugins. For Netvirt use case it should be okay to trigger port close from either but from library perspective it should be closed only when both are going down.

Is this high priority or we have time to discuss this at the DDF?

Regards,
Vishal.

On Fri, Sep 14, 2018 at 11:19 PM, Sam Hague <shague@...> wrote:
Vishal, Michael,

when changing the 3node csit to use a graceful shutdown rather than a kill -9, I noticed the ovsdb listening on 6640 i still open while shutting down. It looks weird because as the bundles shutdown the current connections are closed, but the 6640 is still open so the ovsdb nodes reconnect. Also during this time any currently processed txs from ovsdb have exceptions. kill -9 just kills everything instantly so we don't see this.

What looks to be happening is the southbound-impl gets the blueprint shutdown first and starts to close connections. The library has not received the shutdown yet. As part of closing connections exceptions come out. Few seconds later the library gets the blueprint shutdown, but somehow the listening ports are still open. the ovsdb nodes reconnect and the southbound-impl shows logs for incoming connections. Not sure if the earlier exceptions left something hanging or the library was stuck.

Should we close those ports really quickly when shutting down? Is there way to make the blueprint shutdown ordered?

Thanks, Sam




Sam Hague
 



On Sun, Sep 16, 2018 at 10:43 AM Vishal Thapar <vthapar@...> wrote:
Hi Sam,

Yes, we bring up the ports once everything is ready, so should bring them down before everything goes away. Michael can tell more about how order is determined by bundles, but we can figure out a way to trigger closing the port. Only issue would be HWVTEP and OVSDB both use same port but different plugins. For Netvirt use case it should be okay to trigger port close from either but from library perspective it should be closed only when both are going down.

Is this high priority or we have time to discuss this at the DDF?
It is in the cluster priority range. The graceful shutdown changes make this easy to reproduce. We can wait till DDF, but I think this will be a good one to have in place. Also related is that the openflowplugin has similar issues where parts of ofp are going down but the switches are still connected and causing problems. I need to file some jiras there. 

Regards,
Vishal.

On Fri, Sep 14, 2018 at 11:19 PM, Sam Hague <shague@...> wrote:
Vishal, Michael,

when changing the 3node csit to use a graceful shutdown rather than a kill -9, I noticed the ovsdb listening on 6640 i still open while shutting down. It looks weird because as the bundles shutdown the current connections are closed, but the 6640 is still open so the ovsdb nodes reconnect. Also during this time any currently processed txs from ovsdb have exceptions. kill -9 just kills everything instantly so we don't see this.

What looks to be happening is the southbound-impl gets the blueprint shutdown first and starts to close connections. The library has not received the shutdown yet. As part of closing connections exceptions come out. Few seconds later the library gets the blueprint shutdown, but somehow the listening ports are still open. the ovsdb nodes reconnect and the southbound-impl shows logs for incoming connections. Not sure if the earlier exceptions left something hanging or the library was stuck.

Should we close those ports really quickly when shutting down? Is there way to make the blueprint shutdown ordered?

Thanks, Sam




Michael Vorburger <vorburger@...>
 

On Mon, Sep 17, 2018 at 2:44 AM Sam Hague <shague@...> wrote:
On Sun, Sep 16, 2018 at 10:43 AM Vishal Thapar <vthapar@...> wrote:
Hi Sam,

Yes, we bring up the ports once everything is ready, so should bring them down before everything goes away. Michael can tell more about how order is determined by bundles, but we can figure out a way to trigger closing the port. Only issue would be HWVTEP and OVSDB both use same port but different plugins. For Netvirt use case it should be okay to trigger port close from either but from library perspective it should be closed only when both are going down.

Is this high priority or we have time to discuss this at the DDF?
It is in the cluster priority range. The graceful shutdown changes make this easy to reproduce. We can wait till DDF, but I think this will be a good one to have in place. Also related is that the openflowplugin has similar issues where parts of ofp are going down but the switches are still connected and causing problems. I need to file some jiras there. 
 
I'm not sure if / doubt that Blueprint stops things in the order you want, or how to make it do so, but perhaps the easiest is to just make it explicit? 

So a @PreDestroy somewhere approprriate in ovsdb could call a close() or whatever in library.

I am unfortunately not into ovsdb / library code details, so perhaps the best would be to look at this (code) together at ONS while F2F.

Regards,
Vishal.

On Fri, Sep 14, 2018 at 11:19 PM, Sam Hague <shague@...> wrote:
Vishal, Michael,

when changing the 3node csit to use a graceful shutdown rather than a kill -9, I noticed the ovsdb listening on 6640 i still open while shutting down. It looks weird because as the bundles shutdown the current connections are closed, but the 6640 is still open so the ovsdb nodes reconnect. Also during this time any currently processed txs from ovsdb have exceptions. kill -9 just kills everything instantly so we don't see this.

What looks to be happening is the southbound-impl gets the blueprint shutdown first and starts to close connections. The library has not received the shutdown yet. As part of closing connections exceptions come out. Few seconds later the library gets the blueprint shutdown, but somehow the listening ports are still open. the ovsdb nodes reconnect and the southbound-impl shows logs for incoming connections. Not sure if the earlier exceptions left something hanging or the library was stuck.

Should we close those ports really quickly when shutting down? Is there way to make the blueprint shutdown ordered?

Thanks, Sam




Vishal Thapar <vthapar@...>
 



On Mon, Sep 17, 2018 at 7:48 PM, Michael Vorburger <vorburger@...> wrote:
On Mon, Sep 17, 2018 at 2:44 AM Sam Hague <shague@...> wrote:
On Sun, Sep 16, 2018 at 10:43 AM Vishal Thapar <vthapar@...> wrote:
Hi Sam,

Yes, we bring up the ports once everything is ready, so should bring them down before everything goes away. Michael can tell more about how order is determined by bundles, but we can figure out a way to trigger closing the port. Only issue would be HWVTEP and OVSDB both use same port but different plugins. For Netvirt use case it should be okay to trigger port close from either but from library perspective it should be closed only when both are going down.

Is this high priority or we have time to discuss this at the DDF?
It is in the cluster priority range. The graceful shutdown changes make this easy to reproduce. We can wait till DDF, but I think this will be a good one to have in place. Also related is that the openflowplugin has similar issues where parts of ofp are going down but the switches are still connected and causing problems. I need to file some jiras there. 
 
I'm not sure if / doubt that Blueprint stops things in the order you want, or how to make it do so, but perhaps the easiest is to just make it explicit? 

So a @PreDestroy somewhere approprriate in ovsdb could call a close() or whatever in library.
During bringup, if A depends on B, B will come up first. I expect shutdown order to be reverse, as in, A can't go while B is still up if B depends on A. I suspect that is why we're seeing the issue. Listen socket is library code while plugins shutdown and trigger cleanup of their connections. We need to close socket when ovsdb//hwvtep go down, as bundle order means library close will not happen till much later. 

I am unfortunately not into ovsdb / library code details, so perhaps the best would be to look at this (code) together at ONS while F2F.

Regards,
Vishal.

On Fri, Sep 14, 2018 at 11:19 PM, Sam Hague <shague@...> wrote:
Vishal, Michael,

when changing the 3node csit to use a graceful shutdown rather than a kill -9, I noticed the ovsdb listening on 6640 i still open while shutting down. It looks weird because as the bundles shutdown the current connections are closed, but the 6640 is still open so the ovsdb nodes reconnect. Also during this time any currently processed txs from ovsdb have exceptions. kill -9 just kills everything instantly so we don't see this.

What looks to be happening is the southbound-impl gets the blueprint shutdown first and starts to close connections. The library has not received the shutdown yet. As part of closing connections exceptions come out. Few seconds later the library gets the blueprint shutdown, but somehow the listening ports are still open. the ovsdb nodes reconnect and the southbound-impl shows logs for incoming connections. Not sure if the earlier exceptions left something hanging or the library was stuck.

Should we close those ports really quickly when shutting down? Is there way to make the blueprint shutdown ordered?

Thanks, Sam