Re: [netvirt-dev] how to address ovsdb node connection flap
Please find my comments in line.
I would vote for option 3 which you suggested.
From: Anil Vishnoi [mailto:vishnoianil@...]
On Fri, Dec 15, 2017 at 3:55 AM, K.V Suneelu Verma <k.v.suneelu.verma@...> wrote:
I have created the following jira
client connects to only one odl controller via ha proxy
Some times when the client ovsdb connection flap happens, its node goes missing from operational datastore.
The following could be scenarios when connection flap happens
1) client disconnects and connects back to same controller after some delay
When client disconnects all the odl controllers are trying to cleanup the operds node.
When the processing of one odl controller which is trying to cleanup the operds node is delayed, then we end up client node missing in oper topology.
Just to understand it better, i believe you are talking about race condition across the cluster node, where once node disconnect, all the controller attempts to delete the node from data store, but meanwhile switch connects back to the controllers and the node that is added by that controller get deleted by other controller because it's still processing the entity ownership notification to cleanup the data store.
[Suneelu] This race is exactly what I am talking about.
This will generate unnecessary node removed notification to all the consumer applications, even if connection flap happened for one controller node and that can cause executing of lot of reconciliation login in the consumer application (e.f netvirt).
I believe cleanup only happen when we see that the controller that receive disconnect see that there is only one manager or the EOS says that respective Entity don't have owner. Timeout across cluster is going to be ticking bomb, it can cause the same issue in at any time.
You can't take that assumption, because if switch is connected to even one controller and it gets disconnect from it, all the controller nodes will get following notifications
(1) Latest owner ( wasOwner=true, hasOwner=false, isOwner=false)
) Second Controller ( wasOwner=false, hasOwner=false, isOwner=false)
(3) Third Controler ( wasOwner=false, hasOwner=false, isOwner=false)
With the assumption that all the controller are up, it's probably easy to determine that the controller that gets notification with wasOwner=true is the owner and it should delete the node, but you can't take that assumption, because that won't hold in case that owner controller get killed. So if you will write this logic, other two controller won't cleanup the data store and even thought switch is not connected to any controller, you will see that node is still present in the data store.
It won't, for the reason i mentioned above.
Any approach that you implement should handle these scenarios
Now this race condition is happening because existing clustering service can not guarantee the EOS notification delivery at the same time across the cluster nodes. In my opinion, to address this issue properly for the production level deployment, we need to implement following
[Suneelu] Totally agree.
If somehow eos gives a notification with info like this ( wasOwner=false, hasOwner=false, isOwner=false, original-owner=[timedout|crashed|released])
Then the other controllers can detect that owner controller crashed and can cleanup the node from oper datastore.
Basically the reason for eos change.
I noticed that such a reason field is not present in the newly introduced cluster singleton service also.
(1) Implement connection flap damping mechanism at the library level. A simple exponential decay mechanism that is used in IP link flapping can be used here. Tricky part here is how you determine that the connection flap is happening for same OVS, specially when you are behind the NAT system. To make it deterministic you will have to look at the iid that is coming from that OVS to determine which switch connection is flapping. You can expose various configuration parameter to the user, so that they can configure it according to their environment.
(2) When switch connect to all the controllers, you can still hit the same scenario if connection flap happens, so i would suggest to let all the controller write the node information to the data store. Given that writing connection details to the data store is not very frequent operation, cost of writing it 3 times will be negligible per switch. Also it will keep the solution simple.
(3) Attach a temporary data store listener specifically for the node that controller is writing to the data store. If that listener receives a notification of node delete, then you can write the node again to the data store if plugin see that local node is still connected to the switch. This approach will help you with the scenarios where your switch is connected to only controller.
I my opinion, (2) + (3) should work for all the scenario that i mentioned above and are more deterministic. But when it comes to scale, it's good to put a preventive mechanism as well at the lower layer (option 1) so that you can avoid any unnecessary data writes during warm up time by supressing these connection flap at library level.
[Suneelu] Agree with option 3 , not so sure about option 2 where all the controllers will try to write to the device (that may bring in more races I feel)