[integration-dev] [opendaylight-dev] ODL Clustering issue - High Availability


Daniel de la Rosa
 

Rohini and all

Please use Phosphorus SR3 since CONTROLLER-2035 is fixed in that version. In any case, @Venkatrangan Govindarajan  will also get back to you in case he finds anything in the logs you provided

thanks

On Wed, Jul 27, 2022 at 5:45 AM Rohini Ambika via lists.opendaylight.org <rohini.ambika=infosys.com@...> wrote:

Hi,

 

ODL version – Phosphorous SR2

 

Thanks & Regards,

Rohini

 

From: dev@... <dev@...> On Behalf Of Ivan Hrasko
Sent: Wednesday, July 27, 2022 5:31 PM
To: integration-dev@...; dev@...; kernel-dev@...; kernel-dev@...
Subject: Re: [opendaylight-dev] ODL Clustering issue - High Availability

 

[**EXTERNAL EMAIL**]

Hello,

 

what is the ODL version please?

 

Best,

 

Ivan Hraško

Senior Software Engineer

 

PANTHEON .tech

Mlynské Nivy 56, 821 05 Bratislava

Slovakia

Tel / +421 220 665 111

 

MAIL / ivan.hrasko@...

WEB / https://pantheon.tech

 


Od: integration-dev@... <integration-dev@...> v mene používateľa Rohini Ambika via lists.opendaylight.org <rohini.ambika=infosys.com@...>
Odoslané: streda, 27. júla 2022 13:19
Komu: integration-dev@...; dev@...; kernel-dev@...; kernel-dev@...
Predmet: [integration-dev] ODL Clustering issue - High Availability

 

Hi All,

 

As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.

 

Details and configurations as follows:

 

·         Requirement : ODL clustering for high availability (HA) on data distribution

·         Env Configuration:

o    3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node

o    CPU :  8 Cores

o    RAM : 20GB

o    Java Heap size : Min – 512MB Max – 16GB

o    JDK version : 11

o    Kubernetes version : 1.19.1

o    Docker version : 20.10.7

·         ODL features installed to enable clustering:

o    odl-netconf-clustered-topology

o    odl-restconf-all

·         Device configured : Netconf devices , all devices having same schema(tested with 250 devices)

·         Use Case:

o    Fail Over/High Availability:

§  Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.

§  Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.  

·         JIRA reference : https://jira.opendaylight.org/browse/CONTROLLER-2035  

·         Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)

 

 

Requesting your support to identify if there is any mis-configurations or any known solution for the issue .

Please let us know if any further information required.

 

Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.  

             

 

Thanks & Regards,

Rohini

 





Rohini Ambika <rohini.ambika@...>
 

Thanks.

 

We have already tested CONTROLLER-2035 with Phosphorous SR2(created a patch with the fix) and the issue still persists when we do multiple restarts of the master node(approx. after 10th restart).

 

Thanks & Regards,

Rohini

Cell: +91.9995241298 | VoIP: +91.471.3025332

 

From: TSC@... <TSC@...> On Behalf Of Daniel de la Rosa
Sent: Thursday, July 28, 2022 9:50 PM
To: Rohini Ambika <rohini.ambika@...>; Venkatrangan Govindarajan <gvrangan@...>
Cc: Ivan Hrasko <ivan.hrasko@...>; integration-dev@...; dev@...; kernel-dev@...; TSC <tsc@...>
Subject: Re: [OpenDaylight TSC] [integration-dev] [opendaylight-dev] ODL Clustering issue - High Availability

 

[**EXTERNAL EMAIL**]

Rohini and all

 

Please use Phosphorus SR3 since CONTROLLER-2035 is fixed in that version. In any case, @Venkatrangan Govindarajan  will also get back to you in case he finds anything in the logs you provided

 

thanks

 

On Wed, Jul 27, 2022 at 5:45 AM Rohini Ambika via lists.opendaylight.org <rohini.ambika=infosys.com@...> wrote:

Hi,

 

ODL version – Phosphorous SR2

 

Thanks & Regards,

Rohini

 

From: dev@... <dev@...> On Behalf Of Ivan Hrasko
Sent: Wednesday, July 27, 2022 5:31 PM
To: integration-dev@...; dev@...; kernel-dev@...; kernel-dev@...
Subject: Re: [opendaylight-dev] ODL Clustering issue - High Availability

 

[**EXTERNAL EMAIL**]

Hello,

 

what is the ODL version please?

 

Best,

 

Ivan Hraško

Senior Software Engineer

 

PANTHEON .tech

Mlynské Nivy 56, 821 05 Bratislava

Slovakia

Tel / +421 220 665 111

 

MAIL / ivan.hrasko@...

WEB / https://pantheon.tech

 


Od: integration-dev@... <integration-dev@...> v mene používateľa Rohini Ambika via lists.opendaylight.org <rohini.ambika=infosys.com@...>
Odoslané: streda, 27. júla 2022 13:19
Komu: integration-dev@...; dev@...; kernel-dev@...; kernel-dev@...
Predmet: [integration-dev] ODL Clustering issue - High Availability

 

Hi All,

 

As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.

 

Details and configurations as follows:

 

·         Requirement : ODL clustering for high availability (HA) on data distribution

·         Env Configuration:

o    3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node

o    CPU :  8 Cores

o    RAM : 20GB

o    Java Heap size : Min – 512MB Max – 16GB

o    JDK version : 11

o    Kubernetes version : 1.19.1

o    Docker version : 20.10.7

·         ODL features installed to enable clustering:

o    odl-netconf-clustered-topology

o    odl-restconf-all

·         Device configured : Netconf devices , all devices having same schema(tested with 250 devices)

·         Use Case:

o    Fail Over/High Availability:

§  Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.

§  Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.  

·         JIRA reference : https://jira.opendaylight.org/browse/CONTROLLER-2035  

·         Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)

 

 

Requesting your support to identify if there is any mis-configurations or any known solution for the issue .

Please let us know if any further information required.

 

Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.  

             

 

Thanks & Regards,

Rohini