Hi Andrew and Rahul:
I remember we have discussed these topics in the ODL containers and helm charts meetings. Do we know if the expected configuration would work with the ODL on K8s clusters setup or requires some configuration changes?
Cheers, Anil A message was sent to the group https://lists.opendaylight.org/g/dev from rohini.ambika@... that needs to be approved because
the user is new member moderated.
View this message online
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
* Requirement : ODL clustering for high availability (HA) on data distribution * Env Configuration:
* 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node * CPU : 8 Cores * RAM : 20GB * Java Heap size : Min - 512MB Max - 16GB * JDK version : 11 * Kubernetes version : 1.19.1 * Docker version : 20.10.7
* ODL features installed to enable clustering:
* odl-netconf-clustered-topology * odl-restconf-all
* Device configured : Netconf devices , all devices having same schema(tested with 250 devices) * Use Case:
* Fail Over/High Availability:
* Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts. * Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
* JIRA reference : https://jira.opendaylight.org/browse/CONTROLLER-2035<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjira.opendaylight.org%2Fbrowse%2FCONTROLLER-2035&data=05%7C01%7Crohini.ambika%40infosys.com%7C12cedda8fd77459df73b08da6fb6802e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637945126890707334%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6yZbWAhTgVdwHVbpO7UtUenKW5%2B476j%2BG4ZEodjBUKc%3D&reserved=0> * Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)
Requesting your support to identify if there is any mis-configurations or any known solution for the issue . Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards, Rohini
A complete copy of this message has been attached for your convenience.
To approve this using email, reply to this message. You do not need to attach the original message, just reply and send.
Reject this message and notify the sender.
Delete this message and do not notify the sender.
NOTE: The pending message will expire after 14 days. If you do not take action within that time, the pending message will be automatically rejected.
Change your notification settings
---------- Forwarded message ---------- From: rohini.ambika@...To: " dev@..." < dev@...> Cc: Bcc: Date: Wed, 27 Jul 2022 11:03:22 +0000 Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
- Requirement : ODL clustering for high availability (HA) on
data distribution
- Env Configuration:
- 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
- CPU : 8 Cores
- RAM : 20GB
- Java Heap size : Min – 512MB Max – 16GB
- JDK version : 11
- Kubernetes version : 1.19.1
- Docker version : 20.10.7
- ODL features installed to enable clustering:
- odl-netconf-clustered-topology
- odl-restconf-all
- Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
- Use Case:
- Fail Over/High Availability:
- Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional.
If the affected instance is having master mount, the other instance who is elected as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member
node and register the slave mounts.
- Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the
new leader is trying to re-register the master mount but failed at a point due to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration
of already mounted device/ new mounts will fail.
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
|
|
Rahul Sharma <rahul.iitr@...>
Hi Anil,
Thank you for bringing this up.
Couple of questions:
Thanks, Rahul
toggle quoted message
Show quoted text
On Wed, Jul 27, 2022 at 5:05 PM Anil Shashikumar Belur < abelur@...> wrote: Hi Andrew and Rahul:
I remember we have discussed these topics in the ODL containers and helm charts meetings. Do we know if the expected configuration would work with the ODL on K8s clusters setup or requires some configuration changes?
Cheers, Anil A message was sent to the group https://lists.opendaylight.org/g/dev from rohini.ambika@... that needs to be approved because
the user is new member moderated.
View this message online
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
* Requirement : ODL clustering for high availability (HA) on data distribution * Env Configuration:
* 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node * CPU : 8 Cores * RAM : 20GB * Java Heap size : Min - 512MB Max - 16GB * JDK version : 11 * Kubernetes version : 1.19.1 * Docker version : 20.10.7
* ODL features installed to enable clustering:
* odl-netconf-clustered-topology * odl-restconf-all
* Device configured : Netconf devices , all devices having same schema(tested with 250 devices) * Use Case:
* Fail Over/High Availability:
* Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts. * Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
* JIRA reference : https://jira.opendaylight.org/browse/CONTROLLER-2035<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjira.opendaylight.org%2Fbrowse%2FCONTROLLER-2035&data=05%7C01%7Crohini.ambika%40infosys.com%7C12cedda8fd77459df73b08da6fb6802e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637945126890707334%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6yZbWAhTgVdwHVbpO7UtUenKW5%2B476j%2BG4ZEodjBUKc%3D&reserved=0> * Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)
Requesting your support to identify if there is any mis-configurations or any known solution for the issue . Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards, Rohini
A complete copy of this message has been attached for your convenience.
To approve this using email, reply to this message. You do not need to attach the original message, just reply and send.
Reject this message and notify the sender.
Delete this message and do not notify the sender.
NOTE: The pending message will expire after 14 days. If you do not take action within that time, the pending message will be automatically rejected.
Change your notification settings
---------- Forwarded message ---------- From: rohini.ambika@...To: " dev@..." < dev@...> Cc: Bcc: Date: Wed, 27 Jul 2022 11:03:22 +0000 Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
- Requirement : ODL clustering for high availability (HA) on
data distribution
- Env Configuration:
- 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
- CPU : 8 Cores
- RAM : 20GB
- Java Heap size : Min – 512MB Max – 16GB
- JDK version : 11
- Kubernetes version : 1.19.1
- Docker version : 20.10.7
- ODL features installed to enable clustering:
- odl-netconf-clustered-topology
- odl-restconf-all
- Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
- Use Case:
- Fail Over/High Availability:
- Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional.
If the affected instance is having master mount, the other instance who is elected as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member
node and register the slave mounts.
- Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the
new leader is trying to re-register the master mount but failed at a point due to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration
of already mounted device/ new mounts will fail.
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
|
|
Hi,
I belive they are using ODL Helm charts and K8s for the cluster setup, that said I have requested the version of ODL being used. Rohoni: Can you provide more details on the ODL version, and configuration, that Rahul/Andrew requested?
toggle quoted message
Show quoted text
Hi Anil,
Thank you for bringing this up.
Couple of questions:
Thanks, Rahul On Wed, Jul 27, 2022 at 5:05 PM Anil Shashikumar Belur < abelur@...> wrote: Hi Andrew and Rahul:
I remember we have discussed these topics in the ODL containers and helm charts meetings. Do we know if the expected configuration would work with the ODL on K8s clusters setup or requires some configuration changes?
Cheers, Anil A message was sent to the group https://lists.opendaylight.org/g/dev from rohini.ambika@... that needs to be approved because
the user is new member moderated.
View this message online
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
* Requirement : ODL clustering for high availability (HA) on data distribution * Env Configuration:
* 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node * CPU : 8 Cores * RAM : 20GB * Java Heap size : Min - 512MB Max - 16GB * JDK version : 11 * Kubernetes version : 1.19.1 * Docker version : 20.10.7
* ODL features installed to enable clustering:
* odl-netconf-clustered-topology * odl-restconf-all
* Device configured : Netconf devices , all devices having same schema(tested with 250 devices) * Use Case:
* Fail Over/High Availability:
* Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts. * Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
* JIRA reference : https://jira.opendaylight.org/browse/CONTROLLER-2035<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjira.opendaylight.org%2Fbrowse%2FCONTROLLER-2035&data=05%7C01%7Crohini.ambika%40infosys.com%7C12cedda8fd77459df73b08da6fb6802e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637945126890707334%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6yZbWAhTgVdwHVbpO7UtUenKW5%2B476j%2BG4ZEodjBUKc%3D&reserved=0> * Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)
Requesting your support to identify if there is any mis-configurations or any known solution for the issue . Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards, Rohini
A complete copy of this message has been attached for your convenience.
To approve this using email, reply to this message. You do not need to attach the original message, just reply and send.
Reject this message and notify the sender.
Delete this message and do not notify the sender.
NOTE: The pending message will expire after 14 days. If you do not take action within that time, the pending message will be automatically rejected.
Change your notification settings
---------- Forwarded message ---------- From: rohini.ambika@...To: " dev@..." < dev@...> Cc: Bcc: Date: Wed, 27 Jul 2022 11:03:22 +0000 Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
- Requirement : ODL clustering for high availability (HA) on
data distribution
- Env Configuration:
- 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
- CPU : 8 Cores
- RAM : 20GB
- Java Heap size : Min – 512MB Max – 16GB
- JDK version : 11
- Kubernetes version : 1.19.1
- Docker version : 20.10.7
- ODL features installed to enable clustering:
- odl-netconf-clustered-topology
- odl-restconf-all
- Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
- Use Case:
- Fail Over/High Availability:
- Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional.
If the affected instance is having master mount, the other instance who is elected as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member
node and register the slave mounts.
- Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the
new leader is trying to re-register the master mount but failed at a point due to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration
of already mounted device/ new mounts will fail.
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
-- - Rahul Sharma
|
|
Rahul Sharma <rahul.iitr@...>
Hello Rohini,
Thank you for the answers. - For the 1st one: when you say you tried with the official Helm charts - which helm charts are you referring to? Can you send more details on how (parameters in values.yaml that you used) when you deployed these charts.
- What was the Temporary fix that reduced the occurrence of the issue. Can you point to the check-in made or change in configuration parameters? Would be helpful to diagnose a proper fix.
Regards, Rahul
toggle quoted message
Show quoted text
Hi Anil,
Thanks for the response.
Please find the details below:
1. Is the Test deployment using our Helm charts (ODL Helm Chart)? – We have created our own helm chart for the ODL deployment. Have also tried the use case with official helm chart.
2. I see that the JIRA mentioned in the below email ( https://jira.opendaylight.org/browse/CONTROLLER-2035 ) is already marked Resolved. Has somebody fixed it in the latest version. –
This was a temporary fix from our end and the failure rate has reduced due to the fix, however we are still facing the issue when we do multiple restarts of master node.
ODL version used is Phosphorous SR2
All the configurations are provided and attached in the initial mail .
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
I belive they are using ODL Helm charts and K8s for the cluster setup, that said I have requested the version of ODL being used.
Rohoni: Can you provide more details on the ODL version, and configuration, that Rahul/Andrew requested?
Thank you for bringing this up.
On Wed, Jul 27, 2022 at 5:05 PM Anil Shashikumar Belur <abelur@...> wrote:
Hi Andrew and Rahul:
I remember we have discussed these topics in the ODL containers and helm charts meetings.
Do we know if the expected configuration would work with the ODL on K8s clusters setup or requires some configuration changes?
A message was sent to the group
https://lists.opendaylight.org/g/dev from
rohini.ambika@... that needs to be approved because the user is new member moderated.
View
this message online
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
* Requirement : ODL clustering for high availability (HA) on data distribution
* Env Configuration:
* 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
* CPU : 8 Cores
* RAM : 20GB
* Java Heap size : Min - 512MB Max - 16GB
* JDK version : 11
* Kubernetes version : 1.19.1
* Docker version : 20.10.7
* ODL features installed to enable clustering:
* odl-netconf-clustered-topology
* odl-restconf-all
* Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
* Use Case:
* Fail Over/High Availability:
* Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master
by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
* Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination
of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
* JIRA reference :
https://jira.opendaylight.org/browse/CONTROLLER-2035<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjira.opendaylight.org%2Fbrowse%2FCONTROLLER-2035&data=05%7C01%7Crohini.ambika%40infosys.com%7C12cedda8fd77459df73b08da6fb6802e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637945126890707334%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6yZbWAhTgVdwHVbpO7UtUenKW5%2B476j%2BG4ZEodjBUKc%3D&reserved=0>
* Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
A complete copy of this message has been attached for your convenience.
To approve this using email, reply to this message. You do not need to attach the original message, just reply and send.
Reject this message and notify the sender.
Delete this message and do not notify the sender.
NOTE: The pending message will expire after 14 days. If you do not take action within that time, the pending message will be automatically rejected.
Change
your notification settings
---------- Forwarded message ----------
From: rohini.ambika@...
To: "dev@..." <dev@...>
Cc:
Bcc:
Date: Wed, 27 Jul 2022 11:03:22 +0000
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during
the performance testing.
Details and configurations as follows:
-
Requirement : ODL clustering for high availability (HA) on
data distribution
-
Env Configuration:
-
3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
-
CPU : 8 Cores
-
RAM : 20GB
-
Java Heap size : Min – 512MB Max – 16GB
-
JDK version : 11
-
Kubernetes version : 1.19.1
-
Docker version : 20.10.7
-
ODL features installed to enable clustering:
-
odl-netconf-clustered-topology
-
odl-restconf-all
-
Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
-
Use Case:
-
Fail Over/High Availability:
-
Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected
as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
-
Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due
to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled
in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
--
- Rahul Sharma
|
|
Rohini Ambika <rohini.ambika@...>
Hi Anil,
Thanks for the response.
Please find the details below:
1. Is the Test deployment using our Helm charts (ODL Helm Chart)? – We have created our own helm chart for the ODL deployment. Have also tried the use case with official helm chart.
2. I see that the JIRA mentioned in the below email ( https://jira.opendaylight.org/browse/CONTROLLER-2035 ) is already marked Resolved. Has somebody fixed it in the latest version. –
This was a temporary fix from our end and the failure rate has reduced due to the fix, however we are still facing the issue when we do multiple restarts of master node.
ODL version used is Phosphorous SR2
All the configurations are provided and attached in the initial mail .
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
toggle quoted message
Show quoted text
From: Anil Shashikumar Belur <abelur@...>
Sent: Thursday, July 28, 2022 5:05 AM
To: Rahul Sharma <rahul.iitr@...>
Cc: Hsia, Andrew <andrew.hsia@...>; Rohini Ambika <rohini.ambika@...>; Casey Cain <ccain@...>; Luis Gomez <ecelgp@...>; TSC <tsc@...>
Subject: Re: [opendaylight-dev] Message Approval Needed - rohini.ambika@... posted to dev@...
[**EXTERNAL EMAIL**]
I belive they are using ODL Helm charts and K8s for the cluster setup, that said I have requested the version of ODL being used.
Rohoni: Can you provide more details on the ODL version, and configuration, that Rahul/Andrew requested?
Thank you for bringing this up.
On Wed, Jul 27, 2022 at 5:05 PM Anil Shashikumar Belur <abelur@...> wrote:
Hi Andrew and Rahul:
I remember we have discussed these topics in the ODL containers and helm charts meetings.
Do we know if the expected configuration would work with the ODL on K8s clusters setup or requires some configuration changes?
A message was sent to the group
https://lists.opendaylight.org/g/dev from
rohini.ambika@... that needs to be approved because the user is new member moderated.
View
this message online
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
* Requirement : ODL clustering for high availability (HA) on data distribution
* Env Configuration:
* 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
* CPU : 8 Cores
* RAM : 20GB
* Java Heap size : Min - 512MB Max - 16GB
* JDK version : 11
* Kubernetes version : 1.19.1
* Docker version : 20.10.7
* ODL features installed to enable clustering:
* odl-netconf-clustered-topology
* odl-restconf-all
* Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
* Use Case:
* Fail Over/High Availability:
* Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master
by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
* Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination
of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
* JIRA reference :
https://jira.opendaylight.org/browse/CONTROLLER-2035<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjira.opendaylight.org%2Fbrowse%2FCONTROLLER-2035&data=05%7C01%7Crohini.ambika%40infosys.com%7C12cedda8fd77459df73b08da6fb6802e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637945126890707334%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6yZbWAhTgVdwHVbpO7UtUenKW5%2B476j%2BG4ZEodjBUKc%3D&reserved=0>
* Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
A complete copy of this message has been attached for your convenience.
To approve this using email, reply to this message. You do not need to attach the original message, just reply and send.
Reject this message and notify the sender.
Delete this message and do not notify the sender.
NOTE: The pending message will expire after 14 days. If you do not take action within that time, the pending message will be automatically rejected.
Change
your notification settings
---------- Forwarded message ----------
From: rohini.ambika@...
To: "dev@..." <dev@...>
Cc:
Bcc:
Date: Wed, 27 Jul 2022 11:03:22 +0000
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during
the performance testing.
Details and configurations as follows:
-
Requirement : ODL clustering for high availability (HA) on
data distribution
-
Env Configuration:
-
3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
-
CPU : 8 Cores
-
RAM : 20GB
-
Java Heap size : Min – 512MB Max – 16GB
-
JDK version : 11
-
Kubernetes version : 1.19.1
-
Docker version : 20.10.7
-
ODL features installed to enable clustering:
-
odl-netconf-clustered-topology
-
odl-restconf-all
-
Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
-
Use Case:
-
Fail Over/High Availability:
-
Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected
as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
-
Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due
to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled
in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
--
- Rahul Sharma
|
|
Rohini Ambika <rohini.ambika@...>
toggle quoted message
Show quoted text
From: Rahul Sharma <rahul.iitr@...>
Sent: Thursday, July 28, 2022 9:32 PM
To: Rohini Ambika <rohini.ambika@...>
Cc: Anil Shashikumar Belur <abelur@...>; Hsia, Andrew <andrew.hsia@...>; Casey Cain <ccain@...>; Luis Gomez <ecelgp@...>; TSC <tsc@...>; John Mangan <John.Mangan@...>;
Sathya Manalan <sathya.manalan@...>; Hemalatha Thangavelu <hemalatha.t@...>; Gokul Sakthivel <gokul.sakthivel@...>; Bhaswati_Das <Bhaswati_Das@...>
Subject: Re: [opendaylight-dev] Message Approval Needed - rohini.ambika@... posted to dev@...
[**EXTERNAL EMAIL**]
Hello Rohini,
Thank you for the answers.
-
For the 1st one: when you say you tried with the official Helm charts - which helm charts are you referring to? Can you send more details on how (parameters in values.yaml that you used) when you deployed these charts.
-
What was the Temporary fix that reduced the occurrence of the issue. Can you point to the check-in made or change in configuration parameters? Would be helpful to diagnose a proper fix.
Hi Anil,
Thanks for the response.
Please find the details below:
1. Is the Test deployment using our Helm charts (ODL Helm Chart)? – We have created our own helm chart for the ODL deployment. Have also tried
the use case with official helm chart.
2. I see that the JIRA mentioned in the below email (
https://jira.opendaylight.org/browse/CONTROLLER-2035 ) is already marked Resolved. Has somebody fixed it in the latest version. –
This was a temporary fix from our end and the failure rate has reduced due to the fix, however we are still facing the issue when we do multiple restarts of master node.
ODL version used is Phosphorous SR2
All the configurations are provided and attached in the initial mail .
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
I belive they are using ODL Helm charts and K8s for the cluster setup, that said I have requested the version of ODL being used.
Rohoni: Can you provide more details on the ODL version, and configuration, that Rahul/Andrew requested?
Thank you for bringing this up.
On Wed, Jul 27, 2022 at 5:05 PM Anil Shashikumar Belur <abelur@...> wrote:
Hi Andrew and Rahul:
I remember we have discussed these topics in the ODL containers and helm charts meetings.
Do we know if the expected configuration would work with the ODL on K8s clusters setup or requires some configuration changes?
A message was sent to the group
https://lists.opendaylight.org/g/dev from
rohini.ambika@... that needs to be approved because the user is new member moderated.
View
this message online
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
* Requirement : ODL clustering for high availability (HA) on data distribution
* Env Configuration:
* 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
* CPU : 8 Cores
* RAM : 20GB
* Java Heap size : Min - 512MB Max - 16GB
* JDK version : 11
* Kubernetes version : 1.19.1
* Docker version : 20.10.7
* ODL features installed to enable clustering:
* odl-netconf-clustered-topology
* odl-restconf-all
* Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
* Use Case:
* Fail Over/High Availability:
* Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master
by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
* Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination
of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
* JIRA reference :
https://jira.opendaylight.org/browse/CONTROLLER-2035<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjira.opendaylight.org%2Fbrowse%2FCONTROLLER-2035&data=05%7C01%7Crohini.ambika%40infosys.com%7C12cedda8fd77459df73b08da6fb6802e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637945126890707334%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6yZbWAhTgVdwHVbpO7UtUenKW5%2B476j%2BG4ZEodjBUKc%3D&reserved=0>
* Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
A complete copy of this message has been attached for your convenience.
To approve this using email, reply to this message. You do not need to attach the original message, just reply and send.
Reject this message and notify the sender.
Delete this message and do not notify the sender.
NOTE: The pending message will expire after 14 days. If you do not take action within that time, the pending message will be automatically rejected.
Change
your notification settings
---------- Forwarded message ----------
From: rohini.ambika@...
To: "dev@..." <dev@...>
Cc:
Bcc:
Date: Wed, 27 Jul 2022 11:03:22 +0000
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during
the performance testing.
Details and configurations as follows:
-
Requirement : ODL clustering for high availability (HA) on
data distribution
-
Env Configuration:
-
3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
-
CPU : 8 Cores
-
RAM : 20GB
-
Java Heap size : Min – 512MB Max – 16GB
-
JDK version : 11
-
Kubernetes version : 1.19.1
-
Docker version : 20.10.7
-
ODL features installed to enable clustering:
-
odl-netconf-clustered-topology
-
odl-restconf-all
-
Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
-
Use Case:
-
Fail Over/High Availability:
-
Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected
as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
-
Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due
to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled
in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
--
- Rahul Sharma
--
|
|
Rohini Ambika <rohini.ambika@...>
Hello,
Did you get a chance to look in to the configurations shared.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
toggle quoted message
Show quoted text
From: Rohini Ambika
Sent: Friday, July 29, 2022 11:35 AM
To: Rahul Sharma <rahul.iitr@...>
Cc: Anil Shashikumar Belur <abelur@...>; Hsia, Andrew <andrew.hsia@...>; Casey Cain <ccain@...>; Luis Gomez <ecelgp@...>; TSC <tsc@...>; John Mangan <John.Mangan@...>;
Sathya Manalan <sathya.manalan@...>; Hemalatha Thangavelu <hemalatha.t@...>; Gokul Sakthivel <gokul.sakthivel@...>; Bhaswati_Das <Bhaswati_Das@...>
Subject: RE: [opendaylight-dev] Message Approval Needed - rohini.ambika@... posted to dev@...
Hello Rahul,
Please find the answers below:
- Official Helm chart @
ODL Helm Chart . Attaching the values.yml for reference
- Fix was to restart the Owner Supervisor on failure . Check-in @
https://git.opendaylight.org/gerrit/c/controller/+/100357
We observed the same problem when tested without K8s set up by following the instructions @
https://docs.opendaylight.org/en/stable-phosphorus/getting-started-guide/clustering.html. Instead of installing
odl-mdsal-distributed-datastore feature, we have enabled the features given in the values.yml.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
Hello Rohini,
Thank you for the answers.
-
For the 1st one: when you say you tried with the official Helm charts - which helm charts are you referring to? Can you send more details on how (parameters in values.yaml that you used) when you deployed these charts.
-
What was the Temporary fix that reduced the occurrence of the issue. Can you point to the check-in made or change in configuration parameters? Would be helpful to diagnose a proper fix.
Hi Anil,
Thanks for the response.
Please find the details below:
1. Is the Test deployment using our Helm charts (ODL Helm Chart)? – We have created our own helm chart for the ODL deployment. Have also tried
the use case with official helm chart.
2. I see that the JIRA mentioned in the below email (
https://jira.opendaylight.org/browse/CONTROLLER-2035 ) is already marked Resolved. Has somebody fixed it in the latest version. –
This was a temporary fix from our end and the failure rate has reduced due to the fix, however we are still facing the issue when we do multiple restarts of master node.
ODL version used is Phosphorous SR2
All the configurations are provided and attached in the initial mail .
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
I belive they are using ODL Helm charts and K8s for the cluster setup, that said I have requested the version of ODL being used.
Rohoni: Can you provide more details on the ODL version, and configuration, that Rahul/Andrew requested?
Thank you for bringing this up.
On Wed, Jul 27, 2022 at 5:05 PM Anil Shashikumar Belur <abelur@...> wrote:
Hi Andrew and Rahul:
I remember we have discussed these topics in the ODL containers and helm charts meetings.
Do we know if the expected configuration would work with the ODL on K8s clusters setup or requires some configuration changes?
A message was sent to the group
https://lists.opendaylight.org/g/dev from
rohini.ambika@... that needs to be approved because the user is new member moderated.
View
this message online
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
* Requirement : ODL clustering for high availability (HA) on data distribution
* Env Configuration:
* 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
* CPU : 8 Cores
* RAM : 20GB
* Java Heap size : Min - 512MB Max - 16GB
* JDK version : 11
* Kubernetes version : 1.19.1
* Docker version : 20.10.7
* ODL features installed to enable clustering:
* odl-netconf-clustered-topology
* odl-restconf-all
* Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
* Use Case:
* Fail Over/High Availability:
* Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master
by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
* Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination
of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
* JIRA reference :
https://jira.opendaylight.org/browse/CONTROLLER-2035<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjira.opendaylight.org%2Fbrowse%2FCONTROLLER-2035&data=05%7C01%7Crohini.ambika%40infosys.com%7C12cedda8fd77459df73b08da6fb6802e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637945126890707334%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6yZbWAhTgVdwHVbpO7UtUenKW5%2B476j%2BG4ZEodjBUKc%3D&reserved=0>
* Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
A complete copy of this message has been attached for your convenience.
To approve this using email, reply to this message. You do not need to attach the original message, just reply and send.
Reject this message and notify the sender.
Delete this message and do not notify the sender.
NOTE: The pending message will expire after 14 days. If you do not take action within that time, the pending message will be automatically rejected.
Change
your notification settings
---------- Forwarded message ----------
From: rohini.ambika@...
To: "dev@..." <dev@...>
Cc:
Bcc:
Date: Wed, 27 Jul 2022 11:03:22 +0000
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during
the performance testing.
Details and configurations as follows:
-
Requirement : ODL clustering for high availability (HA) on
data distribution
-
Env Configuration:
-
3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
-
CPU : 8 Cores
-
RAM : 20GB
-
Java Heap size : Min – 512MB Max – 16GB
-
JDK version : 11
-
Kubernetes version : 1.19.1
-
Docker version : 20.10.7
-
ODL features installed to enable clustering:
-
odl-netconf-clustered-topology
-
odl-restconf-all
-
Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
-
Use Case:
-
Fail Over/High Availability:
-
Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected
as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
-
Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due
to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled
in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
--
- Rahul Sharma
--
|
|
Rahul Sharma <rahul.iitr@...>
Hi Rohini,
Sorry, got pulled into other things. For this issue, we were wondering if it's related to ODL deployed using Helm charts, considering that the problem is also reproducible when ODL is running as a cluster (without K8s). Perhaps the ODL-Clustering team can provide better inputs since the problem looks to be at the application level. Let me know what you think.
Regards, Rahul
toggle quoted message
Show quoted text
Hello,
Did you get a chance to look in to the configurations shared.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
Hello Rahul,
Please find the answers below:
- Official Helm chart @
ODL Helm Chart . Attaching the values.yml for reference
- Fix was to restart the Owner Supervisor on failure . Check-in @
https://git.opendaylight.org/gerrit/c/controller/+/100357
We observed the same problem when tested without K8s set up by following the instructions @
https://docs.opendaylight.org/en/stable-phosphorus/getting-started-guide/clustering.html. Instead of installing
odl-mdsal-distributed-datastore feature, we have enabled the features given in the values.yml.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
Hello Rohini,
Thank you for the answers.
-
For the 1st one: when you say you tried with the official Helm charts - which helm charts are you referring to? Can you send more details on how (parameters in values.yaml that you used) when you deployed these charts.
-
What was the Temporary fix that reduced the occurrence of the issue. Can you point to the check-in made or change in configuration parameters? Would be helpful to diagnose a proper fix.
Hi Anil,
Thanks for the response.
Please find the details below:
1. Is the Test deployment using our Helm charts (ODL Helm Chart)? – We have created our own helm chart for the ODL deployment. Have also tried
the use case with official helm chart.
2. I see that the JIRA mentioned in the below email (
https://jira.opendaylight.org/browse/CONTROLLER-2035 ) is already marked Resolved. Has somebody fixed it in the latest version. –
This was a temporary fix from our end and the failure rate has reduced due to the fix, however we are still facing the issue when we do multiple restarts of master node.
ODL version used is Phosphorous SR2
All the configurations are provided and attached in the initial mail .
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
I belive they are using ODL Helm charts and K8s for the cluster setup, that said I have requested the version of ODL being used.
Rohoni: Can you provide more details on the ODL version, and configuration, that Rahul/Andrew requested?
Thank you for bringing this up.
On Wed, Jul 27, 2022 at 5:05 PM Anil Shashikumar Belur <abelur@...> wrote:
Hi Andrew and Rahul:
I remember we have discussed these topics in the ODL containers and helm charts meetings.
Do we know if the expected configuration would work with the ODL on K8s clusters setup or requires some configuration changes?
A message was sent to the group
https://lists.opendaylight.org/g/dev from
rohini.ambika@... that needs to be approved because the user is new member moderated.
View
this message online
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
* Requirement : ODL clustering for high availability (HA) on data distribution
* Env Configuration:
* 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
* CPU : 8 Cores
* RAM : 20GB
* Java Heap size : Min - 512MB Max - 16GB
* JDK version : 11
* Kubernetes version : 1.19.1
* Docker version : 20.10.7
* ODL features installed to enable clustering:
* odl-netconf-clustered-topology
* odl-restconf-all
* Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
* Use Case:
* Fail Over/High Availability:
* Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master
by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
* Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination
of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
* JIRA reference :
https://jira.opendaylight.org/browse/CONTROLLER-2035<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjira.opendaylight.org%2Fbrowse%2FCONTROLLER-2035&data=05%7C01%7Crohini.ambika%40infosys.com%7C12cedda8fd77459df73b08da6fb6802e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637945126890707334%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6yZbWAhTgVdwHVbpO7UtUenKW5%2B476j%2BG4ZEodjBUKc%3D&reserved=0>
* Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
A complete copy of this message has been attached for your convenience.
To approve this using email, reply to this message. You do not need to attach the original message, just reply and send.
Reject this message and notify the sender.
Delete this message and do not notify the sender.
NOTE: The pending message will expire after 14 days. If you do not take action within that time, the pending message will be automatically rejected.
Change
your notification settings
---------- Forwarded message ----------
From: rohini.ambika@...
To: "dev@..." <dev@...>
Cc:
Bcc:
Date: Wed, 27 Jul 2022 11:03:22 +0000
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during
the performance testing.
Details and configurations as follows:
-
Requirement : ODL clustering for high availability (HA) on
data distribution
-
Env Configuration:
-
3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
-
CPU : 8 Cores
-
RAM : 20GB
-
Java Heap size : Min – 512MB Max – 16GB
-
JDK version : 11
-
Kubernetes version : 1.19.1
-
Docker version : 20.10.7
-
ODL features installed to enable clustering:
-
odl-netconf-clustered-topology
-
odl-restconf-all
-
Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
-
Use Case:
-
Fail Over/High Availability:
-
Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected
as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
-
Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due
to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled
in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
--
- Rahul Sharma
--
|
|
Rohini Ambika <rohini.ambika@...>
Hi Rahul,
Thanks for your response.
We can confirm that the issue persists without K8s when we deploy ODL as a cluster.
Could you help us to connect with the ODL clustering team to proceed further.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
toggle quoted message
Show quoted text
From: Rahul Sharma <rahul.iitr@...>
Sent: Tuesday, August 9, 2022 2:07 AM
To: Rohini Ambika <rohini.ambika@...>
Cc: Anil Shashikumar Belur <abelur@...>; Hsia, Andrew <andrew.hsia@...>; Casey Cain <ccain@...>; Luis Gomez <ecelgp@...>; TSC <tsc@...>; John Mangan <John.Mangan@...>;
Sathya Manalan <sathya.manalan@...>; Hemalatha Thangavelu <hemalatha.t@...>; Gokul Sakthivel <gokul.sakthivel@...>; Bhaswati_Das <Bhaswati_Das@...>
Subject: Re: [opendaylight-dev] Message Approval Needed - rohini.ambika@... posted to dev@...
[**EXTERNAL EMAIL**]
Hi Rohini,
Sorry, got pulled into other things.
For this issue, we were wondering if it's related to ODL deployed using Helm charts, considering that the problem is also reproducible when ODL is running as a cluster (without K8s). Perhaps the ODL-Clustering team can provide better inputs
since the problem looks to be at the application level.
Let me know what you think.
Hello,
Did you get a chance to look in to the configurations shared.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
Hello Rahul,
Please find the answers below:
-
Official Helm chart @
ODL Helm Chart . Attaching the values.yml for reference
-
Fix was to restart the Owner Supervisor on failure . Check-in @
https://git.opendaylight.org/gerrit/c/controller/+/100357
We observed the same problem when tested without K8s set up by following the instructions @
https://docs.opendaylight.org/en/stable-phosphorus/getting-started-guide/clustering.html. Instead of installing
odl-mdsal-distributed-datastore feature, we have enabled the features given in the values.yml.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
Hello Rohini,
Thank you for the answers.
-
For the 1st one: when you say you tried with the official Helm charts - which helm charts are you referring to? Can you send more details on how (parameters in values.yaml that you used) when you deployed these charts.
-
What was the Temporary fix that reduced the occurrence of the issue. Can you point to the check-in made or change in configuration parameters? Would be helpful to diagnose a proper fix.
Hi Anil,
Thanks for the response.
Please find the details below:
1. Is the Test deployment using our Helm charts (ODL Helm Chart)? – We have created our own helm chart for the ODL deployment. Have also tried
the use case with official helm chart.
2. I see that the JIRA mentioned in the below email (
https://jira.opendaylight.org/browse/CONTROLLER-2035 ) is already marked Resolved. Has somebody fixed it in the latest version. –
This was a temporary fix from our end and the failure rate has reduced due to the fix, however we are still facing the issue when we do multiple restarts of master node.
ODL version used is Phosphorous SR2
All the configurations are provided and attached in the initial mail .
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
I belive they are using ODL Helm charts and K8s for the cluster setup, that said I have requested the version of ODL being used.
Rohoni: Can you provide more details on the ODL version, and configuration, that Rahul/Andrew requested?
Thank you for bringing this up.
On Wed, Jul 27, 2022 at 5:05 PM Anil Shashikumar Belur <abelur@...> wrote:
Hi Andrew and Rahul:
I remember we have discussed these topics in the ODL containers and helm charts meetings.
Do we know if the expected configuration would work with the ODL on K8s clusters setup or requires some configuration changes?
A message was sent to the group
https://lists.opendaylight.org/g/dev from
rohini.ambika@... that needs to be approved because the user is new member moderated.
View
this message online
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
* Requirement : ODL clustering for high availability (HA) on data distribution
* Env Configuration:
* 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
* CPU : 8 Cores
* RAM : 20GB
* Java Heap size : Min - 512MB Max - 16GB
* JDK version : 11
* Kubernetes version : 1.19.1
* Docker version : 20.10.7
* ODL features installed to enable clustering:
* odl-netconf-clustered-topology
* odl-restconf-all
* Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
* Use Case:
* Fail Over/High Availability:
* Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master
by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
* Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination
of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
* JIRA reference :
https://jira.opendaylight.org/browse/CONTROLLER-2035<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjira.opendaylight.org%2Fbrowse%2FCONTROLLER-2035&data=05%7C01%7Crohini.ambika%40infosys.com%7C12cedda8fd77459df73b08da6fb6802e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637945126890707334%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6yZbWAhTgVdwHVbpO7UtUenKW5%2B476j%2BG4ZEodjBUKc%3D&reserved=0>
* Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
A complete copy of this message has been attached for your convenience.
To approve this using email, reply to this message. You do not need to attach the original message, just reply and send.
Reject this message and notify the sender.
Delete this message and do not notify the sender.
NOTE: The pending message will expire after 14 days. If you do not take action within that time, the pending message will be automatically rejected.
Change
your notification settings
---------- Forwarded message ----------
From: rohini.ambika@...
To: "dev@..." <dev@...>
Cc:
Bcc:
Date: Wed, 27 Jul 2022 11:03:22 +0000
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during
the performance testing.
Details and configurations as follows:
-
Requirement : ODL clustering for high availability (HA) on
data distribution
-
Env Configuration:
-
3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
-
CPU : 8 Cores
-
RAM : 20GB
-
Java Heap size : Min – 512MB Max – 16GB
-
JDK version : 11
-
Kubernetes version : 1.19.1
-
Docker version : 20.10.7
-
ODL features installed to enable clustering:
-
odl-netconf-clustered-topology
-
odl-restconf-all
-
Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
-
Use Case:
-
Fail Over/High Availability:
-
Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected
as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
-
Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due
to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled
in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
--
- Rahul Sharma
--
--
|
|
Venkatrangan Govindarajan
Hi Rohini,
I think we connected already, I can take a look at the issue and provide a response next week. As we discussed, please check if your use-case would require the HA in NBI or not. We can look at the logged jira ticket and get back to you.
Regards, Rangan
toggle quoted message
Show quoted text
Hi Rahul,
Thanks for your response.
We can confirm that the issue persists without K8s when we deploy ODL as a cluster.
Could you help us to connect with the ODL clustering team to proceed further.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
Hi Rohini,
Sorry, got pulled into other things.
For this issue, we were wondering if it's related to ODL deployed using Helm charts, considering that the problem is also reproducible when ODL is running as a cluster (without K8s). Perhaps the ODL-Clustering team can provide better inputs
since the problem looks to be at the application level.
Let me know what you think.
Hello,
Did you get a chance to look in to the configurations shared.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
Hello Rahul,
Please find the answers below:
-
Official Helm chart @
ODL Helm Chart . Attaching the values.yml for reference
-
Fix was to restart the Owner Supervisor on failure . Check-in @
https://git.opendaylight.org/gerrit/c/controller/+/100357
We observed the same problem when tested without K8s set up by following the instructions @
https://docs.opendaylight.org/en/stable-phosphorus/getting-started-guide/clustering.html. Instead of installing
odl-mdsal-distributed-datastore feature, we have enabled the features given in the values.yml.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
Hello Rohini,
Thank you for the answers.
-
For the 1st one: when you say you tried with the official Helm charts - which helm charts are you referring to? Can you send more details on how (parameters in values.yaml that you used) when you deployed these charts.
-
What was the Temporary fix that reduced the occurrence of the issue. Can you point to the check-in made or change in configuration parameters? Would be helpful to diagnose a proper fix.
Hi Anil,
Thanks for the response.
Please find the details below:
1. Is the Test deployment using our Helm charts (ODL Helm Chart)? – We have created our own helm chart for the ODL deployment. Have also tried
the use case with official helm chart.
2. I see that the JIRA mentioned in the below email (
https://jira.opendaylight.org/browse/CONTROLLER-2035 ) is already marked Resolved. Has somebody fixed it in the latest version. –
This was a temporary fix from our end and the failure rate has reduced due to the fix, however we are still facing the issue when we do multiple restarts of master node.
ODL version used is Phosphorous SR2
All the configurations are provided and attached in the initial mail .
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
I belive they are using ODL Helm charts and K8s for the cluster setup, that said I have requested the version of ODL being used.
Rohoni: Can you provide more details on the ODL version, and configuration, that Rahul/Andrew requested?
Thank you for bringing this up.
On Wed, Jul 27, 2022 at 5:05 PM Anil Shashikumar Belur <abelur@...> wrote:
Hi Andrew and Rahul:
I remember we have discussed these topics in the ODL containers and helm charts meetings.
Do we know if the expected configuration would work with the ODL on K8s clusters setup or requires some configuration changes?
A message was sent to the group
https://lists.opendaylight.org/g/dev from
rohini.ambika@... that needs to be approved because the user is new member moderated.
View
this message online
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
* Requirement : ODL clustering for high availability (HA) on data distribution
* Env Configuration:
* 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
* CPU : 8 Cores
* RAM : 20GB
* Java Heap size : Min - 512MB Max - 16GB
* JDK version : 11
* Kubernetes version : 1.19.1
* Docker version : 20.10.7
* ODL features installed to enable clustering:
* odl-netconf-clustered-topology
* odl-restconf-all
* Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
* Use Case:
* Fail Over/High Availability:
* Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master
by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
* Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination
of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
* JIRA reference :
https://jira.opendaylight.org/browse/CONTROLLER-2035<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjira.opendaylight.org%2Fbrowse%2FCONTROLLER-2035&data=05%7C01%7Crohini.ambika%40infosys.com%7C12cedda8fd77459df73b08da6fb6802e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637945126890707334%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6yZbWAhTgVdwHVbpO7UtUenKW5%2B476j%2BG4ZEodjBUKc%3D&reserved=0>
* Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
A complete copy of this message has been attached for your convenience.
To approve this using email, reply to this message. You do not need to attach the original message, just reply and send.
Reject this message and notify the sender.
Delete this message and do not notify the sender.
NOTE: The pending message will expire after 14 days. If you do not take action within that time, the pending message will be automatically rejected.
Change
your notification settings
---------- Forwarded message ----------
From: rohini.ambika@...
To: "dev@..." <dev@...>
Cc:
Bcc:
Date: Wed, 27 Jul 2022 11:03:22 +0000
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during
the performance testing.
Details and configurations as follows:
-
Requirement : ODL clustering for high availability (HA) on
data distribution
-
Env Configuration:
-
3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
-
CPU : 8 Cores
-
RAM : 20GB
-
Java Heap size : Min – 512MB Max – 16GB
-
JDK version : 11
-
Kubernetes version : 1.19.1
-
Docker version : 20.10.7
-
ODL features installed to enable clustering:
-
odl-netconf-clustered-topology
-
odl-restconf-all
-
Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
-
Use Case:
-
Fail Over/High Availability:
-
Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected
as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
-
Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due
to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled
in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
--
- Rahul Sharma
--
--
-- Venkatrangan Govindarajan ( When there is no wind...Row )
|
|
Rohini Ambika <rohini.ambika@...>
Hi Rangan,
Thanks . Looking forward for your response .
Hi
@John Mangan , Could you please confirm if our use-case would require the HA in NBI or not.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
toggle quoted message
Show quoted text
From: Venkatrangan Govindarajan <gvrangan@...>
Sent: Wednesday, August 10, 2022 12:14 PM
To: Rohini Ambika <rohini.ambika@...>
Cc: Rahul Sharma <rahul.iitr@...>; Anil Shashikumar Belur <abelur@...>; Hsia, Andrew <andrew.hsia@...>; Casey Cain <ccain@...>; Luis Gomez <ecelgp@...>; TSC <tsc@...>; John
Mangan <John.Mangan@...>; Sathya Manalan <sathya.manalan@...>; Hemalatha Thangavelu <hemalatha.t@...>; Gokul Sakthivel <gokul.sakthivel@...>; Bhaswati_Das <Bhaswati_Das@...>
Subject: Re: [OpenDaylight TSC] [opendaylight-dev] Message Approval Needed - rohini.ambika@... posted to dev@...
[**EXTERNAL EMAIL**]
Hi Rohini,
I think we connected already, I can take a look at the issue and provide a response next week.
As we discussed, please check if your use-case would require the HA in NBI or not.
We can look at the logged jira ticket and get back to you.
Hi Rahul,
Thanks for your response.
We can confirm that the issue persists without K8s when we deploy ODL as a cluster.
Could you help us to connect with the ODL clustering team to proceed further.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
Hi Rohini,
Sorry, got pulled into other things.
For this issue, we were wondering if it's related to ODL deployed using Helm charts, considering that the problem is also reproducible when ODL is running as a cluster (without
K8s). Perhaps the ODL-Clustering team can provide better inputs since the problem looks to be at the application level.
Let me know what you think.
Hello,
Did you get a chance to look in to the configurations shared.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
Hello Rahul,
Please find the answers below:
-
Official Helm chart @
ODL Helm Chart . Attaching the values.yml for reference
-
Fix was to restart the Owner Supervisor on failure . Check-in @
https://git.opendaylight.org/gerrit/c/controller/+/100357
We observed the same problem when tested without K8s set up by following the instructions @
https://docs.opendaylight.org/en/stable-phosphorus/getting-started-guide/clustering.html. Instead of installing
odl-mdsal-distributed-datastore feature, we have enabled the features given in the values.yml.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
Hello Rohini,
Thank you for the answers.
-
For the 1st one: when you say you tried with the official Helm charts - which helm charts are you referring to? Can you send more details on how (parameters in values.yaml that you used) when you deployed these charts.
-
What was the Temporary fix that reduced the occurrence of the issue. Can you point to the check-in made or change in configuration parameters? Would be helpful to diagnose a proper fix.
Hi Anil,
Thanks for the response.
Please find the details below:
1. Is the Test deployment using our Helm charts (ODL Helm Chart)? – We have created our own helm chart for the ODL deployment. Have also tried
the use case with official helm chart.
2. I see that the JIRA mentioned in the below email (
https://jira.opendaylight.org/browse/CONTROLLER-2035 ) is already marked Resolved. Has somebody fixed it in the latest version. –
This was a temporary fix from our end and the failure rate has reduced due to the fix, however we are still facing the issue when we do multiple restarts of master node.
ODL version used is Phosphorous SR2
All the configurations are provided and attached in the initial mail .
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
I belive they are using ODL Helm charts and K8s for the cluster setup, that said I have requested the version of ODL being used.
Rohoni: Can you provide more details on the ODL version, and configuration, that Rahul/Andrew requested?
Thank you for bringing this up.
On Wed, Jul 27, 2022 at 5:05 PM Anil Shashikumar Belur <abelur@...> wrote:
Hi Andrew and Rahul:
I remember we have discussed these topics in the ODL containers and helm charts meetings.
Do we know if the expected configuration would work with the ODL on K8s clusters setup or requires some configuration changes?
A message was sent to the group
https://lists.opendaylight.org/g/dev from
rohini.ambika@... that needs to be approved because the user is new member moderated.
View
this message online
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
* Requirement : ODL clustering for high availability (HA) on data distribution
* Env Configuration:
* 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
* CPU : 8 Cores
* RAM : 20GB
* Java Heap size : Min - 512MB Max - 16GB
* JDK version : 11
* Kubernetes version : 1.19.1
* Docker version : 20.10.7
* ODL features installed to enable clustering:
* odl-netconf-clustered-topology
* odl-restconf-all
* Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
* Use Case:
* Fail Over/High Availability:
* Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master
by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
* Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination
of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
* JIRA reference :
https://jira.opendaylight.org/browse/CONTROLLER-2035<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjira.opendaylight.org%2Fbrowse%2FCONTROLLER-2035&data=05%7C01%7Crohini.ambika%40infosys.com%7C12cedda8fd77459df73b08da6fb6802e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637945126890707334%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6yZbWAhTgVdwHVbpO7UtUenKW5%2B476j%2BG4ZEodjBUKc%3D&reserved=0>
* Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
A complete copy of this message has been attached for your convenience.
To approve this using email, reply to this message. You do not need to attach the original message, just reply and send.
Reject this message and notify the sender.
Delete this message and do not notify the sender.
NOTE: The pending message will expire after 14 days. If you do not take action within that time, the pending message will be automatically rejected.
Change
your notification settings
---------- Forwarded message ----------
From: rohini.ambika@...
To: "dev@..." <dev@...>
Cc:
Bcc:
Date: Wed, 27 Jul 2022 11:03:22 +0000
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during
the performance testing.
Details and configurations as follows:
-
Requirement : ODL clustering for high availability (HA) on
data distribution
-
Env Configuration:
-
3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
-
CPU : 8 Cores
-
RAM : 20GB
-
Java Heap size : Min – 512MB Max – 16GB
-
JDK version : 11
-
Kubernetes version : 1.19.1
-
Docker version : 20.10.7
-
ODL features installed to enable clustering:
-
odl-netconf-clustered-topology
-
odl-restconf-all
-
Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
-
Use Case:
-
Fail Over/High Availability:
-
Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected
as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
-
Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due
to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled
in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
--
- Rahul Sharma
--
--
--
Venkatrangan Govindarajan
( When there is no wind...Row )
|
|
Rohini Ambika <rohini.ambika@...>
Hi Rangan,
A gentle reminder.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
toggle quoted message
Show quoted text
From: John Mangan <John.Mangan@...>
Sent: Wednesday, August 10, 2022 5:54 PM
To: Rohini Ambika <rohini.ambika@...>; Venkatrangan Govindarajan <gvrangan@...>
Cc: Rahul Sharma <rahul.iitr@...>; Anil Shashikumar Belur <abelur@...>; Hsia, Andrew <andrew.hsia@...>; Casey Cain <ccain@...>; Luis Gomez <ecelgp@...>; TSC <tsc@...>; Sathya
Manalan <sathya.manalan@...>; Hemalatha Thangavelu <hemalatha.t@...>; Gokul Sakthivel <gokul.sakthivel@...>; Bhaswati_Das <Bhaswati_Das@...>
Subject: RE: [OpenDaylight TSC] [opendaylight-dev] Message Approval Needed - rohini.ambika@... posted to dev@...
[**EXTERNAL EMAIL**]
Hi Rohini
The use case does not require HA NBI.
The use case involves simply bringing down the master node of a 3 node cluster (can be either Kubernetes or ODL directly installed on 3 VMs) while many Netconf devices are registered.
The re-election of the master node shows AKKA framework timeouts and failure to register the devices on the new master. Some devices are re-registered and some are not.
No Northbound traffic is required to reproduce
-John
Hi Rangan,
Thanks . Looking forward for your response .
Hi
@John Mangan , Could you please confirm if our use-case would require the HA in NBI or not.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
Hi Rohini,
I think we connected already, I can take a look at the issue and provide a response next week.
As we discussed, please check if your use-case would require the HA in NBI or not.
We can look at the logged jira ticket and get back to you.
Hi Rahul,
Thanks for your response.
We can confirm that the issue persists without K8s when we deploy ODL as a cluster.
Could you help us to connect with the ODL clustering team to proceed further.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
Hi Rohini,
Sorry, got pulled into other things.
For this issue, we were wondering if it's related to ODL deployed using Helm charts, considering that the problem is also reproducible when ODL is running as a cluster (without
K8s). Perhaps the ODL-Clustering team can provide better inputs since the problem looks to be at the application level.
Let me know what you think.
Hello,
Did you get a chance to look in to the configurations shared.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
Hello Rahul,
Please find the answers below:
-
Official Helm chart @
ODL Helm Chart . Attaching the values.yml for reference
-
Fix was to restart the Owner Supervisor on failure . Check-in @
https://git.opendaylight.org/gerrit/c/controller/+/100357
We observed the same problem when tested without K8s set up by following the instructions @
https://docs.opendaylight.org/en/stable-phosphorus/getting-started-guide/clustering.html. Instead of installing
odl-mdsal-distributed-datastore feature, we have enabled the features given in the values.yml.
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
Hello Rohini,
Thank you for the answers.
-
For the 1st one: when you say you tried with the official Helm charts - which helm charts are you referring to? Can you send more details on how (parameters in values.yaml that you used) when you deployed these charts.
-
What was the Temporary fix that reduced the occurrence of the issue. Can you point to the check-in made or change in configuration parameters? Would be helpful to diagnose a proper fix.
Hi Anil,
Thanks for the response.
Please find the details below:
1. Is the Test deployment using our Helm charts (ODL Helm Chart)? – We have created our own helm chart for the ODL deployment. Have also tried
the use case with official helm chart.
2. I see that the JIRA mentioned in the below email (
https://jira.opendaylight.org/browse/CONTROLLER-2035 ) is already marked Resolved. Has somebody fixed it in the latest version. –
This was a temporary fix from our end and the failure rate has reduced due to the fix, however we are still facing the issue when we do multiple restarts of master node.
ODL version used is Phosphorous SR2
All the configurations are provided and attached in the initial mail .
Thanks & Regards,
Rohini
Cell: +91.9995241298 | VoIP: +91.471.3025332
[**EXTERNAL EMAIL**]
I belive they are using ODL Helm charts and K8s for the cluster setup, that said I have requested the version of ODL being used.
Rohoni: Can you provide more details on the ODL version, and configuration, that Rahul/Andrew requested?
Thank you for bringing this up.
On Wed, Jul 27, 2022 at 5:05 PM Anil Shashikumar Belur <abelur@...> wrote:
Hi Andrew and Rahul:
I remember we have discussed these topics in the ODL containers and helm charts meetings.
Do we know if the expected configuration would work with the ODL on K8s clusters setup or requires some configuration changes?
A message was sent to the group
https://lists.opendaylight.org/g/dev from
rohini.ambika@... that needs to be approved because the user is new member moderated.
View
this message online
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during the performance testing.
Details and configurations as follows:
* Requirement : ODL clustering for high availability (HA) on data distribution
* Env Configuration:
* 3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
* CPU : 8 Cores
* RAM : 20GB
* Java Heap size : Min - 512MB Max - 16GB
* JDK version : 11
* Kubernetes version : 1.19.1
* Docker version : 20.10.7
* ODL features installed to enable clustering:
* odl-netconf-clustered-topology
* odl-restconf-all
* Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
* Use Case:
* Fail Over/High Availability:
* Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected as master
by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
* Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due to the termination
of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
* JIRA reference :
https://jira.opendaylight.org/browse/CONTROLLER-2035<https://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fjira.opendaylight.org%2Fbrowse%2FCONTROLLER-2035&data=05%7C01%7Crohini.ambika%40infosys.com%7C12cedda8fd77459df73b08da6fb6802e%7C63ce7d592f3e42cda8ccbe764cff5eb6%7C0%7C0%7C637945126890707334%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6yZbWAhTgVdwHVbpO7UtUenKW5%2B476j%2BG4ZEodjBUKc%3D&reserved=0>
* Akka configuration of all the nodes attached. (Increased the gossip-interval time to 5s in akka.conf file to avoid Akka AskTimedOut issue while mounting multiple devices at a time.)
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
A complete copy of this message has been attached for your convenience.
To approve this using email, reply to this message. You do not need to attach the original message, just reply and send.
Reject this message and notify the sender.
Delete this message and do not notify the sender.
NOTE: The pending message will expire after 14 days. If you do not take action within that time, the pending message will be automatically rejected.
Change
your notification settings
---------- Forwarded message ----------
From: rohini.ambika@...
To: "dev@..." <dev@...>
Cc:
Bcc:
Date: Wed, 27 Jul 2022 11:03:22 +0000
Subject: FW: ODL Clustering issue - High Availability
Hi All,
As presented/discussed in the ODL TSC meeting held on 22nd Friday 10.30 AM IST, posting this email to highlight the issues on ODL clustering use cases encountered during
the performance testing.
Details and configurations as follows:
-
Requirement : ODL clustering for high availability (HA) on
data distribution
-
Env Configuration:
-
3 node k8s Cluster ( 1 master & 3 worker nodes) with 3 ODL instances running on each node
-
CPU : 8 Cores
-
RAM : 20GB
-
Java Heap size : Min – 512MB Max – 16GB
-
JDK version : 11
-
Kubernetes version : 1.19.1
-
Docker version : 20.10.7
-
ODL features installed to enable clustering:
-
odl-netconf-clustered-topology
-
odl-restconf-all
-
Device configured : Netconf devices , all devices having same schema(tested with 250 devices)
-
Use Case:
-
Fail Over/High Availability:
-
Expected : In case any of the ODL instance gets down/restarted due to network splits or internal error, other instance in cluster should be available and functional. If the affected instance is having master mount, the other instance who is elected
as master by re-election should be able to re-register the devices and resume the operations. Once the affected instance comes up, it should be able to join the cluster as member node and register the slave mounts.
-
Observation : When the odl instance which is having the master mount restarts, election happens among the other node in the cluster and elects the new leader. Now the new leader is trying to re-register the master mount but failed at a point due
to the termination of the Akka Cluster Singleton Actor. Hence the cluster goes to idle state and failed to assign owner for the device DOM entity. In this case, the configuration of already mounted device/ new mounts will fail.
Requesting your support to identify if there is any mis-configurations or any known solution for the issue .
Please let us know if any further information required.
Note : We have tested the single ODL instance without enabling cluster features in K8s cluster. In case of K8s node failure, ODL instance will be re-scheduled
in other available K8s node and operations will be resumed.
Thanks & Regards,
Rohini
--
- Rahul Sharma
--
--
--
Venkatrangan Govindarajan
( When there is no wind...Row )
This communication is private and may be confidential and contain information that is proprietary, privileged or otherwise legally exempt from disclosure and should be treated accordingly. If you have received this message in error, please notify the sender
immediately by e-mail and delete all copies of the message. In accordance with our guidelines, emails sent or received may be monitored. Inmarsat Global Limited, Registered No. 3675885. Registered in England and Wales with Registered Office at 99 City Road,
London EC1Y 1AX
|
|