[yangtools-dev] OVSDB CSIT Failures on Oxygen


Tom Pantelis <tompantelis@...>
 



On Thu, Mar 8, 2018 at 1:35 PM, Sam Hague <shague@...> wrote:
Vishal, Tom,

can you look at the comment from Robert on the original patch that set the codec iid and see if that moves us forward?

I think I see what he means - I can push a patch.
 

On Thu, Mar 8, 2018 at 1:03 AM, Vishal Thapar <vishal.thapar@...> wrote:
Stephen,
Any comments on thread safety of code calling serialize?

Note that these failures are on a patch that reverted the workaround that we did for yangtools issue. There are no other changes in OVSDB code, except the checkstyle etc. fixes.

Regards,
Vishal.
-----Original Message-----
From: Robert Varga [mailto:nite@...]
Sent: 06 March 2018 16:19
To: Vishal Thapar <vishal.thapar@...>; Jamo Luhrsen <jluhrsen@...>; Sam Hague <shague@...>
Cc: ovsdb-dev@...rg; Kit Lou <klou.external@...>; yangtools-dev <yangtools-dev@...ght.org>
Subject: Re: [ovsdb-dev] OVSDB CSIT Failures on Oxygen

On 06/03/18 11:15, Robert Varga wrote:
>> Any suggestions on the same? Do we need to rewrite our serializers? If yes, inputs would be welcome. [8] is the original bug raised for this.
> The original issue was a ordering one, now it would seem you cannot
> find the module in the SchemaContext -- at this point the question is,
> what modules are in the SchemaContext -- you'll need to debug that on
> your side...

More specifically, it seems your class is not thread-safe w.r.t.
SchemaContext updates.

Bye,
Robert



_______________________________________________
yangtools-dev mailing list
yangtools-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/yangtools-dev



Tom Pantelis <tompantelis@...>
 

On Thu, Mar 8, 2018 at 1:44 PM, Tom Pantelis <tompantelis@...> wrote:


On Thu, Mar 8, 2018 at 1:35 PM, Sam Hague <shague@...> wrote:
Vishal, Tom,

can you look at the comment from Robert on the original patch that set the codec iid and see if that moves us forward?

I think I see what he means - I can push a patch.
 

On Thu, Mar 8, 2018 at 1:03 AM, Vishal Thapar <vishal.thapar@...> wrote:
Stephen,
Any comments on thread safety of code calling serialize?

Note that these failures are on a patch that reverted the workaround that we did for yangtools issue. There are no other changes in OVSDB code, except the checkstyle etc. fixes.

Regards,
Vishal.
-----Original Message-----
From: Robert Varga [mailto:nite@...]
Sent: 06 March 2018 16:19
To: Vishal Thapar <vishal.thapar@...>; Jamo Luhrsen <jluhrsen@...>; Sam Hague <shague@...>
Cc: ovsdb-dev@...rg; Kit Lou <klou.external@...>; yangtools-dev <yangtools-dev@...ght.org>
Subject: Re: [ovsdb-dev] OVSDB CSIT Failures on Oxygen

On 06/03/18 11:15, Robert Varga wrote:
>> Any suggestions on the same? Do we need to rewrite our serializers? If yes, inputs would be welcome. [8] is the original bug raised for this.
> The original issue was a ordering one, now it would seem you cannot
> find the module in the SchemaContext -- at this point the question is,
> what modules are in the SchemaContext -- you'll need to debug that on
> your side...

More specifically, it seems your class is not thread-safe w.r.t.
SchemaContext updates.

Bye,
Robert



_______________________________________________
yangtools-dev mailing list
yangtools-dev@...ht.org
https://lists.opendaylight.org/mailman/listinfo/yangtools-dev




Vishal Thapar <vishal.thapar@...>
 

Thank you Tom, next time round of beer on me (:

 

Jamo,

This fixes the QoS regression that we had. 3 node cluster is a different issue. Want to track it as separate bug or with this itself?

 

For the 3 node one, I am seeing lots of shard exceptions on ODL1:

 

2018-03-08T19:52:09,785 | WARN  | ForkJoinPool.commonPool-worker-2 | AbstractShardBackendResolver     | 228 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | Failed to resolve shard

java.util.concurrent.TimeoutException: Shard has no current leader

        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$resolveBackendInfo$1(AbstractShardBackendResolver.java:112) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

 

Infra issue? Are 3 node cluster tests passing in other modules [OFplugin? Netvirt?]

 

Regards,

Vishal.

 

From: Tom Pantelis [mailto:tompantelis@...]
Sent: 09 March 2018 00:27
To: Sam Hague <shague@...>
Cc: Vishal Thapar <vishal.thapar@...>; Jamo Luhrsen <jluhrsen@...>; ovsdb-dev@...; yangtools-dev <yangtools-dev@...>; Kit Lou <klou.external@...>; Stephen Kitt <skitt@...>
Subject: Re: [yangtools-dev] [ovsdb-dev] OVSDB CSIT Failures on Oxygen

 

 

On Thu, Mar 8, 2018 at 1:44 PM, Tom Pantelis <tompantelis@...> wrote:

 

 

On Thu, Mar 8, 2018 at 1:35 PM, Sam Hague <shague@...> wrote:

Vishal, Tom,

 

can you look at the comment from Robert on the original patch that set the codec iid and see if that moves us forward?

 

I think I see what he means - I can push a patch.

 

 

On Thu, Mar 8, 2018 at 1:03 AM, Vishal Thapar <vishal.thapar@...> wrote:

Stephen,
Any comments on thread safety of code calling serialize?

Note that these failures are on a patch that reverted the workaround that we did for yangtools issue. There are no other changes in OVSDB code, except the checkstyle etc. fixes.

Regards,
Vishal.
-----Original Message-----
From: Robert Varga [mailto:nite@...]
Sent: 06 March 2018 16:19
To: Vishal Thapar <vishal.thapar@...>; Jamo Luhrsen <jluhrsen@...>; Sam Hague <shague@...>
Cc: ovsdb-dev@...; Kit Lou <klou.external@...>; yangtools-dev <yangtools-dev@...>
Subject: Re: [ovsdb-dev] OVSDB CSIT Failures on Oxygen

On 06/03/18 11:15, Robert Varga wrote:
>> Any suggestions on the same? Do we need to rewrite our serializers? If yes, inputs would be welcome. [8] is the original bug raised for this.
> The original issue was a ordering one, now it would seem you cannot
> find the module in the SchemaContext -- at this point the question is,
> what modules are in the SchemaContext -- you'll need to debug that on
> your side...

More specifically, it seems your class is not thread-safe w.r.t.
SchemaContext updates.

Bye,
Robert

 


_______________________________________________
yangtools-dev mailing list
yangtools-dev@...
https://lists.opendaylight.org/mailman/listinfo/yangtools-dev

 

 


Sam Hague
 



On Fri, Mar 9, 2018 at 12:05 AM, Vishal Thapar <vishal.thapar@...> wrote:

Thank you Tom, next time round of beer on me (:

 

Jamo,

This fixes the QoS regression that we had. 3 node cluster is a different issue. Want to track it as separate bug or with this itself?

 

For the 3 node one, I am seeing lots of shard exceptions on ODL1:

 

2018-03-08T19:52:09,785 | WARN  | ForkJoinPool.commonPool-worker-2 | AbstractShardBackendResolver     | 228 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | Failed to resolve shard

java.util.concurrent.TimeoutException: Shard has no current leader

        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$resolveBackendInfo$1(AbstractShardBackendResolver.java:112) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

 

Infra issue? Are 3 node cluster tests passing in other modules [OFplugin? Netvirt?]

Yes, it is a different bug. Problem is it popped up recently and only impacts the clustering suite as cluster operations are being done. 

 

Regards,

Vishal.

 

From: Tom Pantelis [mailto:tompantelis@...]
Sent: 09 March 2018 00:27
To: Sam Hague <shague@...>
Cc: Vishal Thapar <vishal.thapar@...>; Jamo Luhrsen <jluhrsen@...>; ovsdb-dev@....org; yangtools-dev <yangtools-dev@lists.opendaylight.org>; Kit Lou <klou.external@...>; Stephen Kitt <skitt@...>
Subject: Re: [yangtools-dev] [ovsdb-dev] OVSDB CSIT Failures on Oxygen

 

 

On Thu, Mar 8, 2018 at 1:44 PM, Tom Pantelis <tompantelis@...> wrote:

 

 

On Thu, Mar 8, 2018 at 1:35 PM, Sam Hague <shague@...> wrote:

Vishal, Tom,

 

can you look at the comment from Robert on the original patch that set the codec iid and see if that moves us forward?

 

I think I see what he means - I can push a patch.

 

 

On Thu, Mar 8, 2018 at 1:03 AM, Vishal Thapar <vishal.thapar@...> wrote:

Stephen,
Any comments on thread safety of code calling serialize?

Note that these failures are on a patch that reverted the workaround that we did for yangtools issue. There are no other changes in OVSDB code, except the checkstyle etc. fixes.

Regards,
Vishal.
-----Original Message-----
From: Robert Varga [mailto:nite@...]
Sent: 06 March 2018 16:19
To: Vishal Thapar <vishal.thapar@...>; Jamo Luhrsen <jluhrsen@...>; Sam Hague <shague@...>
Cc: ovsdb-dev@....org; Kit Lou <klou.external@...>; yangtools-dev <yangtools-dev@lists.opendaylight.org>
Subject: Re: [ovsdb-dev] OVSDB CSIT Failures on Oxygen

On 06/03/18 11:15, Robert Varga wrote:
>> Any suggestions on the same? Do we need to rewrite our serializers? If yes, inputs would be welcome. [8] is the original bug raised for this.
> The original issue was a ordering one, now it would seem you cannot
> find the module in the SchemaContext -- at this point the question is,
> what modules are in the SchemaContext -- you'll need to debug that on
> your side...

More specifically, it seems your class is not thread-safe w.r.t.
SchemaContext updates.

Bye,
Robert

 


_______________________________________________
yangtools-dev mailing list
yangtools-dev@lists.opendaylight.org
https://lists.opendaylight.org/mailman/listinfo/yangtools-dev

 

 



Tom Pantelis <tompantelis@...>
 



On Fri, Mar 9, 2018 at 6:57 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 12:05 AM, Vishal Thapar <vishal.thapar@...> wrote:

Thank you Tom, next time round of beer on me (:

 

Jamo,

This fixes the QoS regression that we had. 3 node cluster is a different issue. Want to track it as separate bug or with this itself?

 

For the 3 node one, I am seeing lots of shard exceptions on ODL1:

 

2018-03-08T19:52:09,785 | WARN  | ForkJoinPool.commonPool-worker-2 | AbstractShardBackendResolver     | 228 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | Failed to resolve shard

java.util.concurrent.TimeoutException: Shard has no current leader

        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$resolveBackendInfo$1(AbstractShardBackendResolver.java:112) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

 

Infra issue? Are 3 node cluster tests passing in other modules [OFplugin? Netvirt?]

Yes, it is a different bug. Problem is it popped up recently and only impacts the clustering suite as cluster operations are being done. 



Nothing's changed in the clustering code  recently and in fact the entire release. "Shard has no current leader" indicates the nodes didn't connect.


Sam Hague
 



On Fri, Mar 9, 2018 at 7:16 AM, Tom Pantelis <tompantelis@...> wrote:


On Fri, Mar 9, 2018 at 6:57 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 12:05 AM, Vishal Thapar <vishal.thapar@...> wrote:

Thank you Tom, next time round of beer on me (:

 

Jamo,

This fixes the QoS regression that we had. 3 node cluster is a different issue. Want to track it as separate bug or with this itself?

 

For the 3 node one, I am seeing lots of shard exceptions on ODL1:

 

2018-03-08T19:52:09,785 | WARN  | ForkJoinPool.commonPool-worker-2 | AbstractShardBackendResolver     | 228 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | Failed to resolve shard

java.util.concurrent.TimeoutException: Shard has no current leader

        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$resolveBackendInfo$1(AbstractShardBackendResolver.java:112) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

 

Infra issue? Are 3 node cluster tests passing in other modules [OFplugin? Netvirt?]

Yes, it is a different bug. Problem is it popped up recently and only impacts the clustering suite as cluster operations are being done. 



Nothing's changed in the clustering code  recently and in fact the entire release. "Shard has no current leader" indicates the nodes didn't connect.
But OVSDB code has changed. Maybe just cleanup stuff, but maybe somehow it is impacting clustering. Maybe not related, but a few patches went in on OVSDB just as the tests started failing.


Sam Hague
 



On Fri, Mar 9, 2018 at 7:20 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 7:16 AM, Tom Pantelis <tompantelis@...> wrote:


On Fri, Mar 9, 2018 at 6:57 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 12:05 AM, Vishal Thapar <vishal.thapar@...> wrote:

Thank you Tom, next time round of beer on me (:

 

Jamo,

This fixes the QoS regression that we had. 3 node cluster is a different issue. Want to track it as separate bug or with this itself?

 

For the 3 node one, I am seeing lots of shard exceptions on ODL1:

 

2018-03-08T19:52:09,785 | WARN  | ForkJoinPool.commonPool-worker-2 | AbstractShardBackendResolver     | 228 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | Failed to resolve shard

java.util.concurrent.TimeoutException: Shard has no current leader

        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$resolveBackendInfo$1(AbstractShardBackendResolver.java:112) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

 

Infra issue? Are 3 node cluster tests passing in other modules [OFplugin? Netvirt?]

Yes, it is a different bug. Problem is it popped up recently and only impacts the clustering suite as cluster operations are being done. 



Nothing's changed in the clustering code  recently and in fact the entire release. "Shard has no current leader" indicates the nodes didn't connect.
But OVSDB code has changed. Maybe just cleanup stuff, but maybe somehow it is impacting clustering. Maybe not related, but a few patches went in on OVSDB just as the tests started failing.
Looks like both oxygen and flourine jobs are busted. Maybe something in the infra changed or an mdsal/controller type patch has changed and been merged in oxygen and fluorine? 


Tom Pantelis <tompantelis@...>
 



On Fri, Mar 9, 2018 at 7:30 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 7:20 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 7:16 AM, Tom Pantelis <tompantelis@...> wrote:


On Fri, Mar 9, 2018 at 6:57 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 12:05 AM, Vishal Thapar <vishal.thapar@...> wrote:

Thank you Tom, next time round of beer on me (:

 

Jamo,

This fixes the QoS regression that we had. 3 node cluster is a different issue. Want to track it as separate bug or with this itself?

 

For the 3 node one, I am seeing lots of shard exceptions on ODL1:

 

2018-03-08T19:52:09,785 | WARN  | ForkJoinPool.commonPool-worker-2 | AbstractShardBackendResolver     | 228 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | Failed to resolve shard

java.util.concurrent.TimeoutException: Shard has no current leader

        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

        at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$resolveBackendInfo$1(AbstractShardBackendResolver.java:112) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

 

Infra issue? Are 3 node cluster tests passing in other modules [OFplugin? Netvirt?]

Yes, it is a different bug. Problem is it popped up recently and only impacts the clustering suite as cluster operations are being done. 



Nothing's changed in the clustering code  recently and in fact the entire release. "Shard has no current leader" indicates the nodes didn't connect.
But OVSDB code has changed. Maybe just cleanup stuff, but maybe somehow it is impacting clustering. Maybe not related, but a few patches went in on OVSDB just as the tests started failing.
Looks like both oxygen and flourine jobs are busted. Maybe something in the infra changed or an mdsal/controller type patch has changed and been merged in oxygen and fluorine? 


There's been no changes in clustering for quite some time. The error indicates the nodes aren't connecting which would point to infra or something in the test setup. One thing though is that the stack trace indicates the test is using the newer tell-based protocol which is disabled by default. If so then the test setup is enabling that in the cfg file - if so, was that recent? Still there shouldn't be an issue with that.


Luis Gomez
 

I think ovsdb is hitting https://jira.opendaylight.org/browse/CONTROLLER-1751. This is seen in OFP for quite some time but I do not think it is infra problem (at least not the only reason) because it happened also with old cloud provider.

You can easily identify the issue in the robot log with this phrase: "Keyword 'Check_Cluster_Is_In_Sync' failed after retrying for 5 minutes. The last error was: Index 2 has incorrect status: False"

Which basically means the member does not join the cluster after it is started.

I also opened a ticket to AKKA for this but so far I have not received much feedback.

BR/Luis

On Mar 9, 2018, at 4:50 AM, Tom Pantelis <tompantelis@...> wrote:



On Fri, Mar 9, 2018 at 7:30 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 7:20 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 7:16 AM, Tom Pantelis <tompantelis@...> wrote:


On Fri, Mar 9, 2018 at 6:57 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 12:05 AM, Vishal Thapar <vishal.thapar@...> wrote:
Thank you Tom, next time round of beer on me (:



Jamo,

This fixes the QoS regression that we had. 3 node cluster is a different issue. Want to track it as separate bug or with this itself?



For the 3 node one, I am seeing lots of shard exceptions on ODL1:



2018-03-08T19:52:09,785 | WARN | ForkJoinPool.commonPool-worker-2 | AbstractShardBackendResolver | 228 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | Failed to resolve shard

java.util.concurrent.TimeoutException: Shard has no current leader

at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$resolveBackendInfo$1(AbstractShardBackendResolver.java:112) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]



Infra issue? Are 3 node cluster tests passing in other modules [OFplugin? Netvirt?]

Yes, it is a different bug. Problem is it popped up recently and only impacts the clustering suite as cluster operations are being done.




Nothing's changed in the clustering code recently and in fact the entire release. "Shard has no current leader" indicates the nodes didn't connect.
But OVSDB code has changed. Maybe just cleanup stuff, but maybe somehow it is impacting clustering. Maybe not related, but a few patches went in on OVSDB just as the tests started failing.
Looks like both oxygen and flourine jobs are busted. Maybe something in the infra changed or an mdsal/controller type patch has changed and been merged in oxygen and fluorine?


There's been no changes in clustering for quite some time. The error indicates the nodes aren't connecting which would point to infra or something in the test setup. One thing though is that the stack trace indicates the test is using the newer tell-based protocol which is disabled by default. If so then the test setup is enabling that in the cfg file - if so, was that recent? Still there shouldn't be an issue with that.

_______________________________________________
yangtools-dev mailing list
yangtools-dev@...
https://lists.opendaylight.org/mailman/listinfo/yangtools-dev


Jamo Luhrsen <jluhrsen@...>
 

Again, I was wondering if these 3node OVSDB failures were just a blip
and we happened to hit it yesterday when everyone's radar was on high
alert. it seems like it.

https://jenkins.opendaylight.org/releng/user/jluhrsen/my-views/view/ovsdb%20csit/job/ovsdb-csit-3node-upstream-clustering-only-oxygen/194/robot/

still, the failures are real and we have reported this from time to
time for a while now. As Luis said, even back in RackSpace.

Luis has a bug, I have a bug:

https://jira.opendaylight.org/browse/CONTROLLER-1768

I'm not sure how/when or who these things can be addressed, but I don't
think we've ever assumed clustering CSIT jobs were ever 100% stable.


JamO

On 3/9/18 6:45 AM, Luis Gomez wrote:
I think ovsdb is hitting https://jira.opendaylight.org/browse/CONTROLLER-1751. This is seen in OFP for quite some time but I do not think it is infra problem (at least not the only reason) because it happened also with old cloud provider.

You can easily identify the issue in the robot log with this phrase: "Keyword 'Check_Cluster_Is_In_Sync' failed after retrying for 5 minutes. The last error was: Index 2 has incorrect status: False"

Which basically means the member does not join the cluster after it is started.

I also opened a ticket to AKKA for this but so far I have not received much feedback.

BR/Luis



On Mar 9, 2018, at 4:50 AM, Tom Pantelis <tompantelis@...> wrote:



On Fri, Mar 9, 2018 at 7:30 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 7:20 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 7:16 AM, Tom Pantelis <tompantelis@...> wrote:


On Fri, Mar 9, 2018 at 6:57 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 12:05 AM, Vishal Thapar <vishal.thapar@...> wrote:
Thank you Tom, next time round of beer on me (:



Jamo,

This fixes the QoS regression that we had. 3 node cluster is a different issue. Want to track it as separate bug or with this itself?



For the 3 node one, I am seeing lots of shard exceptions on ODL1:



2018-03-08T19:52:09,785 | WARN | ForkJoinPool.commonPool-worker-2 | AbstractShardBackendResolver | 228 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | Failed to resolve shard

java.util.concurrent.TimeoutException: Shard has no current leader

at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$resolveBackendInfo$1(AbstractShardBackendResolver.java:112) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]



Infra issue? Are 3 node cluster tests passing in other modules [OFplugin? Netvirt?]

Yes, it is a different bug. Problem is it popped up recently and only impacts the clustering suite as cluster operations are being done.




Nothing's changed in the clustering code recently and in fact the entire release. "Shard has no current leader" indicates the nodes didn't connect.
But OVSDB code has changed. Maybe just cleanup stuff, but maybe somehow it is impacting clustering. Maybe not related, but a few patches went in on OVSDB just as the tests started failing.
Looks like both oxygen and flourine jobs are busted. Maybe something in the infra changed or an mdsal/controller type patch has changed and been merged in oxygen and fluorine?


There's been no changes in clustering for quite some time. The error indicates the nodes aren't connecting which would point to infra or something in the test setup. One thing though is that the stack trace indicates the test is using the newer tell-based protocol which is disabled by default. If so then the test setup is enabling that in the cfg file - if so, was that recent? Still there shouldn't be an issue with that.

_______________________________________________
yangtools-dev mailing list
yangtools-dev@...
https://lists.opendaylight.org/mailman/listinfo/yangtools-dev


Luis Gomez
 

On Mar 9, 2018, at 10:14 AM, Jamo Luhrsen <jluhrsen@...> wrote:

Again, I was wondering if these 3node OVSDB failures were just a blip
and we happened to hit it yesterday when everyone's radar was on high
alert. it seems like it.

https://jenkins.opendaylight.org/releng/user/jluhrsen/my-views/view/ovsdb%20csit/job/ovsdb-csit-3node-upstream-clustering-only-oxygen/194/robot/

still, the failures are real and we have reported this from time to
time for a while now. As Luis said, even back in RackSpace.

Luis has a bug, I have a bug:

https://jira.opendaylight.org/browse/CONTROLLER-1768

I'm not sure how/when or who these things can be addressed, but I don't
think we've ever assumed clustering CSIT jobs were ever 100% stable.
+1, this is hard to admit but we cannot claim a stable cluster solution after 8 releases of ODL.



JamO



On 3/9/18 6:45 AM, Luis Gomez wrote:
I think ovsdb is hitting https://jira.opendaylight.org/browse/CONTROLLER-1751. This is seen in OFP for quite some time but I do not think it is infra problem (at least not the only reason) because it happened also with old cloud provider.

You can easily identify the issue in the robot log with this phrase: "Keyword 'Check_Cluster_Is_In_Sync' failed after retrying for 5 minutes. The last error was: Index 2 has incorrect status: False"

Which basically means the member does not join the cluster after it is started.

I also opened a ticket to AKKA for this but so far I have not received much feedback.

BR/Luis



On Mar 9, 2018, at 4:50 AM, Tom Pantelis <tompantelis@...> wrote:



On Fri, Mar 9, 2018 at 7:30 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 7:20 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 7:16 AM, Tom Pantelis <tompantelis@...> wrote:


On Fri, Mar 9, 2018 at 6:57 AM, Sam Hague <shague@...> wrote:


On Fri, Mar 9, 2018 at 12:05 AM, Vishal Thapar <vishal.thapar@...> wrote:
Thank you Tom, next time round of beer on me (:



Jamo,

This fixes the QoS regression that we had. 3 node cluster is a different issue. Want to track it as separate bug or with this itself?



For the 3 node one, I am seeing lots of shard exceptions on ODL1:



2018-03-08T19:52:09,785 | WARN | ForkJoinPool.commonPool-worker-2 | AbstractShardBackendResolver | 228 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | Failed to resolve shard

java.util.concurrent.TimeoutException: Shard has no current leader

at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]

at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$resolveBackendInfo$1(AbstractShardBackendResolver.java:112) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]



Infra issue? Are 3 node cluster tests passing in other modules [OFplugin? Netvirt?]

Yes, it is a different bug. Problem is it popped up recently and only impacts the clustering suite as cluster operations are being done.




Nothing's changed in the clustering code recently and in fact the entire release. "Shard has no current leader" indicates the nodes didn't connect.
But OVSDB code has changed. Maybe just cleanup stuff, but maybe somehow it is impacting clustering. Maybe not related, but a few patches went in on OVSDB just as the tests started failing.
Looks like both oxygen and flourine jobs are busted. Maybe something in the infra changed or an mdsal/controller type patch has changed and been merged in oxygen and fluorine?


There's been no changes in clustering for quite some time. The error indicates the nodes aren't connecting which would point to infra or something in the test setup. One thing though is that the stack trace indicates the test is using the newer tell-based protocol which is disabled by default. If so then the test setup is enabling that in the cfg file - if so, was that recent? Still there shouldn't be an issue with that.

_______________________________________________
yangtools-dev mailing list
yangtools-dev@...
https://lists.opendaylight.org/mailman/listinfo/yangtools-dev