Tom Pantelis <tompantelis@...>
|
Tom Pantelis <tompantelis@...>
|
Vishal Thapar <vishal.thapar@...>
Thank you Tom, next time round of beer on me (:
Jamo,
This fixes the QoS regression that we had. 3 node cluster is a different issue. Want to track it as separate bug or with this itself?
For the 3 node one, I am seeing lots of shard exceptions on ODL1:
2018-03-08T19:52:09,785 | WARN | ForkJoinPool.commonPool-worker-2 | AbstractShardBackendResolver | 228 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT
| Failed to resolve shard
java.util.concurrent.TimeoutException: Shard has no current leader
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$resolveBackendInfo$1(AbstractShardBackendResolver.java:112) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]
Infra issue? Are 3 node cluster tests passing in other modules [OFplugin? Netvirt?]
Regards,
Vishal.
toggle quoted message
Show quoted text
From: Tom Pantelis [mailto:tompantelis@...]
Sent: 09 March 2018 00:27
To: Sam Hague <shague@...>
Cc: Vishal Thapar <vishal.thapar@...>; Jamo Luhrsen <jluhrsen@...>; ovsdb-dev@...; yangtools-dev <yangtools-dev@...>; Kit Lou <klou.external@...>; Stephen Kitt <skitt@...>
Subject: Re: [yangtools-dev] [ovsdb-dev] OVSDB CSIT Failures on Oxygen
On Thu, Mar 8, 2018 at 1:44 PM, Tom Pantelis <tompantelis@...> wrote:
On Thu, Mar 8, 2018 at 1:35 PM, Sam Hague <shague@...> wrote:
Vishal, Tom,
can you look at the comment from Robert on the original patch that set the codec iid and see if that moves us forward?
I think I see what he means - I can push a patch.
On Thu, Mar 8, 2018 at 1:03 AM, Vishal Thapar <vishal.thapar@...> wrote:
Stephen,
Any comments on thread safety of code calling serialize?
Note that these failures are on a patch that reverted the workaround that we did for yangtools issue. There are no other changes in OVSDB code, except the checkstyle etc. fixes.
Regards,
Vishal.
-----Original Message-----
From: Robert Varga [mailto:nite@...]
Sent: 06 March 2018 16:19
To: Vishal Thapar <vishal.thapar@...>; Jamo Luhrsen <jluhrsen@...>;
Sam Hague <shague@...>
Cc:
ovsdb-dev@...; Kit Lou <klou.external@...>; yangtools-dev <yangtools-dev@...>
Subject: Re: [ovsdb-dev] OVSDB CSIT Failures on Oxygen
On 06/03/18 11:15, Robert Varga wrote:
>> Any suggestions on the same? Do we need to rewrite our serializers? If yes, inputs would be welcome. [8] is the original bug raised for this.
> The original issue was a ordering one, now it would seem you cannot
> find the module in the SchemaContext -- at this point the question is,
> what modules are in the SchemaContext -- you'll need to debug that on
> your side...
More specifically, it seems your class is not thread-safe w.r.t.
SchemaContext updates.
Bye,
Robert
_______________________________________________
yangtools-dev mailing list
yangtools-dev@...
https://lists.opendaylight.org/mailman/listinfo/yangtools-dev
|
|
Tom Pantelis <tompantelis@...>
|
|
|
Tom Pantelis <tompantelis@...>
|
I think ovsdb is hitting https://jira.opendaylight.org/browse/CONTROLLER-1751. This is seen in OFP for quite some time but I do not think it is infra problem (at least not the only reason) because it happened also with old cloud provider. You can easily identify the issue in the robot log with this phrase: "Keyword 'Check_Cluster_Is_In_Sync' failed after retrying for 5 minutes. The last error was: Index 2 has incorrect status: False" Which basically means the member does not join the cluster after it is started. I also opened a ticket to AKKA for this but so far I have not received much feedback. BR/Luis
toggle quoted message
Show quoted text
On Mar 9, 2018, at 4:50 AM, Tom Pantelis <tompantelis@...> wrote:
On Fri, Mar 9, 2018 at 7:30 AM, Sam Hague <shague@...> wrote:
On Fri, Mar 9, 2018 at 7:20 AM, Sam Hague <shague@...> wrote:
On Fri, Mar 9, 2018 at 7:16 AM, Tom Pantelis <tompantelis@...> wrote:
On Fri, Mar 9, 2018 at 6:57 AM, Sam Hague <shague@...> wrote:
On Fri, Mar 9, 2018 at 12:05 AM, Vishal Thapar <vishal.thapar@...> wrote: Thank you Tom, next time round of beer on me (:
Jamo,
This fixes the QoS regression that we had. 3 node cluster is a different issue. Want to track it as separate bug or with this itself?
For the 3 node one, I am seeing lots of shard exceptions on ODL1:
2018-03-08T19:52:09,785 | WARN | ForkJoinPool.commonPool-worker-2 | AbstractShardBackendResolver | 228 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | Failed to resolve shard
java.util.concurrent.TimeoutException: Shard has no current leader
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$resolveBackendInfo$1(AbstractShardBackendResolver.java:112) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]
Infra issue? Are 3 node cluster tests passing in other modules [OFplugin? Netvirt?]
Yes, it is a different bug. Problem is it popped up recently and only impacts the clustering suite as cluster operations are being done.
Nothing's changed in the clustering code recently and in fact the entire release. "Shard has no current leader" indicates the nodes didn't connect. But OVSDB code has changed. Maybe just cleanup stuff, but maybe somehow it is impacting clustering. Maybe not related, but a few patches went in on OVSDB just as the tests started failing. Looks like both oxygen and flourine jobs are busted. Maybe something in the infra changed or an mdsal/controller type patch has changed and been merged in oxygen and fluorine?
There's been no changes in clustering for quite some time. The error indicates the nodes aren't connecting which would point to infra or something in the test setup. One thing though is that the stack trace indicates the test is using the newer tell-based protocol which is disabled by default. If so then the test setup is enabling that in the cfg file - if so, was that recent? Still there shouldn't be an issue with that.
_______________________________________________ yangtools-dev mailing list yangtools-dev@... https://lists.opendaylight.org/mailman/listinfo/yangtools-dev
|
Jamo Luhrsen <jluhrsen@...>
toggle quoted message
Show quoted text
On 3/9/18 6:45 AM, Luis Gomez wrote: I think ovsdb is hitting https://jira.opendaylight.org/browse/CONTROLLER-1751. This is seen in OFP for quite some time but I do not think it is infra problem (at least not the only reason) because it happened also with old cloud provider.
You can easily identify the issue in the robot log with this phrase: "Keyword 'Check_Cluster_Is_In_Sync' failed after retrying for 5 minutes. The last error was: Index 2 has incorrect status: False"
Which basically means the member does not join the cluster after it is started.
I also opened a ticket to AKKA for this but so far I have not received much feedback.
BR/Luis
On Mar 9, 2018, at 4:50 AM, Tom Pantelis <tompantelis@...> wrote:
On Fri, Mar 9, 2018 at 7:30 AM, Sam Hague <shague@...> wrote:
On Fri, Mar 9, 2018 at 7:20 AM, Sam Hague <shague@...> wrote:
On Fri, Mar 9, 2018 at 7:16 AM, Tom Pantelis <tompantelis@...> wrote:
On Fri, Mar 9, 2018 at 6:57 AM, Sam Hague <shague@...> wrote:
On Fri, Mar 9, 2018 at 12:05 AM, Vishal Thapar <vishal.thapar@...> wrote: Thank you Tom, next time round of beer on me (:
Jamo,
This fixes the QoS regression that we had. 3 node cluster is a different issue. Want to track it as separate bug or with this itself?
For the 3 node one, I am seeing lots of shard exceptions on ODL1:
2018-03-08T19:52:09,785 | WARN | ForkJoinPool.commonPool-worker-2 | AbstractShardBackendResolver | 228 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | Failed to resolve shard
java.util.concurrent.TimeoutException: Shard has no current leader
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$resolveBackendInfo$1(AbstractShardBackendResolver.java:112) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]
Infra issue? Are 3 node cluster tests passing in other modules [OFplugin? Netvirt?]
Yes, it is a different bug. Problem is it popped up recently and only impacts the clustering suite as cluster operations are being done.
Nothing's changed in the clustering code recently and in fact the entire release. "Shard has no current leader" indicates the nodes didn't connect. But OVSDB code has changed. Maybe just cleanup stuff, but maybe somehow it is impacting clustering. Maybe not related, but a few patches went in on OVSDB just as the tests started failing. Looks like both oxygen and flourine jobs are busted. Maybe something in the infra changed or an mdsal/controller type patch has changed and been merged in oxygen and fluorine?
There's been no changes in clustering for quite some time. The error indicates the nodes aren't connecting which would point to infra or something in the test setup. One thing though is that the stack trace indicates the test is using the newer tell-based protocol which is disabled by default. If so then the test setup is enabling that in the cfg file - if so, was that recent? Still there shouldn't be an issue with that.
_______________________________________________ yangtools-dev mailing list yangtools-dev@... https://lists.opendaylight.org/mailman/listinfo/yangtools-dev
|
On Mar 9, 2018, at 10:14 AM, Jamo Luhrsen <jluhrsen@...> wrote:
Again, I was wondering if these 3node OVSDB failures were just a blip and we happened to hit it yesterday when everyone's radar was on high alert. it seems like it.
https://jenkins.opendaylight.org/releng/user/jluhrsen/my-views/view/ovsdb%20csit/job/ovsdb-csit-3node-upstream-clustering-only-oxygen/194/robot/
still, the failures are real and we have reported this from time to time for a while now. As Luis said, even back in RackSpace.
Luis has a bug, I have a bug:
https://jira.opendaylight.org/browse/CONTROLLER-1768
I'm not sure how/when or who these things can be addressed, but I don't think we've ever assumed clustering CSIT jobs were ever 100% stable. +1, this is hard to admit but we cannot claim a stable cluster solution after 8 releases of ODL.
JamO
On 3/9/18 6:45 AM, Luis Gomez wrote:
I think ovsdb is hitting https://jira.opendaylight.org/browse/CONTROLLER-1751. This is seen in OFP for quite some time but I do not think it is infra problem (at least not the only reason) because it happened also with old cloud provider.
You can easily identify the issue in the robot log with this phrase: "Keyword 'Check_Cluster_Is_In_Sync' failed after retrying for 5 minutes. The last error was: Index 2 has incorrect status: False"
Which basically means the member does not join the cluster after it is started.
I also opened a ticket to AKKA for this but so far I have not received much feedback.
BR/Luis
On Mar 9, 2018, at 4:50 AM, Tom Pantelis <tompantelis@...> wrote:
On Fri, Mar 9, 2018 at 7:30 AM, Sam Hague <shague@...> wrote:
On Fri, Mar 9, 2018 at 7:20 AM, Sam Hague <shague@...> wrote:
On Fri, Mar 9, 2018 at 7:16 AM, Tom Pantelis <tompantelis@...> wrote:
On Fri, Mar 9, 2018 at 6:57 AM, Sam Hague <shague@...> wrote:
On Fri, Mar 9, 2018 at 12:05 AM, Vishal Thapar <vishal.thapar@...> wrote: Thank you Tom, next time round of beer on me (:
Jamo,
This fixes the QoS regression that we had. 3 node cluster is a different issue. Want to track it as separate bug or with this itself?
For the 3 node one, I am seeing lots of shard exceptions on ODL1:
2018-03-08T19:52:09,785 | WARN | ForkJoinPool.commonPool-worker-2 | AbstractShardBackendResolver | 228 - org.opendaylight.controller.sal-distributed-datastore - 1.8.0.SNAPSHOT | Failed to resolve shard
java.util.concurrent.TimeoutException: Shard has no current leader
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.wrap(AbstractShardBackendResolver.java:129) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]
at org.opendaylight.controller.cluster.databroker.actors.dds.AbstractShardBackendResolver.lambda$resolveBackendInfo$1(AbstractShardBackendResolver.java:112) ~[228:org.opendaylight.controller.sal-distributed-datastore:1.8.0.SNAPSHOT]
Infra issue? Are 3 node cluster tests passing in other modules [OFplugin? Netvirt?]
Yes, it is a different bug. Problem is it popped up recently and only impacts the clustering suite as cluster operations are being done.
Nothing's changed in the clustering code recently and in fact the entire release. "Shard has no current leader" indicates the nodes didn't connect. But OVSDB code has changed. Maybe just cleanup stuff, but maybe somehow it is impacting clustering. Maybe not related, but a few patches went in on OVSDB just as the tests started failing. Looks like both oxygen and flourine jobs are busted. Maybe something in the infra changed or an mdsal/controller type patch has changed and been merged in oxygen and fluorine?
There's been no changes in clustering for quite some time. The error indicates the nodes aren't connecting which would point to infra or something in the test setup. One thing though is that the stack trace indicates the test is using the newer tell-based protocol which is disabled by default. If so then the test setup is enabling that in the cfg file - if so, was that recent? Still there shouldn't be an issue with that.
_______________________________________________ yangtools-dev mailing list yangtools-dev@... https://lists.opendaylight.org/mailman/listinfo/yangtools-dev
|