[genius-dev] [integration-dev] Genius CSIT intermittent 3 node failures due to OVSDB reconnect and connect issue


Vishal Thapar <vthapar@...>
 

Jira: https://jira.opendaylight.org/browse/OVSDB-438

Link to fix in HWVTEP code is in there too.

Regards,
Vishal.

On Sat, Apr 28, 2018 at 2:19 AM, Jamo Luhrsen <jluhrsen@...> wrote:
re-sending with new email for Vishal


On 4/27/18 1:46 PM, Jamo Luhrsen wrote:



On 4/27/18 11:39 AM, Faseela K wrote:

Sam was mentioning in last genius weekly meeting that there is a JIRA
already for this, and Suneelu is working on it.

@Vishal : Could you please share the JIRA?

We are hitting the issue intermittently, and when we try to debug with
ovsdb TRACE logs, it never happens.

is this tested multiple times with TRACE logs enabled and we never hit
the issue? If so, that leads me to believe some race condition is
happening
so perfectly that the little bit of slowdown we get with extra logging
is enough to avoid it. fun :)

JamO

Thanks,

Faseela

*From:*Vishal Thapar
*Sent:* Monday, March 26, 2018 2:22 PM
*To:* B Sathwik <b.sathwik@...>; Tomáš Markovič
<tomas.markovic@...>
*Cc:* Sam Hague <shague@...>; genius-dev@...;
Faseela K <faseela.k@...>; ovsdb-dev@...;
integration-dev@...; K.V Suneelu Verma
<k.v.suneelu.verma@...>
*Subject:* RE: [integration-dev] Genius CSIT intermittent 3 node failures
due to OVSDB reconnect and connect issue

Thanks Tomas, I missed the testplan part as I was facing exact same issue
in my patch test job and wrongly assumed cause was same. After Sathwick’s
change, it is indeed same infra issue.


https://jenkins.opendaylight.org/releng/job/genius-csit-1node-gate-all-fluorine/19/console

Regards,

Vishal.

*From:*B Sathwik
*Sent:* 26 March 2018 14:19
*To:* Tomáš Markovič <tomas.markovic@...
<mailto:tomas.markovic@...>>
*Cc:* Sam Hague <shague@... <mailto:shague@...>>;
genius-dev@...
<mailto:genius-dev@...>; Vishal Thapar
<vishal.thapar@... <mailto:vishal.thapar@...>>; Faseela K
<faseela.k@... <mailto:faseela.k@...>>;
ovsdb-dev@... <mailto:ovsdb-dev@...>;
integration-dev@...
<mailto:integration-dev@...>; K.V Suneelu Verma
<k.v.suneelu.verma@... <mailto:k.v.suneelu.verma@...>>
*Subject:* RE: [integration-dev] Genius CSIT intermittent 3 node failures
due to OVSDB reconnect and connect issue

Hi,

Changed the test plan accordingly and rebuild it

Facing the following error. It’s a infra issue

2: Waiting for 15 minutes to create
sandbox-genius-csit-3node-sathwikgate-all-fluorine-3.

1: CREATE_FAILED

ERROR: Failed to initialize infrastructure. Reason: Resource CREATE
failed: OverLimit: resources.vm_1_group.resources[1].resources.volume:
VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds
allowed gigabytes quota. Requested 40G, quota is 8192G and 8160G has been
consumed. (HTTP 413) (Request-ID: req-ebe75897-6320-49ea-b052-6d139ff869d1)

Regards

Sathwik

*From:*Tomáš Markovič [mailto:tomas.markovic@...]
*Sent:* Monday, March 26, 2018 1:45 PM
*To:* B Sathwik <b.sathwik@... <mailto:b.sathwik@...>>
*Cc:* Sam Hague <shague@... <mailto:shague@...>>;
genius-dev@...
<mailto:genius-dev@...>; Vishal Thapar
<vishal.thapar@... <mailto:vishal.thapar@...>>; Faseela K
<faseela.k@... <mailto:faseela.k@...>>;
ovsdb-dev@... <mailto:ovsdb-dev@...>;
integration-dev@...
<mailto:integration-dev@...>; K.V Suneelu Verma
<k.v.suneelu.verma@... <mailto:k.v.suneelu.verma@...>>
*Subject:* Re: [integration-dev] Genius CSIT intermittent 3 node failures
due to OVSDB reconnect and connect issue

Also from,

*07:48:19* [ ERROR ] Expected at least 1 argument, got 0.

You can see you are using wrong testplan:

genius-sathwikgate-fluorine.txt / genius-sathwikgate.txt

which do not exist, so change them accordingly to what you want.

Regards,

Tomas Markovic

On Mon, Mar 26, 2018 at 9:20 AM, B Sathwik <b.sathwik@...
<mailto:b.sathwik@...>> wrote:

Hi,

Started sandbox job with ovsdb TRACE logs for 3node genius CSIT.


https://jenkins.opendaylight.org/sandbox/job/genius-csit-3node-sathwikgate-all-fluorine/

Regards

Sathwik

*From:*Sam Hague [mailto:shague@...
<mailto:shague@...>]
*Sent:* Friday, March 23, 2018 7:41 PM
*To:* B Sathwik <b.sathwik@...
<mailto:b.sathwik@...>>
*Cc:* Vishal Thapar <vishal.thapar@...
<mailto:vishal.thapar@...>>; Faseela K
<faseela.k@... <mailto:faseela.k@...>>;
integration-dev@...
<mailto:integration-dev@...>;
ovsdb-dev@...
<mailto:ovsdb-dev@...>;
genius-dev@...
<mailto:genius-dev@...>; K.V Suneelu Verma
<k.v.suneelu.verma@...
<mailto:k.v.suneelu.verma@...>>
*Subject:* Re: [integration-dev] Genius CSIT intermittent 3 node
failures due to OVSDB reconnect and connect issue

On Mar 23, 2018 12:20 AM, "B Sathwik" <b.sathwik@...
<mailto:b.sathwik@...>> wrote:

Vishal,

Suneelu was asking for the ovsdb TRACE logs for the 3 node
CSIT runs.

I need to know how to enable the same while running 3 node
CSIT jobs in sandbox.

In sandbox, simply add your custom trace settings in the
CONTROLLERDEBUGMAPparam, like ovsdb:TRACE. Read the comment
on that parameter. You can add multiple log settings.

Any pointers ?

Regards

Sathwik

*From:* Vishal Thapar
*Sent:* Thursday, March 22, 2018 5:39 PM
*To:* Faseela K <faseela.k@...
<mailto:faseela.k@...>>


*Cc:* integration-dev@...
<mailto:integration-dev@...>;
ovsdb-dev@...
<mailto:ovsdb-dev@...>; B Sathwik <b.sathwik@...
<mailto:b.sathwik@...>>;
genius-dev@... <mailto:genius-dev@...>

*Subject:* RE: Genius CSIT intermittent 3 node failures due to
OVSDB reconnect and connect issue

Hi Faseela,

I didn’t say that issue will not occur if we enhance Genius 3
node CSIT. Only that Genius 3 node CSIT isn’t
configured like an actual cluster deployment.

Yes, there is an issue in OVSDB with disconnect/connect in rapid
succession [1] and that is the issue we’re
hitting in Genius 3 node CSIT. Issue is not with create/delete of
bridge but connect/disconnect on OVSDB
channel. Suneelu had fixed it for HWVTEP but there were some open
questions for OVSDB and there were still open
discussions on fix. Would be good to revive this discussion at
DDF where Jamo and Anil both would be there. If
we have OVSDB 3 node CSIT where we can reproduce this reliably,
we can try 2-3 options and test out which one works.

Reason it is intermittent is because issue depends on EOS. If
switch connects to node that isn’t leader for
OVSDB Instance, you run into this issue. Also, there are these
exceptions, not sure if these are cause or effect
of the issue.

Caused by: java.lang.IllegalArgumentException: Metadata not
available for modification NodeModification

[identifier=(urn:opendaylight:params:xml:ns:yang:ovsdb?revision=2015-01-05)manager-entry,
modificationType=TOUCH,

childModification={(urn:opendaylight:params:xml:ns:yang:ovsdb?revision=2015-01-05)manager-entry[{(urn:opendaylight:params:xml:ns:yang:ovsdb?revision=2015-01-05)target=tcp:10.30.170.65
<http://10.30.170.65>:6640}]=NodeModification

[identifier=(urn:opendaylight:params:xml:ns:yang:ovsdb?revision=2015-01-05)manager-entry[{(urn:opendaylight:params:xml:ns:yang:ovsdb?revision=2015-01-05)target=tcp:10.30.170.65:6640}],
modificationType=DELETE, childModification={}]}]

Regards,

Vishal.

[1]
https://lists.opendaylight.org/pipermail/ovsdb-dev/2018-February/004567.html

*From:* Faseela K
*Sent:* 22 March 2018 15:19
*To:* Vishal Thapar <vishal.thapar@...
<mailto:vishal.thapar@...>>
*Cc:* integration-dev@...
<mailto:integration-dev@...>;
ovsdb-dev@...
<mailto:ovsdb-dev@...>; B Sathwik <b.sathwik@...
<mailto:b.sathwik@...>>;
genius-dev@... <mailto:genius-dev@...>
*Subject:* Genius CSIT intermittent 3 node failures due to OVSDB
reconnect and connect issue

Hi Vishal,

As we have already discussed, genius 3 node CSIT is randomly
failing, due to bridge not showing up in
topology/operational DS, on delete and create of bridge.

You were indicating that, the clustered CSIT of genius will
need some enhancements(add HAPROXY?) so that
this issue will not occur.

Could you please give pointers to Sathwik, so that he can
start looking into it?

Also, even if we don’t use HAPROXY, and delete and a create
a bridge, why is there an issue in ovsdb plugin
to detect the same?


https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/genius-csit-verify-3node-upstream/220/robot-plugin/log.html.gz

Thanks,

Faseela


_______________________________________________
integration-dev mailing list
integration-dev@...
<mailto:integration-dev@...>
https://lists.opendaylight.org/mailman/listinfo/integration-dev


_______________________________________________
integration-dev mailing list
integration-dev@...
<mailto:integration-dev@...>
https://lists.opendaylight.org/mailman/listinfo/integration-dev


_______________________________________________
ovsdb-dev mailing list
ovsdb-dev@...
https://lists.opendaylight.org/mailman/listinfo/ovsdb-dev
_______________________________________________
genius-dev mailing list
genius-dev@...
https://lists.opendaylight.org/mailman/listinfo/genius-dev


Vishal Thapar <vthapar@...>
 

There is also OVSDB-439, Anil had a patch on it on carbon and Suneely
recently raised one on master for it.

https://git.opendaylight.org/gerrit/#/c/69061/

On Sat, Apr 28, 2018 at 2:41 AM, Vishal Thapar <vthapar@...> wrote:
Jira: https://jira.opendaylight.org/browse/OVSDB-438

Link to fix in HWVTEP code is in there too.

Regards,
Vishal.

On Sat, Apr 28, 2018 at 2:19 AM, Jamo Luhrsen <jluhrsen@...> wrote:
re-sending with new email for Vishal


On 4/27/18 1:46 PM, Jamo Luhrsen wrote:



On 4/27/18 11:39 AM, Faseela K wrote:

Sam was mentioning in last genius weekly meeting that there is a JIRA
already for this, and Suneelu is working on it.

@Vishal : Could you please share the JIRA?

We are hitting the issue intermittently, and when we try to debug with
ovsdb TRACE logs, it never happens.

is this tested multiple times with TRACE logs enabled and we never hit
the issue? If so, that leads me to believe some race condition is
happening
so perfectly that the little bit of slowdown we get with extra logging
is enough to avoid it. fun :)

JamO

Thanks,

Faseela

*From:*Vishal Thapar
*Sent:* Monday, March 26, 2018 2:22 PM
*To:* B Sathwik <b.sathwik@...>; Tomáš Markovič
<tomas.markovic@...>
*Cc:* Sam Hague <shague@...>; genius-dev@...;
Faseela K <faseela.k@...>; ovsdb-dev@...;
integration-dev@...; K.V Suneelu Verma
<k.v.suneelu.verma@...>
*Subject:* RE: [integration-dev] Genius CSIT intermittent 3 node failures
due to OVSDB reconnect and connect issue

Thanks Tomas, I missed the testplan part as I was facing exact same issue
in my patch test job and wrongly assumed cause was same. After Sathwick’s
change, it is indeed same infra issue.


https://jenkins.opendaylight.org/releng/job/genius-csit-1node-gate-all-fluorine/19/console

Regards,

Vishal.

*From:*B Sathwik
*Sent:* 26 March 2018 14:19
*To:* Tomáš Markovič <tomas.markovic@...
<mailto:tomas.markovic@...>>
*Cc:* Sam Hague <shague@... <mailto:shague@...>>;
genius-dev@...
<mailto:genius-dev@...>; Vishal Thapar
<vishal.thapar@... <mailto:vishal.thapar@...>>; Faseela K
<faseela.k@... <mailto:faseela.k@...>>;
ovsdb-dev@... <mailto:ovsdb-dev@...>;
integration-dev@...
<mailto:integration-dev@...>; K.V Suneelu Verma
<k.v.suneelu.verma@... <mailto:k.v.suneelu.verma@...>>
*Subject:* RE: [integration-dev] Genius CSIT intermittent 3 node failures
due to OVSDB reconnect and connect issue

Hi,

Changed the test plan accordingly and rebuild it

Facing the following error. It’s a infra issue

2: Waiting for 15 minutes to create
sandbox-genius-csit-3node-sathwikgate-all-fluorine-3.

1: CREATE_FAILED

ERROR: Failed to initialize infrastructure. Reason: Resource CREATE
failed: OverLimit: resources.vm_1_group.resources[1].resources.volume:
VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds
allowed gigabytes quota. Requested 40G, quota is 8192G and 8160G has been
consumed. (HTTP 413) (Request-ID: req-ebe75897-6320-49ea-b052-6d139ff869d1)

Regards

Sathwik

*From:*Tomáš Markovič [mailto:tomas.markovic@...]
*Sent:* Monday, March 26, 2018 1:45 PM
*To:* B Sathwik <b.sathwik@... <mailto:b.sathwik@...>>
*Cc:* Sam Hague <shague@... <mailto:shague@...>>;
genius-dev@...
<mailto:genius-dev@...>; Vishal Thapar
<vishal.thapar@... <mailto:vishal.thapar@...>>; Faseela K
<faseela.k@... <mailto:faseela.k@...>>;
ovsdb-dev@... <mailto:ovsdb-dev@...>;
integration-dev@...
<mailto:integration-dev@...>; K.V Suneelu Verma
<k.v.suneelu.verma@... <mailto:k.v.suneelu.verma@...>>
*Subject:* Re: [integration-dev] Genius CSIT intermittent 3 node failures
due to OVSDB reconnect and connect issue

Also from,

*07:48:19* [ ERROR ] Expected at least 1 argument, got 0.

You can see you are using wrong testplan:

genius-sathwikgate-fluorine.txt / genius-sathwikgate.txt

which do not exist, so change them accordingly to what you want.

Regards,

Tomas Markovic

On Mon, Mar 26, 2018 at 9:20 AM, B Sathwik <b.sathwik@...
<mailto:b.sathwik@...>> wrote:

Hi,

Started sandbox job with ovsdb TRACE logs for 3node genius CSIT.


https://jenkins.opendaylight.org/sandbox/job/genius-csit-3node-sathwikgate-all-fluorine/

Regards

Sathwik

*From:*Sam Hague [mailto:shague@...
<mailto:shague@...>]
*Sent:* Friday, March 23, 2018 7:41 PM
*To:* B Sathwik <b.sathwik@...
<mailto:b.sathwik@...>>
*Cc:* Vishal Thapar <vishal.thapar@...
<mailto:vishal.thapar@...>>; Faseela K
<faseela.k@... <mailto:faseela.k@...>>;
integration-dev@...
<mailto:integration-dev@...>;
ovsdb-dev@...
<mailto:ovsdb-dev@...>;
genius-dev@...
<mailto:genius-dev@...>; K.V Suneelu Verma
<k.v.suneelu.verma@...
<mailto:k.v.suneelu.verma@...>>
*Subject:* Re: [integration-dev] Genius CSIT intermittent 3 node
failures due to OVSDB reconnect and connect issue

On Mar 23, 2018 12:20 AM, "B Sathwik" <b.sathwik@...
<mailto:b.sathwik@...>> wrote:

Vishal,

Suneelu was asking for the ovsdb TRACE logs for the 3 node
CSIT runs.

I need to know how to enable the same while running 3 node
CSIT jobs in sandbox.

In sandbox, simply add your custom trace settings in the
CONTROLLERDEBUGMAPparam, like ovsdb:TRACE. Read the comment
on that parameter. You can add multiple log settings.

Any pointers ?

Regards

Sathwik

*From:* Vishal Thapar
*Sent:* Thursday, March 22, 2018 5:39 PM
*To:* Faseela K <faseela.k@...
<mailto:faseela.k@...>>


*Cc:* integration-dev@...
<mailto:integration-dev@...>;
ovsdb-dev@...
<mailto:ovsdb-dev@...>; B Sathwik <b.sathwik@...
<mailto:b.sathwik@...>>;
genius-dev@... <mailto:genius-dev@...>

*Subject:* RE: Genius CSIT intermittent 3 node failures due to
OVSDB reconnect and connect issue

Hi Faseela,

I didn’t say that issue will not occur if we enhance Genius 3
node CSIT. Only that Genius 3 node CSIT isn’t
configured like an actual cluster deployment.

Yes, there is an issue in OVSDB with disconnect/connect in rapid
succession [1] and that is the issue we’re
hitting in Genius 3 node CSIT. Issue is not with create/delete of
bridge but connect/disconnect on OVSDB
channel. Suneelu had fixed it for HWVTEP but there were some open
questions for OVSDB and there were still open
discussions on fix. Would be good to revive this discussion at
DDF where Jamo and Anil both would be there. If
we have OVSDB 3 node CSIT where we can reproduce this reliably,
we can try 2-3 options and test out which one works.

Reason it is intermittent is because issue depends on EOS. If
switch connects to node that isn’t leader for
OVSDB Instance, you run into this issue. Also, there are these
exceptions, not sure if these are cause or effect
of the issue.

Caused by: java.lang.IllegalArgumentException: Metadata not
available for modification NodeModification

[identifier=(urn:opendaylight:params:xml:ns:yang:ovsdb?revision=2015-01-05)manager-entry,
modificationType=TOUCH,

childModification={(urn:opendaylight:params:xml:ns:yang:ovsdb?revision=2015-01-05)manager-entry[{(urn:opendaylight:params:xml:ns:yang:ovsdb?revision=2015-01-05)target=tcp:10.30.170.65
<http://10.30.170.65>:6640}]=NodeModification

[identifier=(urn:opendaylight:params:xml:ns:yang:ovsdb?revision=2015-01-05)manager-entry[{(urn:opendaylight:params:xml:ns:yang:ovsdb?revision=2015-01-05)target=tcp:10.30.170.65:6640}],
modificationType=DELETE, childModification={}]}]

Regards,

Vishal.

[1]
https://lists.opendaylight.org/pipermail/ovsdb-dev/2018-February/004567.html

*From:* Faseela K
*Sent:* 22 March 2018 15:19
*To:* Vishal Thapar <vishal.thapar@...
<mailto:vishal.thapar@...>>
*Cc:* integration-dev@...
<mailto:integration-dev@...>;
ovsdb-dev@...
<mailto:ovsdb-dev@...>; B Sathwik <b.sathwik@...
<mailto:b.sathwik@...>>;
genius-dev@... <mailto:genius-dev@...>
*Subject:* Genius CSIT intermittent 3 node failures due to OVSDB
reconnect and connect issue

Hi Vishal,

As we have already discussed, genius 3 node CSIT is randomly
failing, due to bridge not showing up in
topology/operational DS, on delete and create of bridge.

You were indicating that, the clustered CSIT of genius will
need some enhancements(add HAPROXY?) so that
this issue will not occur.

Could you please give pointers to Sathwik, so that he can
start looking into it?

Also, even if we don’t use HAPROXY, and delete and a create
a bridge, why is there an issue in ovsdb plugin
to detect the same?


https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/genius-csit-verify-3node-upstream/220/robot-plugin/log.html.gz

Thanks,

Faseela


_______________________________________________
integration-dev mailing list
integration-dev@...
<mailto:integration-dev@...>
https://lists.opendaylight.org/mailman/listinfo/integration-dev


_______________________________________________
integration-dev mailing list
integration-dev@...
<mailto:integration-dev@...>
https://lists.opendaylight.org/mailman/listinfo/integration-dev


_______________________________________________
ovsdb-dev mailing list
ovsdb-dev@...
https://lists.opendaylight.org/mailman/listinfo/ovsdb-dev
_______________________________________________
genius-dev mailing list
genius-dev@...
https://lists.opendaylight.org/mailman/listinfo/genius-dev