[OpenDaylight Infrastructure] IT-22529 Vexxhost stack instantiation failed


Robert Varga
 

On 22/07/2021 17:14, Luis Gomez wrote:
Couple of comments regarding distribution test duration:
Sorry for the late reply,

- I saw the netconf scale test can take from 6 to 10 hours so I filled
this patch to reduce that time, otherwise we can move it to the weekend
like the longevity tests:

https://git.opendaylight.org/gerrit/c/releng/builder/+/97011
<https://git.opendaylight.org/gerrit/c/releng/builder/+/97011>
We have https://jira.opendaylight.org/browse/INTTEST-125 tracking this
problem. I already have a prototype patch, but it needs a few fixes to
restconf and finishing up.

- AR Aluminium is still running once a day, should we set this to
weekly? AFAIR we did that for the 2nd stable branch in the past.

https://jenkins.opendaylight.org/releng/view/autorelease/job/autorelease-release-aluminium-mvn35-openjdk11/
<https://jenkins.opendaylight.org/releng/view/autorelease/job/autorelease-release-aluminium-mvn35-openjdk11/>
Might make sense, then again we are about 5 weeks from desupporting
Aluminium completely...


BR/Luis


On Jul 21, 2021, at 3:46 PM, Robert Varga <nite@...
<mailto:nite@...>> wrote:

So this again is interplay between CSIT, Vexxhost, openstack and us.

There were wide-scale Jenkins problems during past 24 hours, as for
example:

1.
https://jenkins.opendaylight.org/releng/job/integration-distribution-test-silicon/390/
<https://jenkins.opendaylight.org/releng/job/integration-distribution-test-silicon/390/>
spent 12 hours waiting in the queue. This job normally executes within
90 minutes.

2.
https://jenkins.opendaylight.org/releng/view/distribution/job/integration-distribution-test-aluminium/507/
has 2 new failures in three-node scenarios

3.
https://jenkins.opendaylight.org/releng/view/distribution/job/distribution-merge-managed-aluminium/4928/
took 13 hours to build.
It usually completes in 20 minutes.

https://jenkins.opendaylight.org/releng/view/distribution/job/distribution-merge-managed-aluminium/4928/console
reveals this:

01:03:35 -----END_OF_BUILD-----
This is after end of build, i.e. post-build actions which the ODL
community has no way of affecting.

01:03:35 [JaCoCo plugin] Loading packages..
12:21:28 [JaCoCo plugin] Done.
i.e. it took 11 hours and 18 minutes...

12:21:28 [JaCoCo plugin] Loading packages..
12:26:24 [JaCoCo plugin] Done.
... then 5 minutes ...

12:26:24 [JaCoCo plugin] Loading packages..
13:43:35 [JaCoCo plugin] Done.
... and some 76 minutes ...

13:43:35 [JaCoCo plugin] Loading packages..
14:25:10 [JaCoCo plugin] Done.
... and some 42 minutes ...

14:25:10 [JaCoCo plugin] Loading packages..
14:25:20 [JaCoCo plugin] Done.
14:25:20 [PostBuildScript] - [ERROR] An error occured during
post-build processing.
14:25:32
org.jenkinsci.plugins.postbuildscript.PostBuildScriptException:
java.lang.InterruptedException
and this is me, taking that job out back. It probably would still be
running, holding back retriggers (of which 7 have since passed
successfully).

While all of this was going on, https://status.linuxfoundation.org/
shows OpenDaylight infra to be happy as a clam the last 24 hours.

Something is seriously not adding up here and it is Not Fun[*].

Regards,
Robert

[*] All of this delayed Phosphorus CSIT results by ~10 hours, i.e. no
new data during morning CET.


On 21/07/2021 22:22, Project Services wrote:
—-—-—-—
—-—-—-—
Reply above this line.

Hello *Robert Varga*,

*Eric Ball* has added a comment on your request *IT-22529*
<https://jira.linuxfoundation.org/servicedesk/customer/portal/2/IT-22529?sda_source=notification-email>:

   _The resource usage is currently low, so I was able to successfully
   re-run the ovsdb job. I'm re-running the openflowplugin job now.

   I'll also submit a request to Vexxhost to expand or remove the
   memory cap. We have a limit to the total number of instances active
   at once, which should keep our resource usage from hitting any of
   the usage caps. But it seems that the memory cap doesn't take into
   account the ram used by stacks dynamically spun up by the jobs, so a
   large number of those jobs running concurrently can still have this
   issue._

You may reply to their comment *here
<https://jira.linuxfoundation.org/servicedesk/customer/portal/2/IT-22529?sda_source=notification-email>*

------------------------------------------------------------------------

Hello *Robert Varga*,

*System Administrator* changed the status of your request *IT-22529
<https://jira.linuxfoundation.org/servicedesk/customer/portal/2/IT-22529?sda_source=notification-email>*
to *Waiting for customer*.

View request
<https://jira.linuxfoundation.org/servicedesk/customer/portal/2/IT-22529?sda_source=notification-email>
· Turn off this request's notifications
<https://jira.linuxfoundation.org/servicedesk/customer/portal/2/IT-22529/unsubscribe?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJxc2giOiI0Nzc2OWVjZGY5OTNhYzM4ZjczYTNlMWZhNTY0YWViOGZmODM0ZjYzZWYyZjEzMjUzMTQ1NjU0OTg5ZGYwMTMyIiwiaXNzIjoic2VydmljZWRlc2stand0LXRva2VuLWlzc3VlciIsImNvbnRleHQiOnsidXNlciI6InJvdmFyZ2EiLCJpc3N1ZSI6IklULTIyNTI5In0sImV4cCI6MTYyOTMxODEwNCwiaWF0IjoxNjI2ODk4OTA0fQ.Dw61-8vZTZJFaz5psnA6RauRWMALf-ctSJIQun61XwY>


This is shared with ODL and Robert Varga.

Help Center, powered by Jira Service Desk
<https://www.atlassian.com/software/jira/service-desk/powered-by?utm_medium=email&utm_source=service-desk_email-notification_server&utm_campaign=service-desk_email-notification_server>,
sent you this message.






Luis Gomez
 

Couple of comments regarding distribution test duration:

- I saw the netconf scale test can take from 6 to 10 hours so I filled this patch to reduce that time, otherwise we can move it to the weekend like the longevity tests:


- AR Aluminium is still running once a day, should we set this to weekly? AFAIR we did that for the 2nd stable branch in the past.


BR/Luis


On Jul 21, 2021, at 3:46 PM, Robert Varga <nite@...> wrote:

So this again is interplay between CSIT, Vexxhost, openstack and us.

There were wide-scale Jenkins problems during past 24 hours, as for example:

1.
https://jenkins.opendaylight.org/releng/job/integration-distribution-test-silicon/390/
spent 12 hours waiting in the queue. This job normally executes within
90 minutes.

2.
https://jenkins.opendaylight.org/releng/view/distribution/job/integration-distribution-test-aluminium/507/
has 2 new failures in three-node scenarios

3.
https://jenkins.opendaylight.org/releng/view/distribution/job/distribution-merge-managed-aluminium/4928/
took 13 hours to build.
It usually completes in 20 minutes.

https://jenkins.opendaylight.org/releng/view/distribution/job/distribution-merge-managed-aluminium/4928/console
reveals this:

01:03:35 -----END_OF_BUILD-----

This is after end of build, i.e. post-build actions which the ODL
community has no way of affecting.

01:03:35 [JaCoCo plugin] Loading packages..
12:21:28 [JaCoCo plugin] Done.

i.e. it took 11 hours and 18 minutes...

12:21:28 [JaCoCo plugin] Loading packages..
12:26:24 [JaCoCo plugin] Done.

... then 5 minutes ...

12:26:24 [JaCoCo plugin] Loading packages..
13:43:35 [JaCoCo plugin] Done.

... and some 76 minutes ...

13:43:35 [JaCoCo plugin] Loading packages..
14:25:10 [JaCoCo plugin] Done.

... and some 42 minutes ...

14:25:10 [JaCoCo plugin] Loading packages..
14:25:20 [JaCoCo plugin] Done.
14:25:20 [PostBuildScript] - [ERROR] An error occured during post-build processing.
14:25:32 org.jenkinsci.plugins.postbuildscript.PostBuildScriptException: java.lang.InterruptedException

and this is me, taking that job out back. It probably would still be
running, holding back retriggers (of which 7 have since passed
successfully).

While all of this was going on, https://status.linuxfoundation.org/
shows OpenDaylight infra to be happy as a clam the last 24 hours.

Something is seriously not adding up here and it is Not Fun[*].

Regards,
Robert

[*] All of this delayed Phosphorus CSIT results by ~10 hours, i.e. no
new data during morning CET.


On 21/07/2021 22:22, Project Services wrote:
—-—-—-—
—-—-—-—
Reply above this line.

Hello *Robert Varga*,

*Eric Ball* has added a comment on your request *IT-22529*
<https://jira.linuxfoundation.org/servicedesk/customer/portal/2/IT-22529?sda_source=notification-email>:

   _The resource usage is currently low, so I was able to successfully
   re-run the ovsdb job. I'm re-running the openflowplugin job now.

   I'll also submit a request to Vexxhost to expand or remove the
   memory cap. We have a limit to the total number of instances active
   at once, which should keep our resource usage from hitting any of
   the usage caps. But it seems that the memory cap doesn't take into
   account the ram used by stacks dynamically spun up by the jobs, so a
   large number of those jobs running concurrently can still have this
   issue._

You may reply to their comment *here
<https://jira.linuxfoundation.org/servicedesk/customer/portal/2/IT-22529?sda_source=notification-email>*

------------------------------------------------------------------------

Hello *Robert Varga*,

*System Administrator* changed the status of your request *IT-22529
<https://jira.linuxfoundation.org/servicedesk/customer/portal/2/IT-22529?sda_source=notification-email>*
to *Waiting for customer*.

View request
<https://jira.linuxfoundation.org/servicedesk/customer/portal/2/IT-22529?sda_source=notification-email>
· Turn off this request's notifications
<https://jira.linuxfoundation.org/servicedesk/customer/portal/2/IT-22529/unsubscribe?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJxc2giOiI0Nzc2OWVjZGY5OTNhYzM4ZjczYTNlMWZhNTY0YWViOGZmODM0ZjYzZWYyZjEzMjUzMTQ1NjU0OTg5ZGYwMTMyIiwiaXNzIjoic2VydmljZWRlc2stand0LXRva2VuLWlzc3VlciIsImNvbnRleHQiOnsidXNlciI6InJvdmFyZ2EiLCJpc3N1ZSI6IklULTIyNTI5In0sImV4cCI6MTYyOTMxODEwNCwiaWF0IjoxNjI2ODk4OTA0fQ.Dw61-8vZTZJFaz5psnA6RauRWMALf-ctSJIQun61XwY>


This is shared with ODL and Robert Varga.

Help Center, powered by Jira Service Desk
<https://www.atlassian.com/software/jira/service-desk/powered-by?utm_medium=email&utm_source=service-desk_email-notification_server&utm_campaign=service-desk_email-notification_server>,
sent you this message.