global-jjb vs. packer vs. Jenkins jobs


Robert Varga
 

Hello everyone,

as the (still current) failure to start Jenkins jobs shows, our current
way of integrating with external dependencies (global-jjb) is beyond
fragile.

The way our jobs work is that:

1) we have a base image, created by builder-packer-* jobs on a regular
basis and roll up distro upgrades plus some other things (like mininet,
etc.) that we need

2) the Jenkins job launches on that base image and call two scripts from
global-jjb, both of which end up installing more things:
a) python-tools-install.sh
b) lf-env.sh

3) the actual job runs

4) some more stuff invoking lf-env.sh to setup another Python
environment runs.

Now, it is clear that everything in 1) is invariant and updated in a
controlled way.

The problem is with 2), where again, everything is supposed to be
invariant for a particular version of global-jjb -- yet we reinstall
these things on every single job run.

Not only is this subject to random breakage (like now, or when pip
repositories are unavailable), etc.

It also takes around 3 minutes of each job execution, which does not
sound like much, but it is full 30%(!) of runtime of
yangtools-release-merge (which takes around 10 minutes).

We obviously can and must do better: global-jjb's environment-impacting
scripts must all be executed during builder-packer, so that they become
proper invariants.

For that, global-jjb needs to grow two things:

1) a way to install *all* of its dependencies without doing anything
else, for use in packer jobs

2) compatibility checks on the environment to ensure it is uptodate
enough to run a particular global-jjb version's scripts

With that, our jobs should be both faster and more reliable.

Does anybody see a problem why this would not work?

If not, I will be filing LFIT issues to get this done.

Regards,
Robert


Anil Belur
 

Greetings Robert:

lf-env.sh: Creates a virtual env and sets up the environment, while the python-tools-install.sh Installs the python tools/utils during Job runtime. Since releng/global-jjb is a repo of Generic JJB templates (can be used by any of the CI management repositories), its up to the $project/$job to install the dependencies required for running the job.  

We have discussed this in the past, installing PyPI dependencies during packer image build time, comes with its own set of problems and added costs:
1. This requires maintaining a large number of packer images (if the project needs to support multiple versions of python/PyPI deps).
2. All releng/global-jjb (templates) scripts do not require all of the PyPi dependencies to be installed and are tied down to the $job or $project, since this approach binding them all into the same env has a risk of the deps being broken more frequently.
3. PyPi libs/modules are updated more frequently. 

Thanks,
Anil


On Mon, Jan 25, 2021 at 7:44 PM Robert Varga <nite@...> wrote:
Hello everyone,

as the (still current) failure to start Jenkins jobs shows, our current
way of integrating with external dependencies (global-jjb) is beyond
fragile.

The way our jobs work is that:

1) we have a base image, created by builder-packer-* jobs on a regular
basis and roll up distro upgrades plus some other things (like mininet,
etc.) that we need

2) the Jenkins job launches on that base image and call two scripts from
global-jjb, both of which end up installing more things:
   a) python-tools-install.sh
   b) lf-env.sh

3) the actual job runs

4) some more stuff invoking lf-env.sh to setup another Python
environment runs.

Now, it is clear that everything in 1) is invariant and updated in a
controlled way.

The problem is with 2), where again, everything is supposed to be
invariant for a particular version of global-jjb -- yet we reinstall
these things on every single job run.

Not only is this subject to random breakage (like now, or when pip
repositories are unavailable), etc.

It also takes around 3 minutes of each job execution, which does not
sound like much, but it is full 30%(!) of runtime of
yangtools-release-merge (which takes around 10 minutes).

We obviously can and must do better: global-jjb's environment-impacting
scripts must all be executed during builder-packer, so that they become
proper invariants.

For that, global-jjb needs to grow two things:

1) a way to install *all* of its dependencies without doing anything
else, for use in packer jobs

2) compatibility checks on the environment to ensure it is uptodate
enough to run a particular global-jjb version's scripts

With that, our jobs should be both faster and more reliable.

Does anybody see a problem why this would not work?

If not, I will be filing LFIT issues to get this done.

Regards,
Robert





Robert Varga
 

On 03/02/2021 00:03, Anil Belur wrote:
Greetings Robert:
Hello Anil,

lf-env.sh: Creates a virtual env and sets up the environment, while the
python-tools-install.sh Installs the python tools/utils during Job
runtime. Since releng/global-jjb is a repo of Generic JJB templates (can
be used by any of the CI management repositories), its up to the
$project/$job to install the dependencies required for running the job. 
Understood. At the end of the day, though, we have only a few classes of
jobs and there is a ton of commonalities between them.

We have discussed this in the past, installing PyPI dependencies during
packer image build time, comes with its own set of problems and added costs:
1. This requires maintaining a large number of packer images (if the
project needs to support multiple versions of python/PyPI deps).
I do not believe this is the case for OpenDaylight jobs. For example
each and every job I looked at performs two things:
- python-tools-install.sh (70 seconds)
- job-cost.sh (39 seconds)

2. All releng/global-jjb (templates) scripts do not require all of the
PyPi dependencies to be installed and are tied down to the $job or
$project, since this approach binding them all into the same env has a
risk of the deps being broken more frequently.
3. PyPi libs/modules are updated more frequently.
While that is true, this line of reasoning completely ignores the
failure mode and recovery.

As it stands any of:
- busted global-jjb
- PyPi package updates
- PyPi repository unavailability

As we have seen in these past weeks, any such failure immediately
propagates to all jobs and breaks them -- resulting in nothing working
anymore, with no real avenue for recovery without help of LF IT.

We actually went through exactly this discussion when we had Sigul
failures -- and Sigul is now part of base images.

It is deemed sufficient to update our cloud images once a month -- and
that includes all sorts security fixes and similar. As a community we
are free to decide when to spin new images and can do that completely
without LF IT intervention.

I am sorry, but I fail to see how Python packages special enough to inflict:
- breakages occurring at completely random times
- incur 2-5 minutes of infra install to *each and every job* we run[*]

I am sorry to say that the world has changed in the past 5 years and we
no longer have the attention of LF IT staff that made resolution of
these failures a matter of hours -- it really is multiple days. That
fact alone makes a huge difference when weighing pros and cons.

Regards,
Robert

[*]
Just take a good look at what
https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/aaa-maven-verify-master-mvn35-openjdk11/3/console-timestamp.log.gz
did:

Total job runtime: 9m56s
Useful build time: 7m16s
Setup/teardown time: 2m40s

That's **27%** of the time spent on infra, amounting to **37%** overhead.


Thanks,
Anil


On Mon, Jan 25, 2021 at 7:44 PM Robert Varga <nite@...
<mailto:nite@...>> wrote:

Hello everyone,

as the (still current) failure to start Jenkins jobs shows, our current
way of integrating with external dependencies (global-jjb) is beyond
fragile.

The way our jobs work is that:

1) we have a base image, created by builder-packer-* jobs on a regular
basis and roll up distro upgrades plus some other things (like mininet,
etc.) that we need

2) the Jenkins job launches on that base image and call two scripts from
global-jjb, both of which end up installing more things:
   a) python-tools-install.sh
   b) lf-env.sh

3) the actual job runs

4) some more stuff invoking lf-env.sh to setup another Python
environment runs.

Now, it is clear that everything in 1) is invariant and updated in a
controlled way.

The problem is with 2), where again, everything is supposed to be
invariant for a particular version of global-jjb -- yet we reinstall
these things on every single job run.

Not only is this subject to random breakage (like now, or when pip
repositories are unavailable), etc.

It also takes around 3 minutes of each job execution, which does not
sound like much, but it is full 30%(!) of runtime of
yangtools-release-merge (which takes around 10 minutes).

We obviously can and must do better: global-jjb's environment-impacting
scripts must all be executed during builder-packer, so that they become
proper invariants.

For that, global-jjb needs to grow two things:

1) a way to install *all* of its dependencies without doing anything
else, for use in packer jobs

2) compatibility checks on the environment to ensure it is uptodate
enough to run a particular global-jjb version's scripts

With that, our jobs should be both faster and more reliable.

Does anybody see a problem why this would not work?

If not, I will be filing LFIT issues to get this done.

Regards,
Robert




Robert Varga
 

On 05/02/2021 10:16, Robert Varga wrote:
On 03/02/2021 00:03, Anil Belur wrote:
Greetings Robert:
Hello Anil,
Hello again,

sorry, for self-reply, but as it happens ...

[snip]

2. All releng/global-jjb (templates) scripts do not require all of the
PyPi dependencies to be installed and are tied down to the $job or
$project, since this approach binding them all into the same env has a
risk of the deps being broken more frequently.
3. PyPi libs/modules are updated more frequently.
While that is true, this line of reasoning completely ignores the
failure mode and recovery.

As it stands any of:
- busted global-jjb
- PyPi package updates
- PyPi repository unavailability

As we have seen in these past weeks, any such failure immediately
propagates to all jobs and breaks them -- resulting in nothing working
anymore, with no real avenue for recovery without help of LF IT.
... we just got hit by this.

[snip]


I am sorry, but I fail to see how Python packages special enough to inflict:
- breakages occurring at completely random times
A case in point:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/yangtools-maven-verify-master-mvn35-openjdk11/3761/console.log.gz
just failed with:

[yangtools-maven-verify-master-mvn35-openjdk11] $ /bin/bash /tmp/jenkins2774821607907773560.sh
---> python-tools-install.sh
Generating Requirements File
ERROR: Command errored out with exit status 1:
command: /usr/bin/python3 /home/jenkins/.local/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /tmp/tmpyskb3tqv
cwd: /tmp/pip-install-ibprnn2v/cryptography_c4b114d9e7a14d488215cb74c3a04315
Complete output (144 lines):
[...]

writing manifest file 'src/cryptography.egg-info/SOURCES.txt'
running build_ext
generating cffi module 'build/temp.linux-x86_64-3.6/_padding.c'
creating build/temp.linux-x86_64-3.6
generating cffi module 'build/temp.linux-x86_64-3.6/_openssl.c'
running build_rust

=============================DEBUG ASSISTANCE=============================
If you are seeing a compilation error please try the following steps to
successfully install cryptography:
1) Upgrade to the latest pip and try again. This will fix errors for most
users. See: https://pip.pypa.io/en/stable/installing/#upgrading-pip
2) Read https://cryptography.io/en/latest/installation.html for specific
instructions for your platform.
3) Check our frequently asked questions for more information:
https://cryptography.io/en/latest/faq.html
4) Ensure you have a recent Rust toolchain installed.
=============================DEBUG ASSISTANCE=============================

error: Can not find Rust compiler
----------------------------------------
ERROR: Failed building wheel for cryptography
ERROR: Could not build wheels for cryptography which use PEP 517 and cannot be installed directly
Build step 'Execute shell' marked build as failure
python-tools-install.sh comes from global-jjb. This means jobs are
currently broken because of global-jjb stopped working.

I have filed
https://jira.linuxfoundation.org/plugins/servlet/theme/portal/2/IT-21509
and we are blocked on it. Let's see what sort KPIs that issue will have.

Bye,
Robert


Anil Belur
 

Greetings Robert:

We'll need to pin the cryptography module < 3.4 since rust dependencies are broken upstream pyca repo.

This issue should be addressed once this is merged. 

Cheers,
Anil

On Mon, Feb 8, 2021 at 6:58 AM Robert Varga <nite@...> wrote:
On 05/02/2021 10:16, Robert Varga wrote:
> On 03/02/2021 00:03, Anil Belur wrote:
>> Greetings Robert:
>
> Hello Anil,

Hello again,

sorry, for self-reply, but as it happens ...

[snip]

>> 2. All releng/global-jjb (templates) scripts do not require all of the
>> PyPi dependencies to be installed and are tied down to the $job or
>> $project, since this approach binding them all into the same env has a
>> risk of the deps being broken more frequently.
>> 3. PyPi libs/modules are updated more frequently.
>
> While that is true, this line of reasoning completely ignores the
> failure mode and recovery.
>
> As it stands any of:
> - busted global-jjb
> - PyPi package updates
> - PyPi repository unavailability
>
> As we have seen in these past weeks, any such failure immediately
> propagates to all jobs and breaks them -- resulting in nothing working
> anymore, with no real avenue for recovery without help of LF IT.

... we just got hit by this.

[snip]

>
> I am sorry, but I fail to see how Python packages special enough to inflict:
> - breakages occurring at completely random times

A case in point:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/yangtools-maven-verify-master-mvn35-openjdk11/3761/console.log.gz
just failed with:

> [yangtools-maven-verify-master-mvn35-openjdk11] $ /bin/bash /tmp/jenkins2774821607907773560.sh
> ---> python-tools-install.sh
> Generating Requirements File
>   ERROR: Command errored out with exit status 1:
>    command: /usr/bin/python3 /home/jenkins/.local/lib/python3.6/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /tmp/tmpyskb3tqv
>        cwd: /tmp/pip-install-ibprnn2v/cryptography_c4b114d9e7a14d488215cb74c3a04315
>   Complete output (144 lines):

[...]

>   writing manifest file 'src/cryptography.egg-info/SOURCES.txt'
>   running build_ext
>   generating cffi module 'build/temp.linux-x86_64-3.6/_padding.c'
>   creating build/temp.linux-x86_64-3.6
>   generating cffi module 'build/temp.linux-x86_64-3.6/_openssl.c'
>   running build_rust
>   
>       =============================DEBUG ASSISTANCE=============================
>       If you are seeing a compilation error please try the following steps to
>       successfully install cryptography:
>       1) Upgrade to the latest pip and try again. This will fix errors for most
>          users. See: https://pip.pypa.io/en/stable/installing/#upgrading-pip
>       2) Read https://cryptography.io/en/latest/installation.html for specific
>          instructions for your platform.
>       3) Check our frequently asked questions for more information:
>          https://cryptography.io/en/latest/faq.html
>       4) Ensure you have a recent Rust toolchain installed.
>       =============================DEBUG ASSISTANCE=============================
>   
>   error: Can not find Rust compiler
>   ----------------------------------------
>   ERROR: Failed building wheel for cryptography
> ERROR: Could not build wheels for cryptography which use PEP 517 and cannot be installed directly
> Build step 'Execute shell' marked build as failure

python-tools-install.sh comes from global-jjb. This means jobs are
currently broken because of global-jjb stopped working.

I have filed
https://jira.linuxfoundation.org/plugins/servlet/theme/portal/2/IT-21509
and we are blocked on it. Let's see what sort KPIs that issue will have.

Bye,
Robert