Re: [integration-dev] global-jjb vs. packer vs. Jenkins jobs


Robert Varga
 

On 03/02/2021 00:03, Anil Belur wrote:
Greetings Robert:
Hello Anil,

lf-env.sh: Creates a virtual env and sets up the environment, while the
python-tools-install.sh Installs the python tools/utils during Job
runtime. Since releng/global-jjb is a repo of Generic JJB templates (can
be used by any of the CI management repositories), its up to the
$project/$job to install the dependencies required for running the job. 
Understood. At the end of the day, though, we have only a few classes of
jobs and there is a ton of commonalities between them.

We have discussed this in the past, installing PyPI dependencies during
packer image build time, comes with its own set of problems and added costs:
1. This requires maintaining a large number of packer images (if the
project needs to support multiple versions of python/PyPI deps).
I do not believe this is the case for OpenDaylight jobs. For example
each and every job I looked at performs two things:
- python-tools-install.sh (70 seconds)
- job-cost.sh (39 seconds)

2. All releng/global-jjb (templates) scripts do not require all of the
PyPi dependencies to be installed and are tied down to the $job or
$project, since this approach binding them all into the same env has a
risk of the deps being broken more frequently.
3. PyPi libs/modules are updated more frequently.
While that is true, this line of reasoning completely ignores the
failure mode and recovery.

As it stands any of:
- busted global-jjb
- PyPi package updates
- PyPi repository unavailability

As we have seen in these past weeks, any such failure immediately
propagates to all jobs and breaks them -- resulting in nothing working
anymore, with no real avenue for recovery without help of LF IT.

We actually went through exactly this discussion when we had Sigul
failures -- and Sigul is now part of base images.

It is deemed sufficient to update our cloud images once a month -- and
that includes all sorts security fixes and similar. As a community we
are free to decide when to spin new images and can do that completely
without LF IT intervention.

I am sorry, but I fail to see how Python packages special enough to inflict:
- breakages occurring at completely random times
- incur 2-5 minutes of infra install to *each and every job* we run[*]

I am sorry to say that the world has changed in the past 5 years and we
no longer have the attention of LF IT staff that made resolution of
these failures a matter of hours -- it really is multiple days. That
fact alone makes a huge difference when weighing pros and cons.

Regards,
Robert

[*]
Just take a good look at what
https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/aaa-maven-verify-master-mvn35-openjdk11/3/console-timestamp.log.gz
did:

Total job runtime: 9m56s
Useful build time: 7m16s
Setup/teardown time: 2m40s

That's **27%** of the time spent on infra, amounting to **37%** overhead.


Thanks,
Anil


On Mon, Jan 25, 2021 at 7:44 PM Robert Varga <nite@...
<mailto:nite@...>> wrote:

Hello everyone,

as the (still current) failure to start Jenkins jobs shows, our current
way of integrating with external dependencies (global-jjb) is beyond
fragile.

The way our jobs work is that:

1) we have a base image, created by builder-packer-* jobs on a regular
basis and roll up distro upgrades plus some other things (like mininet,
etc.) that we need

2) the Jenkins job launches on that base image and call two scripts from
global-jjb, both of which end up installing more things:
   a) python-tools-install.sh
   b) lf-env.sh

3) the actual job runs

4) some more stuff invoking lf-env.sh to setup another Python
environment runs.

Now, it is clear that everything in 1) is invariant and updated in a
controlled way.

The problem is with 2), where again, everything is supposed to be
invariant for a particular version of global-jjb -- yet we reinstall
these things on every single job run.

Not only is this subject to random breakage (like now, or when pip
repositories are unavailable), etc.

It also takes around 3 minutes of each job execution, which does not
sound like much, but it is full 30%(!) of runtime of
yangtools-release-merge (which takes around 10 minutes).

We obviously can and must do better: global-jjb's environment-impacting
scripts must all be executed during builder-packer, so that they become
proper invariants.

For that, global-jjb needs to grow two things:

1) a way to install *all* of its dependencies without doing anything
else, for use in packer jobs

2) compatibility checks on the environment to ensure it is uptodate
enough to run a particular global-jjb version's scripts

With that, our jobs should be both faster and more reliable.

Does anybody see a problem why this would not work?

If not, I will be filing LFIT issues to get this done.

Regards,
Robert



Join TSC@lists.opendaylight.org to automatically receive all group messages.