[kernel-dev] Regression in Mg SR1


Robert Varga
 

On 01/05/2020 23:36, JamO Luhrsen wrote:

On 5/1/20 1:01 PM, Luis Gomez wrote:
FYI this change merged in April 27th:

https://git.opendaylight.org/gerrit/c/controller/+/87586
here's the revert:
https://git.opendaylight.org/gerrit/c/controller/+/89554
Merged.


Produced regression in all these many suites:

https://jenkins.opendaylight.org/releng/job/controller-csit-3node-clustering-tell-all-magnesium/
https://jenkins.opendaylight.org/releng/job/controller-csit-3node-clustering-ask-all-magnesium/
https://jenkins.opendaylight.org/releng/job/controller-csit-1node-akka1-all-magnesium/
This one seems easiest to try to debug. The gist of the problem is this:

- bring up older controller version and do some configs
- copy snapshots/ and *journal/ folders off to new controller version
- start new controller version
- notice that the data/config is not there (404 on cars:cars)

That's all I have though, by looking at the robot logs. Looking at the
karaf log, it's
weirdly silent after the new controller boots up like normal. All that's
there are
the two log statements we write to it from robot:

2020-05-01T01:47:57,330 | INFO | pipe-log:log "ROBOT MESSAGE: Starting
test controller-akka1.txt.Verify_Data_Is_Restored" | core | 123 -
org.apache.karaf.log.core - 4.2.6 | ROBOT MESSAGE: Starting test
controller-akka1.txt.Verify_Data_Is_Restored 2020-05-01T01:51:01,859 |
INFO | pipe-log:log "ROBOT MESSAGE: Starting test
controller-akka1.txt.Archive_Older_Karaf_Log" | core | 123 -
org.apache.karaf.log.core - 4.2.6 | ROBOT MESSAGE: Starting test
controller-akka1.txt.Archive_Older_Karaf_Log
I looked at the BGPCEP job and this part is quite weird:

2020-05-04T16:10:03,320 | ERROR | opendaylight-cluster-data-shard-dispatcher-39 | Shard | 298 - org.opendaylight.controller.sal-clustering-commons - 1.11.1 | member-1-shard-default-config: Log entry not found for index 0
2020-05-04T16:10:03,383 | ERROR | opendaylight-cluster-data-shard-dispatcher-39 | Shard | 298 - org.opendaylight.controller.sal-clustering-commons - 1.11.1 | member-1-shard-default-config: failed to apply payload org.opendaylight.controller.cluster.datastore.persisted.CommitTransactionPayload$Simple@3d3565da
also, similar things are going down in the ask-all job:

2020-05-04T15:29:10,295 | INFO | opendaylight-cluster-data-shard-dispatcher-40 | Shard | 297 - org.opendaylight.controller.sal-clustering-commons - 1.11.1 | member-1-shard-prefix-configuration-shard-config (Follower): The log is not empty but the prevLogIndex 22 was not found in it - lastIndex: 21, snapshotIndex: -1, snapshotTerm: -1
2020-05-04T15:29:10,296 | INFO | opendaylight-cluster-data-shard-dispatcher-40 | Shard | 297 - org.opendaylight.controller.sal-clustering-commons - 1.11.1 | member-1-shard-prefix-configuration-shard-config (Follower): Follower is out-of-sync so sending negative reply: AppendEntriesReply [term=54, success=false, followerId=member-1-shard-prefix-configuration-shard-config, logLastIndex=21, logLastTerm=5, forceInstallSnapshot=false, needsLeaderAddress=false, payloadVersion=11, raftVersion=4, recipientRaftVersion=3]
so something is definitely off with journal :-/

The bgp jobs seem to be even more broken though. More ERRORs, etc. Not sure
if we need to look at those separately or not.
Probably yes, as they seem to indicate an inconsistency in cluster
singleton -- but that may be related (although I am not sure how).

Regards,
Robert


JamO Luhrsen
 


On 5/1/20 1:01 PM, Luis Gomez wrote:
FYI this change merged in April 27th:

here's the revert:
https://git.opendaylight.org/gerrit/c/controller/+/89554



Produced regression in all these many suites:

This one seems easiest to try to debug. The gist of the problem is this:

- bring up older controller version and do some configs
- copy snapshots/ and *journal/ folders off to new controller version
- start new controller version
- notice that the data/config is not there (404 on cars:cars)

That's all I have though, by looking at the robot logs. Looking at the karaf log, it's
weirdly silent after the new controller boots up like normal. All that's there are
the two log statements we write to it from robot:

2020-05-01T01:47:57,330 | INFO | pipe-log:log "ROBOT MESSAGE: Starting test controller-akka1.txt.Verify_Data_Is_Restored" | core | 123 - org.apache.karaf.log.core - 4.2.6 | ROBOT MESSAGE: Starting test controller-akka1.txt.Verify_Data_Is_Restored 2020-05-01T01:51:01,859 | INFO | pipe-log:log "ROBOT MESSAGE: Starting test controller-akka1.txt.Archive_Older_Karaf_Log" | core | 123 - org.apache.karaf.log.core - 4.2.6 | ROBOT MESSAGE: Starting test controller-akka1.txt.Archive_Older_Karaf_Log


The bgp jobs seem to be even more broken though. More ERRORs, etc. Not sure
if we need to look at those separately or not.


These suites are in some degree dealing with the snapshot folder that might have changed after the mentioned patch.
Did snapshot change? I know journal did, but we addressed that here:
https://git.opendaylight.org/gerrit/c/integration/test/+/88658

I am not sure at this moment we should investigate the issues + repair the test (it can take a while) or just revert and try next SR.

I would guess some folks might have also reservations in introducing this change in an SR.
Once the revert I created gives me a distribution, I'll run it through these four
jobs in the sandbox. If those all pass like expected, and we don't get any quick
fix on the CSIT side, it might make sense to merge the revert and get moving
on a new release candidate.
Thanks,
JamO

BR/Luis