Re: [kernel-dev] Regression in Mg SR1


Robert Varga
 

On 01/05/2020 23:36, JamO Luhrsen wrote:

On 5/1/20 1:01 PM, Luis Gomez wrote:
FYI this change merged in April 27th:

https://git.opendaylight.org/gerrit/c/controller/+/87586
here's the revert:
https://git.opendaylight.org/gerrit/c/controller/+/89554
Merged.


Produced regression in all these many suites:

https://jenkins.opendaylight.org/releng/job/controller-csit-3node-clustering-tell-all-magnesium/
https://jenkins.opendaylight.org/releng/job/controller-csit-3node-clustering-ask-all-magnesium/
https://jenkins.opendaylight.org/releng/job/controller-csit-1node-akka1-all-magnesium/
This one seems easiest to try to debug. The gist of the problem is this:

- bring up older controller version and do some configs
- copy snapshots/ and *journal/ folders off to new controller version
- start new controller version
- notice that the data/config is not there (404 on cars:cars)

That's all I have though, by looking at the robot logs. Looking at the
karaf log, it's
weirdly silent after the new controller boots up like normal. All that's
there are
the two log statements we write to it from robot:

2020-05-01T01:47:57,330 | INFO | pipe-log:log "ROBOT MESSAGE: Starting
test controller-akka1.txt.Verify_Data_Is_Restored" | core | 123 -
org.apache.karaf.log.core - 4.2.6 | ROBOT MESSAGE: Starting test
controller-akka1.txt.Verify_Data_Is_Restored 2020-05-01T01:51:01,859 |
INFO | pipe-log:log "ROBOT MESSAGE: Starting test
controller-akka1.txt.Archive_Older_Karaf_Log" | core | 123 -
org.apache.karaf.log.core - 4.2.6 | ROBOT MESSAGE: Starting test
controller-akka1.txt.Archive_Older_Karaf_Log
I looked at the BGPCEP job and this part is quite weird:

2020-05-04T16:10:03,320 | ERROR | opendaylight-cluster-data-shard-dispatcher-39 | Shard | 298 - org.opendaylight.controller.sal-clustering-commons - 1.11.1 | member-1-shard-default-config: Log entry not found for index 0
2020-05-04T16:10:03,383 | ERROR | opendaylight-cluster-data-shard-dispatcher-39 | Shard | 298 - org.opendaylight.controller.sal-clustering-commons - 1.11.1 | member-1-shard-default-config: failed to apply payload org.opendaylight.controller.cluster.datastore.persisted.CommitTransactionPayload$Simple@3d3565da
also, similar things are going down in the ask-all job:

2020-05-04T15:29:10,295 | INFO | opendaylight-cluster-data-shard-dispatcher-40 | Shard | 297 - org.opendaylight.controller.sal-clustering-commons - 1.11.1 | member-1-shard-prefix-configuration-shard-config (Follower): The log is not empty but the prevLogIndex 22 was not found in it - lastIndex: 21, snapshotIndex: -1, snapshotTerm: -1
2020-05-04T15:29:10,296 | INFO | opendaylight-cluster-data-shard-dispatcher-40 | Shard | 297 - org.opendaylight.controller.sal-clustering-commons - 1.11.1 | member-1-shard-prefix-configuration-shard-config (Follower): Follower is out-of-sync so sending negative reply: AppendEntriesReply [term=54, success=false, followerId=member-1-shard-prefix-configuration-shard-config, logLastIndex=21, logLastTerm=5, forceInstallSnapshot=false, needsLeaderAddress=false, payloadVersion=11, raftVersion=4, recipientRaftVersion=3]
so something is definitely off with journal :-/

The bgp jobs seem to be even more broken though. More ERRORs, etc. Not sure
if we need to look at those separately or not.
Probably yes, as they seem to indicate an inconsistency in cluster
singleton -- but that may be related (although I am not sure how).

Regards,
Robert

Join integration-dev@lists.opendaylight.org to automatically receive all group messages.