[controller-dev] many logs about log index out-of-sync


Tom Pantelis <tompantelis@...>
 



On Tue, Jan 8, 2019 at 5:40 PM Jamo Luhrsen <jluhrsen@...> wrote:
I am debugging a 3node (cluster) netvirt csit job and one of my controllers
has a bunch of the messages like this:

2019-01-03T03:04:38,335 | INFO  | opendaylight-cluster-data-shard-dispatcher-21 | Shard                            | 229
- org.opendaylight.controller.sal-clustering-commons - 1.8.2 | member-3-shard-default-config (Follower): The log is not
empty but the prevLogIndex 19047 was not found in it - lastIndex: 17875, snapshotIndex: -1

2019-01-03T03:04:38,335 | INFO  | opendaylight-cluster-data-shard-dispatcher-21 | Shard                            | 229
- org.opendaylight.controller.sal-clustering-commons - 1.8.2 | member-3-shard-default-config (Follower): Follower is
out-of-sync so sending negative reply: AppendEntriesReply [term=23, success=false,
followerId=member-3-shard-default-config, logLastIndex=17875, logLastTerm=4, forceInstallSnapshot=false,
payloadVersion=9, raftVersion=3]

log:
https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-fluorine/168/odl_3/odl3_karaf.log.gz

The job has lots of csit failures, so I know something is broken somewhere.
I don't know if the above has anything to do with it or not. Maybe it's
even expected in this scenario.

Those messages can happen if a follower gets behind the leader, especially if it gets isolated and AppendEntries messages from the leader get backed up and then the follower gets a bunch of messages quickly when it reconnects to the leader. Looks like it eventually caught up to the leader although, from the output, the distro you're testing with is missing https://git.opendaylight.org/gerrit/#/c/78929/ which should speed up the sync process in that case and alleviate most of those messages.
 

Thanks,
JamO
_______________________________________________
controller-dev mailing list
controller-dev@...
https://lists.opendaylight.org/mailman/listinfo/controller-dev


JamO Luhrsen
 

On 1/8/19 3:52 PM, Tom Pantelis wrote:
On Tue, Jan 8, 2019 at 5:40 PM Jamo Luhrsen <jluhrsen@... <mailto:jluhrsen@...>> wrote:
I am debugging a 3node (cluster) netvirt csit job and one of my controllers
has a bunch of the messages like this:
2019-01-03T03:04:38,335 | INFO  | opendaylight-cluster-data-shard-dispatcher-21 | Shard                            |
229
- org.opendaylight.controller.sal-clustering-commons - 1.8.2 | member-3-shard-default-config (Follower): The log is not
empty but the prevLogIndex 19047 was not found in it - lastIndex: 17875, snapshotIndex: -1
2019-01-03T03:04:38,335 | INFO  | opendaylight-cluster-data-shard-dispatcher-21 | Shard                            |
229
- org.opendaylight.controller.sal-clustering-commons - 1.8.2 | member-3-shard-default-config (Follower): Follower is
out-of-sync so sending negative reply: AppendEntriesReply [term=23, success=false,
followerId=member-3-shard-default-config, logLastIndex=17875, logLastTerm=4, forceInstallSnapshot=false,
payloadVersion=9, raftVersion=3]
log:
https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-fluorine/168/odl_3/odl3_karaf.log.gz
The job has lots of csit failures, so I know something is broken somewhere.
I don't know if the above has anything to do with it or not. Maybe it's
even expected in this scenario.
Those messages can happen if a follower gets behind the leader, especially if it gets isolated and AppendEntries messages from the leader get backed up and then the follower gets a bunch of messages quickly when it reconnects to the leader. Looks like it eventually caught up to the leader although, from the output, the distro you're testing with is missing https://git.opendaylight.org/gerrit/#/c/78929/ which should speed up the sync process in that case and alleviate most of those messages.
Thanks Tom, makes sense then as this is on one controller that came up last after
bouncing them all.

JamO

Thanks,
JamO
_______________________________________________
controller-dev mailing list
controller-dev@... <mailto:controller-dev@...>
https://lists.opendaylight.org/mailman/listinfo/controller-dev


Tom Pantelis <tompantelis@...>
 



On Tue, Jan 8, 2019 at 7:02 PM Jamo Luhrsen <jluhrsen@...> wrote:


On 1/8/19 3:52 PM, Tom Pantelis wrote:
>
>
> On Tue, Jan 8, 2019 at 5:40 PM Jamo Luhrsen <jluhrsen@... <mailto:jluhrsen@...>> wrote:
>
>     I am debugging a 3node (cluster) netvirt csit job and one of my controllers
>     has a bunch of the messages like this:
>
>     2019-01-03T03:04:38,335 | INFO  | opendaylight-cluster-data-shard-dispatcher-21 | Shard                            |
>     229
>     - org.opendaylight.controller.sal-clustering-commons - 1.8.2 | member-3-shard-default-config (Follower): The log is not
>     empty but the prevLogIndex 19047 was not found in it - lastIndex: 17875, snapshotIndex: -1
>
>     2019-01-03T03:04:38,335 | INFO  | opendaylight-cluster-data-shard-dispatcher-21 | Shard                            |
>     229
>     - org.opendaylight.controller.sal-clustering-commons - 1.8.2 | member-3-shard-default-config (Follower): Follower is
>     out-of-sync so sending negative reply: AppendEntriesReply [term=23, success=false,
>     followerId=member-3-shard-default-config, logLastIndex=17875, logLastTerm=4, forceInstallSnapshot=false,
>     payloadVersion=9, raftVersion=3]
>
>     log:
>     https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-fluorine/168/odl_3/odl3_karaf.log.gz
>
>     The job has lots of csit failures, so I know something is broken somewhere.
>     I don't know if the above has anything to do with it or not. Maybe it's
>     even expected in this scenario.
>
>
> Those messages can happen if a follower gets behind the leader, especially if it gets isolated and AppendEntries
> messages from the leader get backed up and then the follower gets a bunch of messages quickly when it reconnects to the
> leader. Looks like it eventually caught up to the leader although, from the output, the distro you're testing with is
> missing https://git.opendaylight.org/gerrit/#/c/78929/ which should speed up the sync process in that case and alleviate
> most of those messages.

Thanks Tom, makes sense then as this is on one controller that came up last after
bouncing them all.

yeah it eventually synced but it went down an inefficient path that resulted in 1176 messages from the leader to sync it. Actually that patch I mentioned above was for a different case and wouldn't help in this case. I have an idea where we can optimize it to eliminate those extra messages and make it faster. 
 

JamO

>
>     Thanks,
>     JamO
>     _______________________________________________
>     controller-dev mailing list
>     controller-dev@... <mailto:controller-dev@...>
>     https://lists.opendaylight.org/mailman/listinfo/controller-dev
>


Robert Varga
 

On 09/01/2019 01:41, Tom Pantelis wrote:
> Those messages can happen if a follower gets behind the leader,
especially if it gets isolated and AppendEntries
> messages from the leader get backed up and then the follower gets
a bunch of messages quickly when it reconnects to the
> leader. Looks like it eventually caught up to the leader although,
from the output, the distro you're testing with is
> missing https://git.opendaylight.org/gerrit/#/c/78929/ which
should speed up the sync process in that case and alleviate
> most of those messages.

Thanks Tom, makes sense then as this is on one controller that came
up last after
bouncing them all.


yeah it eventually synced but it went down an inefficient path that
resulted in 1176 messages from the leader to sync it. Actually that
patch I mentioned above was for a different case and wouldn't help in
this case. I have an idea where we can optimize it to eliminate those
extra messages and make it faster. 
https://git.opendaylight.org/gerrit/79334 should help with that (sorry,
it got lost for a year).

Regards,
Robert


Tom Pantelis <tompantelis@...>
 



On Wed, Jan 9, 2019 at 4:17 AM Robert Varga <nite@...> wrote:
On 09/01/2019 01:41, Tom Pantelis wrote:
>     > Those messages can happen if a follower gets behind the leader,
>     especially if it gets isolated and AppendEntries
>     > messages from the leader get backed up and then the follower gets
>     a bunch of messages quickly when it reconnects to the
>     > leader. Looks like it eventually caught up to the leader although,
>     from the output, the distro you're testing with is
>     > missing https://git.opendaylight.org/gerrit/#/c/78929/ which
>     should speed up the sync process in that case and alleviate
>     > most of those messages.
>
>     Thanks Tom, makes sense then as this is on one controller that came
>     up last after
>     bouncing them all.
>
>
> yeah it eventually synced but it went down an inefficient path that
> resulted in 1176 messages from the leader to sync it. Actually that
> patch I mentioned above was for a different case and wouldn't help in
> this case. I have an idea where we can optimize it to eliminate those
> extra messages and make it faster. 

https://git.opendaylight.org/gerrit/79334 should help with that (sorry,
it got lost for a year).

yup - that's basically what I was going to do :)
 

Regards,
Robert