[controller-dev] OFP cluster test with "tell-based"


Robert Varga
 

On 12/02/2019 19:44, Luis Gomez wrote:
Hi everybody,

FYI I have just tried OFP cluster test with "tell-based" protocol:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-only-neon/180/robot-plugin/log.html.gz

My observations:

1) node/port down events do not clear links in topology, this is why all topology check test fail.
I think this is related to the transactions not commit in 5 seconds,
hence masters are not created.

2) some WARNs are flooding the log:

2019-02-12T00:26:30,055 | WARN | opendaylight-cluster-data-shard-dispatcher-33 | FrontendClientMetadataBuilder | 223 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-1-shard-inventory-operational: Unknown history for aborted transaction member-1-datastore-operational-fe-0-txn-30-2, ignoring

2019-02-12T00:26:30,056 | WARN | opendaylight-cluster-data-shard-dispatcher-33 | FrontendClientMetadataBuilder | 223 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-1-shard-inventory-operational: Unknown history for aborted transaction member-2-datastore-operational-fe-0-txn-19-1, ignoring

2019-02-12T00:26:30,056 | WARN | opendaylight-cluster-data-shard-dispatcher-33 | FrontendClientMetadataBuilder | 223 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-1-shard-inventory-operational: Unknown history for aborted transaction member-3-datastore-operational-fe-0-txn-7-1, ignoring
This is interesting, as it starts happening for the same transaction on
all shard members and these are standalone transactions, for which the
history should always be there.

Can you re-run the test with debug on
org.opendaylight.controller.cluster.datastore.FrontendClientMetadataBuilder,
please?

3) The cluster perf test does not pass: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-perf-bulkomatic-only-neon/180/robot-plugin/log.html.gz

I do not know if we still pursue to switch the cluster protocols, at least after this test it does not seem an straight forward change.
I'd like to be able to ditch the old one, but it seems we need to shake
some bugs out :(

Thanks,
Robert


Luis Gomez
 

On Feb 13, 2019, at 2:22 AM, Robert Varga <nite@...> wrote:

On 12/02/2019 19:44, Luis Gomez wrote:
Hi everybody,

FYI I have just tried OFP cluster test with "tell-based" protocol:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-only-neon/180/robot-plugin/log.html.gz

My observations:

1) node/port down events do not clear links in topology, this is why all topology check test fail.
I think this is related to the transactions not commit in 5 seconds,
hence masters are not created.
Any workaround for this?


2) some WARNs are flooding the log:

2019-02-12T00:26:30,055 | WARN | opendaylight-cluster-data-shard-dispatcher-33 | FrontendClientMetadataBuilder | 223 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-1-shard-inventory-operational: Unknown history for aborted transaction member-1-datastore-operational-fe-0-txn-30-2, ignoring

2019-02-12T00:26:30,056 | WARN | opendaylight-cluster-data-shard-dispatcher-33 | FrontendClientMetadataBuilder | 223 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-1-shard-inventory-operational: Unknown history for aborted transaction member-2-datastore-operational-fe-0-txn-19-1, ignoring

2019-02-12T00:26:30,056 | WARN | opendaylight-cluster-data-shard-dispatcher-33 | FrontendClientMetadataBuilder | 223 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-1-shard-inventory-operational: Unknown history for aborted transaction member-3-datastore-operational-fe-0-txn-7-1, ignoring
This is interesting, as it starts happening for the same transaction on
all shard members and these are standalone transactions, for which the
history should always be there.

Can you re-run the test with debug on
org.opendaylight.controller.cluster.datastore.FrontendClientMetadataBuilder,
please?
Here it is: https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-neon/1


3) The cluster perf test does not pass: https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-perf-bulkomatic-only-neon/180/robot-plugin/log.html.gz

I do not know if we still pursue to switch the cluster protocols, at least after this test it does not seem an straight forward change.
I'd like to be able to ditch the old one, but it seems we need to shake
some bugs out :(

Thanks,
Robert


Robert Varga
 

On 19/02/2019 02:11, Luis Gomez wrote:


On Feb 13, 2019, at 2:22 AM, Robert Varga <nite@...> wrote:

On 12/02/2019 19:44, Luis Gomez wrote:
Hi everybody,

FYI I have just tried OFP cluster test with "tell-based" protocol:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-only-neon/180/robot-plugin/log.html.gz

My observations:

1) node/port down events do not clear links in topology, this is why all topology check test fail.
I think this is related to the transactions not commit in 5 seconds,
hence masters are not created.
Any workaround for this?
Not sure... if we have messed up accounding (below), we may end up
reporting things out of whack.



2) some WARNs are flooding the log:

2019-02-12T00:26:30,055 | WARN | opendaylight-cluster-data-shard-dispatcher-33 | FrontendClientMetadataBuilder | 223 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-1-shard-inventory-operational: Unknown history for aborted transaction member-1-datastore-operational-fe-0-txn-30-2, ignoring

2019-02-12T00:26:30,056 | WARN | opendaylight-cluster-data-shard-dispatcher-33 | FrontendClientMetadataBuilder | 223 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-1-shard-inventory-operational: Unknown history for aborted transaction member-2-datastore-operational-fe-0-txn-19-1, ignoring

2019-02-12T00:26:30,056 | WARN | opendaylight-cluster-data-shard-dispatcher-33 | FrontendClientMetadataBuilder | 223 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-1-shard-inventory-operational: Unknown history for aborted transaction member-3-datastore-operational-fe-0-txn-7-1, ignoring
This is interesting, as it starts happening for the same transaction on
all shard members and these are standalone transactions, for which the
history should always be there.

Can you re-run the test with debug on
org.opendaylight.controller.cluster.datastore.FrontendClientMetadataBuilder,
please?
Here it is: https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-neon/1
Thanks, this actually provides a lead: everything works with normal
transaction chains, yet breaks down with single transactions.

Since we have module-based shards in play and multi-shard commits, the
cookie inside LocalHistoryIdentifier becomes significant in lookup --
and the single history is hard-wired to not have a cookie.

https://git.opendaylight.org/gerrit/80392 does that.

Regards,
Robert


Luis Gomez
 



On Feb 19, 2019, at 6:16 AM, Robert Varga <nite@...> wrote:

On 19/02/2019 02:11, Luis Gomez wrote:


On Feb 13, 2019, at 2:22 AM, Robert Varga <nite@...> wrote:

On 12/02/2019 19:44, Luis Gomez wrote:
Hi everybody,

FYI I have just tried OFP cluster test with "tell-based" protocol:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-only-neon/180/robot-plugin/log.html.gz

My observations:

1) node/port down events do not clear links in topology, this is why all topology check test fail.

I think this is related to the transactions not commit in 5 seconds,
hence masters are not created.

Any workaround for this?

Not sure... if we have messed up accounding (below), we may end up
reporting things out of whack.



2) some WARNs are flooding the log:

2019-02-12T00:26:30,055 | WARN  | opendaylight-cluster-data-shard-dispatcher-33 | FrontendClientMetadataBuilder    | 223 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-1-shard-inventory-operational: Unknown history for aborted transaction member-1-datastore-operational-fe-0-txn-30-2, ignoring

2019-02-12T00:26:30,056 | WARN  | opendaylight-cluster-data-shard-dispatcher-33 | FrontendClientMetadataBuilder    | 223 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-1-shard-inventory-operational: Unknown history for aborted transaction member-2-datastore-operational-fe-0-txn-19-1, ignoring

2019-02-12T00:26:30,056 | WARN  | opendaylight-cluster-data-shard-dispatcher-33 | FrontendClientMetadataBuilder    | 223 - org.opendaylight.controller.sal-distributed-datastore - 1.9.0.SNAPSHOT | member-1-shard-inventory-operational: Unknown history for aborted transaction member-3-datastore-operational-fe-0-txn-7-1, ignoring

This is interesting, as it starts happening for the same transaction on
all shard members and these are standalone transactions, for which the
history should always be there.

Can you re-run the test with debug on
org.opendaylight.controller.cluster.datastore.FrontendClientMetadataBuilder,
please?

Here it is: https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-neon/1

Thanks, this actually provides a lead: everything works with normal
transaction chains, yet breaks down with single transactions.

Since we have module-based shards in play and multi-shard commits, the
cookie inside LocalHistoryIdentifier becomes significant in lookup --
and the single history is hard-wired to not have a cookie.

https://git.opendaylight.org/gerrit/80392 does that.


It looks like the WARNs are addressed, and the only issue remaining is the topology update when node/links go down:


Regards,
Robert


Robert Varga
 

On 19/02/2019 19:50, Luis Gomez wrote:


On Feb 19, 2019, at 6:16 AM, Robert Varga <nite@...
<mailto:nite@...>> wrote:

On 19/02/2019 02:11, Luis Gomez wrote:


On Feb 13, 2019, at 2:22 AM, Robert Varga <nite@...
<mailto:nite@...>> wrote:

On 12/02/2019 19:44, Luis Gomez wrote:
Hi everybody,

FYI I have just tried OFP cluster test with "tell-based" protocol:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-only-neon/180/robot-plugin/log.html.gz

My observations:

1) node/port down events do not clear links in topology, this is
why all topology check test fail.
I think this is related to the transactions not commit in 5 seconds,
hence masters are not created.
Any workaround for this?
Not sure... if we have messed up accounding (below), we may end up
reporting things out of whack.
Alright, we can ditch the builder debugs, as everything there works as
it is supposed to.

The reason for non-removal of the links is captured in
https://logs.opendaylight.org/sandbox/vex-yul-odl-jenkins-2/openflowplugin-csit-3node-clustering-only-sodium/1/odl_3/odl3_karaf.log.gz,
I think, and it is a VerifyException.

Filed to https://jira.opendaylight.org/browse/CONTROLLER-1885.

Regards,
Robert


Luis Gomez
 

FYI tell-based test with latest master:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-gate-clustering-only-sodium/9/

I still see topology link information is not cleared after node/link down in some random tests.

BR/Luis

On Feb 20, 2019, at 2:35 AM, Robert Varga <nite@...> wrote:

On 19/02/2019 19:50, Luis Gomez wrote:


On Feb 19, 2019, at 6:16 AM, Robert Varga <nite@...
<mailto:nite@...>> wrote:

On 19/02/2019 02:11, Luis Gomez wrote:


On Feb 13, 2019, at 2:22 AM, Robert Varga <nite@...
<mailto:nite@...>> wrote:

On 12/02/2019 19:44, Luis Gomez wrote:
Hi everybody,

FYI I have just tried OFP cluster test with "tell-based" protocol:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/openflowplugin-csit-3node-clustering-only-neon/180/robot-plugin/log.html.gz

My observations:

1) node/port down events do not clear links in topology, this is
why all topology check test fail.
I think this is related to the transactions not commit in 5 seconds,
hence masters are not created.
Any workaround for this?
Not sure... if we have messed up accounding (below), we may end up
reporting things out of whack.
Alright, we can ditch the builder debugs, as everything there works as
it is supposed to.

The reason for non-removal of the links is captured in
https://logs.opendaylight.org/sandbox/vex-yul-odl-jenkins-2/openflowplugin-csit-3node-clustering-only-sodium/1/odl_3/odl3_karaf.log.gz,
I think, and it is a VerifyException.

Filed to https://jira.opendaylight.org/browse/CONTROLLER-1885.

Regards,
Robert


Robert Varga
 

On 26/02/2019 21:52, Luis Gomez wrote:
FYI tell-based test with latest master:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-gate-clustering-only-sodium/9/

I still see topology link information is not cleared after node/link down in some random tests.
There does not seem to be anything out of the ordinary from CDS
perspective, but there are some OLFE/validation errors.

Can someone from the OFP team take a look, please?

Thanks,
Robert


Luis Gomez
 

On Feb 26, 2019, at 1:01 PM, Robert Varga <nite@...> wrote:

On 26/02/2019 21:52, Luis Gomez wrote:
FYI tell-based test with latest master:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-gate-clustering-only-sodium/9/

I still see topology link information is not cleared after node/link down in some random tests.

There does not seem to be anything out of the ordinary from CDS
perspective, but there are some OLFE/validation errors.

Can someone from the OFP team take a look, please?

Thanks,
Robert