[bugs] [Bug 3154] New: Clustering: Journal recovery error on restart

bugzilla-daemon at bugs.opendaylight.org bugzilla-daemon at bugs.opendaylight.org
Thu May 7 23:19:43 UTC 2015


https://bugs.opendaylight.org/show_bug.cgi?id=3154

            Bug ID: 3154
           Summary: Clustering: Journal recovery error on restart
           Product: controller
           Version: Helium
          Hardware: All
                OS: All
            Status: CONFIRMED
          Severity: normal
          Priority: High
         Component: mdsal
          Assignee: bugs at lists.opendaylight.org
          Reporter: tpanteli at brocade.com
        Issue Type: ---

The following error was seen after a controller restart (Helium SR2):

java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException:
Metadata not available for modification [NodeModification
[identifier=(com:brocade:neutron:odl?revision=2014-10-02)subnet,
modificationType=SUBTREE_MODIFIED,
childModification={(com:brocade:neutron:odl?revision=2014-10-02)subnet[{(com:brocade:neutron:odl?revision=2014-10-02)id=ace19864-b874-47a9-9cef-b02afd52f37b}]=NodeModification
[identifier=(com:brocade:neutron:odl?revision=2014-10-02)subnet[{(com:brocade:neutron:odl?revision=2014-10-02)id=ace19864-b874-47a9-9cef-b02afd52f37b}],
modificationType=DELETE, childModification={}]}]]
        at
java.util.concurrent.FutureTask.report(FutureTask.java:122)[:1.7.0_76]
        at java.util.concurrent.FutureTask.get(FutureTask.java:188)[:1.7.0_76]
        at
org.opendaylight.controller.cluster.datastore.Shard.syncCommitTransaction(Shard.java:586)[301:org.opendaylight.controller.sal-distributed-datastore:1.1.2.Helium-SR2]
        at
org.opendaylight.controller.cluster.datastore.Shard.onRecoveryComplete(Shard.java:729)[301:org.opendaylight.controller.sal-distributed-datastore:1.1.2.Helium-SR2]
        at
org.opendaylight.controller.cluster.raft.RaftActor.onRecoveryCompletedMessage(RaftActor.java:257)[294:org.opendaylight.controller.sal-akka-raft:1.1.2.Helium-SR2]
        at
org.opendaylight.controller.cluster.raft.RaftActor.handleRecover(RaftActor.java:160)[294:org.opendaylight.controller.sal-akka-raft:1.1.2.Helium-SR2]
        at
org.opendaylight.controller.cluster.common.actor.AbstractUntypedPersistentActor.onReceiveRecover(AbstractUntypedPersistentActor.java:52)[293:org.opendaylight.controller.sal-clustering-commons:1.1.2.Helium-SR2]
        at
org.opendaylight.controller.cluster.datastore.Shard.onReceiveRecover(Shard.java:237)[301:org.opendaylight.controller.sal-distributed-datastore:1.1.2.Helium-SR2]

The modification is for a node delete and it seems "Metadata not available ..."
indicates the node doesn't exist. If that's true, how did this modification
entry get into the persisted journal? Transaction modifications should only get
into the journal if the transaction succeeds.

The ramification of this failure is that the rest of the data failed to recover
as well. This is b/c we batch journal entries 5000 at a time into a single
transaction. This is more performant but the side effect is that one failed
modification fails everything.

In addition, the failed entry remains in the RaftActor's in-memory journal so,
in a 3 node cluster, if it becomes the leader then it wipes out the other nodes
too. We need to protect against a corrupted journal (or a recovery failure) on
one node from corrupting the whole cluster.

-- 
You are receiving this mail because:
You are the assignee for the bug.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.opendaylight.org/pipermail/bugs/attachments/20150507/aef7b8b2/attachment.html>


More information about the bugs mailing list