[mdsal-dev] Bug 7370 OOM due to suspected memory leak in akka.dispatch.Dispatcher found by hprof


Michael Vorburger <vorburger@...>
 

+openflowplugin-dev & +ovsdb-dev:

Tom,

On Mon, May 29, 2017 at 2:57 PM, Tom Pantelis <tompantelis@...> wrote:

Thanks a lot for replying, really appreciate it!

It looks like the Dispatcher was for data change notifications. I suspect a listener was hung or responding slowly so the actor's mailbox filled up with change notifications. I would suggest getting a thread dump next time.

Turn out no need to wait for next time - just figured out that we can obtain thread dumps à posteriori from an HPROF using MAT... see the [4] Bug7370_Threads.zip HTML report just attached to Bug 7370.

It shows 604 threads (a lot?), many of which are e.g. parked ForkJoinPool, and a number of them related to ovsdb and openflowplugin stuff... so what are we looking for, in this thread dump? I haven't looked thread each thread's stack yet, but this one vaguely looks like what you may mean by "a listener was hung or responding slowly" (causing "the actor's mailbox filled upwith change notifications"), could it possibly be the reason for / having something to do with this OOM:
opendaylight-cluster-data-akka.actor.default-dispatcher-16
  at sun.misc.Unsafe.park(ZJ)V (Native Method)
  at java.util.concurrent.locks.LockSupport.park(Ljava/lang/Object;)V (LockSupport.java:175)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt()Z (AbstractQueuedSynchronizer.java:836)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(I)V (AbstractQueuedSynchronizer.java:997)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(I)V (AbstractQueuedSynchronizer.java:1304)
  at com.google.common.util.concurrent.AbstractFuture$Sync.get()Ljava/lang/Object; (AbstractFuture.java:285)
  at com.google.common.util.concurrent.AbstractFuture.get()Ljava/lang/Object; (AbstractFuture.java:116)
  at org.opendaylight.openflowplugin.impl.util.DeviceInitializationUtils.initializeNodeInformation(Lorg/opendaylight/openflowplugin/api/openflow/device/DeviceContext;ZLorg/opendaylight/openflowplugin/openflow/md/core/sal/convertor/ConvertorExecutor;)V (DeviceInitializationUtils.java:155)
  at org.opendaylight.openflowplugin.impl.device.DeviceContextImpl.onContextInstantiateService(Lorg/opendaylight/openflowplugin/api/openflow/connection/ConnectionContext;)Z (DeviceContextImpl.java:730)
  at org.opendaylight.openflowplugin.impl.lifecycle.LifecycleServiceImpl.instantiateServiceInstance()V (LifecycleServiceImpl.java:53)
  at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceRegistrationDelegator.instantiateServiceInstance()V (ClusterSingletonServiceRegistrationDelegator.java:46)
  at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.takeOwnership()V (ClusterSingletonServiceGroupImpl.java:291)
  at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.ownershipChanged(Lorg/opendaylight/mdsal/eos/common/api/GenericEntityOwnershipChange;)V (ClusterSingletonServiceGroupImpl.java:237)
  at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.ownershipChanged(Lorg/opendaylight/mdsal/eos/common/api/GenericEntityOwnershipChange;)V (AbstractClusterSingletonServiceProviderImpl.java:145)
  at org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl.ownershipChanged(Lorg/opendaylight/mdsal/eos/dom/api/DOMEntityOwnershipChange;)V (DOMClusterSingletonServiceProviderImpl.java:23)
  at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.onEntityOwnershipChanged(Lorg/opendaylight/mdsal/eos/dom/api/DOMEntityOwnershipChange;)V (EntityOwnershipListenerActor.java:46)
  at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.handleReceive(Ljava/lang/Object;)V (EntityOwnershipListenerActor.java:36)
  at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(Ljava/lang/Object;)V (AbstractUntypedActor.java:26)
  at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(Ljava/lang/Object;Lscala/Function1;)Ljava/lang/Object; (UntypedActor.scala:165)
  at akka.actor.Actor$class.aroundReceive(Lakka/actor/Actor;Lscala/PartialFunction;Ljava/lang/Object;)V (Actor.scala:484)
  at akka.actor.UntypedActor.aroundReceive(Lscala/PartialFunction;Ljava/lang/Object;)V (UntypedActor.scala:95)
  at akka.actor.ActorCell.receiveMessage(Ljava/lang/Object;)V (ActorCell.scala:526)
  at akka.actor.ActorCell.invoke(Lakka/dispatch/Envelope;)V (ActorCell.scala:495)
  at akka.dispatch.Mailbox.processMailbox(IJ)V (Mailbox.scala:257)
  at akka.dispatch.Mailbox.run()V (Mailbox.scala:224)
  at akka.dispatch.Mailbox.exec()Z (Mailbox.scala:234)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec()I (ForkJoinTask.java:260)
  at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(Lscala/concurrent/forkjoin/ForkJoinTask;)V (ForkJoinPool.java:1339)
  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(Lscala/concurrent/forkjoin/ForkJoinPool$WorkQueue;)V (ForkJoinPool.java:1979)
  at scala.concurrent.forkjoin.ForkJoinWorkerThread.run()V (ForkJoinWorkerThread.java:107)




 
On Mon, May 29, 2017 at 7:52 AM, Michael Vorburger <vorburger@...> wrote:
Hi guys,

I just ran MAT([1]) over an HPROF heap dump on OOM in Bug 7370, and it (MAT) raises a "leak suspect" in akka.dispatch.Dispatcher - see the [3] java_pid19570_Leak_Suspects.zip just attached to Bug 7370 ... questions:

Is this perhaps something you jump at with an "ah that, we know about it and already fixed that in ..." ?

If not, how do we go about better understanding the root cause of this, and be able to eventually fix this?

My underlying assumption here is that isn't "normal" and not just "by design" - if it is, I'd love some education... like I'm hoping that the conclusion here isn't simply that MD SAL's data store is a dumb in-memory data base which basically just takes a huge amount of GBs to keep (all) YANG model instances on the heap - or is it?

Tx,
--
Michael Vorburger, Red Hat
vorburger@... | IRC: vorburger @freenode | ~ = http://vorburger.ch

_______________________________________________
mdsal-dev mailing list
mdsal-dev@...rg
https://lists.opendaylight.org/mailman/listinfo/mdsal-dev




Tom Pantelis <tompantelis@...>
 

yeah that looks like an issue. DeviceInitializationUtils is doing a blocking get on a Future which is usually not a good thing. And it occurred via an EOS data change and is blocking an akka Dispatcher thread. 
On a side note, there's a lot of threads with io.netty.util.concurrent.FastThreadLocalThread - not sure if that's normal.


On Mon, May 29, 2017 at 9:30 AM, Michael Vorburger <vorburger@...> wrote:
+openflowplugin-dev & +ovsdb-dev:

Tom,

On Mon, May 29, 2017 at 2:57 PM, Tom Pantelis <tompantelis@...> wrote:

Thanks a lot for replying, really appreciate it!

It looks like the Dispatcher was for data change notifications. I suspect a listener was hung or responding slowly so the actor's mailbox filled up with change notifications. I would suggest getting a thread dump next time.

Turn out no need to wait for next time - just figured out that we can obtain thread dumps à posteriori from an HPROF using MAT... see the [4] Bug7370_Threads.zip HTML report just attached to Bug 7370.

It shows 604 threads (a lot?), many of which are e.g. parked ForkJoinPool, and a number of them related to ovsdb and openflowplugin stuff... so what are we looking for, in this thread dump? I haven't looked thread each thread's stack yet, but this one vaguely looks like what you may mean by "a listener was hung or responding slowly" (causing "the actor's mailbox filled upwith change notifications"), could it possibly be the reason for / having something to do with this OOM:
opendaylight-cluster-data-akka.actor.default-dispatcher-16
  at sun.misc.Unsafe.park(ZJ)V (Native Method)
  at java.util.concurrent.locks.LockSupport.park(Ljava/lang/Object;)V (LockSupport.java:175)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt()Z (AbstractQueuedSynchronizer.java:836)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(I)V (AbstractQueuedSynchronizer.java:997)
  at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(I)V (AbstractQueuedSynchronizer.java:1304)
  at com.google.common.util.concurrent.AbstractFuture$Sync.get()Ljava/lang/Object; (AbstractFuture.java:285)
  at com.google.common.util.concurrent.AbstractFuture.get()Ljava/lang/Object; (AbstractFuture.java:116)
  at org.opendaylight.openflowplugin.impl.util.DeviceInitializationUtils.initializeNodeInformation(Lorg/opendaylight/openflowplugin/api/openflow/device/DeviceContext;ZLorg/opendaylight/openflowplugin/openflow/md/core/sal/convertor/ConvertorExecutor;)V (DeviceInitializationUtils.java:155)
  at org.opendaylight.openflowplugin.impl.device.DeviceContextImpl.onContextInstantiateService(Lorg/opendaylight/openflowplugin/api/openflow/connection/ConnectionContext;)Z (DeviceContextImpl.java:730)
  at org.opendaylight.openflowplugin.impl.lifecycle.LifecycleServiceImpl.instantiateServiceInstance()V (LifecycleServiceImpl.java:53)
  at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceRegistrationDelegator.instantiateServiceInstance()V (ClusterSingletonServiceRegistrationDelegator.java:46)
  at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.takeOwnership()V (ClusterSingletonServiceGroupImpl.java:291)
  at org.opendaylight.mdsal.singleton.dom.impl.ClusterSingletonServiceGroupImpl.ownershipChanged(Lorg/opendaylight/mdsal/eos/common/api/GenericEntityOwnershipChange;)V (ClusterSingletonServiceGroupImpl.java:237)
  at org.opendaylight.mdsal.singleton.dom.impl.AbstractClusterSingletonServiceProviderImpl.ownershipChanged(Lorg/opendaylight/mdsal/eos/common/api/GenericEntityOwnershipChange;)V (AbstractClusterSingletonServiceProviderImpl.java:145)
  at org.opendaylight.mdsal.singleton.dom.impl.DOMClusterSingletonServiceProviderImpl.ownershipChanged(Lorg/opendaylight/mdsal/eos/dom/api/DOMEntityOwnershipChange;)V (DOMClusterSingletonServiceProviderImpl.java:23)
  at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.onEntityOwnershipChanged(Lorg/opendaylight/mdsal/eos/dom/api/DOMEntityOwnershipChange;)V (EntityOwnershipListenerActor.java:46)
  at org.opendaylight.controller.cluster.datastore.entityownership.EntityOwnershipListenerActor.handleReceive(Ljava/lang/Object;)V (EntityOwnershipListenerActor.java:36)
  at org.opendaylight.controller.cluster.common.actor.AbstractUntypedActor.onReceive(Ljava/lang/Object;)V (AbstractUntypedActor.java:26)
  at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(Ljava/lang/Object;Lscala/Function1;)Ljava/lang/Object; (UntypedActor.scala:165)
  at akka.actor.Actor$class.aroundReceive(Lakka/actor/Actor;Lscala/PartialFunction;Ljava/lang/Object;)V (Actor.scala:484)
  at akka.actor.UntypedActor.aroundReceive(Lscala/PartialFunction;Ljava/lang/Object;)V (UntypedActor.scala:95)
  at akka.actor.ActorCell.receiveMessage(Ljava/lang/Object;)V (ActorCell.scala:526)
  at akka.actor.ActorCell.invoke(Lakka/dispatch/Envelope;)V (ActorCell.scala:495)
  at akka.dispatch.Mailbox.processMailbox(IJ)V (Mailbox.scala:257)
  at akka.dispatch.Mailbox.run()V (Mailbox.scala:224)
  at akka.dispatch.Mailbox.exec()Z (Mailbox.scala:234)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec()I (ForkJoinTask.java:260)
  at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(Lscala/concurrent/forkjoin/ForkJoinTask;)V (ForkJoinPool.java:1339)
  at scala.concurrent.forkjoin.ForkJoinPool.runWorker(Lscala/concurrent/forkjoin/ForkJoinPool$WorkQueue;)V (ForkJoinPool.java:1979)
  at scala.concurrent.forkjoin.ForkJoinWorkerThread.run()V (ForkJoinWorkerThread.java:107)




 
On Mon, May 29, 2017 at 7:52 AM, Michael Vorburger <vorburger@...> wrote:
Hi guys,

I just ran MAT([1]) over an HPROF heap dump on OOM in Bug 7370, and it (MAT) raises a "leak suspect" in akka.dispatch.Dispatcher - see the [3] java_pid19570_Leak_Suspects.zip just attached to Bug 7370 ... questions:

Is this perhaps something you jump at with an "ah that, we know about it and already fixed that in ..." ?

If not, how do we go about better understanding the root cause of this, and be able to eventually fix this?

My underlying assumption here is that isn't "normal" and not just "by design" - if it is, I'd love some education... like I'm hoping that the conclusion here isn't simply that MD SAL's data store is a dumb in-memory data base which basically just takes a huge amount of GBs to keep (all) YANG model instances on the heap - or is it?

Tx,
--
Michael Vorburger, Red Hat
vorburger@... | IRC: vorburger @freenode | ~ = http://vorburger.ch

_______________________________________________
mdsal-dev mailing list
mdsal-dev@...rg
https://lists.opendaylight.org/mailman/listinfo/mdsal-dev