[All channel inactive]Switch concurrency connect to odl and all others switch disconnected problem


"陈卓文
 

Hi Openflowjava/Openflowplugin Developer:

I meet a concurrency problem in odl openflowplugin.
I create ovs as switch using script and it connected to my controller one by one:
```python
for ovs_id in range (OVS_START_NUM,OVS_NUM):
    subprocess.call('ovs-vsctl add-br bridge%d' % ovs_id, shell=True)
    subprocess.call('ovs-vsctl set-controller bridge%d tcp:%s:6633' % ovs_id, controller, shell=True)
```

And I can see: all connected switch disconnected in a time suddenly. I can see the log:
```
2018-07-24T20:55:00,333 | INFO | epollEventLoopGroup-9-42 | SystemNotificationsListenerImpl | 326 - org.opendaylight.openflowplugin.impl - 0.6.2.netease1 | ConnectionEvent: Connection closed by device, Device:/xxxx:35922, NodeId:openflow:xxx
```

I find the openflowjava code in `DelegatingInboundHandler.java`,it mean "the channel between controller and switch inactive."
```
    @Override
    public void channelInactive(final ChannelHandlerContext ctx) {
        LOG.debug("Channel inactive");
        if (!inactiveMessageSent) {
            DisconnectEventBuilder builder = new DisconnectEventBuilder();
            builder.setInfo("Channel inactive");
            consumer.consume(builder.build());
            inactiveMessageSent = true;
        }
    }
```

But I am very sure all switch and controller network link no problem. I had test many time, this issue will happen again. And I found the special log:
```
2018-07-27T19:07:37,499 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-28 | ContextChainImpl | 326 - org.opendaylight.openflowplugin.impl - 0.6.2.netease1 | Not able to start clustering services for node openflow:86750152275523 2018-07-27T19:07:37,500 | WARN | ofppool-358 | ContextChainHolderImpl | 326 - org.opendaylight.openflowplugin.impl | Not able to set MASTER role on device openflow:86750152275523, reason: java.lang.RuntimeException: Failed to initialize device openflow:86750152275523 in 9s: java.util.concurrent.TimeoutException: Waited 9000 milliseconds for com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture@7ae82a2[status=PENDING]
```
Before this log, the controller everything well. And After this log, I can see many `Connection closed by device` log.
This log show concurrent timeout and thread status PENDING.
I find the Exception code in `DeviceContextImpl.java`:
```
            final Future<Void> initialize = initializer
                    .get()
                    .initialize(this, switchFeaturesMandatory, skipTableFeatures, writerProvider, convertorExecutor);

            try {
                initialize.get(DEVICE_INIT_TIMEOUT, TimeUnit.MILLISECONDS);
            } catch (TimeoutException ex) {
                initialize.cancel(true);
                throw new RuntimeException(String.format("Failed to initialize device %s in %ss: %s",
                        deviceInfo.toString(), String.valueOf(DEVICE_INIT_TIMEOUT / 1000), ex.toString()), ex);
            } catch (ExecutionException | InterruptedException ex) {
                throw new RuntimeException(
                        String.format("Device %s cannot be initialized: %s", deviceInfo.toString(), ex.toString()), ex);
            }
```

The device initialize process for get OFPMPDESC/OFPMPMETERFEATURES/OFPMPGROUPFEATURES... from switch. And it request isn't exceute by channel's Eventloop?

I think this is a concurrency problem in openflowplugin / openflowjava / netty ?

For more, I alse test switch connect to 6633 and 6653 port in same controller. And I saw when all switch disconnect in 6633 port, but all switch connect to 6653 port no problem.
I know 6653 and 6633 port is different netty EventLoopGroup.
Is a channel in one EventLoopGroup(6633), it device initialize failed and make all EventLoopGroup channel inactive? And another EventLoopGroup(6653) isn't affected.

I had read openflowjava and openflowplugin code very detail, but i cannot find the problem in source code.
Is it a concurrency bug in ofp using netty way? or something others?


Hope get your help. Thanks.
-----------
SimonCZW