This group is locked. No changes can be made to the group while it is locked.
Date
1 - 1 of 1
[All channel inactive]Switch concurrency connect to odl and all others switch disconnected problem
"陈卓文
Hi Openflowjava/Openflowplugin Developer:
I meet a concurrency problem in odl openflowplugin.
I create ovs as switch using script and it connected to my controller one by one:
```python
for ovs_id in range (OVS_START_NUM,OVS_NUM):
subprocess.call('ovs-vsctl add-br bridge%d' % ovs_id, shell=True)
subprocess.call('ovs-vsctl set-controller bridge%d tcp:%s:6633' % ovs_id, controller, shell=True)
```
And I can see: all connected switch disconnected in a time suddenly. I can see the log:
```
2018-07-24T20:55:00,333 | INFO | epollEventLoopGroup-9-42 | SystemNotificationsListenerImpl | 326 - org.opendaylight.openflowplugin.impl - 0.6.2.netease1 | ConnectionEvent: Connection closed by device, Device:/xxxx:35922, NodeId:openflow:xxx
```
I find the openflowjava code in `DelegatingInboundHandler.java`,it mean "the channel between controller and switch inactive."
```
@Override
public void channelInactive(final ChannelHandlerContext ctx) {
LOG.debug("Channel inactive");
if (!inactiveMessageSent) {
DisconnectEventBuilder builder = new DisconnectEventBuilder();
builder.setInfo("Channel inactive");
consumer.consume(builder.build());
inactiveMessageSent = true;
}
}
```
But I am very sure all switch and controller network link no problem. I had test many time, this issue will happen again. And I found the special log:
```
2018-07-27T19:07:37,499 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-28 | ContextChainImpl | 326 - org.opendaylight.openflowplugin.impl - 0.6.2.netease1 | Not able to start clustering services for node openflow:86750152275523
2018-07-27T19:07:37,500 | WARN | ofppool-358 | ContextChainHolderImpl | 326 - org.opendaylight.openflowplugin.impl | Not able to set MASTER role on device openflow:86750152275523, reason: java.lang.RuntimeException: Failed to initialize device openflow:86750152275523 in 9s: java.util.concurrent.TimeoutException: Waited 9000 milliseconds for com.google.common.util.concurrent.AbstractTransformFuture$AsyncTransformFuture@7ae82a2[status=PENDING]
```
Before this log, the controller everything well. And After this log, I can see many `Connection closed by device` log.
This log show concurrent timeout and thread status PENDING.
I find the Exception code in `DeviceContextImpl.java`:
```
final Future<Void> initialize = initializer
.get()
.initialize(this, switchFeaturesMandatory, skipTableFeatures, writerProvider, convertorExecutor);
try {
initialize.get(DEVICE_INIT_TIMEOUT, TimeUnit.MILLISECONDS);
} catch (TimeoutException ex) {
initialize.cancel(true);
throw new RuntimeException(String.format("Failed to initialize device %s in %ss: %s",
deviceInfo.toString(), String.valueOf(DEVICE_INIT_TIMEOUT / 1000), ex.toString()), ex);
} catch (ExecutionException | InterruptedException ex) {
throw new RuntimeException(
String.format("Device %s cannot be initialized: %s", deviceInfo.toString(), ex.toString()), ex);
}
```
The device initialize process for get OFPMPDESC/OFPMPMETERFEATURES/OFPMPGROUPFEATURES... from switch. And it request isn't exceute by channel's Eventloop?
I think this is a concurrency problem in openflowplugin / openflowjava / netty ?
For more, I alse test switch connect to 6633 and 6653 port in same controller. And I saw when all switch disconnect in 6633 port, but all switch connect to 6653 port no problem.
I know 6653 and 6633 port is different netty EventLoopGroup.
Is a channel in one EventLoopGroup(6633), it device initialize failed and make all EventLoopGroup channel inactive? And another EventLoopGroup(6653) isn't affected.
I had read openflowjava and openflowplugin code very detail, but i cannot find the problem in source code.
Is it a concurrency bug in ofp using netty way? or something others?
Hope get your help. Thanks.
-----------
SimonCZW