[controller-dev] OpenFlow messages getting lost from MD-SAL to switch?
Michal Rehak -X (mirehak - Pantheon Technologies SRO at Cisco)
mirehak at cisco.com
Wed Apr 8 08:52:18 UTC 2015
in the way to device there is rpc thread or dataChangeEvent (notification) thread entering the service in ofPlugin. In the service there we have blocking queue of size 100k and 10 threads working on it. If the queue is full then the incoming thread is blocked until queue gets freed a bit.
In the way from device there is the message pipeline cut into data (for packetIn) and control (all other messages) queues per connection. Both queues are limited to 5k. However the data queue (unordered) simply drops incoming messaged in case it is full. But the control queue fiddles with autoread property of netty channel configuration. As soon as there is more than 80% of queue occupied then autoread is set to false. See: org.opendaylight.openflowplugin.openflow.md.queue.WrapperQueueImpl.
The main goal is not to use all available memory in case of load burst but rather to apply backpressure to both ways.
From: controller-dev-bounces at lists.opendaylight.org [controller-dev-bounces at lists.opendaylight.org] on behalf of Jim West [jimw at a-bb.net]
Sent: Friday, April 03, 2015 16:16
To: controller-dev at lists.opendaylight.org
Subject: Re: [controller-dev] OpenFlow messages getting lost from MD-SAL to switch?
I eventually found the source of the problem and modified the ODL source to get around this
My change was in the openflowjava project: openflow-protocol-impl/src/main/java/org/opendaylight/openflowjava/protocol/impl/connection/ConnectionAdapterImpl.java
The initial DEFAULT_QUEUE_DEPTH value is 1,000. This was too small for my burst tests. I needed to increase it to 10,000
On another note, I've found a similar problem related to the input-queue. The input queue is defined in the openflowplugin project: openflowplugin/src/main/java/org/opendaylight/openflowplugin/openflow/md/core/ConnectionConductorImpl.java
The INGRESS_QUEUE_MAX_SIZE is set to 200???. Once I fixed the output queue problem, I discovered I was losing messages in the other direction, so I increase INGRESS_QUEUE_MAX_SIZE to 10,000.
In looking at how the OF engine is coded, there's a good change that we loss message (either sent or received) under load. I don't see any mechanism for providing back-pressure to the components putting entries on the queues.
* For the input queue, if we slowed the reading of data off the socket, I believe TCP would provide the necessary feedback to the switch
* For the output queue, couldn't we have the FRM wait and retry? If we reduce the rate that the MD-SAL listener in the FRM responds I know MD-SAL can back up. We need a way to communicate back to the ultimate clients of the FRM (or openflowjava) that the output queue is filling up
Have I missed something here?
On Tue, Mar 24, 2015 at 7:39 AM, Jim West <jimw at a-bb.net<mailto:jimw at a-bb.net>> wrote:
I'm running stable/helium rc2.
My switch is a very recent build of open_vswitch
I've noticed that when I run my system under load that FLOW_MOD messages are getting lost from the controller to the switch. This problem manfests itself in this way:
When I list my flows on the switch, it does not match what's in the ODL config store
The messages from the controller to the switch are:
My performance tests involve hitting the switch with a lot of unlearned MAC addresses
This causes a large number of packet-in messages with is handled by my own java code that uses the MD-SAL.
* I record all the flows that I believe I've programmed.
* I record all the flows that I believe I've deleted
When my tests are over, there's a discrepancy.
* Flows that I have removed (that are not in ODL's config store) are still on the switch.
* My packet traces indicate that ODL never sent a DELETE message for this stale entry.
* ODL _did_ send most other deletes
* Flows that I have programs (and are in ODL's config store) are NOT on the switch
I'm also seeing something similar for Meters and Groups. Sometimes I create a Meter/Group and it doesn't get into the switch. Other times I delete one and it stays (but its gone from the Config Store).
Has anyone else seen this? This only happens when I run under load. Does anyone have suggestions on the sub-systems to turn logging on. I have no problems with changing/instrumenting some code and re-running the tests.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the controller-dev