In the current version each node submit their individual statistics request to the request scheduler, rather then submitting the whole node. These changes avoid sending all statistics requests of the node in one chunk.
Overall results look better compare to the results i sent in previous mail. I did the test with 16, 32 and 64 nodes and i was able to see the topology and ping all was working fine as well. I did minimal functional testing to check that stats are getting updated properly.
Following are the CPU graph for 16,32 and 64 nodes tests
32 nodes: <23953736.jpg>
64 nodes: <23524216.jpg>
I also tried with 128 nodes as well, but controller started acting weird ( switch disconnection, exception from infispane event notify classes, CPU spikes). Overall from these 3 results, it looks like with every time nodes get double, CPU increases by 10% or so on average.
Robert can you please review the gerrit, and let me know if you have any suggestion.
Jan if you have time, can you please try this patch on your machine and see if it gives similar results.
1) Added StatisticsRequestSchedular class,
*.* that keep track of pending MD-SAL transaction requests using DataTransactionListener. It only keep tracks of request submitted by Statistics-Manager.
*.* implemented a queue, where each node submit request for scheduling
*.* whenever pending transaction requests goes to zero, it picks up the next node in the queue are execute that node for sending all the statistics request.
2) All the nodes periodically put themselves in the scheduler queue for execution, if node is already present in the queue, it won't submit accept duplicate request
3) Removal of stale statistics is now done based on the counters, rather then using timestamps, because timestamps can create issues in clustered environment. Removal of stale transaction IDs are still time based.
I did the performance test for both, existing implementation and new changes, on my laptop (i5 processor, 8gb ram, running - ubuntu mininet vm and few other processes as well). Following are the graphs for both the runs, I do see performance improvement but its not upto the expectation.
With existing implementation (16 node tree network):
With new code changes (16 node tree network):
I am working on following changes for further improvement of the code :
*.* As of now I am submitting "node" in the scheduler queue, and whenever scheduler pick any node, it sends all the statistics request for that node and I feel that's the reason behind these CPU peeks. I am planning to now submit individual stats request to the scheduler queue, so it will send individual statistics request, rather then sending all the request for that "node" in one shot.
*.* Modifying the interval dynamically, based on how many nodes are connected.