Originally posted on: http://geekswithblogs.net/akraus1/archive/2015/01/26/161276.aspx
Since most of us use multiple threads we get all sorts of race conditions which we can solve with proper locking and concurrent data structures. A simple issue is if you have concurrency on an object which exposes a Dictionary<string,string> from where you get exceptions back if concurrent writers add the same data to it. You would then either need to coordinate the writers or use a different dictionary which allows concurrent adds. As it turns out ConcurrentDictionary is just the collection you were searching for and your exceptions are gone. You let the tests run. All green, check in and submit it to the main branch.
A long time later at the end of the project until now ignored performance issues tend to get identified and fixed. Someone calls you that your checkin from a long time ago is the root cause of a major performance issue and that you are showing up as mount Everest in the VirtualAlloc reports.
How can that be? The problem with the Concurrent collections which claim to be largely lock free is that to achieve uncontended writes you have to allocate your key data structures per core. This means that you are allocating memory proportional to the number of your cores. This is not a problem if you have relatively few instances around but if you are exchanging a dictionary in key data structures which are heavily allocated you will get much slower. You can test it for yourself with a small sample app like this:
class DataObject // Original Version {public IDictionary<string, string> Data { get; private set; }public DataObject() { Data = new Dictionary<string, string>(); } }class FixedDataObject // Fixed version with ConcurrentDictionary {public IDictionary<string, string> Data { get; private set; }public FixedDataObject() { Data = new ConcurrentDictionary<string, string>(); } }class Program {staticvoid Main(string[] args) { List<DataObject> dataList = new List<DataObject>(); List<FixedDataObject> dataListFixed = new List<FixedDataObject>(); var sw = Stopwatch.StartNew(); constint Runs = 300*1000;for(int i=0;i<Runs;i++) {if( args.Length == 0 ) // No args use old { dataList.Add(new DataObject()); }else// some args passed use fixed version { dataListFixed.Add(new FixedDataObject()); } } sw.Stop(); Console.WriteLine("{0:N0} did take {1:F2}s", Runs, sw.Elapsed.TotalSeconds); GC.KeepAlive(dataList); GC.KeepAlive(dataListFixed); } }
Old Version
300.000 did take 0,14s
Fixed Version with Concurrent Dictionary
300.000 did take 1,08s
Update
On a 24 core server it takes 3,3s and it consumes 1,4GB of memory. Data structures which allocate proportional to the number of cores for the sake of thread performance do not sound like a good idea to me.
This is over 7 times slower and introduces tons of GCs in your application limiting what you can do in parallel even further. The worst part of this is that the bigger the server you let it run on the slower (more cores means more memory to allocate!) it will get. On a slow dev machine with 2 or 4 cores you might not notice it much. But if you deploy this on real server hardware with 40 cores you have a really bad performance problem at hand. That is the reason why you need to do performance testing on virtually every hardware configuration you support.
The fix was to go back to a plain dictionary with a lock. ConcurrentDictionary supports a loadfactor which controls how many arrays for isolated writes is should allocated but it was still a factor two slower.
So beware by simply replacing non concurrent data structures with their lock free but much more memory hungry counterparts you might be making your system much slower. In effect we were loading data slower the bigger the server did get. This is definitively not what the customer has paid for.