2014년 12월 14일 일요일

Java/C# Driver serialization speed issues / retrieving large sets in an efficient manner.

So, for a bit of background: I've been tasked with moving an in-house in-memory cache out to mongoDB because memory usage was getting to extreme levels, so the idea was to shift large chunks of it to disk and only retrieve what is necessary at the time of the query. Each object is quite hefty, having about 20/30 properties (90% of them non-complex) and a single array of 1-5 items of equal size to the parent object. 

Currently I haven't serialized across the array to Mongo, so I'm only dealing with the primary object, but serialization speeds in the c# driver are ridiculously slow – 6/7 seconds for a 65k item retrieval where the Java driver happily performs the same in 1/50th the speed. (I don't have the exact query I ran earlier on-hand, but I believe 33k items were retrieved in 72ms in Java while the C# driver took 2-3 seconds. These numbers are using the BsonDocument.)

I made a custom class and serializer which only serializes the byte array to the object and the data is resolved later upon access and the time is now .3 seconds to retrieve the 65k list, which is better but still not good enough. Just as a note, I'm not doing any serious queries, I'm simply fetching a collection wholesale.

So, my questions:

Does the Java driver do some serious deferring of its own to give the speed difference? 
Are our objects outside the general use-case for MongoDB?
Is there some secret setting in the C# driver/MongoDB that will help me retrieve large sets of objects without such ridiculous overhead?
Or is this simply a problem I can't optimize away thanks to the fact of everything being on disk and out of process?

If I haven't given enough information just let me know what you need; I haven't added more specifics because the operations we are performing are very simple and it's not to do with query optimization / etc.

PS: Saying 'just filter the set' is not really an option thanks to the way our data is structured and MongoDB works. The data is heavily relational and complex filtering would have to be run across multiple collections, which is something that it can't do properly within a reasonable time. We are doing large amounts of the filtering our side then either saving cached collections in mongo, or simply using it as a simple back-end store for our own sets. So retrieval of small subsets of data, or the set itself is our focus, and so far the numbers are not looking pretty.

The answer may be simply 'don't use Mongo', but I am hoping there are a few subtle ways in which to improve performance here as there are good business reasons for using it.



While actual results can vary widely based on hardware configuration, one would normally expect to see somewhere from 30,000 to 60,000 documents retrieved per second.

Most of the work that the driver has to do is to deserialize the raw BSON byte stream into C# objects (either BsonDocument as you are using, or in some cases directly to instances of custom C# classes).

The amount of work needed to deserialize a document is directly and linearly proportional to the complexity of the document. So for example, a document that is 4 times more complex should take about 4 times as long to deserialize.

Given that your documents are complex, a retrieval rate of between 11,000 and 16,500 documents per second (33,000 documents in 2-3 seconds) sounds consistent with previously seen results.

You could try fetching your documents as RawBsonDocument or LazyBsonDocument instead of BsonDocument, which would defer the deserialization of the document elements until such time as they are actually referenced. This can work to your advantage if you only refer to small parts of a document, but if you eventually end up processing the whole document you might as well deserialize it up front. If you decide to use RawBsonDocument or LazyBsonDocument note that they are IDisposable and you must call Dispose when you are done with them.

The Java driver should have roughly comparable performance. I don't understand how you could have seen such a big speed difference when using the Java driver. Perhaps someone more familiar with the Java driver can chime in.



Thanks for the response. 

I already tried using our own version of lazy/raw bson documents (and I just tried using the built-in ones and there is no difference between 'em) and leaving the serialization until we need the values, but as you say it's only putting off the problem, not solving it.

Well at least you have happily validated that the numbers I'm seeing are likely on-par with what's expected, but it is strange if the Java and C# drivers are meant to have relatively similar speeds. I was about to do some proper tests using the Java driver, so tomorrow I will at least be able to see if it was some fluke or if it really is that much faster.



Just to finish this up: there was an issue with the timer that we were using in the Java tests – looks like the default object in the Java driver does take a little less time to return the default document but nowhere near the speed we need it at.


댓글 없음:

댓글 쓰기