My MongoDB model of these data in it's current form ended up with 4049 documents with an avgObjSize of 47831 bytes.
The data is structured in 3 layers of nested documents (years, months, days).
Using a .NET 4.5 console application and 1.9.2 driver from nuget, the test application is using MongoCollection<BsonDocument>.
I did some profiling with the Concurrency Visualizer and noticed that when ramping up the cursor count, all the time was spent allocating memory (waiting for GC).
So I added the following to app.config:
<runtime>
<gcServer enabled="true"/>
<gcConcurrent enabled="false"/>
</runtime>
That brings the time down to less than 4 seconds (using NumberOfCursors=5, which seemed to be the best number on my PC), a more than three times improvement!
I wanted to see if I could get it further down, so I compiled my own driver and did a profiling and I could see that almost all the time is spent in creating the BsonDocuments, specifically maintaining the _elements and _indexes private fields.
I replaced _indexes with this class:
internal class IndexIntoList
{
private readonly List<BsonElement> _list;
public IndexIntoList( List<BsonElement> list )
{
_list = list;
}
public void Clear( ){}
public void Add(string name, int index ){}
public bool ContainsKey( string name )
{
foreach(var e in _list) if( e.Name.Equals( name ) ) returntrue;
return false;
}
public bool TryGetValue( string name, out int index )
{
for(int i=0; i<_list.Count; i++)
{
var e = _list[i];
if( e.Name.Equals( name ) ){index = i;return true;}
}
index = 0; return false;
}
}Using that, combined with pre-allocating _list to size=16, I was able to achieve 2.8 seconds :-)
I also tried pre-allocating _indexes and _elements to size=16, and that gave a minor improvement over the original.
I realize that this optimization is heavily dependent on the BsonDocuments having few keys, since it does a linear scan for the key, but perhaps it would be worth considering only allocating a dictionary when _list exceeds a certain threshold?
In my case, it would result in not having to allocate and maintain 7M dictionaries. I didn't do a memory profiling, but watching the working set seemed to indicate a significantly lower memory usage.
In any case, I was not aware that using the server GC could result in such a dramatic performance increase, so that was a valuable lesson :-)
For reference, here is my test code.
var client = new MongoClient( connectionString );
var srv = client.GetServer( );
var mongo = srv.GetDatabase( "xxxxxx" );
var seriesColl = mongo.GetCollection<BsonDocume nt>( "series" );
var cursors = seriesColl.ParallelScan( new ParallelScanArgs
{
BatchSize = 1,
NumberOfCursors = 5,
} );
var t = new Stopwatch( );
t.Start( );
var tasks = cursors
.Select( cursor => Task.Factory.StartNew( ( c ) =>
{
using( c as IDisposable )
{
var l = new List<object>( 1024 );
lock(_stuff)
_stuff.Add(l);
Console.WriteLine( ">{0}",Thread.CurrentThread.ManagedTh readId );
int i = 0;
var en = c as IEnumerator;
while( en.MoveNext( ) )
{
i++;
l.Add( en.Current );
}
Console.WriteLine( "<{0} : {1}",Thread.CurrentThread.ManagedTh readId, i );
}
}, cursor ) )
.ToArray( );
Task.WaitAll( tasks );
t.Stop( );
GC.Collect();
Console.WriteLine( "\r\n{0:0.0}\r\n{1:0,0} bytes",t.Elapsed.TotalSeconds, Process.GetCurrentProcess().Wo rkingSet64 );
What about using something better suited to time series like kdb? It's a commercial product but it's insanely quick as its optimised for time series and is column based. Most of the investment banks use it,
Thanks so much for the in depth analysis and solution. We recently pushed this change to the master branch which will be our next major release: https://jira.mongodb.
No problem. Sounds like there is something to look forward to in the next version! Is there an expected release date?
댓글 없음:
댓글 쓰기