I am currently looking at mongo and a few other options for performing realtime analysis on structured log data, the document can look something like this, the only index is on type and timestamp as the properties are not known and can be arbitrary.
{
type: 'purchase',
timestamp: DATE.
properties: {
product: 'cake'
}
}
}
For my tests I loaded up 10 million documents and did a simple aggregation to count all cakes sold, the time to return this was not that impressive, I have also tried it out with 2.8 using wiretiger. What I am looking for is some advice on how this can scale to billions of records, where we need to do a scan over 100s of millions of rows to get a result? Who is currently doing this and what sorts of cluster configurations do they use to make it possible or if I am barking down the wrong hole for real-time analytics using mongo?
In my case I thought the 10 million documents would be in memory and could be processed fast, but this does not seem to be the case, any advice would be appreciated.
In general, the fastest way to do aggregations of large datasets is to do some pre-aggregation - as the data comes in, increment various counters so that you don't have to be counting things across all the documents for most frequent queries.
For other types of aggregations, you should most definitely be using indexes - what type of aggregation did you run that didn't perform well? If you give some realistic examples, there are probably ways to see if any indexes would help the performance.
For example I might say "give me the total value of all sold cakes" but as product and price are inside properties which can be an arbitary collection indexing upfront is not really possible here.
{
type: 'purchase',
timestamp: DATE.
properties: {
product: 'cake',
price: 1.99
}
}
}
First of all, it *is* possible to index dynamic (large number of)
attributes not known up-front - you can store them as key-value pairs
in an array and index key,value pairs so that whichever attribute you
searched for it would be indexed.
But secondly, you should consider what queries require very fast real
time responses. Any that do are candidates for preaggregation (i.e.
incrementing some counters as the new records come in).
Second possible approach - depending on the range of time you are
dealing with in "timestamp" would be aggregating things into buckets.
For example, at the end of each day, calculate sums of various
combinations of attributes that you expect queries on into a daily
summary collection. You can then do sums across days instead of
across raw data and have it be significantly faster.
It really depends on what queries you expect to have to satisfy and how fast.
attributes not known up-front - you can store them as key-value pairs
in an array and index key,value pairs so that whichever attribute you
searched for it would be indexed.
But secondly, you should consider what queries require very fast real
time responses. Any that do are candidates for preaggregation (i.e.
incrementing some counters as the new records come in).
Second possible approach - depending on the range of time you are
dealing with in "timestamp" would be aggregating things into buckets.
For example, at the end of each day, calculate sums of various
combinations of attributes that you expect queries on into a daily
summary collection. You can then do sums across days instead of
across raw data and have it be significantly faster.
It really depends on what queries you expect to have to satisfy and how fast.
댓글 없음:
댓글 쓰기