Andre' Gauci wrote:
Also, I cannot use MapReduce due to the 16MB limitation (if you're using Inline map-reduce). Outputting MapReduce to a collection is out of the question since this needs to be both fast (and we need to make sure we don't go over the namespace limit).
Andre' Gauci wrote:After navigating through a lot of Q&A's online I've come to a conclusion that what I'm trying to achieve needs either a document mutation or the document "schema" must be re-arranged for my specific queries.
A sample document:{
"_id" : "uBERl1oijUOsE7fGm4o6RA",
"DataCollection" : [
{
"ModuleType" : "TaskDuration",
"Value" : 200.0
},
{
"ModuleType" : "TaskPriority",
"Value" : "High priority"
},
{]
"ModuleType" : "TaskName",
"Value" : "Test task 1"
}
}
The collection contains almost one thousand documents with this same schema.
What I want to achieve is the sum of task duration, grouped by task priority.
So the result would look something like this:
{
"result": [
{ _id: "High Priority", Duration: 500 },
{ _id: "Low Priority", Duration: 100 }
]
}
I'm using the latest version of Aggregation Framework (on Mongo 2.6.1).
Below is my attempt to solve this problem:
db.runCommand({
aggregate: "mycollection",
pipeline: [
{ $unwind : "$DataCollection" },
{ $match : { $or : [{ "DataCollection.ModuleType" :"TaskDuration" }, { "DataCollection.ModuleType" : "TaskPriority" }]} },
{ $project : { mui : "$DataCollection.ModuleType", v :"$DataCollection.Value" } },
{ $group : { _id: "$_id", dc: { $addToSet : "$$CURRENT" } }},
{ $group :
{
_id: { $cond: {
if: { $eq: ["$.DataCollection","TaskPriority"] },ModuleType
then: { $ifNull: [ "$.Value","Uncategorised"] },DataCollection
else: "__NOMATCH__"
} },
val: { $sum : { $cond: [{ $eq:["$.DataCollection", "TaskDuration"] },"$ModuleType.Value" , 0]DataCollection
} }
}
}
],
allowDiskUse: true
});
I'm not sure whether what I'm trying to achieve is possible or not. Should I switch my document structure to "ModuleType": "Value" ?
Yes, in that case their TaskDuration value would be summed up.
How would you go about the problem ? Would you use $redact (as I've seen that there is a new operator that could shape your document, not sure it's relevant in this case though) ?
Here's how I solved it:
> db.t99.find()
{ "_id" : ObjectId(" 537e3081658ed684758a83b9"), "DataCollection" : [ {"ModuleType" : "TaskDuration", "Value" : 200 }, { "ModuleType" :"TaskPriority", "Value" : "High priority" }, { "ModuleType" :"TaskName", "Value" : "Test task 1" } ] }
{ "_id" : ObjectId(" 537e30d0658ed684758a83ba"), "DataCollection" : [ {"ModuleType" : "TaskDuration", "Value" : 100 }, { "ModuleType" :"TaskPriority", "Value" : "High priority" }, { "ModuleType" :"TaskName", "Value" : "Test task 2" } ] }
{ "_id" : ObjectId(" 537e310f658ed684758a83bb"), "DataCollection" : [ {"ModuleType" : "TaskDuration", "Value" : 100 }, { "ModuleType" :"TaskPriority", "Value" : "Low Priority" }, { "ModuleType" : "TaskName","Value" : "Test task 3" } ] }
{ "_id" : ObjectId(" 537e312b658ed684758a83bc"), "DataCollection" : [ {"ModuleType" : "TaskDuration", "Value" : 300 }, { "ModuleType" :"TaskPriority", "Value" : "Low Priority" }, { "ModuleType" : "TaskName","Value" : "Test task 4" } ] }
>
> unwind{ "$unwind" : "$DataCollection" }
> project
{ "$project" : {
"_id" : "$_id",
"duration" : {
"$cond" : [ {"$eq" : [ "$DataCollection.ModuleType", "TaskDuration" ] },"$DataCollection.Value", null ]
},
"priority" : {
"$cond" : [ {"$eq" : [ "$DataCollection.ModuleType", "TaskPriority" ] }, "$DataCollection.Value", null ]
}
} }
> group1{ "$group" : { "_id" : "$_id", "duration" : { "$max" : "$duration" },"priority" : { "$max" : "$priority" } } }
> group2{ "$group" : { "_id" : "$priority", "duration" : { "$sum" : "$duration"} } }
> >
> db.t99.aggregate(unwind, project, group1, group2)
{ "_id" : "High priority", "duration" : 300 }
{ "_id" : "Low Priority", "duration" : 400 }I find it easiest to understand these kinds of complex aggregations if you go backwards from the result. So if you look at this output, it's easy to understand how 'group2' works:
> db.t99.aggregate(unwind, project, group1)
{ "_id" : ObjectId(" 537e310f658ed684758a83bb"), "duration" : 100,"priority" : "Low Priority" }
{ "_id" : ObjectId(" 537e312b658ed684758a83bc"), "duration" : 300,"priority" : "Low Priority" }
{ "_id" : ObjectId(" 537e30d0658ed684758a83ba"), "duration" : 100,"priority" : "High priority" }
{ "_id" : ObjectId(" 537e3081658ed684758a83b9"), "duration" : 200,"priority" : "High priority" }
>
> group2{ "$group" : { "_id" : "$priority", "duration" : { "$sum" : "$duration"} } }
Similarly, if you look at this output, it's easy to understand how 'group1' works:
> db.t99.aggregate(unwind, project)
{ "_id" : ObjectId(" 537e3081658ed684758a83b9"), "duration" : 200,"priority" : null } { "_id" : ObjectId(" 537e3081658ed684758a83b9"), "duration" : null,"priority" : "High priority" } { "_id" : ObjectId(" 537e3081658ed684758a83b9"), "duration" : null,"priority" : null } { "_id" : ObjectId(" 537e30d0658ed684758a83ba"), "duration" : 100,"priority" : null } { "_id" : ObjectId(" 537e30d0658ed684758a83ba"), "duration" : null,"priority" : "High priority" } { "_id" : ObjectId(" 537e30d0658ed684758a83ba"), "duration" : null,"priority" : null } { "_id" : ObjectId(" 537e310f658ed684758a83bb"), "duration" : 100,"priority" : null } { "_id" : ObjectId(" 537e310f658ed684758a83bb"), "duration" : null,"priority" : "Low Priority" } { "_id" : ObjectId(" 537e310f658ed684758a83bb"), "duration" : null,"priority" : null } { "_id" : ObjectId(" 537e312b658ed684758a83bc"), "duration" : 300,"priority" : null } { "_id" : ObjectId(" 537e312b658ed684758a83bc"), "duration" : null,"priority" : "Low Priority" } { "_id" : ObjectId(" 537e312b658ed684758a83bc"), "duration" : null,"priority" : null } >
> group1{ "$group" : { "_id" : "$_id", "duration" : { "$max" : "$duration" },"priority" : { "$max" : "$priority" } } }
So the tricky part is how we build the "duration" and "priority" fields in the output from "project". Lets take a look at the input and the actual operation:
> db.t99.aggregate(unwind)
{ "_id" : ObjectId(" 537e3081658ed684758a83b9"), "DataCollection" : {"ModuleType" : "TaskDuration", "Value" : 200 } }
{ "_id" : ObjectId(" 537e3081658ed684758a83b9"), "DataCollection" : {"ModuleType" : "TaskPriority", "Value" : "High priority" } }
{ "_id" : ObjectId(" 537e3081658ed684758a83b9"), "DataCollection" : {"ModuleType" : "TaskName", "Value" : "Test task 1" } }
{ "_id" : ObjectId(" 537e30d0658ed684758a83ba"), "DataCollection" : {"ModuleType" : "TaskDuration", "Value" : 100 } }
{ "_id" : ObjectId(" 537e30d0658ed684758a83ba"), "DataCollection" : {"ModuleType" : "TaskPriority", "Value" : "High priority" } }
{ "_id" : ObjectId(" 537e30d0658ed684758a83ba"), "DataCollection" : {"ModuleType" : "TaskName", "Value" : "Test task 2" } }
{ "_id" : ObjectId(" 537e310f658ed684758a83bb"), "DataCollection" : {"ModuleType" : "TaskDuration", "Value" : 100 } }
{ "_id" : ObjectId(" 537e310f658ed684758a83bb"), "DataCollection" : {"ModuleType" : "TaskPriority", "Value" : "Low Priority" } }
{ "_id" : ObjectId(" 537e310f658ed684758a83bb"), "DataCollection" : {"ModuleType" : "TaskName", "Value" : "Test task 3" } }
{ "_id" : ObjectId(" 537e312b658ed684758a83bc"), "DataCollection" : {"ModuleType" : "TaskDuration", "Value" : 300 } }
{ "_id" : ObjectId(" 537e312b658ed684758a83bc"), "DataCollection" : {"ModuleType" : "TaskPriority", "Value" : "Low Priority" } }
{ "_id" : ObjectId(" 537e312b658ed684758a83bc"), "DataCollection" : {"ModuleType" : "TaskName", "Value" : "Test task 4" } }
> project
{ "$project" : {
"_id" : "$_id",
"duration" : {
"duration" : {
"$cond" : [ {"$eq" : [ "$DataCollection.ModuleType", "TaskDuration" ] },"$DataCollection.Value", null ]
},
"priority" : {
"$cond" : [ {"$eq" : [ "$DataCollection.ModuleType", "TaskPriority" ] }, "$DataCollection.Value", null ]
}
} }
What this does is set the "duration" field to be value of "DatacCollection.Value" only if the DataCollection.ModuleType is "TaskDuration", and null otherwise, and something similar for the 'priority' field. I do this to create the output that can be re-grouped using $max in the first $group stage.
You can use this general design pattern to query any collection that structures data using a key/value schema. (For more about the key/value schema, see this link: http://askasya.com/post/
The design pattern is:
- Use $elemMatch to select the desired documents
- $unwind the array with the key/value documents
- $project out the values you want and flatten the structure, using $cond to make other values be null
- $group by the original _id, using $max to filter out the null values generated in the previous step
- Perform whatever additional grouping, or sorting you need
Let me know if you have questions.
This looks great! Tested in MongoDB shell and works like a charm!
I like the way you're using $max to remove those nulls, pretty handy trick :)!
Also, great post from Asya. Actually puts my mind at rest since I had indexing problems previously (when using version 2.4) but know it seems that it has changed in 2.6. Great!
On a side note (don't know if here's the right place to post this question), I'm trying to build the $cond using the C# driver, however, there seems to be a problem when passing null to a new BsonArray. The BsonArray excludes the "null" value and you get a $cond with two arguments, leading to an exception being thrown when querying the database.
This is my code:
return new BsonDocument {
{ _AggregateCond,
new BsonArray() { condition,BsonValue.Create(ifYes), BsonValue.Create(ifNo) }
}
};
However, it should create the following:
{ $cond : [<<condition document>>, "$<<element>>", null]}
Any ideas?
When you're using literals you can pass a document {$literal:null}
instead of just null - would that solve things?
Btw, I have a post on similar reshaping of schema:
http://www.kamsky.org/stupid-
It shows how to generate this pipeline programmatically (in the shell)
based on known fields you want to create).
And it uses a trick of using string that starts with " " to work
with $max which also avoids the use of null.
Yep, the $max trick will be pretty handy :)!
Went over to your blog, great posts! Would find most of them useful in my use case, especially the old post about grouping of dates! :)
Tried using $literal, however still getting the same problem, outputs { } instead of { $literal : null } - so the question remains, how would I build this query fragement using the C# driver (the problem is with adding that end "null" to the BsonArray):
"$cond" : [ {"$eq" : [ "$DataCollection.ModuleType", "TaskPriority" ]}, "$DataCollection.Value", null ]Should I post this in mongodb-csharp ?
Addition to my previous post
I can't really understand how you're using " skip" instead of null. When I use " skip" I get a lot of " skip" values in the result.
However, $max works well with null values.
Am I missing something ?
I can't really understand how you're using " skip" instead of null. When I use " skip" I get a lot of " skip" values in the result.
However, $max works well with null values.
Am I missing something ?
Ah, nevermind. Turns out I had to use BsonNull.Value :)
Have a great day and thanks for your help!
I was using " skip" because all my data was strings.
One quick question: is it ever the case that there is more than one document in the DataCollection[] array that contains the key/value pair "ModuleType" : "TaskDuration"? If so, that simplifies the problem a lot.
Suppose we have multiple documents containing the same pair, what would be your approach to extract the aggregated value?
Is it still doable using the Agg framework? I'm trying to "pivot" the data using $push (as I cannot use Value can contain duplicates).
댓글 없음:
댓글 쓰기