I have a batch job that goes through files collection, processes each file and sets the features on it.
Currently I am modeling this by adding the ObjectId of all the features in the file document as a field "features". See below.
I need to be able to support "show files for a feature" and "show features in a file" type of queries. I have an index on the "features" field in files collection. Once created there are no updates to these records. They can only be removed all together.
I have two questions:
- Is there a better way to model these linkage to better support the insert throughput? I have been thinking about a separate collection (ala "join" table in SQL) where I insert both file and feature ids only. But that collection could get huge !! so I think my current mechanism may be better.
- Is it possible to pre-allocate "larger" file records so there are no relocations at the time of "features" write on the file document?
File document:
{
"_id" : ObjectId("53adce7ce4b0b48e4ef5
"volume" : ObjectId("53875f2de4b01ccaf406
"features" : [
ObjectId("53adcf8d7c60a444f7a8
ObjectId("53adcf6b7c60a444f7a8
],
.... other fields
}
Feature document:
{
"_id" : ObjectId("53adcf8d7c60a444f7a8
"volume" : ObjectId("53875f2de4b01ccaf406
... other fields
}
AFAIK, there is no relocation of a document at the time it is written. Relocation only happens, when you update a document and it ends up larger than its preallocated size.
It sounds to me like you might want to embed the file ids in the feature documents too. That is what will get you the "files per feature" query faster. If you know the document won't grow past 16MB, you should be fine to do this.
How is the data for a documents in each collection inserted? Why do you feel the inserts needs to be faster?
All my files are inserted first without the "features". Then as part of a later batch processing, I calculate "features" and update($set) them on the file document. Because this update happens after the insertion, I fear the update may require relocation and hence my question. I feel like if I can avoid this relocation then my batch job (and update performance) would be much better.
To be able to add file ids to feature document, one option is to use "$addToSet". But I know that its performance degrades with large number of entries (either existing or new). I want to avoid updating both sides if I can because my batch job performance is more important than the query performance.
To be able to add file ids to feature document, one option is to use "$addToSet". But I know that its performance degrades with large number of entries (either existing or new). I want to avoid updating both sides if I can because my batch job performance is more important than the query performance.
Ok. Your schema says you have a one-to-many relationship between files and features and not a many-to-many relationship. But then you say you want to update the file ids in the feature document, which your schema doesn't show. So you do want the file ids in the feature documents?
At any rate, you could theoretically pre-fill the features and files embedded documents with "dummy" data to make sure the document size is correct, before you update document with the proper ids. That would avoid the document movement issue.
Scott is right - you need to consider both directions. There are several ways to model such relationships but as soon as you say you have 4-5 digits worth of features per file (that's thousands or tens of thousands?) it seems a little crazy to embed them.
Once you say 'I have an index on the "features" field in files collection.' i.e. you propose to have an indexed array with thousands of members in each document - and that is THE recipe for trouble. Do *not* use this schema.
What is ultimately the use case for this? You say "show files for a feature" and "show features in a file" type of queries. How fast do the queries have to be? How fast do updates/writes have to be? You can maintain two collections, files with features and features with files - but only if you do NOT index the arrays but use the _id (fileId or featureId) to fetch the information.
Or you can have an indexed two-way relationship between files and features - it seems a little too normalized but I don't see why you would worry about it becoming HUGE - that's what databases are for, huge amounts of data (that's well indexed and therefore easily accessible).
More detail about the types of queries would help here.
Thanks for your reply.
My queries are run by a user facing application on user's demand. I expect them to return results pretty quickly. For updates/writes I am fine with little more delay because the processing happens in a batch job in background. However, the updates/writes cannot be too slow because it will slow down the job. I think what I have in my favor is that once these features are computed (and linked to files) there are no updates. So I can do some work upfront for better query performance.
Now my total number of files (in database) could be in billions. The number of features can also be in millions. I am afraid if I go for the "two-way relationship" then that collection could be really huge even though each entry would be really small. Are there any side effects that you can think of?
Considering above reason, I am more inclined to try your first approach. In particular, I would do a "filesToFeatures" collection and "featuresToFiles" collections where only "fileId" and "featureId" are indexed and key. And no index on the large array !
Please suggest me if you have any other option based on above description of queries.
My queries are run by a user facing application on user's demand. I expect them to return results pretty quickly. For updates/writes I am fine with little more delay because the processing happens in a batch job in background. However, the updates/writes cannot be too slow because it will slow down the job. I think what I have in my favor is that once these features are computed (and linked to files) there are no updates. So I can do some work upfront for better query performance.
Now my total number of files (in database) could be in billions. The number of features can also be in millions. I am afraid if I go for the "two-way relationship" then that collection could be really huge even though each entry would be really small. Are there any side effects that you can think of?
Considering above reason, I am more inclined to try your first approach. In particular, I would do a "filesToFeatures" collection and "featuresToFiles" collections where only "fileId" and "featureId" are indexed and key. And no index on the large array !
Please suggest me if you have any other option based on above description of queries.
댓글 없음:
댓글 쓰기