2014년 12월 25일 목요일

Index unknown data

Trying to decide how best to do it. And it seems the problem with indexing. I have a set of criteria for documents that will often be used in queries. The document looks like this:

The main document {
id
many different fields

Tags [
"Field1": 10
"Field2": 1
"Field3": 14 
"Pole4" 2 
"Pole5": 0 
etc.
]
}

You do not know how to be called a field is the text entered by the user as well as its numeric value. While the query will always use multiple fields for equality and its value at more, less and equality. Read in the book that can be done as follows:

Tags [
{N: "field1" s: 10}
{N: "field2" h: 1}
{N: "field3" s: 14}
etc.
]

But this is not recommended and should be applied only after verification on actual data, as greatly inflates the index. And if the index is large, it may not fit in the memory and everything will work slowly. I understand the size of the index will be as the data itself.

Can do better then tags in separate documents? Type in SRM. Ie each field contained in his paper, which has a link to the main document with the other data. Then the index will be the usual, and requests to do at first sight gags and documents received from them Ida main document to build the following query is already on the main document. Ie two queries two collections instead of one query in one collection.

Both options are some not so that choose not to know. Maybe there are still some ways?



Sometimes that can be made to request some additional filters 
known fields on which to build the index, and then you can by tags 
do full scan without an index. 



Unfortunately in my case, it should Tags allows you to select a few of the main document, and the master is nothing special to request no. Ie I can say 100 Main documents that have tags with polem1 and polem2 where the value is greater than or less than some number. But they are only 5 of the 100 main document, so I then have to see all the main document to find just the right 5. In my case, the indexing of these tags necessarily, as I understand it. 



Well, then you can try to go through the creation of structures {tags: 
[{Name: ..., value: ...}, {name: ..., value: ...}, {name: ..., value: 
} ...]}, But do not index `tags`, and on or` tags.name` 
`[Tags.name, tags.value]`, if the amount of data will allow. In the case of
tags.name index will be less than in the other version, but will 
selected more documents already be processed by 
request to `tags.value`. 



This is almost exactly the version that I wrote in the beginning, one that creates a huge indexes. Ie this is "{n:" field1 "s: 10}, {n," field2 "h: 1}" Just could not more clearly than you describe.
Use only the name of the index tag is certainly an option, but probably will not index a lot less because I value just numbers. Speed ​​decrease request if we take only the names, but the size of the index will also be less. While it turns out the best option.

Can I somehow change the structure of the document itself, and I do not know how to inexperience? Such as indexes of arrays ([polu1, field2, field3]) to rise as an advantage Mongo and then works quickly and the index is not greatly increases compared with the case of the index structure (did not know what to call it:-))

I have a way to have the main documents such arrays, which will also be constantly used in conjunction with the structures and even in the same query. Now, if we could field names gags stored in an array and build an index on it, and the values ​​somewhere else in the document, and somehow they compare. But it seems so did not really do. But I would then tags from the array and the structure could hold in one field and then it would be in the index is less than 1. 



In reality, the difference between the index for the array values ​​and values 
key vocabulary in the array, as far as I know, no. At each value 
create a separate entry in the index. 

You can even try to really identify tags in separate 
doumenty like this: `{_id: <tag_name>, docs: {doc_id: 
<Ref_doc_id>, value: <tag_value_for_doc>}} `. Depending on 
requests may not need to do any additional codes. 
It may be useful here to add some fields of the documents to 
not to go into the collection to him to bring spsika query results, but 
have a lazy update after each izzmeneniya these data 
document. It may also be helpful to make a list of tags in the document,
but too lazy to have to update a collection with values ​​for tags. 



Thank you for an interesting option. Once I saw it more directly, such as for each pair Listings value do your document, and then how to get one tag with a list of the main documents and values. Just a pity that it is necessary to record and make inquiries in the two collections, although some of the values ​​of the main document really can be transferred to the document tag. It turns out one-to-many, one Tag in many major papers. With the upgrade of two collections as time would be useful transactions from RDB. 



Normally in such situations makes parallel process that 
updates the dependent collection. For example, you can use 
some message broker to activate tasks 
updating the data of a particular document. 

In theory, no one bothers to store values ​​in the main document, 
and conveniently in the form of {tags: {tag_1: tag_1_value, tag_2: 
tag_2_value, ..., tag_N: tag_N_value}}, and this collection has to do 
data collection, for example, using Map / Reduce. And then 
track changes in tag values ​​and update the search collection. 




What's the difference?
Collection size smaller than the size of the index will not.
Still need to quickly find to be the minimum necessary data placed in memory. 


댓글 없음:

댓글 쓰기