Mongodb User Forum News: Schema design: multi-language support

There was another thread with the same subject, but my issue is a bit different. I am also looking into multi-language support of documents in MongoDB. I set out with the following schema in mind:

{
  title: {
    en: "Title in English",
    ja: "日本語のタイトル",
    fr: "Subject en Francais"
  },
  content: {
    en: "Content in English",
    ja: "日本語のコンテンツ",
    fr: "Contenu en Francais"
  }
}

I have a nice class that does that pretty much automatically, but I am running into indexing issues. I want to be able to search through texts using text indexes, and did:

db.articles.ensureIndex({ title: "text", content: "text" })

Then, when I try to find some text in my database, I don't get any results. I am thinking it is because text indexes don't look through subdocuments. So I tried the "$**" to match all fields, but that won't work in subdocuments, only in the top-level document, but then it will index ALL string fields. The documents I have here are rather large.

Previously, I had:

{
  lang: "en",
  title: "Title",
  content: "Content"
}

and the aforementioned ensureIndex worked fine. But then I need to be able to tell the system which documents are translations of which documents.

Isn't there a way to use the <lang>: <content> schema approach for translated content, while also be able to search for text through such documents?

Also open to discuss implementations of multilingual content. I wroteseveral articles on it on my blog.

I'm no expert, but I would think you'd be better off, if you keep the translations in their own documents. If you need to reference a group of translations back to a "master" document, then add ids to the documents and simply reference them.

{
  _id: "1",
  lang: "en",
  title: "Title",
  content: "Content",
}
{

_id: "2",
lang: "fr",
title: "Subject",
content: "Contenu",
master: "1"
}

_id: "3",
lang: "de",
title: "Titel",
content: "Inhalt",
master: "1"
}

Check out the section on [creating a text index for multuple languages](http://docs.mongodb.org/manual/tutorial/specify-language-for-text-index/#specify-the-index-language-within-the-document) in the MongoDB Manual. If you're using version 2.6 or higher, you can specify the language for a piece of text in the same subdocument as the text and MongoDB will index the text using that language. For example, for the document in a quotes collection

{

_id: 1,

language: "portuguese",

original: "A sorte protege os audazes.",

translation:

[

{

language: "english",

quote: "Fortune favors the bold."

{

language: "spanish",

quote: "La suerte protege a los audaces."

}

]

}

If you create a text index with the command

db.quotes.ensureIndex( { original: "text", "translation.quote": "text" } )

then the value of "translation.quote" in the first array element, "Fortune favors the bold", will be indexed with English stopword and stemming rules.

How could one search for and return only the spanish quote?

considering the sample from Will`s comment, the query for only

* Spanish language would be:

db.quotes.find({original:/sorte/},{"_id":0,"translation.quote":1})[0].translation[1]

* English language:

db.quotes.find({original:/sorte/},{"_id":0,"translation.quote":1})[0].translation[0]

Correct me if I did not get it correctly . Using explain() shows COLLSCAN and not using the defined index.

Thanks for the answer Badi, but my contemplation is in a different direction. When searching for a translation of content, which I would assume a quote would be, the user wouldn't usually know the original text. She'd be searching for something within the spanish quote.

Mongodb User Forum News

2014년 12월 28일 일요일

Schema design: multi-language support

댓글 없음:

댓글 쓰기