2014년 12월 7일 일요일

ClassCastException in findOne on java driver 2.12.4

I'm doing a simple findOne((String)), which is throwing an exception for only two documents in my collection:

java.lang.ClassCastException: com.mongodb.BasicDBObject cannot be cast to java.lang.Number

I don't see this exception or any problems when I use the mongo client to fetch the documents manually.  The documents are available here:

https://gist.github.com/rickosborne/6ac07af0a7526fcd0f91

On the off chance that it could be something hinky with one of the top-level keys, I've also extracted those as their own files in the gist.

Does anyone see anything kooky about those documents that might be causing the problem?  (Or maybe someone with an environment that they could quickly insert these documents and then try a findOne against the source version of the library could get a more specific failure point?  I only see it against the findOne call.)

-R

Extra context for those curious: I'm working on some textual analysis and prediction using a very naive n-gram algorithm.  Given a sequence of words, like "I want to go to", can I predict the next word?  You've seen stuff like this in modern smart phone keyboard UIs.  I'm comparing a few different data structures and back-end storage solutions, and for this particular MongoDB implementation and data structure, I've set it up like this:
1. Split the phrase into lowercased words: [ "i", "want", "to", "go", "to" ]
2. Each word gets its own document.
3. Working backwards from each word, make each previous word a sub-object, and each successor word an entry in a special ">" sub-object.  For example:
{ "_id": "want",
  "i": {
    ">": {
      "a": 3,
      "to": 5,
      "you": 1
    }
  },
  "we": { ... },
  "you": { ... }
}
4. The numbers in the > objects are the counts of the times I've seen those words, based on ingesting a really large corpus.
5. For prediction:
  a. Fetch the document corresponding to the last word in the given sequence.  In the above example, let's say I was given "i want", so I looked up document "want".
  b. Traverse backwards as far as you can through the words in the sequence, which will each be nested sub-objects.  In the above example, I get to the "i" sub-object.
  c. When you run out of nested sub-objects, look for the > object, which will contain all of the words that have been seen after this sequence.  In the above example, I get the "i.>" path.
  d. Iterate over the > keys and return the one with the highest count.

It's ridiculously storage-inefficient, but it makes prediction lightning-fast: one single lookup, and then an in-memory traverse, generally not more than 10 words.

I don't think any of that has any bearing on this particular bug, but I know someone would ask "wtf is up with those crazy documents?".


댓글 없음:

댓글 쓰기