Mongodb User Forum News: Clarification on MD5 in GridFS

There are certain aspects about the MD5 hash that are quite difficult to understand from reading the docs and the driver source.

First and probably most importantly: what is the actual use of the MD5 hash? It gets written when a whole file is persisted, however (from looking at the java driver) there is no real use of it when reading the file. Is it used by mongodb internally? If so, when and what happens if it is missing?

I have came along some third party implementations that implement features that the mongodb driver is missing, however they don't store the MD5. I was wondering if this may lead to any conflicts (now and in the future).

A lot of times only a few chunks of a file is needed. In those cases there does not seem any good use for the MD5 hash, except it would be used internally.

Lastsly, the MD5 hash prevents uploading of files to a web server in chunks and/or resuming failed uploads (unless that is entirely handled by the application server, which may introduce quite some overhead). The only workaround I can think of (while maintaining the MD5) is that the chunk that is uploaded last, triggers the calculation of the MD5. However, what is the use of the MD5 then, if we are only calculation a hash over what already has been persisted? To check if mongodb didn't lose any chunks over time? Does not seem very likely...

Interestingly the MD5 is not listed as optional in the manual [1], however this post does omit the MD5 entirely when explaining the content of the files collection [2]. It however mentions that the MD5 is compared upon insert, however from looking at java driver, this seems to be the responsibility of the application using the driver.

[1] http://docs.mongodb.org/manual/reference/gridfs/

[2] https://www.mongodb.com/blog/post/building-mongodb-applications-binary-files-using-gridfs-part-2

I agree that the documentation around MD5 isn't the clearest. I'll do what I can to clarify things.

The role of the MD5 is to ensure that the file chunks that reached the server are the same as the file being uploaded. This is almost entirely a driver-side concern; the server only helps by calculating the MD5 on demand for what it has received.

Conceptually, it works like this:

the driver uploads chunks to the database using the same file_id
the driver asks the database for the MD5 of the ordered chunks with that file_id
the driver compares the database MD5 to the local MD5 (and indicates an error if they don't match)
the driver uploads the files document (at which point the file is "visible" in the GridFS collection)

The convention is to include the 'md5' field in the uploaded files document, but nothing on the server side enforces this and the server is not "triggered" to calculate it. The MD5 is calculated with an ordinary database command, on request.

How the stored MD5 is used is up to the application. It could check the hash after downloading the chunks to verify them; it could use it to "fsck" the gridfs files; etc.

That said, different drivers handle the MD5 process slightly differently. Some drivers default to calculating and checking the MD5. Others won't unless you set a parameter. You have to consult the documentation for your particular driver. If the driver doesn't have what you need for resuming partial uploads, then you could always manually insert the chunks and then a files document without computing the MD5 until the upload is complete.

I hope that answered your questions. If not, please let me know.

Thanks a lot for your reply. So, in order to sum it up:

- actually everything you described does not require the MD5 to be persisted, however I would lose the ability to check if a file I retrieve from the db still is the same as it has been persisted
- if I don't check and persist the MD5, it's my own loss (as I am unable to validate anything later), however the database does no care about it at all

- actually the database does not care about GridFS. It just seems to be a convention that is implemented by the driver (which in turn would make the two previous points pretty much obsolete)

That's more or less right. GridFS is a driver-side feature, but the server has the necessary support for doing the server-side MD5 calculation on all the chunks. That includes -- at least recently -- the ability to calculate MD5 correctly even on chunks that are split across multiple shards, which is a nice trick. :-)

Mongodb User Forum News

2014년 12월 25일 목요일

Clarification on MD5 in GridFS

댓글 없음:

댓글 쓰기