2014년 12월 14일 일요일

map-reduce optimization in mongodb

How can i optimize the below system.js function.

function (userRoleMappingid, prefix) {
    var roleId;
    roleId=db.clnUserRoleMapping.findOne({_id:userRoleMappingid}).fkRoleId;

    var companyId=db.clnUserRoleMapping.findOne({_id:userRoleMappingid}).fkCompanyId;

    var role_map = function () {emit(this.fkUserLoginId, {companyId:this.fkCompanyId, fkUserRoleMappingId:this._id, userName:"", fkRoleId:this.fkRoleId});};

    var login_map = function () {emit(this._id, {companyId:"", fkUserRoleMappingId:"", userName:this.userName, fkRoleId:""});};

    r = function (key, values) {var result = {companyId:"", userName:"", fkUserRoleMappingId:"", fkRoleId:""};values.forEach(function (value) {if (value.userName !== "") {result.userName = value.userName;}if (value.companyId !== "") {result.companyId = value.companyId;}if (value.fkUserRoleMappingId !== "") {result.fkUserRoleMappingId = value.fkUserRoleMappingId;}if (value.fkRoleId !== "") {result.fkRoleId = value.fkRoleId;}});return result;};

    db.clnUserRoleMapping.mapReduce(role_map, r, {out:{reduce:"tempClnRoleMapJoined"}},{sort:{companyId:1,userName:1}});

    db.clnUserLogin.mapReduce(login_map, r, {out:{reduce:"tempClnRoleMapJoined"}},{sort:{companyId:1,userName:1}});

        if(companyId=="") //checking for company id empy or not.If empty then all the users get displayed 

          return db.tempClnRoleMapJoined.find({$and:[{'value.fkRoleId':{$ne:ObjectId(roleId.str)}}, {'value.userName':{$regex:prefix, $options:"i"}}]}).limit(12).toArray();

        else

           return db.tempClnRoleMapJoined.find({$and:[{'value.fkRoleId':{$ne:ObjectId(roleId.str)}}, {'value.companyId':companyId}, {'value.userName':{$regex:prefix, $options:"i"}}]}).limit(12).toArray();
  
}

--please help me.



You should not be using server side functions.  You especially should not be using server side functions to call mapReduce.    

I would recommend strongly one of two things:  either change your schema not to require joins or switch to a database that natively supports joins.



First of all thank you for responding .

I can't switch to another database because past i had used sql server as my application database and its too slower for a large data retrieval. For that reason i had switched to mongodb. Is there any other way to including join functionality in mongodb? 

Map reduce functionality takes more time for execution in 10 million data. How can we optimize the query to get the higher performance. How can i get the functionality without using server side functions? Please suggest me little deeply.



Could you please explain us why it's not recommended to run server side functions, in particular those ones who call MR?



You won't get better performance with MongoDB if you don't take advantage of its strengths.    And that usually means that you have to use different schema than you do for relational.

Frequently the reason relational DB queries get slow as the data grows is because of very large tables that have to be joined together - some of the time the correct solution for that is some degree of denormalization in your schema.

Denormalizing your schema from the start is the correct approach with MongoDB - you must design your schema to allow efficient (and simple) queries from the application.   MapReduce is for analysis - it's not meant for real time low latency queries.

Rather than creating your collections and data first and then struggling with how to get the data out that you need, *first* you must figure out what queries you'll be making all the time and then structure your data to address those queries.

It appears that you have a collection called "clnUserRoleMapping", another one called "clnUserLogin" and the only role that clnUserLogin collection plays in your aggregation is it maps userId to "userName" - this strongly suggests that the clnUserRoleMapping collection should just store userName along with (fk)userLoginId.

The whole point of a document database is *not* to normalize everything and looking at the fields in your userRoleMapping collection, it looks like you're normalizing everything:

fkRoleId, 
fkCompanyId, 
fkUserLoginId, 
fkUserRoleMappingId (_id)


Having this sort of structure is very counter-productive in MongoDB.   If you store the fields you need, then you will be able to have much simpler code to write.

To be honest, even removing the userName lookup from the equation (which is all the two map reduces seem to be doing are) I'm still not sure what your code is trying to do.

Best I can tell, given some sort of id which corresponds to a single record in clnUserRoleMapping, plus a prefix (which is supposedly a string) you then return as an array first 12 records from the "temp" table which has the following fields:

companyId, fkUserRoleMappingId, userName, fkRoleId  --> same as original just with loginName!

What calls this function?   Why?  What is userRoleMappingid and prefix that you pass into it?    Whatever code is calling this function can just do some simple queries to get the exact same information.

And on top of it all, at the end you have queries using $and and I'm 99% certain they should be using the $elemMatch operator...   and I don't think I understand why "prefix" is being case insensitively matched to UserName...

If you explain the structure and requirements of your application (this function of it anyway) maybe we can give a way to do it in a simplified manner.   As it is, I don't even see what you are getting from mapReduce that you couldn't get from a simple find (or several)...



> Could you please explain us why it's not recommended to run server side functions, in particular those ones who call MR?
Sure!

Presumably someone is using MongoDB because they are thinking of
scaling in the future.   Scaling usually means sharding.   When you
shard, you don't have a single server now, you have many servers.
How would you expect server-side functions to scale?    The answer is
they can't.

On the docs page there is a note for server side JS functions:
"Do not store application logic in the database. There are performance
limitations to running JavaScript inside of MongoDB. Application code
also is typically most effective when it shares version control with
the application itself."

It is simply the wrong approach with modern architectures - it
probably was the right approach when there was one single database of
record that everything connected to and you grew it up rather than
out.

At least MapReduce by itself can scale by running on each shard if you
invoke it on a sharded collection, but still, being spawned in a
Javascript thread it's always going to be an order of magnitude slower
than running aggregations in MongoDB natively (using aggregation
pipelines or just queries).



Not sure, but this seems like a good use case for the mongodb-hadoop connector



Only if there is actual processing to be done, maybe - so far all I
see in map-reduce is a lookup of userName based on userId.



For your information:

The above server side function which is used for list out the user under specific company. As per your suggestion normalizing the collection is not the good way in mongodb. correct me if i am wrong.

In above example js function which contain two parameters like  'userRoleMappingId' (this id map the user with his specific role) and  'prefix'(this is for searching the user by entering any character).

we have a application in which we normalized the database by creating multiple collections but its seems to be wrong after go through your reply.So we  will be creating a denormalised design.

We have a application which have multiple company with multiple roles. Under this each role there are many users and these users identified by their userRoleMappingId(this id which map the user with there roles based on specific company).We designed the database with different collection like rolemaster,userLogin,userRolemapping,menumaster,userMenumapping. 

Could you please suggest me how can i go with denormalized design .



I'm not sure about what does "to scale server side functions". I just perform a query in order to retrieve and run them, and that does not depend of (non-)sharding.

I always thought the "do not use server side functions" is because of security issues. I agree when you say it's pointless to run a MR trying to emulate a "real time" SQL-JOIN, even if you parallelize the MR operations and if the data is sharded. What sometimes we do is to run MR jobs to denormalize the data, because it's extracted from a relational DB (and that's unavoidable). 

Aggregation framework is very powerful, we all know that. But we use MR because it allows us to run some (very complex) "non-real time" SQL-JOINS. This is: we first do all these operations in order to have some collections and then those collections can give us information by means of a couple of simple queries and (of course) calls to the aggregation framework, instead of 15 queries and a some unreadable calculations. I have say that our MongoDB instance feeds a (web application) reporting tool that gets the data on a daily basis, so we can take some time to preprocess and all that stuff in order to dramatically decrease response times in the generation of each report. We did not migrate into MongoDB for scaling reasons, but we did it because it helps us to represent more naturally our data and to decrease those response times. So my use case if a bit more complicated.

Is this approach still wrong? How can this be done better?



Errata:

I'm not sure about what does "to scale server side functions" means.



You should not attempt to do an operational JOIN inside MongoDB. Your data should be sufficiently de-normalized so that your applications do not have to JOIN data on-the-fly.

On the other hand, it's perfectly legitimate to want to do analytic JOINs inside MongoDB. If you have a large amount of data, and need to do pre-joins or other types of processing to compute various analytics (more for reporting purposes or answering one-off questions), then the map/reduce framework is perfectly acceptable.

It looks like you have an operational use case, which should never use joins. On the other hand, if you do have an analytic use case, I suggest you check out SlamData, an open source project that lets you execute SQL (including JOINs) directly inside MongoDB.




>  lets you execute SQL (including JOINs) directly inside MongoDB.
Technically, that's not possible.  MongoDB does not do joins.  You can
simulate joins outside of MongoDB which is what I'm guessing you're
doing.



Yes, I use MR in order to perform a "pre-join" (i.e. denormalization) instead of an emulated join on the fly. I guess "emulate joins" would be more accurate than "execute joins".

Ok, I understand that such a MR is completely valid and acceptable. Now, going back to my original question: is it a good practice to store those MR calls in server side functions?

And uhm... I cannot distinguish between "analytical" and "operational" use cases, so I did not get that point.



>  lets you execute SQL (including JOINs) directly inside MongoDB.

Technically, that's not possible.  MongoDB does not do joins.  You can
simulate joins outside of MongoDB which is what I'm guessing you're
doing. 

No, we perform the JOINs inside MongoDB by compiling to map/reduce. Not operational joins, but analytic joins, as part of bulk processing.

If you don't believe me, go ahead and download SlamData and give it a try. Or take a look at the source code which does it. :-)



An operational use case is one in which the latency requirements are extremely low (e.g. 10-100ms), typically used to power functionality inside an application (e.g. user clicks on button, display data).

A batch analytic use case is one in which latency requirements are extremely high (hours or days for bulk analytics pre-processing).

MongoDB is not suitable for in-database JOINs for operational use cases. For operational use case, you can do point-JOINs on the client-side. However, for analytic use cases, MongoDB can perform in-database JOINs.



MongoDB map/reduce does not run *in* the server - it's runs in a separate JavaScript thread.  It has to deserialize BSON into JSON and then JSON back into BSON again.  That's why it's so slow and not really recommended.  It does not run in the server.  You are using JS mapreduce functions to simulate joins.  I don't really see much advantage to that - especially since SQL is a language that was meant to express a data model which is nothing like MongoDB data model.



I'm afraid what you say is factually incorrect, as demonstrated by thefreely available source code to MongoDB.

While map/reduce does incur the overhead of bidirectional BSON/JSON serialization, as well as the overhead of Javascript interpretation, the command is executed in-server.

There are numerous reasons why people need to do joins inside MongoDB. Very few data sources can be denormalized entirely (in fact, this is not even possible for cyclic data structures, including most graph data!), and sometimes the decision of what data relates to other data is flexible instead of rigid.

Relying on JOINs for operational use cases is a really bad idea, unless all you need are point-JOINs, in which case you can do this client-side without noticeably affecting the latency of your application. On the other hand, escaping JOINs for analytic processing is completely unavoidable regardless of how you choose to store your data.

Instead of warning people to never use some valuable features of MongoDB, I suggest merely guiding users toward the optimal solution to their problem. For operational use cases, that's denormalization at best, and client-side point JOINs at worst. For analytic use cases, which require batch processing of data to derive analytic value, that may involve MongoDB's mapReduce functionality, which has steadily improved over the years.



We should probably just agree to disagree, since you linked to a C++ file which shows that the server executes MapReduce command by invoking (multiple) calls-outs to JavaScript functions (which are *not* running in the server).

I'm warning people away from features of MongoDB that I feel they won't get good performance from in their use case.  Especially when using them would obscures the fact that their schema may not have been designed with a document database in mind.   

I want people to have good experience with MongoDB - I want their projects to use best possible approach to get there.  I don't have another horse in this race.   


댓글 없음:

댓글 쓰기