December 20th, 2014
(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: email@example.com
If I wrote queries that needed both relational data and JSON in one query then I would say that PostGres is the clear winner, but my brain does not work that way. Perhaps it is just habit, but I think of relational queries as one thing and working with documents as something else. More so, I adopt very different work styles when working with relational data, compared to when I work with documents, so I have trouble imagining using them at the same time.
For now, I can still see a use for MongoDB. But if I was only allowed to use one technology for data persistence, I would pick PostGres.
1. Run your query with explain – to confirm that your queries are indeed indexed. There have been a couple of times where we discovered that a specific branch of a complex query was not hitting index – so discovering that was useful.
2. One thought on slow reads with aggregation framework and 1M+ documents is that you are bottlenecked on memory. Even though all your queries may be indexed – you may be hitting disk too often (and doing a lot of random I/O) which is expensive. Check out MMS – and specifically the page faults graph – and you will get an idea about how many times a second you hit disk. For my specific application – response time was very important – so we tried to keep our page faults close to zero. Is increasing the RAM in your instance a viable option? If so this would definitely give you some breathing room.
3. You should try and verify if your response is slow because the query processing time itself is high, or if it’s because you have several concurrent queries competing for system resources (CPU / memory etc). If it’s the latter – the fix is easy – just add more machines to your replicaset (balance this with adding more RAM).
4. Check your lock %. My guess – from your brief description – this will not be an issue for you. If this is an issue – you __may__ need to shard your system – or do 5.
5. I think the most scalable solution would be to rethink your data pipeline. Why do you need to apply complex aggregation queries on 1M+ documents? Can you build a processing pipeline that takes in your data – and builds “prepared views” for the queries you expect out of your aggregation pipeline? In my application – we used stream processing pipeline that aggregated all the messages we got – in all the different ways we needed – and inserted them into different databases (to work around the DB level locking :( ) in mongo. Each collection had exactly ONE type of query that would be executed – this would be an extremely simple query and would be indexed. This pattern is common – even outside of mongo.