December 16th, 2014
(written by lawrence krubner, however indented passages are often quotes). You can contact lawrence at: firstname.lastname@example.org
As much as I would like it to be ‘good enough’, Postgres’ full text search lacks the feature set to be a reasonable solution to most search use cases.
Background: I wrote a product search engine for a company called iTrackr using Postgres full text search, and later wrote the product search for milo.com using Solr. I also designed a simple (but very effective) algorithm to boost search rankings based on historical user click data, by generalizing click data onto the feature space of the document set (using mutual information). http://www.google.com/patents/US20140067786
So, why would I recommend against using Postgres’ full text search:
- It is not configurable. A search engine like Solr or ElasticSearch allows you to configure the tokenization pipeline, what is indexed, how it is indexed, etc. Postgres hard-codes it’s tokenization pipeline. You can turn stemming on and off, but you can’t, for example, lowercase text, strip out html characters, handle special cases for tokenization (camel cased or delimited text, etc). You can write your own tokenizer, but it has to be as a C extension. Could you get it to do what you want? Mostly, but only with an order of magnitude more work than with Solr or ElasticSearch.
- Postgres has no facility to make it possible to do faceting (histograms). This is a pretty big deal breaker for most kinds of eCommerce. I’ve seen a few attempts at this, but they are incredibly complex and slow compared to other search engines.
- Postgres does not give you much control over ranking. It has some good generic ranking methods, but those will only get you so far. (Query Amazon for ‘ipad -dkfjdkf’ to see how well it’s generic ranking function is. No generic ranking function can do very well compared to something tuned to the domain, and Postgres just isn’t flexible enough to extend in any reasonable way.
- It is not scalable. Solr and ElasticSearch are both very very easy to scale up (and out). Postgres’ built in replication is very heavy, as it has to stream all database changes to all slaves (yes there are things like SLONY but they are incredibly hard to deal with). In any real environment, you have to protect your database from load. It’s the the hardest thing to scale, and consulting it for all traffic is a surefire way to run into issues that are very hard to get back out of.
Now, with all that said, I would recommend it for a few cases:
- You just want to do SQL queries with ‘LIKE’ queries, but indexed.
- You need to do arbitrary relational constraints on top of text queries. Doing a query like “give me all the tickets closed in November by Hugh where the project was ProjectX and the ticket name has the word “nested loop” in it’s title OR it’s description OR it’s title is less than 15 characters long” is going to be more or less impossible in any other environment. If you were building something like Jira, or really anything that needs to support relational logic, Postgres’ full text search would be a great way to go.