Full text search indexing stemmer added to MongoDB core

aartur · on Dec 25, 2012

I evaluated MongoDB as a search engine but what I didn't like was high overhead of using indexes for words. It was like 10x slower when inserting documents containing indexed string-array field. I got real-time indexes updating, but I wanted to have some kind of bulk index updates, as the overhead was too big. And it wasn't available when I checked it about 1 year ago.

I don't know what the plan is, but an integrated stemmer alone isn't of big value to me. Actually I prefer to have a stemmer in my own code, so I can tweak/update it without touching a database.

saurik · on Dec 25, 2012

Yeah... I'm a big fan of PostgreSQL's included full-text search features, but it always seemed weird to me that they seemed to go to so much trouble to include a complex multi-language stemmer... I guess it made it easy to get started with, but I immediately found a bunch of application-specific "no, I really need these specific words (maybe the name of the company, or commonly-searched-for API function names that are all fairly similar) to not be stemmed, and I want you to index this special syntax (maybe a Twitter @-reply or hashtag) as a single token including the punctuation" issues that made me start providing pre-stemmed streams of tokens.

groobque · on Dec 25, 2012

What will the implementation details of this be? Manually stemming text and adding values in a list like prior was far from ideal. In fact, when I full text index, I don't even want to have to think about stem lists, not to mention the performance penalty. How will this be used?

dotborg · on Dec 25, 2012

When you can't have Apache Lucene, because you are using PHP/RoR/Django/node.js/whatever, you can always have "basic text indexing and search".

that's a bad news for users of all those PHP/RoR/Django/node etc. apps, who will never get proper on site search functionality. majority of lazy devs won't go for Solr-like solution

rogerbinns · on Dec 25, 2012

My environment of choice is Python. I've used solr before but for a project a while back I used the pure Python Whoosh http://packages.python.org/Whoosh/

The intention was to quickly develop the extra search pieces needed in Python and then port them to solr. (For example we needed a custom scoring mechanism, and needed to experiment with spelling errors, pronunciation equivalency etc). However Whoosh turned out performant enough that I didn't need to touch solr again (XML config files always make me judder!)

So if you Python, I strong recommend giving Whoosh a go especially when starting out a project as you'll be more productive.

JonnieCache · on Dec 26, 2012

This is just flat out misinformation. You can use lucene from pretty much any environment. In rails it is utterly trivial to integrate solr/lucene, it's probably about 2 or 3 lines of code. I assume it's similar for other frameworks.

Adirael · on Dec 25, 2012

Wouldn't you want to use Solr (which wraps Lucene under a nice API) if you're using PHP? It's what I've always done.

I don't see the bad news at all, if you wan't to implement proper search you need something like that.

dotborg · on Dec 25, 2012

Solr works great for english language, but once you want to have other languages support, you will want to use Lucene directly.

anyway, what if my hosting provider won't let me run Solr or any other java software?

gtani · on Dec 25, 2012

What host doesn't let you run java software? I did a dry run a long time ago with solr and ubuntu under VMware fusion in a half gig VM (or maybe 684 Meg), and it's my impression that solr won't run well in limited memory (sphinx works fine) but it's been a while

dotborg · on Dec 25, 2012

You can't run Solr/Lucene properly on Google App Engine(just an example), other software may run better in such circumstances, but its quality is questionable

Adirael · on Dec 26, 2012

We used it only for Spanish, it worked well enough (and fast enough). We deployed on bare metal so no hosting provider (appart from the rack space) was in the middle. If you are doing things this "difficult" you'll need at least a VPS of course.

fizx · on Dec 26, 2012

Solr will support any combination of analyzers that lucene supports.

andrew93101 · on Dec 25, 2012

Lucene search with Rails is easy and very common. Sunspot[http://sunspot.github.com/] is just one example.

egze · on Dec 25, 2012

Don't know about other frameworks, but for RoR there are a lot of full-text search solutions that can be setup with almost 0 effort.

gtani · on Dec 25, 2012

Thanks to rake and brew etc, Thinking sphinx, sunspot and elastic search/tire are all pretty easy with rails if you want default indexes on English language docs. It all gets complex quickly when you start layering on multiple search strategies and indexes, n-gram search, convert ISO latin to ASCII, etc, not to mention the S word "scaling" and anything near realtime index updates

http://stackoverflow.com/questions/9160305/elastic-search-vs...

http://adventuresincoding.com/2012/05/full-text-search-in-ra...

http://www.slideshare.net/dkeener/rails-and-the-apache-solr-...

dotborg · on Dec 25, 2012

what if you want to have high quality search service? install even better search plugin in RoR?

Xylakant · on Dec 25, 2012

As long as there is a better one - yes. Why not? There are elasticsearch-clients that pretty much plug into activerecord and elasticsearch is pretty darn good.

pretoriusB · on Dec 25, 2012

You are not speaking very clearly.

What exactly is the issue/problem that you're trying to convey?

One can use any search solution, from the most basic to the most advanced one, with any server-side technology.

Even if your hosting provider doesn't allow you to run some technology stack, there are hosted search solutions with APIs you can use.

And it's not like any but the most basic of sites should not use at least a VPS anyway.

pretoriusB · on Dec 25, 2012

>When you can't have Apache Lucene, because you are using PHP/RoR/Django/node.js/whatever, you can always have "basic text indexing and search".

Nothing in PHP/RoR/Django/node.js makes them incompatible with Lucene and/or Solr. You just need to run a jvm in parallel.

And it's not like every page needs a "full text indexing and search" solution.

Personally, for a lot of use cases I prefer exact string matches over BS stem indexing.

holdenk · on Dec 25, 2012

>Personally, for a lot of use cases I prefer exact string matches over BS stem indexing.

Really? I've worked on a few search projects in different spaces (venues (aka places/stores), source code, and products) in the past, and while exact string matches are often a good sign of quality, stemming and other analyzers make huge improvements in recall (and when measuring transaction volume in A/B testing strict string matching performed substantially worse). Certainly if you throw out the exact match signal (i.e. only index stemmed) I've seen that result in a deterioration of quality. What sort of data do you work with?

jwaldrip · on Dec 26, 2012

Wow, its about time! Using Postgres full text search is okay, but I still prefer ElasticSearch. Hopefully, Mongo's implementation will circumvent the need to maintain both a persistence DB as well as a full text search engine. Looking forward to test driving it!

jimfuller · on Dec 25, 2012

interesting commit, though its got a long way to go to match up to something like PostgreSQL (or at the other end of the scale, MarkLogic, for extremely mature structure/unstructured + text search).

from an impl pov I suspect the stemmer is the (logical) next step (of many) on this long road of implementing full text search features.

zdwalter · on Dec 26, 2012

I am using elasticsearch.org + mongodb river to do full text search. And it supports Chinese!