I evaluated MongoDB as a search engine but what I didn't like was high overhead of using indexes for words. It was like 10x slower when inserting documents containing indexed string-array field. I got real-time indexes updating, but I wanted to have some kind of bulk index updates, as the overhead was too big. And it wasn't available when I checked it about 1 year ago.
I don't know what the plan is, but an integrated stemmer alone isn't of big value to me. Actually I prefer to have a stemmer in my own code, so I can tweak/update it without touching a database.
Yeah... I'm a big fan of PostgreSQL's included full-text search features, but it always seemed weird to me that they seemed to go to so much trouble to include a complex multi-language stemmer... I guess it made it easy to get started with, but I immediately found a bunch of application-specific "no, I really need these specific words (maybe the name of the company, or commonly-searched-for API function names that are all fairly similar) to not be stemmed, and I want you to index this special syntax (maybe a Twitter @-reply or hashtag) as a single token including the punctuation" issues that made me start providing pre-stemmed streams of tokens.
What will the implementation details of this be? Manually stemming text and adding values in a list like prior was far from ideal. In fact, when I full text index, I don't even want to have to think about stem lists, not to mention the performance penalty. How will this be used?
When you can't have Apache Lucene, because you are using PHP/RoR/Django/node.js/whatever, you can always have "basic text indexing and search".
that's a bad news for users of all those PHP/RoR/Django/node etc. apps, who will never get proper on site search functionality.
majority of lazy devs won't go for Solr-like solution
My environment of choice is Python. I've used solr before but for a project a while back I used the pure Python Whoosh http://packages.python.org/Whoosh/
The intention was to quickly develop the extra search pieces needed in Python and then port them to solr. (For example we needed a custom scoring mechanism, and needed to experiment with spelling errors, pronunciation equivalency etc). However Whoosh turned out performant enough that I didn't need to touch solr again (XML config files always make me judder!)
So if you Python, I strong recommend giving Whoosh a go especially when starting out a project as you'll be more productive.
This is just flat out misinformation. You can use lucene from pretty much any environment. In rails it is utterly trivial to integrate solr/lucene, it's probably about 2 or 3 lines of code. I assume it's similar for other frameworks.
What host doesn't let you run java software? I did a dry run a long time ago with solr and ubuntu under VMware fusion in a half gig VM (or maybe 684 Meg), and it's my impression that solr won't run well in limited memory (sphinx works fine) but it's been a while
You can't run Solr/Lucene properly on Google App Engine(just an example), other software may run better in such circumstances, but its quality is questionable
We used it only for Spanish, it worked well enough (and fast enough). We deployed on bare metal so no hosting provider (appart from the rack space) was in the middle. If you are doing things this "difficult" you'll need at least a VPS of course.
Thanks to rake and brew etc, Thinking sphinx, sunspot and elastic search/tire are all pretty easy with rails if you want default indexes on English language docs. It all gets complex quickly when you start layering on multiple search strategies and indexes, n-gram search, convert ISO latin to ASCII, etc, not to mention the S word "scaling" and anything near realtime index updates
As long as there is a better one - yes. Why not? There are elasticsearch-clients that pretty much plug into activerecord and elasticsearch is pretty darn good.
>Personally, for a lot of use cases I prefer exact string matches over BS stem indexing.
Really? I've worked on a few search projects in different spaces (venues (aka places/stores), source code, and products) in the past, and while exact string matches are often a good sign of quality, stemming and other analyzers make huge improvements in recall (and when measuring transaction volume in A/B testing strict string matching performed substantially worse). Certainly if you throw out the exact match signal (i.e. only index stemmed) I've seen that result in a deterioration of quality. What sort of data do you work with?
Wow, its about time! Using Postgres full text search is okay, but I still prefer ElasticSearch. Hopefully, Mongo's implementation will circumvent the need to maintain both a persistence DB as well as a full text search engine. Looking forward to test driving it!
interesting commit, though its got a long way to go to match up to something like PostgreSQL (or at the other end of the scale, MarkLogic, for extremely mature structure/unstructured + text search).
from an impl pov I suspect the stemmer is the (logical) next step (of many) on this long road of implementing full text search features.
I don't know what the plan is, but an integrated stemmer alone isn't of big value to me. Actually I prefer to have a stemmer in my own code, so I can tweak/update it without touching a database.