Search Query Syntax - complex query in one - exact phrase, prefix and fuzzy

Is it possible to have one Redisearch query that will first search by using exact phrase, then by prefix and finally by fuzzy? All results should be returned and sorted by score.

Something like…

FT.SEARCH idx “@title:(one two) => { $weight: 20.0; $slop: 1; $inorder: false; $phonetic: true} | @title:(one* two*) => { $weight: 10.0; $slop: 10; $inorder: false; $phonetic: true} | @title:(%%one%% %%two%%) => { $weight: 5.0; $slop: 10; $inorder: false; $phonetic: true}”

Or I should send 3 different queries from a client and combine results etc?

Or just one fuzzy query should do the job?

OK. I see a few things in this query - if you run your existing query through EXPLAINCLI there is some ambiguity which leads it to be parsed a little different than you might imagine:

 2)   @title:INTERSECT {
 3)     @title:UNION {
 4)       @title:one
 5)       @title:+one(expanded)
 6)     }
 7)     @title:UNION {
 8)       @title:two
 9)       @title:+two(expanded)
10)     }
11)   } => { $weight: 20; $slop: 1; $inorder: false; }
12)   @title:UNION {
13)     @title:INTERSECT {
14)       @title:PREFIX{one*}
15)       @title:PREFIX{two*}
16)     } => { $weight: 10; $slop: 10; $inorder: false; }
17)     @title:INTERSECT {
18)       @title:FUZZY{one}
19)       @title:FUZZY{two}
20)     } => { $weight: 5; $slop: 10; $inorder: false; }
21)   }
22) }

If you notice, the prefix and intersect queries are unioned together apart from the “exact phrase” (more on that later), where you probably want all the unions on the same level.

You can achieve this by explicitly creating the clauses:

> FT.EXPLAINCLI idx "(@title:(one two) => { $weight: 20.0; $slop: 1; $inorder: false; $phonetic: true}) | (@title:(one* two*) => { $weight: 10.0; $slop: 10; $inorder: false; $phonetic: true}) | (@title:(%%one%% %%two%%) => { $weight: 5.0; $slop: 10; $inorder: false; $phonetic: true})"
 1) UNION {
 2)   @title:INTERSECT {
 3)     @title:UNION {
 4)       @title:one
 5)       @title:+one(expanded)
 6)     }
 7)     @title:UNION {
 8)       @title:two
 9)       @title:+two(expanded)
10)     }
11)   } => { $weight: 20; $slop: 1; $inorder: false; }
12)   @title:INTERSECT {
13)     @title:PREFIX{one*}
14)     @title:PREFIX{two*}
15)   } => { $weight: 10; $slop: 10; $inorder: false; }
16)   @title:INTERSECT {
17)     @title:FUZZY{one}
18)     @title:FUZZY{two}
19)   } => { $weight: 5; $slop: 10; $inorder: false; }
20) }

Now, one other thing to consider is that you probably want to use exact phrase properly, but enclosing them in double quotes (one or both, depending on what you mean) - right now you’re getting independent clauses for each term.> FT.EXPLAINCLI idx "(@title:\"one two\" => { $weight: 20.0; $slop: 1; $inorder: false; $phonetic: true}) | (@title:(one* two*) => { $weight: 10.0; $slop: 10; $inorder: false; $phonetic: true}) | (@title:(%%one%% %%two%%) => { $weight: 5.0; $slop: 10; $inorder: false; $phonetic: true})"
 1) UNION {
 2)   @title:EXACT {
 3)     @title:one
 4)     @title:two
 5)   } => { $weight: 20; $slop: 1; $inorder: false; }
 6)   @title:INTERSECT {
 7)     @title:PREFIX{one*}
 8)     @title:PREFIX{two*}
 9)   } => { $weight: 10; $slop: 10; $inorder: false; }
10)   @title:INTERSECT {
11)     @title:FUZZY{one}
12)     @title:FUZZY{two}
13)   } => { $weight: 5; $slop: 10; $inorder: false; }
14) }

Now, to answer your softer question regarding just using fuzzy, I would test with your data and adding in the WITHSCORES argument to see how it’s resolving the scores.

@kyle - Thanks the one Union does the trick. I was not aware that it can be done with proper clause.

My understanding is that I do not need exact phrase. My Title field has up to 11 words/tokens and the tokens from query doesn’t have to be next to each other.

I have tested just the fuzzy one but it gives to random results if the query (tokens) have just few chars like ‘on tw’. That is why I think much better is to use (with higher weight) prefix one before fuzzy?

I like explicit clauses as it’s very clear what’s happening without any real cost.

With regards to exact phase, I only mention it because you do write about it in the OP but your query was not exact phrase.

Fuzzy is often what people think they need, but it is rarely actually what is required. In my opinion don’t use it unless you know you have very specific reasons. So many sequences can match on fuzzy, it throws off weights for which you need to compensate, etc. Usually not worth it.

@kyle - I use Fuzzy as a last resort in the query (with UNION) and give it low weight. My app needs to simulate suggester/autocomplete, where searches are happening as user is typing. That is why I use exact, prefix and then fuzzy in one go with different weights - at least this is my understanding how it should be done? Or are there more efficient ways?

@bogumil I would definitely check out the RediSearch suggestion engine (FT.SUGADD, etc.). There are good reasons to bifurcate your full text index from your suggestions. Search is really about managing signal to noise for the end users, having a separate suggestion system will help immensely in this effort.

@kyle - I have checked out the FT.SUGADD etc but it works only on prefixes. My doc can have multiple tokens like: “star wars trilogy” and my understanding is that it will not get suggested when running: FT.SUGGET key “tri”, but with the I can find it.

@bogumil Ah, so there is a trick.

If you manually do some tokenizing, the way you approach mid-query typeahead is to index partial queries and use payloads.

Consider this:> FT.SUGADD typeahead "star wars trilogy" 1 PAYLOAD "star wars trilogy"
(integer) 1> FT.SUGADD typeahead "wars trilogy" 1 PAYLOAD "star wars trilogy"
(integer) 2> FT.SUGADD typeahead "trilogy" 1 PAYLOAD "star wars trilogy"
(integer) 3

Then as a user types:> FT.SUGGET typeahead "tri" WITHPAYLOADS
1) "trilogy"
2) "star wars trilogy"> FT.SUGGET typeahead "wa" WITHPAYLOADS
1) "wars trilogy"
2) "star wars trilogy"

It’s also possible to produce a trivial Lua script to do this tokenization for you, if you don’t want to push that complexity to your application.

One thing to note is that mid-search typeahead is only sometimes desirable. This is a personal opinion, but it pushes up the signal to noise ratio in many search domains. Seems like a good idea, but with actual data it tends to be a bit of a mess. Note: the solution won’t work for multiple query tokens, so star tri won’t work, although I would question the UX of this pattern as well.

@kyle - all makes sense thanks, however I need to support multiple query tokens (for example when user copy paste few tokens in the search box I would like to still return some data, and we have 15MLN+ documents).

With the PAYLOAD - does Redis store just one copy of it if there are multiple suggestions pointing to the same one? Just thinking about memory impact.

Looking at the source, payloads would be duplicated on each suggestion (although this is a decent optimization request in my mind).

Your comment about the document cardinality does bring up another thing to think about with regards to typehead - you don’t need to index every document. The pattern you probably want is actually based on queries rather than documents.

As the search box is used, you add items to the suggestion when the user clicks on items. So, let’s say the user types “Star” and then clicks on a result for “Star wars trilogy.” At this point you FT.SUGADD the user’s search query to the suggestion with an INCR argument. The upside to this pattern is that you are building a typeahead that actually learns based on user behaviour: if users who search for “star” actually click on “Star Trek Movies” instead of “Star Wars Trilogy,” the INCR will push the score up and subsequent searches will show the more popular completion payload.

Additionally, you can layer on multiple suggestions like a personalized typeahead. So, each user would have their own key based on previous searches, either in Lua, RedisGears, or your application logic. This way if one user who enjoys good sci-fi is most often looking for “Star Wars” over “Star Trek”, they wouldn’t repeatedly get the wrong one. A similar pattern can be used for time-based suggestions if you want a ‘hot’ typeahead.

Hope this helps.