query with dash is treated as negation

Hi, I am trying to run this query which contains two constraints and it’s returning unexpected results: the document contains neither the text term nor the tag

“FT.SEARCH” “il1” “@title:v-storm @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

  1. (integer) 172

  2. “p:114625”

    1. “title”
  3. “2017 Harley-Davidson\xc2\xae VRSCF - V-Rod Muscle\xc2\xae”

  4. “tags”

  5. “t:english,t:unitedstates,w:couchride:t:dealer,w:couchride:t:used,t:image,w:couchride:t:cruisers,w:couchride:t:harley-davidson”

I suspect the “-” (dash) is throwing off the query parser:

FT.EXPLAIN “il1” “@title:v-storm @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

“INTERSECT {\n @title:UNION {\n @title:v\n @title:+v(expanded)\n }\n NOT{\n INTERSECT {\n storm\n TAG:@tags {\n w:couchride:t:suzuki\n }\n }\n }\n}\n”

I have tried lots of alternatives - how do I escape the dash in the query so it’s treated as a regular character?

Thanks

Michael

Hi Michael,

Hi, I am trying to run this query which contains two constraints and it’s returning unexpected results: the document contains neither the text term nor the tag

“FT.SEARCH” “il1” “@title:v-storm @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

  1. (integer) 172
  1. “p:114625”
    1. “title”
  1. “2017 Harley-Davidson\xc2\xae VRSCF - V-Rod Muscle\xc2\xae”
  1. “tags”
  1. “t:english,t:unitedstates,w:couchride:t:dealer,w:couchride:t:used,t:image,w:couchride:t:cruisers,w:couchride:t:harley-davidson”

I suspect the “-” (dash) is throwing off the query parser:

FT.EXPLAIN “il1” “@title:v-storm @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

“INTERSECT {\n @title:UNION {\n @title:v\n @title:+v(expanded)\n }\n NOT{\n INTERSECT {\n storm\n TAG:@tags {\n w:couchride:t:suzuki\n }\n }\n }\n}\n”

I have tried lots of alternatives - how do I escape the dash in the query so it’s treated as a regular character?

Thanks

Michael

Pls try to use escaping on the “-”, i.e: “FT.SEARCH” “il1” “@title:v\-storm @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

Let me know if it helped.

I tried that earlier, and it returned zero results.

“FT.SEARCH” “il1” “@title:v\-strom @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

  1. (integer) 0

I think the double escaping works in terms of parsing the query correctly:

“FT.EXPLAIN” “il1” “@title:v\-strom @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

“INTERSECT {\n @title:UNION {\n @title:v-strom\n @title:+v-strom(expanded)\n }\n TAG:@tags {\n w:couchride:t:suzuki\n }\n}\n”

This is the same explanation I get if I specify a non-dashed word in the title field of the query.

I am suspecting the problem might be with the indexer that’s not considering “v-strom” to be one word?

Michael

I see the problem, on the indexing stage the v-storm was separate to ‘v’ and ‘storm’ you need escaping on the indexing stage also. See the following example:

127.0.0.1:6379> FT.ADD idx doc1 1.0 FIELDS text test-1

OK

127.0.0.1:6379> FT.SEARCH idx test-1

  1. (integer) 0

127.0.0.1:6379> FT.ADD idx doc3 1.0 FIELDS text test-1

OK

127.0.0.1:6379> FT.SEARCH idx test-1

  1. (integer) 1

  2. “doc3”

    1. “text”
  3. “test\-1”

The issue in this case is that the escaping appears in the value and I am not sure this is the expected behaviour (will open an issue on that).

Thanks for the explanation Meir. There are a couple of problems with this approach (as you note yourself):

a. It doesn’t work with FT.ADDHASH. I now have to parse, escape and specify in the command every field I want to be indexed.

  1. The returned text is “test\-1”, so I am guessing I will have to also un-escape escaped characters when I read the results back?

Having to escape and unescape the actual content is unexpected from the side of the developer - but escaping the query is a reasonable expectation since you need to cater for special characters in the query language.

If you do open a bug, can you please paste the link here so I can follow?

Thanks again

Michael

I do agree this is not best approach to handle this issue.

We talked about it today and we are planning to allow customising the tokenisation chars (and possibly allow more tokenisation methods), so you can tell the tokeniser not to split on ‘-’.

But this will only be possible in future releases. Until then you will have to settle with this suggested workaround.

Tracking issue: https://github.com/RedisLabsModules/RediSearch/issues/478

Hi, it looks like this issue may not have seen any movement and I wanted to see what plan there may be and whether my understanding is correct. So just to clarify, there is currently no way to search for hyphenated values unless the hyphens have been escaped in the data because the tokenization does not include adding the complete string as a term, and there is also not a way to specify that hyphens should not be separators including during index creation time - is that correct?

So given:

127.0.0.1:20000> ft.add index doc1 1.0 fields field1 some-value

OK

127.0.0.1:20000> hgetall doc1

  1. “field1”

  2. “some-value”

It is impossible to match on “some-value”, regardless of query escaping.

Is it on the agenda to change how this works or is the recommendation still to add escaping to the data, and then do escaping in the query as well?

In the future we may add options to ignore certain characters as separators and treat them as actual terms. It is a non-trivial feature to add, however, but is most definitely in the roadmap

Hey Mark ok, thank you for the update.

Hi, 2 years later is there any progress on this? This is causing us issues because we’re going to be storing and searching UUID/GUIDs. Those values may have hyphens in them (Java UUIDs always do). We’ve decided that we’re going to have to Base64 encode those values for now, but that’s extra processing we’d much rather avoid. These reads and searches are going to be happening many times / second.

I am not aware of any changes - you can also check here: Escaping appears on document values · Issue #478 · RediSearch/RediSearch · GitHub

On our side, we write a second field in the redis hash that contains the same text but without the offending characters, and then search on that. Yes, it doubles the size needed for storage :frowning:

If you are going to use TAG instead of TEXT though (e.g. for IDs), you might be able to just escape the request instead of having to do any custom processing. I would try that first.

OK, thanks Michael. I appreciate the response.