query with dash is treated as negation

michaelmasouras · September 1, 2018, 9:50pm

Hi, I am trying to run this query which contains two constraints and it’s returning unexpected results: the document contains neither the text term nor the tag

“FT.SEARCH” “il1” “@title:v-storm @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

(integer) 172
“p:114625”
1. “title”
“2017 Harley-Davidson\xc2\xae VRSCF - V-Rod Muscle\xc2\xae”
“tags”
“t:english,t:unitedstates,w:couchride:t:dealer,w:couchride:t:used,t:image,w:couchride:t:cruisers,w:couchride:t:harley-davidson”

I suspect the “-” (dash) is throwing off the query parser:

FT.EXPLAIN “il1” “@title:v-storm @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

“INTERSECT {\n @title:UNION {\n @title:v\n @title:+v(expanded)\n }\n NOT{\n INTERSECT {\n storm\n TAG:@tags {\n w:couchride:t:suzuki\n }\n }\n }\n}\n”

I have tried lots of alternatives - how do I escape the dash in the query so it’s treated as a regular character?

Thanks

Michael

meirsh · September 2, 2018, 3:00pm

Hi Michael,

Hi, I am trying to run this query which contains two constraints and it’s returning unexpected results: the document contains neither the text term nor the tag

“FT.SEARCH” “il1” “@title:v-storm @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

(integer) 172

“p:114625”

“title”

“2017 Harley-Davidson\xc2\xae VRSCF - V-Rod Muscle\xc2\xae”

“tags”

“t:english,t:unitedstates,w:couchride:t:dealer,w:couchride:t:used,t:image,w:couchride:t:cruisers,w:couchride:t:harley-davidson”

I suspect the “-” (dash) is throwing off the query parser:

FT.EXPLAIN “il1” “@title:v-storm @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

“INTERSECT {\n @title:UNION {\n @title:v\n @title:+v(expanded)\n }\n NOT{\n INTERSECT {\n storm\n TAG:@tags {\n w:couchride:t:suzuki\n }\n }\n }\n}\n”

I have tried lots of alternatives - how do I escape the dash in the query so it’s treated as a regular character?

Thanks

Michael

Pls try to use escaping on the “-”, i.e: “FT.SEARCH” “il1” “@title:v\-storm @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

Let me know if it helped.

michaelmasouras · September 3, 2018, 6:48pm

I tried that earlier, and it returned zero results.

“FT.SEARCH” “il1” “@title:v\-strom @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

(integer) 0

I think the double escaping works in terms of parsing the query correctly:

“FT.EXPLAIN” “il1” “@title:v\-strom @tags:{w\:couchride\:t\:suzuki}” “LIMIT” “0” “1” RETURN 2 title tags

“INTERSECT {\n @title:UNION {\n @title:v-strom\n @title:+v-strom(expanded)\n }\n TAG:@tags {\n w:couchride:t:suzuki\n }\n}\n”

This is the same explanation I get if I specify a non-dashed word in the title field of the query.

I am suspecting the problem might be with the indexer that’s not considering “v-strom” to be one word?

Michael

meirsh · September 3, 2018, 7:42pm

I see the problem, on the indexing stage the v-storm was separate to ‘v’ and ‘storm’ you need escaping on the indexing stage also. See the following example:

127.0.0.1:6379> FT.ADD idx doc1 1.0 FIELDS text test-1

OK

127.0.0.1:6379> FT.SEARCH idx test-1

(integer) 0

127.0.0.1:6379> FT.ADD idx doc3 1.0 FIELDS text test-1

OK

127.0.0.1:6379> FT.SEARCH idx test-1

(integer) 1
“doc3”
1. “text”
“test\-1”

The issue in this case is that the escaping appears in the value and I am not sure this is the expected behaviour (will open an issue on that).

michaelmasouras · September 4, 2018, 6:29pm

Thanks for the explanation Meir. There are a couple of problems with this approach (as you note yourself):

a. It doesn’t work with FT.ADDHASH. I now have to parse, escape and specify in the command every field I want to be indexed.

The returned text is “test\-1”, so I am guessing I will have to also un-escape escaped characters when I read the results back?

Having to escape and unescape the actual content is unexpected from the side of the developer - but escaping the query is a reasonable expectation since you need to cater for special characters in the query language.

If you do open a bug, can you please paste the link here so I can follow?

Thanks again

Michael

meirsh · September 4, 2018, 7:11pm

I do agree this is not best approach to handle this issue.

We talked about it today and we are planning to allow customising the tokenisation chars (and possibly allow more tokenisation methods), so you can tell the tokeniser not to split on ‘-’.

But this will only be possible in future releases. Until then you will have to settle with this suggested workaround.

itamarhaber · September 5, 2018, 5:23am

Tracking issue: https://github.com/RedisLabsModules/RediSearch/issues/478

David_Snowberger1 · October 16, 2019, 10:04pm

Hi, it looks like this issue may not have seen any movement and I wanted to see what plan there may be and whether my understanding is correct. So just to clarify, there is currently no way to search for hyphenated values unless the hyphens have been escaped in the data because the tokenization does not include adding the complete string as a term, and there is also not a way to specify that hyphens should not be separators including during index creation time - is that correct?

So given:

127.0.0.1:20000> ft.add index doc1 1.0 fields field1 some-value

OK

127.0.0.1:20000> hgetall doc1

“field1”
“some-value”

It is impossible to match on “some-value”, regardless of query escaping.

Is it on the agenda to change how this works or is the recommendation still to add escaping to the data, and then do escaping in the query as well?

Mark_Nunberg · October 17, 2019, 8:22am

In the future we may add options to ignore certain characters as separators and treat them as actual terms. It is a non-trivial feature to add, however, but is most definitely in the roadmap

David_Snowberger1 · October 17, 2019, 6:52pm

Hey Mark ok, thank you for the update.

jezell · November 4, 2021, 4:56pm

Hi, 2 years later is there any progress on this? This is causing us issues because we’re going to be storing and searching UUID/GUIDs. Those values may have hyphens in them (Java UUIDs always do). We’ve decided that we’re going to have to Base64 encode those values for now, but that’s extra processing we’d much rather avoid. These reads and searches are going to be happening many times / second.

michaelmasouras · November 4, 2021, 5:18pm

I am not aware of any changes - you can also check here: Escaping appears on document values · Issue #478 · RediSearch/RediSearch · GitHub

On our side, we write a second field in the redis hash that contains the same text but without the offending characters, and then search on that. Yes, it doubles the size needed for storage

If you are going to use TAG instead of TEXT though (e.g. for IDs), you might be able to just escape the request instead of having to do any custom processing. I would try that first.

jezell · November 4, 2021, 5:39pm

OK, thanks Michael. I appreciate the response.

Topic		Replies	Views
Redisearch 2.0 Queries RediSearch redisearch	6	1703	September 2, 2020
FT Search not searching text having some special character RediSearch redisearch	8	5400	November 23, 2020
TAG fields and escaping RediSearch	6	1388	January 12, 2018
How can I make whitespace remain as is after tokenization (like the underscore)? RediSearch	14	1600	February 12, 2019
Escaping special characters with lettusearch RediSearch redisearch	6	3534	August 26, 2020

query with dash is treated as negation

Related topics