Hi, I am trying to run this query which contains two constraints and it’s returning unexpected results: the document contains neither the text term nor the tag
Hi, I am trying to run this query which contains two constraints and it’s returning unexpected results: the document contains neither the text term nor the tag
I see the problem, on the indexing stage the v-storm was separate to ‘v’ and ‘storm’ you need escaping on the indexing stage also. See the following example:
127.0.0.1:6379> FT.ADD idx doc1 1.0 FIELDS text test-1
OK
127.0.0.1:6379> FT.SEARCH idx test-1
(integer) 0
127.0.0.1:6379> FT.ADD idx doc3 1.0 FIELDS text test-1
OK
127.0.0.1:6379> FT.SEARCH idx test-1
(integer) 1
“doc3”
“text”
“test\-1”
The issue in this case is that the escaping appears in the value and I am not sure this is the expected behaviour (will open an issue on that).
Thanks for the explanation Meir. There are a couple of problems with this approach (as you note yourself):
a. It doesn’t work with FT.ADDHASH. I now have to parse, escape and specify in the command every field I want to be indexed.
The returned text is “test\-1”, so I am guessing I will have to also un-escape escaped characters when I read the results back?
Having to escape and unescape the actual content is unexpected from the side of the developer - but escaping the query is a reasonable expectation since you need to cater for special characters in the query language.
If you do open a bug, can you please paste the link here so I can follow?
I do agree this is not best approach to handle this issue.
We talked about it today and we are planning to allow customising the tokenisation chars (and possibly allow more tokenisation methods), so you can tell the tokeniser not to split on ‘-’.
But this will only be possible in future releases. Until then you will have to settle with this suggested workaround.
Hi, it looks like this issue may not have seen any movement and I wanted to see what plan there may be and whether my understanding is correct. So just to clarify, there is currently no way to search for hyphenated values unless the hyphens have been escaped in the data because the tokenization does not include adding the complete string as a term, and there is also not a way to specify that hyphens should not be separators including during index creation time - is that correct?
So given:
127.0.0.1:20000> ft.add index doc1 1.0 fields field1 some-value
OK
127.0.0.1:20000> hgetall doc1
“field1”
“some-value”
It is impossible to match on “some-value”, regardless of query escaping.
Is it on the agenda to change how this works or is the recommendation still to add escaping to the data, and then do escaping in the query as well?
In the future we may add options to ignore certain characters as separators and treat them as actual terms. It is a non-trivial feature to add, however, but is most definitely in the roadmap
Hi, 2 years later is there any progress on this? This is causing us issues because we’re going to be storing and searching UUID/GUIDs. Those values may have hyphens in them (Java UUIDs always do). We’ve decided that we’re going to have to Base64 encode those values for now, but that’s extra processing we’d much rather avoid. These reads and searches are going to be happening many times / second.
On our side, we write a second field in the redis hash that contains the same text but without the offending characters, and then search on that. Yes, it doubles the size needed for storage
If you are going to use TAG instead of TEXT though (e.g. for IDs), you might be able to just escape the request instead of having to do any custom processing. I would try that first.