FT.AGGREGATE with multi-fields SORTBY to replace FT.SEARCH

lpellegr · July 13, 2020, 12:49pm

Hi there,

I was recently using a search query to filter the last X documents matching a set of conditions. This is working as expected. A problem arose when I tried to sort documents by multiple fields before the conditions for FT.SEARCH are applied.

After searching a bit, I found that FT.SEARCH does not support multi-fields SORTBY but FT.AGGREGATE does. As a consequence, I tried to use FT.AGGREGATE similarly to FT.SEARCH but I noticed that by default FT.AGGREGATE does not return all fields for the matching documents. However, FT.SEARCH does return indexed and non-indexed fields by default.

Looking at the documentation it seems that I need to use LOAD with FT.AGGREGATE and to specify each field that needs to appear in the result. Since I have many fields and need to return them all, is there a specific syntax to say that all indexed and non-indexed fields must be returned similarly to FT.SEARCH, without enumerating all the fields?

My second question is about the warning when using LOAD with FT.AGGREGATE. My purpose is to make a search with a SORTBY on 3 different fields. These fields are SORTABLE. In that case, does using LOAD to return all fields has a higher cost than using FT.SEARCH?

The last question I have is how to retrieve the identifier of the documents that are matched. In my case, a query aggregation is used only for multi-fields SORTBY. Using a client library such as lettusearch the result type is a list of Document for FT.SEARCH, but a Map of String with FT.AGGREGATE. While a Document contains a getId method, I cannot find a field in the Map to retrieve the identifier of the matching documents. Should I add the document ID as a dedicated field in the document when I add a document? or is there a solution to prevent duplicating this information?

kyle · July 13, 2020, 4:16pm

Hi @lpellegr

As an alternative to LOAD you may want to consider adding the fields you’ll need to return as SORTABLE in your schema. LOAD has a high costs for performance. There is no option to load all fields in aggregation, but SORTABLE will prevent the need of doing this at the cost of extra index size.

The aggregation pipeline does not currently have a way of getting the document ID, so if you need this you’ll be required to store it in the document it self (admittedly, this is a hack that does duplicate the doc id).

How big are your typical result set sizes? If they have a low cardinality, you might be better off sorting things yourself (outside of Redis) and caching those values into pages.

lpellegr · July 13, 2020, 6:46pm

Hi @kyle,

Thanks for your answer.

Just to understand, why is FT.AGGREGATE with LOAD and a SORTING on 3 fields costing much more than an FT.SEARCH with a SORTBY on a single field? are field values loaded for all documents before any filter and/or sort with FT.AGGREGATE?

Regarding SORTABLE, we have about 60 fields per document with 3 that can be quite large. The documentation says the content of SORTABLE fields is copied in a special location. Doubling the memory usage will not be an option in our case.

How big are your typical result set sizes?

We have a few million documents. Once the search is performed, about 10 documents are returned. The response must contain documents in the order as specified by the SORTBY on 3 fields. Unfortunately, sorting outside Redis seems not feasible in this case.

kyle · July 14, 2020, 1:39pm

@lpellegr

First things first - from the docs on FT.SEARCH:
If specified, and field is a sortable field, the results are ordered by the value of this field. This applies to both text and numeric fields.

So, to use SORTBY you must have a sortable field anyway.

When you FT.AGGREGATE with LOAD you’re effectively doing an O(n) operation. From the docs:
… LOAD hurts the performance of aggregate queries considerably since every processed record needs to execute the equivalent of HMGET against a redis key, which when executed over millions of keys, amounts to very high processing times.

SORTABLE on a field is creating an ordered index (RediSearch is a search engine and by default, the indexes are a form of unordered set). As you can imagine, all these metadata does swell the required memory for the index considerably - so this explains the cost. LOAD is really designed to process data in aggregations - for just returning it is a pretty heavy operation.

With regards to your sorting - I think you are confounding data set size (10 million) and result set size (about 10). Think of this in two phases - use RediSearch to filter (not exactly what is happening, but close enough) and then sort yourself - you only need to sort 10 documents which would be trivial. You wouldn’t need to sort documents that aren’t returned (the data set size). The cost of maintaining sortable indexes on everything, from my perspective doesn’t make sense for your case: large documents, high document cardinality, very small result set size. If you were returning large results (larger than a few hundred), then yes, pay the cost for sorting everything inside RediSearch, but this doesn’t seem to be the case.

If I were architecting this solution, I think I would use FT.SEARCH and supply search clauses in the query and return the documents. Only maintain the SORTABLE fields in which you are actually sorting upon with SORTBY. Retrieve the results (~10 docs) and sort them in your application logic (which would need to be custom, but probably be more efficient than trying to maintain SORTABLE indexes or doing an O(n) LOAD operation on millions of documents).

lpellegr · July 14, 2020, 2:21pm

Thanks for your explanation and time.

You are right, I was not clear. Let me rephrase: I have a few million documents, when I perform a search (with no LIMIT) this returns a few hundred to a thousand documents. I guess this is the result set size you were asking previously. Then, for this result set, I need to sort documents according to 3 fields and returns the first 10 (and potentially “paginate” 10 by 10 in the future).

Before asking the question on this forum, I started with the solution you described, namely using FT.SEARCH and making the 3 fields used in SORTBY sortable. The problem is that the result was wrong because FT.SEARCH does not support SORTBY on multi-fields. That’s why I switched, to FT.AGGREGATE and started this thread with questions related to this command.

What I still don’t get about LOAD and FT.AGGREGATE is whether the O(n) operation applies to the data set size (a few million) or to the result set size (a few hundred to a thousand documents)? Sorry if that’s a trivial question but I am really confused.

kyle · July 15, 2020, 2:49pm

OK - two things:

FYI - When you paginate with RediSearch it’s aways a good idea to cache the results. LIMIT will still execute the entire query then only grab the the results as needed - so LIMIT only saves you the document retrieval stage, the complexity of the query is still the same no matter if you’re returning 0 or 1 million documents.

With regards to your last question, LOAD will be O(n) on the query result prior to the pipeline. So, if your query is "*" it will be a full iteration of all documents. If you take the query portion of the aggregation and run it through FT.SEARCH with LIMIT 0 0 it will tell you how many total documents are returned - that will be your n. It sounds like in your case the, it will be the like running HMGET vs a few hundred to a few thousand documents, based on your previous statement.

lpellegr · July 15, 2020, 3:04pm

Thanks for the clarification.

sefai · June 29, 2021, 9:11am

Hello guys, I am sorry for hijacking this thread but this is the closest message to my problem.

I have many TAG fields and using FT.AGGREGATE for a faceted navigation experience. I’ve trusted on this: “Fields needed for aggregations should be stored as SORTABLE , where they are available to the aggregation pipeline with very low latency.” and it seemed to work when I had 30-40 TAGs.

Now I have over 500 and some of those TAG fields are still readily available to the pipeline, but some are not. What could be the reason for that?

To be on the safe side should I add LOAD n @aggregated_field1 @aggregated_field2 … to my queries?

Kind Regards,

ashtul · July 4, 2021, 9:26am

This seems related to issue 2102 on GitHub which should be fixed soon.

Topic		Replies	Views
FT.AGGREGATE SORTBY syntax help RediSearch	7	1611	July 5, 2018
FT.AGGREGATE equivalent of FT.SEARCH RediSearch	1	1892	June 19, 2022
FT.AGGREGATE performance problems RediSearch	4	1134	December 16, 2019
FT.AGGREGATE without GROUPBY and TAGs RediSearch	1	1346	January 29, 2021
RediSearch SORTBY performance RediSearch	2	1611	November 6, 2020

FT.AGGREGATE with multi-fields SORTBY to replace FT.SEARCH

Related topics