Count-Min Sketch

Hi everybody,

Please give me an example of how to use the Count-Min Sketch datatype from RedisBloom

Thanks

Hello Pablo,

Let’s assume you want to measure the number of visitors that were directed to your website via a third party but you have millions of websites that might actually do that, you could end with many counters, each one of these also keeps the address of the origin website. That can be very expensive.

If on the other hand, you need only websites that provide significant traffic, maybe 1% of your traffic, CMS will be perfect.

CMS.INITBYPROB in RedisBloom takes two parameters.

  • Error - the error rate allowed. If you are interested in websites that are responsible for 1% of the traffic you should set this at a lower rate. Try testing with 0.5% on your data.
  • Probability - the percentage of websites with lower traffic then specified that return a high traffic value or in other words, low-hitters that pretend to be heavy-hitters. This happens when a low-hitter collide with heavy-hitters on all sketches. This behavior is expected.
    Initialize using CMS.INITBYPROB traffic 0.01 0.001 then add websites using CMS.INCRBY traffic www.google.com 1 etc’

After 1 million websites were added you can query for the count of an individual website. Any website with a count of up to 10,000 (1%) is just noise since the count is within the error rate you set. Beyond that, you can assume the count represents the real frequency and is overestimated by up to 1%.

the 0.1% probability means in this case, that 1 out of a thousand websites that are not heavy-hitter will get a count higher than our error rate (1%). As mentioned before, this is expected from probabilistic data structures.

I hope this helps,

Ariel