gopalv 13 hours ago

My favourite part of these research publications from the US Gov is the licensing.

All of the USDS work is published with "No Copyright".

The SAT filters however still do not support incremental building, which is one of bloom filters fun features when you use them in distributed databases (you can build N of them and then OR bloom filters to get a single one).

I imagine it will still be incredibly useful where you can iterate over them and do OR the old fashioned way, but at higher accuracy for the same size.

inasio 13 hours ago

Membership filters are very efficient filters that guarantee no false negatives, but false positives are possible (how much and how many can be adjusted based on the dataset and filter's parameters). An obvious application could something like checking whether passengers are in a no-fly list, where false-positives could be handled by further checks. As far as I know cuckoo filters [0] are the state of the art for this, but per this work in principle you could make very efficient with using a SAT (or XORSAT) solver that could generate many feasible solutions out of random SAT problems.

- Google scholar pointed out this link to get a pdf for one of the papers cited in the repo [1]

[0] https://en.wikipedia.org/wiki/Cuckoo_filter

[1] http://t-news.cn/Floc2018/FLoC2018-pages/proceedings_paper_4...

  • thaumasiotes 10 hours ago

    > An obvious application could something like checking whether passengers are in a no-fly list, where false-positives could be handled by further checks.

    Why is this an obvious application? How does this application benefit from a "very efficient" first pass? Just the boarding process on an airplane takes 20-30 minutes; you can easily check the entire passenger manifest in an error-free way in much less time than that. People have to buy their tickets before the boarding process begins.

    • jauntywundrkind 7 hours ago

      If 99% of people aren't on the list, and 1% are, if your check is super fast but makes 1% false positives, you still end up having to only do a full check on 2%. Which could be a huge huge huge win computationally.

      Your post is really weird to me, talking about boarding times? You start skeptical of the example & I'm confused how you think this is anything but a fine example. Ultimately there's some service running in the cloud somewhere that needs to have checks run against it. 2.9m people fly a day in the US, and whether the servers doing that work can do it efficiently or whether they do it in a dogsbit bad manner seems like an obvious concern to me? https://www.faa.gov/air_traffic/by_the_numbers

      I suspect the actual usage for this is for much broader higher traffic systems. For things that watch sizable chunks of the internet for patterns and traffic. But checking passengers against. I fly lists sounds like a pretty reasonable example use to me, and the criticism seems off base & weird in a number of dimensions that straight up don't make sense.

      • FridgeSeal 7 hours ago

        Assuming the airport runs from 6am to 11pm, 2.9m people a day works out to be about ~47 reqs/second. Which is not terribly much.

        Even if we check them at both ends, and effectively double the load, thats only ~100reqs/second. A single machine would happily handle that.

        • thaumasiotes 2 hours ago

          > Assuming the airport runs from 6am to 11pm

          That's a strange assumption. The airports that have significant traffic are operating 24 hours.

          Under the assumption that airports close between 11 and 6, there would be no such thing as a redeye flight.

convolvatron 14 hours ago

the reference in the repo is paywalled (US$ 30!). I did find this https://arxiv.org/pdf/1912.08258 which may or may not be related. but what I found interesting is that the construction looks alot like perfect hashes