This BIP describes a structure for compact filters on block data, for use in the [BIP 157 light client protocol](/protocol/forks/bip-0157).
The filter construction proposed is an alternative to Bloom filters, as used in BIP 37, that minimizes filter size by using Golomb-Rice coding for compression.
This document specifies one initial filter type based on this construction that enables basic wallets and applications with more advanced smart contracts.
* <code>new_bit_stream</code> instantiates a new writable bit stream
* <code>new_bit_stream(vector)</code> instantiates a new bit stream reading data from <code>vector</code>
* <code>write_bit(stream, b)</code> appends the bit <code>b</code> to the end of the stream
* <code>read_bit(stream)</code> reads the next available bit from the stream
* <code>write_bits_big_endian(stream, n, k)</code> appends the <code>k</code> least significant bits of integer <code>n</code> to the end of the stream in big-endian bit order
* <code>read_bits_big_endian(stream, k)</code> reads the next available <code>k</code> bits from the stream and interprets them as the least significant bits of a big-endian integer
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
For each block, compact filters are derived containing sets of items associated with the block (eg. addresses sent to, outpoints spent, etc.).
A set of such data objects is compressed into a probabilistic structure called a ''Golomb-coded set'' (GCS), which matches all items in the set with probability 1, and matches other items with probability <code>1/M</code> for some integer parameter <code>M</code>.
The encoding is also parameterized by <code>P</code>, the bit length of the remainder code.
Each filter defined specifies values for <code>P</code> and <code>M</code>.
The first step in the filter construction is hashing the variable-sized raw items in the set to the range <code>[0, F)</code>, where <code>F = N * M</code>. Customarily, <code>M</code> is set to <code>2^P</code>.
However, if one is able to select both Parameters independently, then [more optimal values can be selected](https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc845).
Set membership queries against the hash outputs will have a false positive rate of <code>M</code>.
To avoid integer overflow, the number of items <code>N</code> MUST be <2^32 and <code>M</code> MUST be <2^32.
The items are first passed through the pseudorandom function ''SipHash'', which takes a 128-bit key <code>k</code> and a variable-sized byte vector and produces a uniformly random 64-bit output.
Implementations of this BIP MUST use the SipHash parameters <code>c = 2</code> and <code>d = 4</code>.
The 64-bit SipHash outputs are then mapped uniformly over the desired range by multiplying with F and taking the top 64 bits of the 128-bit result.
This algorithm is a faster alternative to modulo reduction, as it avoids the [expensive division operation](https://lemire.me/blog/2016/06/27/ -fast-alternative-to-the-modulo-reduction/).
Note that care must be taken when implementing this reduction to ensure the upper 64 bits of the integer multiplication are not truncated;
certain architectures and high level languages may require code that decomposes the 64-bit multiplication into four 32-bit multiplications and recombines into the result.
Instead of writing the items in the hashed set directly to the filter, greater compression is achieved by only writing the differences between successive items in sorted order.
Since the items are distributed uniformly, it can be shown that the differences resemble a [geometric distribution](https://en.wikipedia.org/wiki/Geometric_distribution).
[''Golomb-Rice coding''](https://en.wikipedia.org/wiki/Golomb_coding#Rice_coding) is a technique that optimally compresses geometrically distributed values.
We exclude all outputs that start with <code>OP_RETURN</code> in order to allow filters to easily be committed to in the future via a soft-fork.
A likely area for future commitments is an additional <code>OP_RETURN</code> output in the coinbase transaction similar to the [current witness commitment](/protocol/forks/bip-0141).
By excluding all <code>OP_RETURN</code> outputs we avoid a circular dependency between the commitment, and the item being committed to.
The parameter <code>P</code> MUST be set to <code>19</code>, and the parameter <code>M</code> MUST be set to <code>784931</code>.
Analysis has shown that if one is able to select <code>P</code> and <code>M</code> independently, then setting <code>M=1.497137 * 2^P</code> is [close to optimal](https://gist.github.com/sipa/576d5f09c3b86c3b1b75598d799fc845).
Empirical analysis also shows that was chosen as these parameters minimize the bandwidth utilized, considering both the expected number of blocks downloaded due to false positives and the size of the filters themselves.
The parameter <code>k</code> MUST be set to the first 16 bytes of the hash (in standard little-endian representation) of the block for which the filter is constructed.
This ensures the key is deterministic while still varying from block to block.
Since the value <code>N</code> is required to decode a GCS, a serialized GCS includes it as a prefix, written as a <code>CompactSize</code>.
We would like to thank bfd (from the bitcoin-dev mailing list) for bringing the basis of this BIP to our attention, Greg Maxwell for pointing us in the direction of Golomb-Rice coding and fast range optimization, Pieter Wullie for his analysis of optimal GCS parameters, and Pedro Martelletto for writing the initial indexing code for <code>btcd</code>.
Bloom Filters are perhaps the best known probabilistic data structure for testing set membership, and were introduced into the Bitcoin protocol with BIP 37.
The size of a Bloom filter is larger than the expected size of a GCS with the same false positive rate, which is the main reason the option was rejected.
[Cryptographic accumulators](https://en.wikipedia.org/wiki/Accumulator_(cryptography)) are a cryptographic data structures that enable (amongst other operations) a one way membership test.
One advantage of accumulators are that they are constant size, independent of the number of elements inserted into the accumulator.
However, current constructions of cryptographic accumulators require an initial trusted set up.
Additionally, accumulators based on the Strong-RSA Assumption require mapping set items to prime representatives in the associated group which can be preemptively expensive.
There exist data structures based on matrix solving which are even more space efficient compared to [Bloom filters](https://arxiv.org/pdf/0804.1845.pdf).
We instead opted for our GCS-based filters as they have a much lower implementation complexity and are easier to understand.
Test vectors for basic block filters on five testnet blocks, including the filters and filter headers, can be found [here](https://github.com/bitcoin/bips/blob/master/bip-0158/testnet-19.json).
The code to generate them can be found [here](https://github.com/bitcoin/bips/blob/master/bip-0158/gentestvectors.go).