FCMPs++ is the next protocol composition for Monero, featuring Full-chain Membership Proofs (FCMPs). These replace the existing ring signatures, which only offer an anonymity set of 16, with an anonymity set of every outputs on-chain. This resolves concerns such as the EAE, EABE, Overseer, etc. attacks, as demonstrated within a Chainalysis video (starting from a known wallet state). It also resolves the risk of black marbles by ensuring every input's privacy set is the privacy set of all honest outputs on-chain (where an honest output is one whose state is not revealed to an adversary). In the worst case (infeasible), where every already spent output has been connected to its place of spend (via a chain reaction), the privacy set with FCMPs will still be all honest unspent outputs.
Transaction uniformity is an idea which notably traces back to this post by tevador. In it, tevador cites a post to mandate all transactions have 16 outputs so an adversary cannot determine how many outputs a transaction actually has. tevador extends this with a proposal to mandate all transactions have a constant amount of inputs.
Why?
Right now, 16-out transactions represent services doing bulk payment processing. Identifying the transactions which belong to individual users vs services splits Monero's privacy set. Ideally, all transactions look perfectly identical, zero-knowledge in all regards, with no additional info granted to a third-party observer.
Transactions with a high amount of inputs presumably represent stores accepting Monero or even, with just three-to-five outputs, notable users of Monero. In comparison, transactions with just one input represent a new user to Monero (at least a new wallet) or someone who recently made a transaction (aggregating all of their outputs, leaving them with just one).
This is further exacerbated by how some wallets may not support batch payments, allowing identifying all transactions with >2 outputs as not having been made by specific wallets. This fingerprint can be combined with others to further reduce the privacy set of users.
So why didn't Monero fix an input/output count a while ago?
The performance cost. Fixing to 16 outputs, the maximum number of outputs, is wasteful by 87.5% for ~90% of transactions. Outputs not only exist on the blockchain yet must be indexed into the database and randomly read when using ring signatures. With FCMPs, outputs must be accumulated into the Merkle tree (an additional computational cost).
The same commentary exists for inputs, though it is much more difficult to support dummy inputs. The cited post by tevador notes a solution for RingCT yet notes the lack of applicability to Seraphis. Even if dummy outputs can be specified, we need to ensure the dummy outputs cannot be used to burn real outputs. tevador did propose a solution for dummy outputs under FCMPs. It proposes growing the blockchain's tree for each input, creating a tree solely for that input, with a dummy output. It's a great idea which should generally apply to all future iterations of the membership proof design. That doesn't change it's of non-trivial technical complexity when FCMPS++, the next hard fork being discussed, is already of notable technical complexity.
MAX_OUTPUTS is currently set to 16. This was a decision made years ago. It did not make the Monero protocol unusable. The reasoning was due to the computational burden of Bulletproofs and how it scales.
Bulletproofs uses 64 "IPA rows" per output. The notable aspect of the "IPA" is it scales with powers of 2. You can use 1 IPA row, 2 IPA rows, 4, 8, 16, 32, 64, 128, 256, 512... So what happens when you have three outputs? You use 192 IPA rows, yet the IPA scales to 128 and 256 rows (not 192). Internally, the IPA will be for 256 IPA rows, incurring the cost of 256 IPA rows, yet only 192 of them matter to us. This leaves 64 unused yet paid for. If we added a fourth output, it would fill the already-paid-for rows and not increase the cost of the Bulletproof (though we would still pay a cost for the output in general).
If we have 17 outputs, we incur a computational cost as if we had 32 outputs. This is not only incredibly wasteful yet increasingly opens a Denial-of-Service against Monero. If a single output takes a few ms, then specifying a transaction with 333 outputs would take an entire second of computational cost to verify. If the proof resolves invalid, the node cannot recoup any fees (so solely requiring the fee be 'enough' does not work). This is before the rules of Bulletproofs would pad the outputs' computational cost to be equivalent to 512 outputs (a second and a half to verify). When Bulletproofs was introduced, the decision was made to bound the time it costs to verify a transaction to roughly 50ms (estimated, I have not explicitly re-benchmarked all of this on historical hardware and modern hardware to provide a full historical analysis).
Please note Monero replaced Bulletproofs with Bulletproofs+ to prove the validity of outputs. The exact same principles apply.
Monero already has a MAX_INPUTS in practice. wallet2 will only create transactions which are 149,400 bytes or less. If all 16 outputs are present, this ends up as roughly 220 inputs with the current ring size. This means if you have more than 220 inputs, and wish to spend your entire wallet balance in a single transaction, you will need to make an transaction consolidating your outputs prior to making your actual transaction. The current discussion is on setting an explicit limit at a much lower value.
FCMPs, one of the two proofs used for inputs, is premised on Bulletproofs. With that, it has the same scaling rules as prior described. It should be noted each input does add some additional proof data, so a three-input FCMP is still a few hundred bytes smaller than a four-input FCML (despite paying for the same amount of IPA rows).
The naive solution is to use one FCMP per input. This ensures we never pay for IPA rows we don't actually use. The issue is an FCMP for one input is 5600 bytes. We could support just 24 inputs under this policy. 24 inputs, at 5600 bytes, plus the bytes for the signature and just two outputs, will prevent adding a 25th input and still fitting within wallet2's 149,400-byte limit.
The good news is we can prove for multiple inputs within a FCMP. While one input takes 5600 bytes, two inputs is just 6336 (+736 bytes for the second input).
Each FCMP uses 384 IPA rows. This does qualify as a power of two as internally, the FCMP uses two Bulletproofs (one with 256 rows, one with 128 rows). This means an entire FCMP, a proof which evaluates membership within a Merkle tree with capacity for 210 billion outputs, verifying the claimed input is valid, is just six times as many IPA rows as a Bulletproof which verifies an output is a valid 64-bit number and not effectively a negative amount. The proximity of these despite the ocean of difference in complexity is something I'm incredibly proud of, building on top of Curve Trees and Eagen's posited idea of using the divisor field of an elliptic curve to prove scalar multiplications.
If we used a single FCMP, we could still support 134 inputs within the transaction size limit. Since an FCMP for one input is 15ms to verify, we can extrapolate such an FCMP to take 2s to verify. This means any RPC request to publish a transaction would potentially cause the node to devote 2s of CPU time on verifying it. This would immediately be a wide window to perform Denial-of-Services, potentially causing every RPC to be whitelisted (explicitly or via some form of at-time-of-use payment, such as the existing credit scheme). At the very least, we're presumably discussing a multi-second Proof of Work (taking several to tens of seconds on lower-powered devices such as mobile phones) to prove the client isn't using less resources than the RPC for the request.
An alternative proposal was one I made in the FCMPs++ document itself. We can step proofs along powers of two, ensuring we never pay for unused IPA rows while minimizing bandwidth while possible. If we have 11 inputs, we would create one proof for 8 inputs, one proof for 2 inputs, and one proof for 1 input (22580 bytes, compared to 61600 bytes if individual proofs were used and 11904 bytes if a single proof is used). While this isn't optimal for bandwidth, it strikes a great balance with computational performance. Not only is this due to only paying for the actually used IPA rows (~69% compared to the padded 16-input Bulletproof), it enables batch verification.
Batch verification is where multiple proofs are verified at once. Doing so reduces the cost of verifying each individual proof. When accepting transactions into the mempool, we can only define the batch as the proofs within the transaction. When syncing blocks, we can define the batch as all proofs within every transaction within the block. This can reduce proof verification time by almost 33%!
With our 11-input example, across three proofs, work can be shared across the first 384 IPA rows for all three proofs, and the next 384 IPA rows for the 2-input and 8-input proofs. This means not only are no IPA rows wasted, some computations can be merged for even greater performance benefits.
This idea quickly faded away because of how expensive FCMP proofs are regarding bandwidth in the first place. While the 11-input three-proof setup may be twice as fast as an 11-input one-proof, it's twice as large and bandwidth is generally prioritized more than computational cost.
So we do want to use a single proof, for bandwidth reasons, but we can't have checking transactions take multiple seconds. The question is what a reasonable limit is.
If we maintain parity with the IPA rows allowed by MAX_OUTPUTS, we should allow two inputs and a fractional input. This is technically acceptable. So long as we allow consolidating inputs at all, an arbitrary amount of inputs can be consolidated with a logarithmic bound on time. If we had 256 inputs, they could be aggregated with just 8 steps (as log2(256) = 8). This would still be very negative from a user-experience perspective but creates a latent system, not an unusable system.
If we attempt to not so drastically harm the user experience, I pointed to a distinct perspective. We could set MAX_INPUTS to MAX_OUTPUTS. What we have already declared as a sufficient rate for creating outputs would be declared the sufficient rate for consolidating inputs. We'd also have the benefit that an adversary cannot spam outputs to a wallet with fewer transactions than necessary to consolidate them. This holds true so long as MAX_INPUTS >= MAX_OUTPUTS, though such a clause can be argued unnecessary as even just two inputs can be sufficient to achieve logarithmic time bounds.
16 inputs, which would be the limit if we set MAX_INPUTS to the current MAX_OUTPUTS value, has the benefit of only adding a single spend interval of delay to consolidation. Right now, ~220 inputs can be aggregated in a single transaction. With 16 inputs, one can create 16 transactions aggregating 16 inputs into 16 outputs. After the 10-block lock passes (one 'spend interval'), the user has 16 outputs usable in a single transaction. The 256 inputs aggregated exceed the current ~220 inputs allowed.
I personally believe 16 inputs is still too computationally expensive. I'd actually prefer as low as 4 MAX_INPUTS, roughly 60ms to verify. It's still a decent fan-in factor, covers the vast majority of transactions, and makes sense as an eventual target for transaction uniformity.
I also acknowledge how impactful to user experience that would be. That's why my current advocacy is 8 inputs. Since I prefer adversaries not be able to create outputs faster than they can be consolidated, and since I believe 4/4 is a solid target for transaction uniformity (a long-term goal for privacy), I support reducing MAX_OUTPUTS to 8 as well.
Any MIN_INPUTS value which isn't 1 requires support for dummy inputs. I don't support including that complexity in this next hardfork.
We currently have MIN_OUTPUTS of 2. Carrot, the wallet protocol planned to ship with FCMPs++, requires a change output. Establishing a logarithmic time-bound for executing payments requires being able to make more than one payment within a transaction. This means, under Carrot, we cannot reduce MAX_OUTPUTS past 3. We can, with the goal of transaction uniformity, set MIN_OUTPUTS to 3. I don't support padding input/output counts until we are ready to more fully commit to transaction uniformity which I do support as a long-term goal.
We could partially aim for transaction uniformity by only allowing n inputs, and a power of two for outputs. I don't believe this actually achieves a real-world benefit to privacy, as an adversary can still distinguish low-output transactions from high-output transactions.
Once we have support for dummy inputs, the only bar to moving forward with transaction uniformity is the performance cost.
Bulletproofs, when used as a range proof, only grow by 64 bytes for each power of two. If we limit MAX_OUTPUTS to 4, I'd argue Bulletproofs performant enough we can move forward with transaction uniformity.
FCMPs++ is too early to comment on in my option. We need to see how these affect testnets, stressnets, and mainnet to make commentary on the real-world performance impact.
There are further improvements which can be made to Full-chain Membership Proofs however. Curve Forests would effectively remove the bandwidth growth of FCMPs with additional inputs. We'd only face the same 64 byte growth as we cross powers of two. It would increase the computational cost of adding new outputs to the tree however linear to the value for MAX_INPUTS. This is computationally cheaper overall, as the proofs themselves are also a bit cheaper, so long as a sufficient amount of proofs actually use a high amount of inputs. Under transaction uniformity, every proof would always use MAX_INPUTS, ensuring the trade-off for smaller and potentially cheaper proofs always results in cheaper proofs.
We can also halve the amount of commitments to variables we use within FCMPs if the original results from Generalized Bulletproofs are recovered. The original variant promised the ability to twice as much data within each commitment. This was incorrect as the resulting proof failed correctness (an academic term). The solution was either not to commit to the second half of data promised, or to double the length of a polynomial (doubling the amount of commitments to that polynomial within the proof). Needing double the commitments in total, yet halving the commitments within the proof, always has a slight edge on proof size and is the variant of the proof currently implemented. If someone could resolve the correctness issues, allowing to commit to twice as much data without doubling the commitments within the proof itself, it'd halve the commitments we use (also halving the amount of the commitments within the proof) and notably improve our bandwidth.
We can also reduce the bandwidth used by increasing the time it takes to verify proofs. This is an option available now, referred to as the "compute factor" within a script I wrote to estimate proof size and computational cost. While this would double, or even quadruple, the proof time, optimizations to our underlying elliptic curve arithmetic may sufficiently counterbalance this. The current implementations of Helios and Selene, the elliptic curves used by FCMPs, were written by me with generic arithmetic and never advertised to be production-grade. They resolved as usable in production but someone who implements tailored arithmetic may be able to achieve significant performance improvements (allowing decreasing the bandwidth without increasing the time to verify). I have already outlined a competition for developers to work on this task.
Finally, we can look at a distinct proof system entirely. zkFFT may be a better candidate than Bulletproofs for FCMPs and usable in a future protocol upgrade. We can also aim to implement an IVC/RCG proof such that one proof, with the size and computational cost of a proof for one input, proves for an arbitrary number of inputs. These likely have too large constants to make sense at this level yet are worth keeping in the back of one's mind.
Most of these can be targeted for the next year or two, assuming the proper resources are devoted. It's only distinct proof systems entirely which would likely take two to three years (if not more) to adopt.
As prior stated, so long as we allow two inputs and three outputs, we can achieve logarithmic-time bounds for consolidation and fulfillment of payments. Obviously, the average transaction should not have to incur any twenty minute delay. Per Rucknium's tally, 97.67% of transactions used less than or exactly 8 inputs. 97.71% of transactions used less than or exactly 8 outputs.
With FCMPs++, which is when these changes would take effect, transaction chaining becomes possible. If a user has 256 outputs they wish to consolidate, the user may sign the consolidation transactions in a single moment. Then, the wallet, running in the background with neither the private spend key nor even the private view key, may publish the consolidation transactions. The wallet would know which outputs were spent in the transactions it published but no more.
The same scheme is possible for fulfilling an arbitrary number of payments.
I will also note how existing services already have to deal with MAX_OUTPUTS. It is low enough I expect most services handling volume to have hit it in practice. Some services, even with decent volume, may not have hit MAX_INPUTS (currently implicit). Such services are at risk of erroneous or unexpected behavior if they have more inputs than spendable within a single transaction and attempt to spend their entire balance (or enough of their balance they would need to use so many inputs). While the proposed changes may force existing services to face it, it was always something which may be faced, is potential to be faced today, is practical to be forced to be faced today, and should be handled by services.
The impact to users would be minimized if a spend interval wasn't twenty minutes long but if outputs could be immediately spent. This isn't possible. Monero will always have at least a 1-block lock. Outputs must be indexed/accumulate on-chain before they can be spent.
If we have a 1-block lock, and we have a 1-block reorganization, all transactions in the latest block will be invalidated and need to be republished if they refer to state from that block. With ring signatures, this would be via selecting a decoy present in the most recent block. With FCMPs, this would be via selecting the most recent tree to prove membership within. With ring signatures, the transaction would have to be signed again with the private spend key. With FCMPs, the transaction would solely need its membership proof re-proved, requiring neither the private spend key nor the private view key (solely the rerandomization, allowing linking the output to the input). That still requires all affected users be online, actively aware, and willing to republish their membership proofs. Monero itself would also pay a non-trivial computational cost for the reorganization (having to revert the tree state).
The point of the 10-block lock is for Monero nodes and for users to not be affected by small reorganizations which are accepted as part of Monero relying on Proof of Work. The Monero Research Lab did recently revisit this topic, doing a new analysis given Monero's current status, and came to the conclusion that 10 blocks still makes sense for now.
I believe the way this would change is if Monero adopted a Proof-of-Stake finality layer, yet I do not see the Monero community, who define the Monero project, as open to such a system.
Wallets are able to mitigate the 10-block lock by splitting their value across multiple outputs. This allows spending some outputs without locking all outputs (and the wallets entire value with it). This has non-trivial privacy implications under the current ring signatures. The output count of the transaction is a fingerprint (due to the lack of transaction uniformity) allowing identifying potential split transactions. Once potential split transactions are identified, adversaries can look for transactions which have multiple inputs whose rings include distinct outputs from the potential split transaction as ring members. This gives a very high statistical likelihood the transaction does spend the split transaction's outputs and that the potential split transaction was actually a split transaction. Even if a split transaction only has two outputs, the spend pattern is still an issue given Monero's current ring signatures. With FCMPs, these privacy risks are mitigated.
P2Pool represents a real-world use case of creating outputs en masse, forcing users to need to consolidate outputs en masse. I'd argue P2Pool is an inefficient design, even if I believe a peer-to-peer pool is important and should be prioritized, and that the Monero protocol should not pay for its inefficiencies. I'd love to see designs with less impact on Monero, via alternate payment structures (even if they have trade-offs/compromises), be developed.
Defining MAX_INPUTS would not prevent P2Pool from operating yet would increase the time users spend consolidating their outputs from it. This would further incentivize a P2Pool design which creates fewer outputs on-chain, for the user experience benefits.
I'll also note there is a dedicated issue for a coinbase consolidation transaction type. This resolves the P2Pool use case, while harming privacy, with no impact on the discussions and decisions here.
This section does not really flow with the rest of my commentary, yet how transactions are weighted was a topic brought up with the rest of items here. Since it's also being discussed, I want to make my position clear.
Fees in Monero are very annoying to get right. My proposal is to replace the fee field with a fee rate field. This forces
wallets to at least calculate the transaction weight properly as else their
transaction will fail to balance (inputs - outputs - fee = 0
). The actual fee
paid would, of course, be fee_rate * weight
.
Weights in Monero are very annoying to get right. The weight of a Monero transaction, for the purposes of fees, is the byte-length of the transaction plus a 'clawback' for the range proofs on the outputs. The clawback covers the computational cost of a non-power-of-two Bulletproof whose cost isn't represented by the byte-length. The byte-length of the transaction is itself variable to various properties such the decoys selected and even the fee paid (so paying a high enough fee may increase the transaction's weight forcing paying an even higher fee).
My proposal for a weight, for the purposes of fees, is
base_fee + (inputs * input_fee) + (outputs * output_fee) + tx_extra_len
, plus
some clawback for the unrepresented computational cost for non-power-of-two
inputs/outputs. I don't have an opinion on the exact clawback/constants used
here. I solely want to advocate for the simplicity with a formula solely
variable to the amount of inputs, the amount of outputs, and the length of the
transaction's extra field.
This extends existing discussions from Seraphis on fee discretization.
I do not propose these ideas now, and I do not propose them alone. Transaction uniformity was not my idea and has been an active discussion for years. Fee discretization was not my idea and has been an active discussion for years. I started discussing limiting inputs in Monero Research Lab meetings a month ago. There's been multiple discussions on this since in the following meetings.
I encourage anyone who wishes to participate in this discussion to be aware on the considerations and to attend the Monero Research Lab meetings. They are Wednesdays at 17:00 UTC.
I don't think has been discussed yet, but couldn't the node cache "diffs" of old trees for 10 (or any N) blocks in the past for basically free rollbacks? For past FCMP trees, the node knows the first K leaves are the same as the current tree, so each "diff" entry for each block would be some leaf cutoff value K, and then the rightmost edge of the tree at that time. If we rollback some blocks, we simply drop all outputs after K and re-assign the rightmost edge of the tree, which doesn't require any EC arithmetic.
This could mitigate the node performance holdups in favor of the 10-block-lock. Wallets could also leverage this same technique, but it might not be worth the code complexity unless reorgs are a more frequent occurrence.