Canto chain halt post-mortem

Deborah · February 26, 2025, 7:30pm

This somewhat-overdue summary authored by the Gravity Bridge team is a combination of our own information and information from many Canto contributors.

Canto v8 (parallel EVM) upgrade

The Canto v8 upgrade represented a significant set of security fixes and throughput improvements the community maintainers and developers supported. The development of the v8 release was started earlier in 2024 through a collaboration of several development teams.

Leading up to the deployment of the Canto v8 chain upgrade, the Gravity Bridge team expressed concerns about the level of integration and acceptance testing being prepared before the release. The v8 upgrade represented a huge jump in CosmosSDK versions from v0.45 to v0.50 with significant changes in the internal workings of almost every component of the chain.

While the throughput improvements were exciting and the security improvements presented by these upgrades were definitely needed, the QA and acceptance testing burden required to smoothly perform such a large version jump in a single upgrade is difficult to overstate.

At the time of deployment, several teams which had led development and supported the chain upgrades, were not working on the chain full time. As such, several of the testing and planning procedures recommended by the Gravity team did not take place.

Leadup to V8 Upgrade

On July 29th 2024 the preliminary canto v8 (Eden) upgrade announcement went out in the Canto validator community.

Due to Canto’s extremely short governance vote period at the time (2 hours) the coordination burden on the team managing the upgrade was high. Validators needed significant warning to show up for the upgrade, vote on the upgrade proposal, then execute the actual code change.

This communication was poor, with several false starts over the preceding days and validators concerned they would miss upgrade times that were pushed back with no warning.

V8 Upgrade
After several delays the final code for the v8 upgrade was released on August 8th and an upgrade proposal was passed setting the target height of block 10,848,200.

At block 10,846,961, on August 9th, less than two hours before the upgrade was scheduled to execute. The Canto chain stopped producing blocks due to a proposer priority mismatch.

Cosmos-SDK does not embed the block proposer as part of the consensus hash, the algorithm that decides which validator proposes a block next is implicitly synced by virtue of all participants in the network having the same starting point.

Because the next-proposer is local state, it’s not easy to identify what caused this desync. There’s no information available on chain. This is not a bug that has been reproduced in test conditions or experienced by any other chain to our knowledge.

We can not rule out that this was the use of a zero-day bug to disrupt Canto. But given Canto’s infrequent upgrades and relatively small validator set it is also possible that a validator preparing for the upgrade somehow triggered this accidentally.

Regardless of the cause, the priority mismatch was a unique issue and extremely difficult to resolve.

To roughly describe the proposer issue. Validator A thinks Validator B should propose the next block, Validator C thinks Validator D should propose the next block, and Validator D is expecting the next proposer to be Validator B. So no block is produced this round. As each round times out the validators all move their expected next proposer to the next validator in the list.

Every few hours 66% of the chain would randomly align, with the same expected proposer, actual proposer, and other validators believing it’s the right proposer. This would produce a single block.

Two hours worth of blocks would take weeks to pass, with the chain effectively ‘down’ for the duration.

The Gravity team was on standby during the upgrade to provide support for the upgrade issues and stepped in to assist with debugging.

It took more than a day to fully identify the exact problem. The v8 development team designed and developed a temporary solution to the proposer priority issue. This solution did not actually fully re-sync the proposer data, but instead reset the selection process every 100 blocks or 20 rounds on the same block.

This got validators back to the same starting point in the proposer selection algorithm and would resync them for some time. But not forever as they would eventually repeat the proposer priority failure if the code did not force a reset on their proposer value frequently.

The first attempt at a fix v7.0.1 ran into this repeated de-sync issue as it attempted to re-synchronize only once.

Each fix attempt took a significant amount of time to deploy as validators were not able to all be present continuously during the incident.

The v7.1.0 change allowed the chain to progress to the v8 upgrade block and the v8 upgrade to take place on August 12th . Two hours worth of blocks had taken three days to pass.

The Canto V8 upgrade

The v8 upgrade required that the chain-id be specified as a command line arg every time on startup. Otherwise the chain would default to a different chain id. This behavior was dramatically different from previous Canto versions, and generally all Cosmos-SDK based chains which validator operators were familiar with. In all previous Canto versions the chain id would be implicitly loaded from the chain database.

This key change was mentioned in the upgrade instructions only as a soft recommendation that validators should start with the --chain-id value. Without explaining the severity of the consequences of failing to do so.

As a result many validators did not specify the chain-id. This compounded with the proposer priority problem, which was not fully fixed but simply frequently reset. Validators continued to fall out of sync on their proposer priority, resulting in nearly two days of downtime where some blocks were produced, only for the chain to fall back out of consensus based on chain-id disagreements or proposer desync.

The Canto v8 development team developed two additional fixes which the Gravity Team coordinated deployment of. One which provided a stronger proposer priority resync and one which automatically provided the chain-id that Canto could finally resume on August 14th.

Shortly after the chain resumed it was discovered that transactions from before the v8 upgrade could not be queried and transactions between the v8 upgrade and the final v8.1.2 version could also not be queried due to the chain-id mismatch showing up as an implied hardfork.

This discovery took more than a day, as no archive nodes were online after the chaos of hardforks on unexpected blocks. It took until August 23 for complete documentation about getting archive nodes back into sync and to be published.

The number and variety of issues regarding accessing block data prevented any cross chain bridge integrated with Canto, with the exception of Gravity Bridge, from returning to operation before August 23rd.

Ongoing changes

Since August 2024 the Gravity team has worked to support and provide technical consultation during a transition of development teams supporting Canto blockchain. They have also worked to implement and deploy a new version of the Canto.io frontend, together with other community based development teams, that displays lending market positions and provides a simple GUI for liquidations.

This stability of the underlying infrastructure of Canto blockchain and front end now puts the Canto community in a position to decide what to do next.

Ideas for the future of Canto

These are some ideas for how to expand / improve Canto moving forward. Most of these are simple expansions that work well within the current tech and don’t require significant redevelopment. This isn’t a roadmap, but rather an invitation for the community to provide ideas on what may become the roadmap through on chain governance votes.

Getting the lending market moving

The Note mechanism as it is currently designed isn’t well positioned to leverage illiquid collateral.

Cosmos ecosystem liquid staking tokens represent an attractive option in terms of yield, but due to their volatility will need active liquidation bots as well as a carefully set reserve ratio.

WSTETH and SDAI represent far more conservative options, with comparatively lower yields. A combination of STATOM, WESTETH, and SDAI in the lending market with collateral ratios of 70%, 85%, and 95% seem like reasonable next steps.

These will provide a practical reason to use Note and the Canto lending market outside of the RWA use case, as an upside they may attract significantly more capital with higher possible yields. At the same time the volatility of the collateral means that NOTE will be less stable than it was during the RWA era. But with fully liquid and easy to liquidate token types bad debt should be avoidable.