Bus Factor / Single Admin projects


#1

There was some discussion at the FOSDEM gathering about single admin instances and what happens in the case of disaster. Bus Factor (https://en.wikipedia.org/wiki/Bus_factor) is definitely something we should consider, and its likely we should make it clear (in the json and directory) the number of admins of an instance so that users can be aware of the bus factor issues.

There was also a suggestion of single admin projects ‘buddying up’ to provide backup for each other when disaster strikes, but this would likely be on an independent one-to-one basis for reasons of trust.

Any ideas on how to effectively deal with this issue?


#2

As a single admin I completely agree with mitigating this risk.

There are different things to take into account…

Quality of Service

As a single admin, you can’t possibly provide services to professionals: you must limit your number of ‘customers’, or else limit the quality of service to PROVIDED AS IS, NO WARRANTY WHATSOEVER.

Not doing so exposes you to risks that you’re not ready to face: if one or more of your ‘customers’ suffers a downtime, they will lose money – so you will lose credit, at minimum (says the guy who deliberately killed his Mastodon instance a few days ago).

Resilience

You may not like banks and how they operate, but banks have solved a number of key issues with regard to secrets management. If you’re hit by a bus, who else has your secrets? The bank uses the “world” concept, where a “world” of secrets can be opened by one or more keys distributed across several people, so that when one comes missing, the others can still access the vault. This concept is implemented as free software (Shamir’s Secret Sharing Scheme or ssss) and could be used to ensure that among librehosters, there’s always a way to collude and bring together two or more people to decipher your secret vault in bus factor case. Not only this, but you could have a critical infrastructure going down, and you need to access a backup SSH key or something, that would be available through the librehosters network.

We could figure out a protocol according to which each librehoster would provide a shared Tomb with critical secrets to access and recover from broken infrastructure. This could be detailed up to the point of creating a Docker container to access those secrets: each librehoster would have a way to update their Tomb, and one way to access it; a combination of 3 or more librehosters would be able to open the Tomb of a third Librehoster in case of emergency; this could be documented in the lab, e.g., you would open an issue, three people would document their will to open the Tomb, then gather to share keys and open the Tomb.

Some kind of collective agency should be sought to mitigate the bus factor.