I'm the one who created this incident on our status page. I've been overly cautious in resolving this incident, but at this point I think it's causing more harm than good to keep it unresolved on there.
I think it might've prevented users from posting on our forums or sending in an email (premium support). I can imagine users looking at the status page and mistakenly thinking their problems were related to the current incident.
I've interpreted "Monitoring" as essentially meaning: "this is fixed, but we're keeping a close eye on the situation". We do not yet have a formal process for incidents such as this one (but we are working on that).
If our users are having issues, that's a problem. Looking at our own metrics, the community forum and our premium support inbox: I don't believe this to be the case.
Perhaps we should've done a better job at explaining the exact symptoms our users might be experiencing from this particular incident.
I really appreciate the context. We have an SPA with the frontend deployed on vercel and a GraphQL backend hosted on fly. The outage yesterday manifested as 502 errors being delivered to users on the frontend. We had another outage alert at 08:00 PST this morning that lasted about 5-10 minutes. It seemed like the same issue, so we didn't report another incident.
I really like fly, and I think you all are building a great product, but it's looking likely that we're going to migrate off of it. The biggest driver of that has been communication and issues with the status page. Specifically,
- When an incident occurs, we're often among the first to report it on the forum. Over the last month, the status page has lagged pretty significantly behind the incidents. This makes it feel like the we're discovering the issue before fly (I don't know if that's true, but that's the perception). Given that our automated tools are alerting us, it's disconcerting to feel like we're keeping a closer eye on our box's health than our cloud provider (again, this is perception based on communication lag, not necessarily reality).
- We have had multiple outages over the last month. In the middle of an outage, while there is an incident banner displayed at the top of the page, all systems show green with 99.98% or 99.99% uptime. That makes us not trust the numbers on the status page. This reinforces the above perception that fly's systems aren't being accurately monitored. Even now, the status page shows 100% uptime for all systems yesterday and today, which is not true.
- We emailed yesterday about our frustrations and concerns - specifically talking about the disconnect between fly's status page and the multiple outages. We explicitly called out the two points above, and how the communication up to this point has been "We've implemented a fix and are monitoring it". We asked for more details about what occurred, and what was being done to mitigate it in the future. The response was pretty boilerplate: "We're sorry you're frustrated. Here are some credits. We've implemented a fix and are monitoring it. Please let us know if you are still encountering issues."
The incidents were a problem, but disconnect between what was communicated and what occurred through multiple channels is what's driving us to leave. Here's what likely would have convinced us to stay:
- Over-communicate during the incident. I'd prefer to see more status updates rather than fewer.
- Having clear, proactive incident notification. Even with automated monitoring, things will slip through the cracks, but everything over the last month has felt reactive.
- Make sure the status page clearly reflects reality. If the system is down and everything shows green, then I'm 1) frustrated, and 2) wondering what else is slipping through the cracks.
- Publish retro docs or incident reports after an incident. Specifically, report what changes are being made to prevent an outage going forward.
- Train the support staff to communicate directly with developers. Boilerplate emails that focus on empathizing rather than informing are generally frustrating. Especially if they don't actually answer the questions being asked. I get that it's not reasonable to expect a support person to have an in-depth technical conversation, but this is where public incident reports (or live incident pages) can be really helpful.
I think you all are making a great product, but the issues with alerting, monitoring, and communication are too impactful for our production application. I'm confident you'll figure it out, but it's unlikely that we're going to wait.
When you place a fix into production, often it's the case you hope it resolves all issues and doesn't create new ones.
However, you don't know if it resolved everything because you are only working with the symptoms given by one user.
If another user has similar but not the same problem, they won't post about it if the situation is still unresolved. They dont know their case is different, and isn't being worked on.
> When you place a fix into production, often it's the case you hope it resolves all issues and doesn't create new ones.
I hope not. Relying on "hope" when fixing prod is not a recipe for success in my book. It should ideally be possible to recreate the problem in a lesser environment, or at least get a level of comfort that fix will work based more on fact than "hope" before applying it.
Even then, if you are relegated to the level of hope and prayer when trying to handle an incident, it still doesn't mean you should close it unless you are *certain* it's fixed.
You can mark it as mitigated or fix applied, monitoring for xx period before marking as resolved or similar, surely.
I wholly agree. From what I see OP also agrees since they will now be using a stricter criteria to enable them to close more incidents earlier and only reopen when it's proven that there are other issues.
I think it might've prevented users from posting on our forums or sending in an email (premium support). I can imagine users looking at the status page and mistakenly thinking their problems were related to the current incident.
I've interpreted "Monitoring" as essentially meaning: "this is fixed, but we're keeping a close eye on the situation". We do not yet have a formal process for incidents such as this one (but we are working on that).
If our users are having issues, that's a problem. Looking at our own metrics, the community forum and our premium support inbox: I don't believe this to be the case.
Perhaps we should've done a better job at explaining the exact symptoms our users might be experiencing from this particular incident.