Major Outage: my.intuiface.com
Incident Report for Intuiface
Postmortem

Analysis of incident:

  • On 22-October we upgraded our database to enhance Intuiface user security. The upgrade went well but after the server was back in production, we detected some database timeouts. Investigation revealed they were due to some severely altered – thus fragmented – indexes, causing performance degradation.
  • We reindexed almost all but two tables on 24-October without any issue. Reindexing of the two largest tables was deferred to an already scheduled maintenance activity on 5-November.
  • On 5-November at around 6:15 AM UTC, the first table was reindexed without any issue. The second table was then reindexed and this operation generated a very large amount of data. At around 8:00 AM UTC the disk went full and the database locked, preventing any further updates to the database tables. At this point, we initiated a rollback of the index.
  • Until 8:50 AM UTC the web services for license check and activation responded with an error every time a Player or Composer attempted to retrieve or activate a license. This affected all free and licensed Players and Composers during startup. It also affected some running, licensed Players that happened to run a license check (which normally occurs once a day). It was also no longer possible to connect to the My Intuiface website (my.intuiface.com) or to the Intuiface Support site.
  • At 8:50 AM UTC, we neutralized the web services for license check and activation so they would no longer reply with an error. This enabled licensed Players and Composers to start correctly. Free Players and Composers could still not launch as they require a successful license check.
  • At 10:57 AM UTC, the table index was fully rolled-back, making the database accessible and enabling us to begin cleanup in order to free disk storage.
  • At 11:00 AM UTC the server was back online again. Player and Composer could check their licenses and users could again connect on my.intuiface.com or support site.
  • After 10 minutes of monitoring, we considered the issue as fixed at 11:13 AM UTC

NOTE: At no time was any customer data – experiences or analytics – ever at risk.

Remediation plans:

  • We will ensure that database failure or inaccessibility – a very rare event that occurred only 5 hours in total over the last 3 years, an availability rate of better than 99.98% - will no longer impact running, paid Players nor the launch of paid Players and Composers.
  • We will enhance our ability to inform users and customers about issues and remediation steps in real-time, both via our website (www.intuiface.com) and through our status.intuiface.com server monitoring website. (NOTE: It is already possible to use our server monitoring site to sign up for notifications about ongoing issues.)
  • We will also take action to better test maintenance operations, even if in this case it was hard to predict that it could lead to a major event. (The faulty reindexing consumed more than 140GB of disk space for a single table!)
  • An internal review of the incident is currently underway to identify and take the appropriate actions.
Posted Nov 07, 2019 - 09:14 CET

Resolved
This incident has been resolved.
Posted Nov 05, 2019 - 12:12 CET
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Nov 05, 2019 - 12:04 CET
Update
We are continuing to work on a fix for this issue.
Posted Nov 05, 2019 - 12:03 CET
Update
The database was recovered and the server is working correctly again.
Any licensing issue should no longer occur.
Posted Nov 05, 2019 - 12:03 CET
Update
The database rollback is still in progress. According to database progress information, the recovery should be finished in 40 minutes.

We expect a complete return to normal operation after 11:00 AM UTC (12:00 AM CET - Central European Time)

We will provide a full postmortem analysis of the problem later today or tomorrow.
Posted Nov 05, 2019 - 11:28 CET
Update
We are still in progress of applying a rollback to our database.
We are now expecting the database to be working again and the server to be fully functional around 10:20 AM UTC (11:20 AM CET - Central European Time).
Posted Nov 05, 2019 - 10:42 CET
Identified
We are currently applying a rollback to our database.
We are expecting the database to be working again and the server to be fully functional around 9:38 AM UTC (10:38 AM CET - Central European Time).
Posted Nov 05, 2019 - 10:10 CET
Update
Due to server error, licensed player or composer could not start properly. This was fixed. at 8:50 AM UTC (9:50 AM CET - Central European Time).
Paid player and composer can be restarted again.
Please note that it is currently not possible to activate new licenses.

We are currently continuing our investigation.
Posted Nov 05, 2019 - 09:56 CET
Investigating
we are currently investigating
Posted Nov 05, 2019 - 09:20 CET
This incident affected: Management Console.