At 5:30am UTC on 14-June, we initiated a planned maintenance operation. At the end of the maintenance operation, we observed a significant increase in CPU usage for the Intuiface user database, triggering a high latency for my.intuiface.com access and license activation. Eventually, the database CPU usage went to 100%, preventing both access to my.intuiface.com and license activation of Players and Composers. (Already running Players were not impacted due to an efficient fallback mechanism.)
During the incident, several attempts to identify and fix the issues were made, without definitive success, leading to rollback of the maintenance update.
How do we plan to remediate such incident in the future?
We are actively working on enhancing our performance testing tools and procedures, particularly those for environments simulating near-real-life licenses checks and license activations on a pre-production servers.
We are also improving our incident management process for long lasting incidents so that rollbacks are triggered earlier during the incident, typically after one hour duration, thus reducing their negative impact.