While the whole world was suffering from the Corona virus infections there were also side-effects of the global lockdowns visible in the digital world.
World-wide many people were forced to work from home and companies had to provide instant remote workplaces for their employees to collaborate from their own home-office. This caused a massive load on online services. However the large cloud service providers are prepared for larger workloads and are able to scale their services, were not prepared for such a big move to the cloud.
This whole boost of remote-working also made companies make decisions they didn’t prepare the way they used to. One of the companies which were taking advantage of these changes in the landscape is Microsoft. Especially their Microsoft Teams solution offers quite significant benefits for remote collaboratio, teamwork and document management. Things that are essential in modern workplaces.
The backbones of the Microsoft cloud solutions were suffering from this impressive growth. This was noticeable by some temporarily disabled features which caused minor changes in the UI of some applications. Where these UI changes were seen by end-users, there were also more technical changes made by Microsoft which had impact on applications integrating with their online services. We saw that mainly Microsofts API’s were suffering from throttling and other temporary errors. There is some internal optimization in the API’s itself around these kind of issues but many applications had performance or stability issues because of the, sometimes, unreachable endpoints. This lead to several calls with Microsoft Engineers looking at the root causes of these sometimes persisting issues.
How to deal with it?
Within our P365.Provisioning solution we already had several retry-mechanisms to cope with these type of issues based on earlier cases and issues we dealt with. But now there were new, formerly stable, API’s which were causing huge delays and interruptions.
A new look to the updated recommendations and guidance to cope with throttling issues made us implement additional measures such as:
- Implement incremental back-off for all Graph API calls in case of throttling errors (429) or other server errors like ‘Bad Gateway’ (502) and ‘Server Unavailable’ (503)
- Implement incremental back-off for all Client Side Object Model (CSOM) calls in case of all types of throttling errors
- Implement smart retry solutions for specific timeout errors like ‘Gateway Timeout’ (504) which first checks if, despite the timeout message, the action was completed successfully in the background before a retry of the same action is executed
- Implement an overall retry mechanism in case of failures on unexpected locations
These improvements made our P365.Provisioning tooling as stable as it is right now, despite the heavy loads on the Microsoft systems and the many errors which still occur nowadays. We’ve managed to get a near 100% succession of provisioning requests.
Because it’s almost impossible to deal with all possible scenario’s, also think of smart notifications and manual interactions for thease edge cases where the above measures aren’t sufficient. To overcome these type of incidental issues due to for instance a longer period of unavailable services we’ve implemented a manual retry function for system administrators who can now at anytime start a manual retry of the whole provisioning process with one click.
Interested in what P365.Provisioning and CTB can mean for your company?