Long running backup jobs for Customers in Australia

Minor incident Australia Australia East
03-22-2024 16:00 CET · 4 days, 18 hours, 37 minutes

Updates

Post-mortem

Dear valued Jedox customer,

As mentioned in our last communication from the 9th of April, we are providing additional details in regards to the postmortem information that we have received from our Cloud Service Provider.

With the installation of the newer version of the Storage Appliance OS, due to a bug contained in this version, the latency was increased which caused the overall issues.

The following preventive and corrective actions were taken by our Cloud Service Provider:
  - Additional QA testing for all future updates.
  - Rollout plan of future Storage Appliance OS has been adjusted to allow zone failover, incase of any issue.
  - Change Management process has been updated to include additional approval process.

We apologize for any inconvenience caused and Jedox remains committed to drive corrective actions to avoid future recurrences.

If you have any further questions or concerns, please do not hesitate to contact our Support Team via Jedox Customer Portal.

Thank you for your continued partnership and patience!

May 3, 2024 · 15:00 CEST
Post-mortem

Dear valued Jedox customers,

From March 20, 2024 starting at 18:07 UTC until March 27, 2024 at 09:26 UTC we experienced longer than expected runs of the backup jobs for our cloud environments in Australia.

The backups were running longer due to increased latency towards the storage pool.
Because of this error, customers started experiencing slow load time or failure of environments to come online after the backup.

Preliminary root cause: upon Investigation, after checking the metrics of the storage pool, it was immediately noticed the read latency was high as compared to the expected average.

During a later investigation, it was identified that the cause of the latency issue was a bug in the software version upgrade of the storage pool performed by the Cloud Service Provider.

In order to mitigate the outage, the following Corrective Actions have been taken:

  • We continued our efforts to restore the faulty backup jobs.
  • We expended the infrastructure on the region for achieving better scalability for that situation.
  • We added extra storage capacity to improve the throughput.

Additionally, we have taken the following Preventive Actions:

  • Enhance monitoring toolset for improved detection for write and read alarms.
  • Integration of Azure Monitoring in Grafana to improve overall Observability of our system.

At the moment we are still pending final details for the Postmortem from our Cloud Service Provider. Upon receiving them, we will update this communication to include the final root cause and their Preventive and Corrective Actions.

Please rest assured that our top priority is to provide you with the best possible service and we apologize for any inconvenience this may have caused you.

If you have any further questions or concerns, please do not hesitate to contact our Support Team via Jedox Customer
Portal.

Thank you for your understanding.

April 9, 2024 · 16:32 CEST
Resolved

Dear valued customers,

We would like to provide you with another update on the situation with the latency issue which was causing long running backup jobs in Australia.

Since our previous communication yesterday, the problem with the high latency causing long runs for the backup jobs didn’t occur again as the backup processes ran as expected.

The fix applied by the Cloud Service Provider seemed to fix the error and we will close now this incident.

We are still pending details for the postmortem from our Cloud Service Provider and this will shared with you via: https://status.jedox.cloud/

We are apologizing again for any inconvenience caused and Jedox remains committed to drive corrective actions to avoid future recurrences.

If you have any further questions or concerns, please do not hesitate to contact our Support Team via Jedox Customer Portal.

Thank you for your continued partnership and patience!

March 27, 2024 · 10:26 CET
Update

Dear valued customers,

We would like to provide you with another update on the investigation of the latency issue which is causing long running backup jobs in Australia.

We can confirm that our services have been resumed and operations are running without any issues.

The latency is within the expected parameters, during normal operations and at the peak of backup processing. We will keep monitoring the systems parameters to make sure everything is running as designed and if not other events are observed, then we will close this communication tomorrow no later than 11:00 UTC.

We are still pending details for the postmortem from our Cloud Service Provider and this will shared with you via: https://status.jedox.cloud/

If you have any further questions or concerns, please do not hesitate to contact our Support Team via Jedox Customer Portal.

Thank you for your continued partnership and patience!

March 26, 2024 · 16:59 CET
Update

Dear valued customers,

We would like to provide you with another update on the investigation of the latency issue which is causing long running backup jobs in Australia.

We have managed to identify the potential cause of the incident with the help from the Cloud Service Provider, which seems to be a bug in their latest software deployment on Storage.

They confirmed that the workaround has been implemented several hours ago and recommended us to resume our services and to monitor the situation on the affected region.

Today we will run the normal maintenance process and if the solution is sustainable, then we will close this communication tomorrow no later than 11:00 UTC.

We are still pending details for the postmortem from our Cloud Service Provider and this will shared with you via: https://status.jedox.cloud/

If you have any further questions or concerns, please do not hesitate to contact our Support Team via Jedox Customer Portal.

Thank you for your continued partnership and patience!

March 26, 2024 · 10:55 CET
Update

Dear valued customers,

We would like to provide you with another update on the investigation of the latency issue which is causing long running backup jobs in Australia.

Since our last communication, we have continued our efforts and collaboration with the Cloud Service Provider, to identify the solution of the problem.

The partial enablement of the maintenance windows was resumed in order to restore the majority of the operations that were initially shifted to a later point in time for the affected region.

Our engineers have noticed significant improvement of the latency levels as we will continue with the engagement with the Site Reliability Engineers from our Cloud Service Provider, to bring it within the normal range.

We will provide further updates as events advance, including a Postmortem within the 7 days after the final fix is applied.

If you have any further questions or concerns, please do not hesitate to contact our Support Team via Jedox Customer Portal.

Thank you for your continued partnership and patience!

March 25, 2024 · 18:58 CET
Update

Dear valued customers,

We would like to provide you with another update on the investigation of the latency issue which is causing long running backup jobs in Australia.

Since our last communication, we have continued our efforts and collaboration with the Cloud Service Provider, to identify the root cause of the problem.

As per recommendation from the Site Reliability Engineers from our Cloud
Service Provider, we have updated the parameters to our Storage to optimize the performance operation on all affected environments.

At the moment we are involved in extensive testing along with our Cloud Service Provider, to observe if the backup services could be resumed within the optimal parameters.

We will provide further updates as events advance, including a Postmortem within the 7 days after the final fix is applied.

If you have any further questions or concerns, please do not hesitate to contact our Support Team via Jedox Customer Portal.

Thank you for your continued partnership and patience!

March 25, 2024 · 10:49 CET
Investigating

Dear valued customers,

We would like to provide you with another update on the investigation of the latency issue which is causing long running backup jobs in Australia.

Since our last communication, we have intensified our efforts and collaboration with the Cloud Service Provider, to identify the root cause of the problem.

The Site Reliability Engineers from the Cloud Service Provider have confirmed that after their upgrade of the Storage from Australia on Mar 19, some parameters to improve the Storage operation should be adjusted to new values.

We have done the adjustments they recommended on one environment to test the latency behavior, and the initial results look promising.

Our engineers will continue their evaluation and rollout of the solution, based on the future testing, and perform parallel investigations till the full remediation of the incident is achieved.

We will provide further updates as events advance, including a Postmortem within the 7 days after the final fix is applied.

If you have any further questions or concerns, please do not hesitate to contact our Support Team via Jedox Customer Portal.

Thank you for your continued partnership and patience!

March 24, 2024 · 20:37 CET
Update

Dear valued customers,

We would like to provide you with an update on the investigation of the latency issue which is causing long running backup jobs in Australia.

Since our last communication, we have continued our cooperation with the Cloud Service Provider, to identify the root cause of the problem.

We have managed with their support to narrow down further the scope of the investigation as after an analysis of the network traffic it was agreed that the connectivity doesn’t seem to be the source of the issue.

They have also have checked our Kubernetes services, which are in a healthy status.

Our engineers will monitor the situation and continue their investigations, till the full remediation of the incident is achieved.

We will provide further updates as events advance, including a Postmortem within the 7 days after the final fix is applied.

If you have any further questions or concerns, please do not hesitate to contact our Support Team via Jedox Customer Portal.

Thank you for your continued partnership and patience!

March 24, 2024 · 08:52 CET
Update

Dear valued customers,

We would like to provide you with an update on the investigation of the latency issue which is causing long running backup jobs in Australia.

Since our last communication, we continued our cooperation with the Cloud Service Provider, to identify the root cause of the problem.
They have confirmed that the time when the latency occurred on our end coincide with the timing of their maintenance activities in the affected Cloud region.

We have managed with their support to narrow down the scope of the investigation as the Storage was excluded as a cause of the latency.

Our engineers will monitor the situation and continue their investigations, till the full remediation of the incident is achieved.

We will provide further updates as events advance, including a Postmortem within the 7 days after the final fix is applied.

If you have any further questions or concerns, please do not hesitate to contact our Support Team via Jedox Customer Portal.

Thank you for your continued partnership and patience!

March 23, 2024 · 17:48 CET
Investigating

Dear valued customers,

We would like to provide you with an update on the investigation of the latency issue which is causing long running runs of the backup jobs in Australia Cloud Region.

Since our last communication, we continued monitoring the systems and had in parallel several troubleshooting engagements with the engineers from the Cloud Service Provider, to address the problem.

So far we are noticing an improvement in the performance metrics for the affected environments, and we will continue to work on several actions from the Recovery plan, till the full restoration is achieved.

We will monitor the situation and will provide further updates as events advance.

Apologies for any inconvenience caused and Jedox remains committed to drive corrective actions to avoid future recurrences.

If you have any further questions or concerns, please do not hesitate to contact our Support Team via Jedox Customer Portal.

Thank you for your continued partnership and patience!

March 23, 2024 · 08:20 CET
Issue

Dear valued customers,

In our previous communication we noticed an improvement of the latency for the long running jobs, which was confirmed by the experts from our Cloud Service provider.

When the next batch of jobs started to run at 3 PM UTC, the symptom consisting in long running backup jobs with high latency occurred again and we have decided to reopen the incident.

We have re-escalated the incident at our Cloud Service Provider, by increasing its priority to the highest level, and we are in direct touch with their engineers to determine the final root cause of the issue.

In order to prevent having further direct impact on the availability of the the customer systems, for the moment the maintenance windows will be shifted to a later point in time, to continue operating as expected.

We are continuing to monitor the situation and will provide updates as events advance.

Apologies for any inconvenience caused and Jedox remains committed to drive corrective actions to avoid future recurrences.

If you have any further questions or concerns, please do not hesitate to contact our Support Team via Jedox Customer Portal.

Thank you for your continued partnership and patience!

March 22, 2024 · 17:07 CET

← Back