Overview

Huawei Kunlun high-performance computing cluster, powered by Linux Red Hat Enterprise, serves as the backbone for 17 critical databases supporting major applications, including essential calls database. However, the cluster encountered a significant challenge in resource distribution, leading to a node being fenced out during high database loads. This resulted in prolonged downtime, severely impacting service availability. ​

Solution​

Through diligent investigation and troubleshooting, we pinpointed the issue to a heartbeat-related configuration anomaly. Specifically, the heartbeat token parameter was set to an inadequate value, causing nodes to timeout during status updates. By adjusting this parameter to a more appropriate value, we successfully resolved the issue. ​

Result

Since implementing the solution, the Kunlun cluster has achieved unparalleled stability. Service availability has soared to an impressive 100% over the past period, with zero errors and zero instances of downtime. Customer cluster now operates seamlessly, ensuring uninterrupted support for critical applications and databases. ​

Leave a Reply

Your email address will not be published. Required fields are marked *

Awesome Works
Awesome Works

You May Also Like

Our Offices

All Rights Reserved to QAST © Designed by Qeematech