TracCloudStatus: Difference between revisions
From Redrock Wiki
No edit summary Tag: Reverted |
(Updated Automatically) Tag: Manual revert |
||
(101 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
<!-- Green/Online: 7776411, Yellow/Issues: 8980804, Red/Down: 9329764 --><div style="text-align: center;">[[File: | <!-- Green/Online: 7776411, Yellow/Issues: 8980804, Red/Down: 9329764 --><div style="text-align: center;">[[File:7776411.png|600px]]<BR><b>TracCloud is currently online and operating normally.</b></div>{{:TracCloudStatusHistory}} | ||
</b></div>{{:TracCloudStatusHistory}} |
Latest revision as of 16:53, 25 October 2024
November:
October:
2024-10-25 - 🟡 Certain TracCloud pages not loading between 4:00pm and 4:45pm MST.
2024-10-21 - 🟡 Intermittent errors, resolved at 8:30am MST.
2024-10-17 - 🟡 Intermittent errors and slowness, resolved at 7:30am MST.
Dear Valued Customers,
We want to sincerely apologize for the recent downtime on 10/8-10/9 and provide you with a detailed explanation of what occurred, as well as the steps we're taking to ensure this doesn't happen again.
Issue Summary:
Over the affected days, we experienced intermittent connection issues with our read-only database instance. These issues resulted in a higher-than-usual rate of client-side timeouts, despite the database server reporting healthy connection states. This discrepancy between server-side and client-side perspectives made the root cause difficult to diagnose immediately.
Detailed Analysis:
The problem manifested as a significant percentage of requests (approximately 20-25%) to the read replica database timing out on the client side, even though the database instance reported successful connections. Typically, we observe a 2-to-1 ratio of master-to-slave database connections. However, during this period, we saw a significant number of failed requests to the read replica, or rather, would have seen, if the database instance was reporting data correctly.
Our monitoring tools, including CloudWatch and internal health dashboards, initially showed the database instance as operational, with most metrics indicating normal behavior. However, a closer look revealed that certain critical metrics were missing or underreported during a brief window at the start of the outage. After a few minutes these specific metrics returned to "normal" values, but the connection issue persisted. This contributed greatly to the challenge of diagnosing the issue in real-time. The instance remained available from the perspective of AWS and continued to report healthy status across most metrics, which further obscured the problem.
Symptoms:
- Client-side timeouts while attempting to connect to the read-only replica.
- The server-side reported successful connection attempts.
- No significant anomalies were initially detected in the database’s reported health metrics.
- The issue persisted intermittently, making replication of the problem inconsistent.
Potential Root Cause Hypothesis:
Based on the observed behavior, the most plausible hypothesis is a network-related issue between the read replica and the clients. This could involve intermittent network degradation or failure in one of the intermediate layers responsible for routing client connection attempts. It is possible that the database instance accepted connections but failed to fully establish them with the clients due to external factors, such as storage network instability or client-side network routing problems.
Without access to more detailed server logs from the underlying database infrastructure, it is challenging to pinpoint the exact cause. However, the behavior suggests that the database received and acknowledged connection attempts, but clients were unable to establish or maintain full connections, leading to the observed timeouts. AWS restricts access to certain low-level logs, particularly those related to hardware performance and infrastructure, as part of their shared responsibility model. These logs are considered sensitive and are not exposed to customers, making it difficult to definitively trace issues related to hardware failures or underlying network disruptions. Without access to these internal logs, we can only hypothesize the root cause based on the available metrics.
Mitigations Implemented:
While we already had extensive monitoring in place, the challenge was that the issue didn’t immediately show up in the standard metrics. It was only after examining certain specific, detailed metrics that the root cause became clearer. As a result, we have expanded our monitoring from 20 to 36 metrics per instance, allowing for a much deeper level of insight.
In particular, we focused on metrics where AWS reports zero in cases of underlying issues, rather than omitting data. Many of the new alarms are now configured to trigger when specific data points drop to zero, which is often a more reliable indicator of problems in these areas. By refining our alerting mechanisms and expanding the scope of metrics we track, we’re now in a better position to detect these subtle issues before they escalate.
Next Steps:
As we continue moving forward, our efforts are focused on enhancing the resilience and efficiency of our systems. It's important to highlight that we are continuously improving the robustness of our infrastructure each week, even when no specific issues prompt the work. These ongoing improvements are part of our commitment to ensuring that our systems remain not only stable but also increasingly optimized over time.
Beyond resolving this particular issue, we’re already implementing further refinements to maintain a proactive approach. Regular performance tuning, infrastructure reviews, and strategic enhancements are being made to ensure our system stays ahead of potential problems. This forward-looking effort means we're not waiting for issues to arise—we’re actively strengthening our systems and processes to prevent them before they can impact service.
2024-10-08 - 🔴 TracCloud inaccessible, starting at 10:50am MST.
September:
2024-09-25 - 🟡 Some TracCloud instances rarely experienced 500 errors. Resolved around 10:30am MST.
2024-09-23 - 🟡 System slowness around 12:00pm MST.
2024-09-23 - 🟡 Automated emails (auto report emails, reminder emails, etc) not sending. Resolved around 9:00am MST.
2024-09-19 - 🟡 Reason categories not working correctly. Resolved at 7:14am MST.
We wanted to inform you that the system issue experienced today has been successfully resolved. While Amazon Web Services (AWS) does not always provide direct notifications for hardware-related issues, we have identified that the root cause was likely a hardware fault affecting systems in one of AWS's availability zones in the region TracCloud resides in (Zone A). This issue impacted several unrelated systems that were all using resources in that zone.
Upon realizing this, to resolve the problem, we manually redirected operations to a different zone (Zone B), and both TracCloud and other associated systems are now functioning as expected. We are also in contact with AWS for further clarification, but at this point, the issue has been fully addressed.
We wanted to provide you with an update regarding the web performance issues some of you may have experienced between August 29th and September 12th. We understand the importance of reliable service, and we apologize for any inconvenience this may have caused.
Over the past three weeks, we encountered a convergence of technical challenges that impacted our web performance: 1. PHP Unserialization Bug: We discovered a complex bug where PHP unserialization operations were consuming 50-100% more CPU cycles after upgrading from version 8.3.10 to 8.3.11. This unexpected increase in resource usage led to slower processing times for certain functions. 2. RDS Performance Degradation: Our primary RDS instance exhibited unexplained performance degradation. Despite normal operational metrics, the instance underperformed, affecting database response times. 3. Index Changes and Aurora RDS Compatibility: Recent changes to our database indexes affected Amazon Aurora RDS differently than our on-premise MySQL version, despite their advertised compatibility. This discrepancy resulted in suboptimal query performance on the Aurora instance. 4. Shifts in Traffic Patterns: We observed a shift in traffic patterns prioritizing higher-load-inducing requests. Coupled with a natural increase in traffic due to the start of the fall semester and an expanding customer base, our systems experienced additional strain.
Actions Taken: We have addressed each of these issues methodically:
- PHP Optimization: Our development team isolated and resolved the PHP deserialization bug. We implemented optimizations to reduce CPU usage back to expected levels.
- RDS Instance Investigation: We initiated a thorough investigation into the anomalous RDS instance. To mitigate immediate impacts, we have removed this instance from our cluster while we continue to analyze the root cause.
- Database Index Adjustments: We modified our indexing strategy to better align with Aurora RDS's performance characteristics, ensuring consistent query performance across both our cloud and on-premise databases.
- Enhanced Benchmarking Tools: We've improved our benchmarking systems to more accurately track query performance before changes go live. These enhancements will help us identify potential issues earlier in the development cycle.
- Resource Scaling: In response to increased traffic and load, we've scaled up our resources to maintain optimal performance levels.
Our Commitment to Reliability: We continuously monitor thousands of metrics and deploy hundreds of sensors across all TracCloud systems to track performance in real-time. While challenges can occur, this extensive monitoring enables us to swiftly identify and address issues. Rest assured, each time an incident happens, we implement measures to prevent it from happening again, strengthening our systems for the future.
Moving Forward: We are committed to providing you with the highest level of service. The steps we've taken have stabilized web performance, and we will continue to monitor our systems closely. Our team is also exploring long-term solutions to prevent similar issues in the future. We appreciate your patience and understanding during this time. If you have any questions or continue to experience any issues, please don't hesitate to reach out to our support team. Thank you for your continued trust in our services.
2024-09-11 - 🔴 TracCloud 502/504 errors or slowness, resolved at 9:00am MST.
August:
2024-08-15 - 🔴 TracCloud slowness or timeout errors between 3:00pm and 3:30pm and 4:20pm to 4:34pm MST.
2024-08-09 - 🔴 TracCloud sporadically inaccessible with 503 and 504 errors between 10:48am and 11:20am MST.
July:
2024-07-11 - 🟡 Reason drop-down on search availability widget not working on some TracCloud instances between 10:49am and 11:31am MST.
June:
May:
2024-05-31 - 🟡 TracCloud job server down, causing automated emails and processes to run late.
2024-05-22 - 🟠 502 Bad Gateway error from 11:45am to 11:51am MST.
April:
2024-04-23 - 🟡 AWS errors preventing many features from working as expected. Resolved at 7:45am MST.
2024-04-22 - 🟡 Appointment statuses not displaying in appointments over the weekend. Resolved on Monday at 8:29am MST.
2024-04-11 - 🟡 Reminder emails not sending for some instances. Resolved at 5:32pm MST.
March:
February:
2024-02-15 - 🔴 SAML error 500 preventing SSO login between 2:00pm and 2:30pm MST.
2024-02-13 - 🔴 SAML errors preventing SSO login, resolved at 8:00am MST.
2024-02-12 - 🟡 Registration lists displayed when booking appointments, logging in, etc not functioning correctly starting at 2:57pm. Most issues resolved at 3:13pm MST, quick visit section selection and badge section selection resolved at 3:57pm.
2024-02-02 - 🔴 SAML SSO logins failing from 2:35pm to 2:49pm MST.
January:
2024-01-25 - 🔴 TracCloud displaying a 503 error between 1:51pm and 2:07pm MST.
2024-01-16 - 🟡 403 Forbidden error on certain records, resolved 7:02am MST.
2024-01-04 - 🟡 403 Forbidden error on Center, Group, and Staff records from 7:40am to 8:40am MST.
December:
2023-12-13 - 🟡 Users/kiosks being frequently logged out during system use. Resolved at 8:50am MST.
2023-12-12 - 🟠 TracCloud error 500 from 11:15 to 11:20 MST due to PHP 8.2 issue.
2023-12-07 - 🟡 TracCloud update not functioning as expected, reverted to previous version at 6:44am MST.
2023-11-27 - 🟡 Palo Alto Networks incorrectly blocked TracCloud, clients using their firewall service lost access to TracCloud until PAN whitelisted traccloud.go-redrock.com.
2023-11-02 - 🔴 TracCloud error 500 from 1:10pm to 1:50pm PST.
2023-11-01 - 🟡 TracCloud job server limitations, email address errors, and time zone issues. Resolved at 8:40am.
2023-10-31 - 🟠 TracCloud error 500 from 11:37am to 11:48am and 2:10pm to 2:15pm PST.
2023-10-27 - 🟡 TracCloud job server limitations, causing automated emails and processes to run late.
2023-10-26 - 🔴 TracCloud error 500 from 4:55pm to 5:15pm PST.
2023-10-06 - 🔴 TracCloud errors and slowness from 9:50 to 10:20am PST.
2023-10-05 - 🔴 TracCloud 500, 502, 503, 504 errors from 3:40 to 4:15pm PST.
2023-09-26 - TracCloud bad gateway error from 2:45 to 3:00pm PST.
2023-06-13 - TracCloud instances with custom URLs received error 500 from 3:15pm to 3:35pm PST.
2023-05-09 - TracCloud 503 error from 2:00pm to 2:10pm.
2023-05-08 - TracCloud slowness/errors due to failed maintenance over weekend, resolved at 1:30pm.
2022-12-05 - TracCloud connection timing out for certain ISPs due to AWS outage from 11:34 AM to 12:51 PM PST. https://health.aws.amazon.com/health/status
2022-10-18 - TracCloud running slowly due to AWS limitations. Resolved at 2:30pm MST.
2022-10-03 - TracCloud Appointment Scheduling limitations, resolved at 9:19am MST
2022-07-28 - TracCloud Down due to AWS outage from 10:11am to 11:25am MST. https://health.aws.amazon.com/health/status
2022-07-25 - TracCloud SSO limitations, resolved at 8:00am MST.