TracCloudStatusHistory: Difference between revisions
From Redrock Wiki
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
<div class="tcWidgetPage"> | |||
<div style="text-align:center"><b>History:</b></div> | |||
<hr> | <hr> | ||
<div style="float: left; margin-top: 0em; margin-bottom: 1em"><b>2024-09-16</b> - TracCloud SAML logins failing beginning at 1:08pm, TracCloud down at 1:40pm, both resolved at 2:18pm MST.</div><div class="mw-collapsible mw-collapsed"> | |||
<div style="float: left; margin-top: 0em; margin-bottom: 1em"><b>2024-09-16</b> - TracCloud SAML logins failing | |||
<br><br> | <br><br> | ||
We wanted to inform you that the system issue experienced today has been successfully resolved. While Amazon Web Services (AWS) does not always provide direct notifications for hardware-related issues, we have identified that the root cause was likely a hardware fault affecting systems in one of AWS's availability zones in the region TracCloud resides in (Zone A). This issue impacted several unrelated systems that were all using resources in that zone. | We wanted to inform you that the system issue experienced today has been successfully resolved. While Amazon Web Services (AWS) does not always provide direct notifications for hardware-related issues, we have identified that the root cause was likely a hardware fault affecting systems in one of AWS's availability zones in the region TracCloud resides in (Zone A). This issue impacted several unrelated systems that were all using resources in that zone. | ||
Line 9: | Line 8: | ||
</div> | </div> | ||
<hr> | <hr> | ||
<b>2024-09-12</b> - TracCloud periodically running slowly between August 29th and September 12th. | <div style="float: left; margin-top: 0em; margin-bottom: 1em"><b>2024-09-12</b> - TracCloud periodically running slowly between August 29th and September 12th.</div><div class="mw-collapsible mw-collapsed"> | ||
<br><br> | |||
We wanted to provide you with an update regarding the web performance issues some of you may have experienced between August 29th and September 12th. We understand the importance of reliable service, and we apologize for any inconvenience this may have caused. | |||
Over the past three weeks, we encountered a convergence of technical challenges that impacted our web performance: <b>1. PHP Unserialization Bug</b>: We discovered a complex bug where PHP unserialization operations were consuming 50-100% more CPU cycles after upgrading from version 8.3.10 to 8.3.11. This unexpected increase in resource usage led to slower processing times for certain functions. <b>2. RDS Performance Degradation</b>: Our primary RDS instance exhibited unexplained performance degradation. Despite normal operational metrics, the instance underperformed, affecting database response times. <b>3. Index Changes and Aurora RDS Compatibility</b>: Recent changes to our database indexes affected Amazon Aurora RDS differently than our on-premise MySQL version, despite their advertised compatibility. This discrepancy resulted in suboptimal query performance on the Aurora instance. <b>4. Shifts in Traffic Patterns</b>: We observed a shift in traffic patterns prioritizing higher-load-inducing requests. Coupled with a natural increase in traffic due to the start of the fall semester and an expanding customer base, our systems experienced additional strain. | |||
<b>Actions Taken</b>: We have addressed each of these issues methodically: | |||
* <b>PHP Optimization</b>: Our development team isolated and resolved the PHP deserialization bug. We implemented optimizations to reduce CPU usage back to expected levels. | |||
* <b>RDS Instance Investigation</b>: We initiated a thorough investigation into the anomalous RDS instance. To mitigate immediate impacts, we have removed this instance from our cluster while we continue to analyze the root cause. | |||
* <b>Database Index Adjustments</b>: We modified our indexing strategy to better align with Aurora RDS's performance characteristics, ensuring consistent query performance across both our cloud and on-premise databases. | |||
* <b>Enhanced Benchmarking Tools</b>: We've improved our benchmarking systems to more accurately track query performance before changes go live. These enhancements will help us identify potential issues earlier in the development cycle. | |||
* <b>Resource Scaling</b>: In response to increased traffic and load, we've scaled up our resources to maintain optimal performance levels. | |||
<b>Our Commitment to Reliability</b>: We continuously monitor thousands of metrics and deploy hundreds of sensors across all TracCloud systems to track performance in real-time. While challenges can occur, this extensive monitoring enables us to swiftly identify and address issues. Rest assured, each time an incident happens, we implement measures to prevent it from happening again, strengthening our systems for the future. | |||
<b>Moving Forward</b>: We are committed to providing you with the highest level of service. The steps we've taken have stabilized web performance, and we will continue to monitor our systems closely. Our team is also exploring long-term solutions to prevent similar issues in the future. | |||
We appreciate your patience and understanding during this time. If you have any questions or continue to experience any issues, please don't hesitate to reach out to our support team. Thank you for your continued trust in our services. | |||
</div> | |||
<hr> | |||
<b>2024-09-11</b> - TracCloud 502/504 errors or slowness, resolved at 9:00am MST. | <b>2024-09-11</b> - TracCloud 502/504 errors or slowness, resolved at 9:00am MST. | ||
Line 90: | Line 114: | ||
</div> | </div> | ||
<hr> | <hr> | ||
</div> |
Revision as of 14:54, 16 September 2024
We wanted to inform you that the system issue experienced today has been successfully resolved. While Amazon Web Services (AWS) does not always provide direct notifications for hardware-related issues, we have identified that the root cause was likely a hardware fault affecting systems in one of AWS's availability zones in the region TracCloud resides in (Zone A). This issue impacted several unrelated systems that were all using resources in that zone.
Upon realizing this, to resolve the problem, we manually redirected operations to a different zone (Zone B), and both TracCloud and other associated systems are now functioning as expected. We are also in contact with AWS for further clarification, but at this point, the issue has been fully addressed.
We wanted to provide you with an update regarding the web performance issues some of you may have experienced between August 29th and September 12th. We understand the importance of reliable service, and we apologize for any inconvenience this may have caused.
Over the past three weeks, we encountered a convergence of technical challenges that impacted our web performance: 1. PHP Unserialization Bug: We discovered a complex bug where PHP unserialization operations were consuming 50-100% more CPU cycles after upgrading from version 8.3.10 to 8.3.11. This unexpected increase in resource usage led to slower processing times for certain functions. 2. RDS Performance Degradation: Our primary RDS instance exhibited unexplained performance degradation. Despite normal operational metrics, the instance underperformed, affecting database response times. 3. Index Changes and Aurora RDS Compatibility: Recent changes to our database indexes affected Amazon Aurora RDS differently than our on-premise MySQL version, despite their advertised compatibility. This discrepancy resulted in suboptimal query performance on the Aurora instance. 4. Shifts in Traffic Patterns: We observed a shift in traffic patterns prioritizing higher-load-inducing requests. Coupled with a natural increase in traffic due to the start of the fall semester and an expanding customer base, our systems experienced additional strain.
Actions Taken: We have addressed each of these issues methodically:
- PHP Optimization: Our development team isolated and resolved the PHP deserialization bug. We implemented optimizations to reduce CPU usage back to expected levels.
- RDS Instance Investigation: We initiated a thorough investigation into the anomalous RDS instance. To mitigate immediate impacts, we have removed this instance from our cluster while we continue to analyze the root cause.
- Database Index Adjustments: We modified our indexing strategy to better align with Aurora RDS's performance characteristics, ensuring consistent query performance across both our cloud and on-premise databases.
- Enhanced Benchmarking Tools: We've improved our benchmarking systems to more accurately track query performance before changes go live. These enhancements will help us identify potential issues earlier in the development cycle.
- Resource Scaling: In response to increased traffic and load, we've scaled up our resources to maintain optimal performance levels.
Our Commitment to Reliability: We continuously monitor thousands of metrics and deploy hundreds of sensors across all TracCloud systems to track performance in real-time. While challenges can occur, this extensive monitoring enables us to swiftly identify and address issues. Rest assured, each time an incident happens, we implement measures to prevent it from happening again, strengthening our systems for the future.
Moving Forward: We are committed to providing you with the highest level of service. The steps we've taken have stabilized web performance, and we will continue to monitor our systems closely. Our team is also exploring long-term solutions to prevent similar issues in the future. We appreciate your patience and understanding during this time. If you have any questions or continue to experience any issues, please don't hesitate to reach out to our support team. Thank you for your continued trust in our services.
2024-09-11 - TracCloud 502/504 errors or slowness, resolved at 9:00am MST.
2024-08-15 - TracCloud slowness or timeout errors between 3:00pm and 3:30pm and 4:20pm to 4:34pm MST.
2024-08-09 - TracCloud sporadically inaccessible with 503 and 504 errors between 10:48am and 11:20am MST.
2024-07-11 - Reason drop-down on search availability widget not working on some TracCloud instances between 10:49am and 11:31am MST.
2024-05-31 - TracCloud job server down, causing automated emails and processes to run late.
2024-05-22 - 502 Bad Gateway error from 11:45am to 11:51am MST.
2024-04-23 - AWS errors preventing many features from working as expected. Resolved at 7:45am MST.
2024-04-22 - Appointment statuses not displaying in appointments over the weekend. Resolved on Monday at 8:29am MST.
2024-04-11 - Reminder emails not sending for some instances. Resolved at 5:32pm MST.
2024-02-15 - SAML error 500 preventing SSO login between 2:00pm and 2:30pm MST.
2024-02-13 - SAML errors preventing SSO login, resolved at 8:00am MST.
2024-02-12 - Registration lists displayed when booking appointments, logging in, etc not functioning correctly starting at 2:57pm. Most issues resolved at 3:13pm MST, quick visit section selection and badge section selection resolved at 3:57pm.
2024-02-02 - SAML SSO logins failing from 2:35pm to 2:49pm MST.
2024-01-25 - TracCloud displaying a 503 error between 1:51pm and 2:07pm MST.
2024-01-16 - 403 Forbidden error on certain records, resolved 7:02am MST.
2024-01-04 - 403 Forbidden error on Center, Group, and Staff records from 7:40am to 8:40am MST.
2023-12-13 - Users/kiosks being frequently logged out during system use. Resolved at 8:50am MST.
2023-12-12 - TracCloud error 500 from 11:15 to 11:20 MST due to PHP 8.2 issue.
2023-12-07 - TracCloud update not functioning as expected, reverted to previous version at 6:44am MST.
2023-11-27 - Palo Alto Networks incorrectly blocked TracCloud, clients using their firewall service lost access to TracCloud until PAN whitelisted traccloud.go-redrock.com.
2023-11-02 - TracCloud error 500 from 1:10pm to 1:50pm PST.
2023-11-01 - TracCloud job server limitations, email address errors, and time zone issues. Resolved at 8:40am.
2023-10-31 - TracCloud error 500 from 2:10pm to 2:15pm PST.
2023-10-31 - TracCloud error 500 from 11:37am to 11:48am PST.
2023-10-27 - TracCloud job server limitations, causing automated emails and processes to run late.
2023-10-26 - TracCloud error 500 from 4:55pm to 5:15pm PST.
2023-10-06 - TracCloud errors and slowness from 9:50 to 10:20am PST.
2023-10-05 - TracCloud 500, 502, 503, 504 errors from 3:40 to 4:15pm PST.
2023-09-26 - TracCloud bad gateway error from 2:45 to 3:00pm PST.
2023-06-13 - TracCloud instances with custom URLs received error 500 from 3:15pm to 3:35pm PST.
2023-05-09 - TracCloud 503 error from 2:00pm to 2:10pm.
2023-05-08 - TracCloud slowness/errors due to failed maintenance over weekend, resolved at 1:30pm.
2022-12-05 - TracCloud connection timing out for certain ISPs due to AWS outage from 11:34 AM to 12:51 PM PST. https://health.aws.amazon.com/health/status
2022-10-18 - TracCloud running slowly due to AWS limitations. Resolved at 2:30pm MST.
2022-10-03 - TracCloud Appointment Scheduling limitations, resolved at 9:19am MST
2022-07-28 - TracCloud Down due to AWS outage from 10:11am to 11:25am MST. https://health.aws.amazon.com/health/status
2022-07-25 - TracCloud SSO limitations, resolved at 8:00am MST.