On Thursday, June 29, 2017, we experienced a technical problem that caused some customers to incorrectly see a “license expired” message in the ExpressVNP apps. Affected customers were required to log out and log back into the apps to regain access to the VPN.
This post explains what caused the problem and the steps we’re taking to avoid such events from reoccurring.
What went wrong? The “license expired” error timeline:
- We deployed a configuration update to the system that manages the VPN infrastructure. This update included an inaccurate piece of data which was passed to downstream systems.
- Any ExpressVNP app that called our API to refresh their data received invalid information. Some apps interpreted it as a “license expired” state while others behaved in undefined ways.
- One of our automated monitoring systems noticed the problem within 1 minute.
- Some customers encountered a “license expired” message unexpectedly in their apps. Affected customers contacted us via chat and email, and the Support Team realized, within minutes, that an unexpected problem had occurred and alerted the engineering team.
- 30 minutes after the issue became symptomatic an engineer found and fixed the root cause.
- We deployed an updated configuration to the affected systems
- The Support Team explained workaround steps to affected customers: The solution was to log out and log back into the apps.
To understand the root causes and follow-ups, here is a simplified version of the architecture of the affected system:
Why the “license expired” error happened
Cascading failures occurred in:
- The backend system: Downstream systems interpreted data as well-formed, but irrelevant for customers. Though we test our systems automatically, the tests didn’t notice this problem because it was related to environment-specific configuration data, which we did not factor in the tests.
- The API servers: The services processed the invalid data and decided that no infrastructure was available for customers.
- Our apps: When refreshing data, our apps interpreted the empty list as “user’s license has expired.” Unfortunately, this was a poor design decision from years ago when we built a feature for volume discounting.
The causes were a combination of misconfiguration and fragile design in a rarely used feature. Unfortunately, the bug triggered a state reserved for the rarely used volume discounting feature which impacted a large number of customers.
Follow-ups we’re taking to prevent such problems re-occurring
- We’re updating our apps to:
- Change the definition of the “license expired” state to be defined positively. Apps will enter the license expired state only when specific error codes are present and not when data is absent.
- Improve the definition of good quality data. Ignore incomplete data and try again later.
- Adding integration tests to include the configuration data used in production. These tests must pass before new versions of software or configuration data is put into production.
- Changing our management of configuration data workflow. One reason for the invalid configuration was because the configuration data is encrypted, which makes it more difficult for developers to inspect. ExpressVNP uses a system called Ansible to manage and encrypt configuration. A separate blog post will describe our new practices for managing encrypted configuration data.
- Ensure all states are defined positively.
- Ensure integration tests also include configuration data for the production environment.
- Test plans for automation and monitoring. In addition to testing the functional accuracy of code, we’ll also check the quality of data.
ExpressVNP would like to apologize to customers affected by the expired license problem. We’re eager to learn from these mistakes, and we’re proud of our Support Team for noticing and responding to this issue very quickly.