ExpressVNP如何保持其Web服务器的修补和安全性

ExpressVNP服务器从灰烬中升起。

本文介绍了ExpressVNP 针对运行ExpressVNP网站(而不是VPN服务器)的基础架构的安全补丁管理方法。总的来说,我们的安全方法是:

  1. 使系统很难破解
  2. 如果系统假设被黑客入侵并且承认某些系统无法完全安全,则可以最大限度地减少潜在的损害。通常,这从架构设计阶段开始,我们将应用程序的访问权限降至最低。
  3. 最大限度地减少系统可以保持受损的时间。
  4. 使用内部和外部的常规测试来验证这些点。

安全在我们的文化中根深蒂固,是指导我们所有工作的主要关注点。还有许多其他主题,例如我们的安全软件开发实践,应用程序安全性,员工流程和培训等,但这些主题超出了本文的范围。

在这里,我们解释如何实现以下目标:

  1. 确保所有服务器都经过完全修补,并且永远不会超过CVE出版物24小时。
  2. 确保超过24小时不使用任何服务器,从而对攻击者持久性的时间量设置上限。

我们通过一个自动化系统来实现这两个目标,这个系统重建服务器,从操作系统和所有最新补丁开始,至少每24小时销毁一次

我们对本文的意图是对面临类似挑战的其他开发人员有用,并为我们的客户和媒体提供ExpressVNP操作的透明度。

我们如何使用Ansible playbooks和Cloudformation

ExpressVNP的Web基础架构托管在AWS上(而不是我们在专用硬件上运行的VPN服务器),我们大量使用其功能来进行重建。

我们的整个Web基础架构都配置了Cloudformation,我们尝试尽可能多地自动化流程。但是,由于需要重复,整体可读性差以及JSON或YAML语法的限制,我们发现使用原始Cloudformation模板非常不愉快。

为了缓解这种情况,我们使用名为cloudformation-ruby-dsl的DSL,它使我们能够在Ruby中编写模板定义并以JSON格式导出Cloudformation模板。

特别是,DSL允许我们将用户数据脚本编写为常规脚本,这些脚本会自动转换为JSON(而不是经历将脚本的每一行变为有效JSON字符串的痛苦过程)。

A generic Ansible role called cloudformation-infrastructure takes care of rendering the actual template to a temporary file, which is then used by the cloudformation Ansible module:

- name: 'render {{ component }} stack cloudformation json'
  shell: 'ruby "{{ template_name | default(component) }}.rb" expand --stack-name {{ stack }} --region {{ aws_region }} > {{ tempfile_path }}'
  args:
chdir: ../cloudformation/templates
  changed_when: false

- name: 'create/update {{ component }} stack'
  cloudformation:
stack_name: '{{ stack }}-{{ xv_env_name }}-{{ component }}'
state: present
region: '{{ aws_region }}'
template: '{{ tempfile_path }}'
template_parameters: '{{ template_parameters | default({}) }}'
stack_policy: '{{ stack_policy }}'
  register: cf_result

In the playbook, we call the cloudformation-infrastructure role several times with different component variables to create several Cloudformation stacks. For example, we have a network stack that defines the VPC and related resources and an app stack that defines the Auto Scaling group, launch configuration, lifecycle hooks, etc.

We then use a somewhat ugly but useful trick to turn the output of the cloudformationmodule into Ansible variables for subsequent roles. We have to use this approach since Ansible does not allow the creation of variables with dynamic names:

- include: _tempfile.yml
- copy:
content: '{{ component | regex_replace("-", "_") }}_stack: {{ cf_result.stack_outputs | to_json }}'
dest: '{{ tempfile_path }}.json'
no_log: true
changed_when: false

- include_vars: '{{ tempfile_path }}.json'

Updating the EC2 Auto Scaling group

The ExpressVNP website is hosted on multiple EC2 instances in an Auto Scaling group behind an Application Load Balancer which enables us to destroy servers without any downtime since the load balancer can drain existing connections before an instance terminates.

Cloudformation orchestrates the entire rebuild, and we trigger the Ansible playbook described above every 24 hours to rebuild all instances, making use of the AutoScalingRollingUpdate UpdatePolicy attribute of the AWS::AutoScaling:: AutoScalingGroup resource.

When simply triggered repeatedly without any changes, the UpdatePolicy attribute is not used—it is only invoked under special circumstances as described in the documentation. One of those circumstances is an update to the Auto Scaling launch configuration—a template that an Auto Scaling group uses to launch EC2 instances—which includes the EC2 user-data script that runs on the creation of a new instance:

resource 'AppLaunchConfiguration', Type: 'AWS::AutoScaling::LaunchConfiguration',
Properties: {
KeyName: param('AppServerKey'),
ImageId: param('AppServerAMI'),
InstanceType: param('AppServerInstanceType'),
SecurityGroups: [
param('SecurityGroupApp'),
],
IamInstanceProfile: param('RebuildIamInstanceProfile'),
InstanceMonitoring: true,
BlockDeviceMappings: [
{
DeviceName: '/dev/sda1', # root volume
Ebs: {
VolumeSize: param('AppServerStorageSize'),
VolumeType: param('AppServerStorageType'),
DeleteOnTermination: true,
},
},
],
UserData: base64(interpolate(file('scripts/app_user_data.sh'))),
}

If we make any update to the user data script, even a comment, the launch configuration will be considered changed, and Cloudformation will update all instances in the Auto Scaling group to comply with the new launch configuration.

Thanks to cloudformation-ruby-dsl and its interpolate utility function, we can use Cloudformation references in the app_user_data.sh script:

readonly rebuild_timestamp="{{ param('RebuildTimestamp') }}"

This procedure ensures our launch configuration is new every time the rebuild is triggered.

Lifecycle hooks

We use Auto Scaling lifecycle hooks to make sure our instances are fully provisioned and pass the required health checks before they go live.

Using lifecycle hooks allows us to have the same instance lifecycle both when we trigger the update with Cloudformation and when an auto-scaling event occurs (for example, when an instance fails an EC2 health check and gets terminated). We don’t use cfn-signal and the WaitOnResourceSignals auto-scaling update policy because they are only applied when Cloudformation triggers an update.

When an auto-scaling group creates a new instance, the EC2_INSTANCE_LAUNCHING lifecycle hook is triggered, and it automatically puts the instance in a Pending:Wait state.

After the instance is fully configured, it starts hitting its own health check endpoints with curl from the user data script. Once the health checks report the application to be healthy, we issue a CONTINUE action for this lifecycle hook, so the instance gets attached to the load balancer and starts serving traffic.

If the health checks fail, we issue an ABANDON action which terminates the faulty instance, and the auto scaling group launches another one.

Besides failing to pass health checks, our user data script may fail at other points—for example, if temporary connectivity issues prevent software installation.

We want the creation of a new instance to fail as soon as we realize that it will never become healthy. To achieve that, we set an ERR trap in the user data script together with set -o errtrace to call a function that sends an ABANDON lifecycle action so a faulty instance can terminate as soon as possible.

User data scripts

The user data script is responsible for installing all the required software on the instance. We’ve successfully used Ansible to provision instances and Capistrano to deploy applications for a long time, so we’re also using them here, allowing for the minimal difference between regular deploys and rebuilds.

The user data script checks out our application repository from Github, which includes Ansible provisioning scripts, then runs Ansible and Capistrano pointed to localhost.

When checking out code, we need to be sure that the currently deployed version of the application is deployed during the rebuild. The Capistrano deployment script includes a task that updates a file in S3 that stores the currently deployed commit SHA. When the rebuild happens, the system picks up the commit that is supposed to be deployed from that file.

Software updates are applied by running unattended-upgrade in the foreground with the unattended-upgrade -d command. Once complete, the instance reboots and starts the health checks.

Dealing with secrets

The server needs temporary access to secrets (such as the Ansible vault password) which are fetched from the EC2 Parameter Store. The server can only access secrets for a short duration during the rebuild. After they are fetched, we immediately replace the initial instance profile with a different one which only has access to resources that are required for the application to run.

We want to avoid storing any secrets on the instance’s persistent memory. The only secret we save to disk is the Github SSH key, but not its passphrase. We don’t save the Ansible vault password, either.

However, we need to pass these passphrases to SSH and Ansible respectively, and it’s only possible in interactive mode (i.e. the utility prompts the user to input the passphrases manually) for a good reason—if a passphrase is a part of a command it is saved in the shell history and can be visible to all users in the system if they run ps. We use the expect utility to automate interaction with those tools:

expect << EOF
cd ${repo_dir}
spawn make ansible_local env=${deploy_env} stack=${stack} hostname=${server_hostname}
set timeout 2
expect 'Vault password'
send "${vault_password}\r"
set timeout 900
expect {
"unreachable=0 failed=0" {
exit 0
}
eof {
exit 1
}
timeout {
exit 1
}
}
EOF

Triggering the rebuild

Since we trigger the rebuild by running the same Cloudformation script that is used to create/update our infrastructure, we need to make sure that we don’t accidentally update some part of the infrastructure that is not supposed to be updated during the rebuild.

We achieve this by setting a restrictive stack policy on our Cloudformation stacks so only the resources necessary for the rebuild are updated:

{
"Statement" : [
{
"Effect" : "Allow",
"Action" : "Update:Modify",
"Principal": "*",
"Resource" : [
"LogicalResourceId/*AutoScalingGroup"
]
},
{
"Effect" : "Allow",
"Action" : "Update:Replace",
"Principal": "*",
"Resource" : [
"LogicalResourceId/*LaunchConfiguration"
]
}
]
}

When we need to do actual infrastructure updates, we have to manually update the stack policy to allow updates to those resources explicitly.

Because our server hostnames and IPs change every day, we have a script that updates our local Ansible inventories and SSH configs. It discovers the instances via the AWS API by tags, renders the inventory and config files from ERB templates, and adds the new IPs to SSH known_hosts.

ExpressVNP follows the highest security standards

Rebuilding servers protects us from a specific threat: attackers gaining access to our servers via a kernel/software vulnerability.

However, this is only one of the many ways we keep our infrastructure secure, including but not limited to undergoing regular security audits and making critical systems inaccessible from the internet.

Additionally, we make sure that all of our code and internal processes follow the highest security standards.