Relative performance tradeoffs of AWS-native provisioning methods

Feb 23, 2023

Engineering

There are many different ways to provision AWS services, and we use several of them to address different use cases at Stedi. We set out to benchmark the performance of each option – direct APIs, Cloud Control, CloudFormation, and Service Catalog.

When compared to direct service APIs, we found that:

  • Cloud Control introduced an additional ~5 seconds of deployment latency

  • CloudFormation introduced an additional ~13 seconds of deployment latency

  • Service Catalog introduced an additional ~33 seconds of deployment latency.

This additional latency can make day-to-day operations quite painful.

How we provision resources at Stedi

Each AWS service has its own APIs for CRUD of various resources, but since AWS services are built by many different teams, the ergonomics of these APIs vary greatly – as an example, you would use the Lambda CreateFunction API to create a function vs the EC2 RunInstances API to create an EC2 instance.

To make it easier for developers to work with these disparate APIs in a uniform fashion, AWS launched the Cloud Control API, which exposes five normalized verbs (CreateResource, GetResource, UpdateResource, DeleteResource, ListResources) to manage the lifecycle of various services. Cloud Control provides a convenient way of working with many different AWS services in the same way.

That said, we rarely use the ‘native’ service APIs or Cloud Control APIs directly. Instead, we typically define resources using CDK, which synthesizes AWS CloudFormation templates that are then deployed by the CloudFormation service.

Over the past year, we’ve also begun to use AWS Service Catalog for certain use cases. Service Catalog allows us to define a set of CloudFormation templates in a single AWS account, which are then shared with many other AWS accounts for deployment on-demand. Service Catalog handles complexity such as versioning and governance, and we’ve been thrilled with the higher-order functionality it provides.

Expectations

We expect to pay a performance penalty as we move ‘up the stack’ of value delivery – it would be unreasonable to expect a value-add layer to offer identical performance as the underlying abstractions. Cloud Control offers added value (in the form of normalization) over direct APIs; CloudFormation offers added value over direct APIs or Cloud Control (in the form of state management and dependency resolution); Service Catalog offers added value over CloudFormation (in the form of versioning, governance, and more).

Any performance hit can be broken into two categories: essential latency and incidental latency. Essential latency is the latency required to deliver the functionality, and incidental latency is the latency introduced as a result of a chosen implementation. The theoretical minimum performance hit, then, is equal to the essential latency, and the actual performance hit is equal to the essential latency plus the incidental latency.

It requires substantial investment to achieve something approaching essential latency, and such an investment isn’t sensible in anything but the most latency-sensitive use cases. But as an AWS customer, it’s reasonable to expect that the actual latency of AWS’s various layers of abstraction is within some margin that is difficult to perceive in the normal course of development work – in other words, we expect the unnecessary latency to be largely unnoticeable.

Reality

To test the relative performance of each provisioning method, we ran a series of performance benchmarks for managing Lambda Functions and SQS Queues. Here is a summary of the P50 (median) results:

  • Cloud Control was 744% (~5 seconds) and 1,259% (500 ms) slower than Lambda and SQS direct APIs, respectively.

  • CloudFormation was 1,736% (~13 seconds) and 21,076% (8 seconds) slower than Lambda and SQS direct APIs, respectively.

  • Service Catalog was 4,339% and 86,771% (~33 seconds, in both cases) slower than Lambda and SQS direct APIs, respectively.

The full results are below.

We experimented with Service Catalog to determine what is causing its staggeringly poor performance. According to CloudTrail logs, Service Catalog is triggering the underlying CloudFormation stack create/update/delete, and then sleeping for 30 seconds before polling every 30 seconds until it’s finished. In practice, this means that Service Catalog can never take less than 30 seconds to complete an operation, and if the CloudFormation stack isn’t finished within 30 seconds, then Service Catalog can’t finish in under a minute.

Conclusion

Our hope is that AWS tracks provisioning latency for each of these options internally and takes steps towards improving them – ideally, each provisioning method only introduces the minimum latecy overhead necessary to provide its corresponding functionality.

Full results

Lambda

|                 | Absolute |        |        |        | Delta |       |       |      |
|-Service---------|-P10------|-P50----|-P90----|-P99----|-P10---|-P50---|-P90---|-P99--|
| Lambda          | 464      | 744    | 2,301  | 5,310  |       |       |       |      |
| Cloud Control   | 6,098    | 6,278  | 7,206  | 12,971 | 1214% | 744%  | 213%  | 144% |
| CloudFormation  | 13,054   | 13,654 | 14,591 | 15,906 | 2713% | 1736% | 534%  | 200% |
| Service Catalog | 32,797   | 33,013 | 33,389 | 34,049 | 6967% | 4339% | 1351% | 541

Methodology:

  • Change an existing function's code via different services, which involves first calling UpdateFunctionCode then polling GetFunction.

  • In the case of CloudFormation and Service Catalog, the new code value was passed in as a parameter rather than changing the template.

  • The "Wait" timings represent how long it took the resource to stabilize. This was determined by polling the applicable service operation every 50 milliseconds.

SQS

|                 | Absolute |          |        |        | Delta   |         |         |         |
|-Service---------|-P10------|-P50------|-P90----|-P99----|-P10-----|-P50-----|-P90-----|-P99-----|
| SQS             | 34       | 38       | 45     | 51     |         |         |         |         |
| Cloud Control   | 444      | 516      | 669    | 1,023  | 1,205%  | 1,259%  | 1,382%  | 1,904%  |
| CloudFormation  | 7,417    | 8,047    | 8,766  | 11,398 | 21,714% | 21,076% | 19,337% | 22,239% |
| Service Catalog | 32,785   | 33,011   | 33,320 | 33,659 | 96,327% | 86,771% | 73,780% | 65,873

Methodology:

  • Change an existing queue's visibility timeout attribute via different services, which involves calling SetQueueAttributes.

  • In the case of CloudFormation and Service Catalog, the new visibility timeout value was passed in as a parameter rather than changing the template.

  • The "Wait" timings represent how long it took the resource to stabilize. This was determined by polling the applicable service operation every 50 milliseconds.

There are many different ways to provision AWS services, and we use several of them to address different use cases at Stedi. We set out to benchmark the performance of each option – direct APIs, Cloud Control, CloudFormation, and Service Catalog.

When compared to direct service APIs, we found that:

  • Cloud Control introduced an additional ~5 seconds of deployment latency

  • CloudFormation introduced an additional ~13 seconds of deployment latency

  • Service Catalog introduced an additional ~33 seconds of deployment latency.

This additional latency can make day-to-day operations quite painful.

How we provision resources at Stedi

Each AWS service has its own APIs for CRUD of various resources, but since AWS services are built by many different teams, the ergonomics of these APIs vary greatly – as an example, you would use the Lambda CreateFunction API to create a function vs the EC2 RunInstances API to create an EC2 instance.

To make it easier for developers to work with these disparate APIs in a uniform fashion, AWS launched the Cloud Control API, which exposes five normalized verbs (CreateResource, GetResource, UpdateResource, DeleteResource, ListResources) to manage the lifecycle of various services. Cloud Control provides a convenient way of working with many different AWS services in the same way.

That said, we rarely use the ‘native’ service APIs or Cloud Control APIs directly. Instead, we typically define resources using CDK, which synthesizes AWS CloudFormation templates that are then deployed by the CloudFormation service.

Over the past year, we’ve also begun to use AWS Service Catalog for certain use cases. Service Catalog allows us to define a set of CloudFormation templates in a single AWS account, which are then shared with many other AWS accounts for deployment on-demand. Service Catalog handles complexity such as versioning and governance, and we’ve been thrilled with the higher-order functionality it provides.

Expectations

We expect to pay a performance penalty as we move ‘up the stack’ of value delivery – it would be unreasonable to expect a value-add layer to offer identical performance as the underlying abstractions. Cloud Control offers added value (in the form of normalization) over direct APIs; CloudFormation offers added value over direct APIs or Cloud Control (in the form of state management and dependency resolution); Service Catalog offers added value over CloudFormation (in the form of versioning, governance, and more).

Any performance hit can be broken into two categories: essential latency and incidental latency. Essential latency is the latency required to deliver the functionality, and incidental latency is the latency introduced as a result of a chosen implementation. The theoretical minimum performance hit, then, is equal to the essential latency, and the actual performance hit is equal to the essential latency plus the incidental latency.

It requires substantial investment to achieve something approaching essential latency, and such an investment isn’t sensible in anything but the most latency-sensitive use cases. But as an AWS customer, it’s reasonable to expect that the actual latency of AWS’s various layers of abstraction is within some margin that is difficult to perceive in the normal course of development work – in other words, we expect the unnecessary latency to be largely unnoticeable.

Reality

To test the relative performance of each provisioning method, we ran a series of performance benchmarks for managing Lambda Functions and SQS Queues. Here is a summary of the P50 (median) results:

  • Cloud Control was 744% (~5 seconds) and 1,259% (500 ms) slower than Lambda and SQS direct APIs, respectively.

  • CloudFormation was 1,736% (~13 seconds) and 21,076% (8 seconds) slower than Lambda and SQS direct APIs, respectively.

  • Service Catalog was 4,339% and 86,771% (~33 seconds, in both cases) slower than Lambda and SQS direct APIs, respectively.

The full results are below.

We experimented with Service Catalog to determine what is causing its staggeringly poor performance. According to CloudTrail logs, Service Catalog is triggering the underlying CloudFormation stack create/update/delete, and then sleeping for 30 seconds before polling every 30 seconds until it’s finished. In practice, this means that Service Catalog can never take less than 30 seconds to complete an operation, and if the CloudFormation stack isn’t finished within 30 seconds, then Service Catalog can’t finish in under a minute.

Conclusion

Our hope is that AWS tracks provisioning latency for each of these options internally and takes steps towards improving them – ideally, each provisioning method only introduces the minimum latecy overhead necessary to provide its corresponding functionality.

Full results

Lambda

|                 | Absolute |        |        |        | Delta |       |       |      |
|-Service---------|-P10------|-P50----|-P90----|-P99----|-P10---|-P50---|-P90---|-P99--|
| Lambda          | 464      | 744    | 2,301  | 5,310  |       |       |       |      |
| Cloud Control   | 6,098    | 6,278  | 7,206  | 12,971 | 1214% | 744%  | 213%  | 144% |
| CloudFormation  | 13,054   | 13,654 | 14,591 | 15,906 | 2713% | 1736% | 534%  | 200% |
| Service Catalog | 32,797   | 33,013 | 33,389 | 34,049 | 6967% | 4339% | 1351% | 541

Methodology:

  • Change an existing function's code via different services, which involves first calling UpdateFunctionCode then polling GetFunction.

  • In the case of CloudFormation and Service Catalog, the new code value was passed in as a parameter rather than changing the template.

  • The "Wait" timings represent how long it took the resource to stabilize. This was determined by polling the applicable service operation every 50 milliseconds.

SQS

|                 | Absolute |          |        |        | Delta   |         |         |         |
|-Service---------|-P10------|-P50------|-P90----|-P99----|-P10-----|-P50-----|-P90-----|-P99-----|
| SQS             | 34       | 38       | 45     | 51     |         |         |         |         |
| Cloud Control   | 444      | 516      | 669    | 1,023  | 1,205%  | 1,259%  | 1,382%  | 1,904%  |
| CloudFormation  | 7,417    | 8,047    | 8,766  | 11,398 | 21,714% | 21,076% | 19,337% | 22,239% |
| Service Catalog | 32,785   | 33,011   | 33,320 | 33,659 | 96,327% | 86,771% | 73,780% | 65,873

Methodology:

  • Change an existing queue's visibility timeout attribute via different services, which involves calling SetQueueAttributes.

  • In the case of CloudFormation and Service Catalog, the new visibility timeout value was passed in as a parameter rather than changing the template.

  • The "Wait" timings represent how long it took the resource to stabilize. This was determined by polling the applicable service operation every 50 milliseconds.

There are many different ways to provision AWS services, and we use several of them to address different use cases at Stedi. We set out to benchmark the performance of each option – direct APIs, Cloud Control, CloudFormation, and Service Catalog.

When compared to direct service APIs, we found that:

  • Cloud Control introduced an additional ~5 seconds of deployment latency

  • CloudFormation introduced an additional ~13 seconds of deployment latency

  • Service Catalog introduced an additional ~33 seconds of deployment latency.

This additional latency can make day-to-day operations quite painful.

How we provision resources at Stedi

Each AWS service has its own APIs for CRUD of various resources, but since AWS services are built by many different teams, the ergonomics of these APIs vary greatly – as an example, you would use the Lambda CreateFunction API to create a function vs the EC2 RunInstances API to create an EC2 instance.

To make it easier for developers to work with these disparate APIs in a uniform fashion, AWS launched the Cloud Control API, which exposes five normalized verbs (CreateResource, GetResource, UpdateResource, DeleteResource, ListResources) to manage the lifecycle of various services. Cloud Control provides a convenient way of working with many different AWS services in the same way.

That said, we rarely use the ‘native’ service APIs or Cloud Control APIs directly. Instead, we typically define resources using CDK, which synthesizes AWS CloudFormation templates that are then deployed by the CloudFormation service.

Over the past year, we’ve also begun to use AWS Service Catalog for certain use cases. Service Catalog allows us to define a set of CloudFormation templates in a single AWS account, which are then shared with many other AWS accounts for deployment on-demand. Service Catalog handles complexity such as versioning and governance, and we’ve been thrilled with the higher-order functionality it provides.

Expectations

We expect to pay a performance penalty as we move ‘up the stack’ of value delivery – it would be unreasonable to expect a value-add layer to offer identical performance as the underlying abstractions. Cloud Control offers added value (in the form of normalization) over direct APIs; CloudFormation offers added value over direct APIs or Cloud Control (in the form of state management and dependency resolution); Service Catalog offers added value over CloudFormation (in the form of versioning, governance, and more).

Any performance hit can be broken into two categories: essential latency and incidental latency. Essential latency is the latency required to deliver the functionality, and incidental latency is the latency introduced as a result of a chosen implementation. The theoretical minimum performance hit, then, is equal to the essential latency, and the actual performance hit is equal to the essential latency plus the incidental latency.

It requires substantial investment to achieve something approaching essential latency, and such an investment isn’t sensible in anything but the most latency-sensitive use cases. But as an AWS customer, it’s reasonable to expect that the actual latency of AWS’s various layers of abstraction is within some margin that is difficult to perceive in the normal course of development work – in other words, we expect the unnecessary latency to be largely unnoticeable.

Reality

To test the relative performance of each provisioning method, we ran a series of performance benchmarks for managing Lambda Functions and SQS Queues. Here is a summary of the P50 (median) results:

  • Cloud Control was 744% (~5 seconds) and 1,259% (500 ms) slower than Lambda and SQS direct APIs, respectively.

  • CloudFormation was 1,736% (~13 seconds) and 21,076% (8 seconds) slower than Lambda and SQS direct APIs, respectively.

  • Service Catalog was 4,339% and 86,771% (~33 seconds, in both cases) slower than Lambda and SQS direct APIs, respectively.

The full results are below.

We experimented with Service Catalog to determine what is causing its staggeringly poor performance. According to CloudTrail logs, Service Catalog is triggering the underlying CloudFormation stack create/update/delete, and then sleeping for 30 seconds before polling every 30 seconds until it’s finished. In practice, this means that Service Catalog can never take less than 30 seconds to complete an operation, and if the CloudFormation stack isn’t finished within 30 seconds, then Service Catalog can’t finish in under a minute.

Conclusion

Our hope is that AWS tracks provisioning latency for each of these options internally and takes steps towards improving them – ideally, each provisioning method only introduces the minimum latecy overhead necessary to provide its corresponding functionality.

Full results

Lambda

|                 | Absolute |        |        |        | Delta |       |       |      |
|-Service---------|-P10------|-P50----|-P90----|-P99----|-P10---|-P50---|-P90---|-P99--|
| Lambda          | 464      | 744    | 2,301  | 5,310  |       |       |       |      |
| Cloud Control   | 6,098    | 6,278  | 7,206  | 12,971 | 1214% | 744%  | 213%  | 144% |
| CloudFormation  | 13,054   | 13,654 | 14,591 | 15,906 | 2713% | 1736% | 534%  | 200% |
| Service Catalog | 32,797   | 33,013 | 33,389 | 34,049 | 6967% | 4339% | 1351% | 541

Methodology:

  • Change an existing function's code via different services, which involves first calling UpdateFunctionCode then polling GetFunction.

  • In the case of CloudFormation and Service Catalog, the new code value was passed in as a parameter rather than changing the template.

  • The "Wait" timings represent how long it took the resource to stabilize. This was determined by polling the applicable service operation every 50 milliseconds.

SQS

|                 | Absolute |          |        |        | Delta   |         |         |         |
|-Service---------|-P10------|-P50------|-P90----|-P99----|-P10-----|-P50-----|-P90-----|-P99-----|
| SQS             | 34       | 38       | 45     | 51     |         |         |         |         |
| Cloud Control   | 444      | 516      | 669    | 1,023  | 1,205%  | 1,259%  | 1,382%  | 1,904%  |
| CloudFormation  | 7,417    | 8,047    | 8,766  | 11,398 | 21,714% | 21,076% | 19,337% | 22,239% |
| Service Catalog | 32,785   | 33,011   | 33,320 | 33,659 | 96,327% | 86,771% | 73,780% | 65,873

Methodology:

  • Change an existing queue's visibility timeout attribute via different services, which involves calling SetQueueAttributes.

  • In the case of CloudFormation and Service Catalog, the new visibility timeout value was passed in as a parameter rather than changing the template.

  • The "Wait" timings represent how long it took the resource to stabilize. This was determined by polling the applicable service operation every 50 milliseconds.

Share

Twitter
LinkedIn

Backed by

Stedi is a registered trademark of Stedi, Inc. All names, logos, and brands of third parties listed on our site are trademarks of their respective owners (including “X12”, which is a trademark of X12 Incorporated). Stedi, Inc. and its products and services are not endorsed by, sponsored by, or affiliated with these third parties. Our use of these names, logos, and brands is for identification purposes only, and does not imply any such endorsement, sponsorship, or affiliation.

Backed by

Stedi is a registered trademark of Stedi, Inc. All names, logos, and brands of third parties listed on our site are trademarks of their respective owners (including “X12”, which is a trademark of X12 Incorporated). Stedi, Inc. and its products and services are not endorsed by, sponsored by, or affiliated with these third parties. Our use of these names, logos, and brands is for identification purposes only, and does not imply any such endorsement, sponsorship, or affiliation.

Backed by

Stedi is a registered trademark of Stedi, Inc. All names, logos, and brands of third parties listed on our site are trademarks of their respective owners (including “X12”, which is a trademark of X12 Incorporated). Stedi, Inc. and its products and services are not endorsed by, sponsored by, or affiliated with these third parties. Our use of these names, logos, and brands is for identification purposes only, and does not imply any such endorsement, sponsorship, or affiliation.