MLOps

Uncovering The Hidden Costs Behind Sagemaker's Pricing

Explore the hidden costs of AWS SageMaker compared to Qwak's MLOps platform. Understand the financial and operational implications for your organization.

Yuval Fernbach

Co-founder & CTO at Qwak

May 16, 2023

Contents

Uncovering The Hidden Costs Behind Sagemaker's Pricing

AWS SageMaker is a suite of machine learning tools. Within their product, AWS offers a diverse array of products, intended to encompass various stages of the machine learning lifecycle. On top of breaking down the different products offered and the cost structure entailed, we'll take a look at the complexity that comes with this offering and how SageMaker's pricing can easily skyrocket accordingly.

Understanding the Scope of SageMaker's Pricing

In the following use case, we'll construct a machine learning model capable of real-time operation. This model will undergo periodic training and subsequently be deployed as a real-time endpoint. To ensure the safety of our users from unfavorable results, the initial iteration of the new version will handle just 20% of the incoming traffic. As we monitor its performance, we will gradually transition the remaining traffic. The data for training this model will be sourced from our analytical database.

Developing this solution within SageMaker entails the utilization of a variety of tools and requires a significant investment of resources from ML teams.

Which AWS SageMaker products are required to build a full blown solution?

Here is a brief outline of the services we will need to setup:

Amazon EMR will connect to our analytical database, and create features in Amazon S3.‍
‍Amazon feature store will read the data from S3 and save in the Offline in online feature store‍
Amazon managed workflows for Apache Airflow will run a daily workflow for the process above.‍
Amazon SageMaker training will be used to train the model.‍
Amazon SageMaker pipelines will allow us to connect to the Feature store, create training data and run SageMaker training jobs

The process above will allow us to automate our data processing, and model training.

Some additional challenges are to be expected when delivering code to production:

Create immutable objects; Run a build system that instals requirements, run tests, and validate the output artifacts.

Model versioning and management; Track model versions, code changes, and data affinity.

Generate model metrics, efficiently assess model performance

Prepare deployment infrastructure, based on the model infrastructure metrics and business needs.

Error handling and monitoring; Monitor for data changes and model errors.

Constructing the aforementioned system bears resemblance to a conventional CI/CD platform, distinguished by its unique data and model attributes. This solution can be established on AWS by leveraging the following steps:

Use a CI system like Jenkins, GitHub actions, Circle CI, etc to build a container from the trained model
Use SageMaker pipelines to deploy and create a SageMaker endpoint, and deployment
Save all models data with SageMaker Data Capture.
Create a model quality baseline using SageMaker model monitoring baseline job
Setup a Schedule Model Quality Monitoring Job using SageMaker model monitoring‍
Run infrastructure monitoring directly on AWS CloudWatch, create alerts, and monitor your deployed model

Time, resources, and budget considerations for SageMaker needed to establish such pipeline

Creating a platform capable of accommodating one or two models necessitates a year and a half of dedicated engineering effort, amounting to an initial investment of approximately $330,000, all prior to reaping initial benefits.

To scale this platform and broaden its scope to cater to numerous models and additional usage scenarios, an additional three and a half full-time engineers are indispensable, elevating the overall setup cost to roughly $1,000,000.

Even with a conservative estimate, sustaining and up keeping the platform requires no less than half of the engineering team's resources, equaling an extra 2.5 full-time engineers, or a yearly maintenance expense of $500,000.

These costs do not encompass the opportunity cost incurred by waiting for the platform's construction, any licensing charges, or various human resource expenses tied to recruiting, supervising, and substituting engineers.

Collectively, the expense of developing and perpetually managing an ML platform with the capacity to accommodate a multitude of models can be substantial. It is recommended to thoughtfully evaluate requisites before embarking on this endeavor.

Building an ML Pipeline in SageMaker	Costs ($)
Creating an initial platform that can accommodate 1-2 models	$330,000
Scaling the platform to handle additional usage capabilities	$1,000,000
Ongoing maintenance of platform (yearly)	$500,000

How is it done with Qwak?

Qwak is built to support model productionization in an easy and cost effective way. Creating data pipelines can be done directly through the UI/CLI by creating a new feature set.

Qwak Automation empowers users to initiate both model building and deployment. This process is directly executable from the model code residing within a GitHub repository. With each build, an immutable and readily deployable model object is automatically generated. Automatic inclusion of model infrastructure and data monitoring is inherent in the system. This encompasses the ability for users to devise alerts and automated processes. Within Qwak, diverse deployment configurations are supported, including A/B testing, variable traffic allocation, and gradual rollouts.

Summary

The Qwak machine learning platform was conceived with the intention of streamlining the process for data teams to construct and implement models into production. Employing an end-to-end platform yields a diminished Total Cost of Ownership when compared to solutions reliant on numerous managed services and tools as you can see with SageMaker's pricing.

By leveraging an all-encompassing platform, the need for integration efforts is curtailed. The effort and expenses linked to constructing and maintaining integrations among various tools can be extensive. An integrated platform obviates the necessity for intricate integrations, thus conserving development resources. Additionally, inherent automation capabilities further alleviate the necessity for manual interventions, thereby diminishing labor costs.

Overall, the reduced complexity and improved productivity of an end-to-end platform contribute to a more cost-effective machine learning solution, making it a preferable choice for many organizations.

MLOps

Bridging the Gap: How MLOps and DevOps Work Together for AI Adoption in 2025

Guy Eshet

December 8, 2024