LLMops

Breaking Down the Cost of Large Language Models

Explore the factors influencing the cost of Large Language Models (LLMs) and how to optimize expenses. Dive in for a detailed breakdown.

Hudson Buzby

Solutions Architect at JFrog ML

June 25, 2024

Contents

Breaking Down the Cost of Large Language Models

By now, most organizations have had ample opportunity to test and experiment with Large Language Models. Many teams have even had success in identifying use cases and incorporating LLMs into their existing applications and workflows, seeing tremendous results in chatbots, data parsing tasks, document summaries, and data analysis. However, now that LLMs are being integrated into production workflows, it’s time to come down to earth a bit and have a serious look at the generative AI development costs incurred by your applications. While LLMs are incredibly easy to get started and costs can seem insignificant when beginning, as tokens and requests build-up, the costs of Large Language Models can begin to accumulate and require careful consideration.

In this article, we’ll discuss large language model costs, how you can estimate the cost of your workflow, when you should consider using a managed LLM provider like OpenAI or Anthropic, or an open-source solution, choosing the right size of your LLM, as well as the additional costs associated with RAG pipelines or more advanced LLM workflows.

How Are LLMs Priced?

In order to determine large language model costs, we’ll need to distinguish between two main classes of LLMs - managed and open-source. For managed LLMs, the core foundation of logic as well as pricing is the token. For open-source LLMs, there are no management fees to pay or tokens to price on, but there are underlying resource requirements and infrastructure costs that will be incurred. Let’s take a deeper look at each of these.

Managed LLM Providers

The first term we’ll need to define is the token. The token is the basic foundation and conversion unit of all large language models. Tokenization is the process of converting a string of text, like the input of the prompt, into a series of vectors. Tokenization allows the input text to be converted, interpreted, and processed as numerical values by the language model.

One important thing to note - tokens are not consistent across different managed large language models, or even across the same prompts within the same large language model. Tokens can be as short as one character or as long as one word. They are the basic elements that the model uses to understand and generate language. They may be similar or approximately close, but ultimately, you should look at the estimation of your input tokens as an approximation or an average.

Managed LLM providers like OpenAI or Anthropic price their models differently depending on the size of the parameters, the input token context, and the complexity or capabilities of their latest models. They also tend to price their models in terms of dollars/1M tokens consumed on the input, as well as return on the output of the model.

Here is the latest pricing for a few of OpenAI’s most recent models:

Model	Input	Output	Context Window
gpt-4o	$5.00 / 1M tokens	$15.00 / 1M tokens	128K
gpt-4-turbo	$10.00 / 1M tokens	$30.00 / 1M tokens	128K
gpt-4	$30.00 / 1M tokens	$60.00 / 1M tokens	32K
gpt-3.5-turbo-0125	$0.50 / 1M tokens	$1.50 / 1M tokens	16K

Be sure to stay up to date on OpenAI’s pricing. It frequently changes and updates as new models are being released and optimizations are being made.

Here is the breakdown for Anthropic’s series of Claude models:

Model	Input	Output	Context Window
Claude - Haiku	$0.25 / 1M Tokens	$1.25 / 1M Tokens	200k
Claude - Sonnet	$3.00 / 1M Tokens	$15.00 / 1M Tokens	200K
Claude - Opus	$15.00 / 1M Tokens	$75.00 / 1M Tokens	200k

For pricing details of Anthropic, check here.

As you can see above, the cost per large language model across both platforms varies drastically based on the recency of the model, as well as the size of the input context window. We’ll discuss below how you should think about choosing the right model for your specific use case.

Open-Source LLMs

With open-source LLMs like Llama2 or Llama3 from Meta, Mistral, or Bert, there is no per query token rate as you are simply pulling the model weights from a repository and deploying the model on infrastructure of your choosing. However, most open-source large language models, especially those hanging at the top of the performance list on HuggingFace’s leaderboard, tend to be very large and have significant resource requirements for deployment.

Like the menu of models from OpenAI and Anthropic, open-source developers also provide different sizes of their models, measured in millions or billions of parameters, that increase in speed and accuracy as the number of parameters scales.

Let’s take a look at some of the most popular open-source models, their size of parameters, and the resources required for deployment.

Model	Size	Memory	GPU
Mistral-7b-v0.3	7B parameters	24G	1 GPU
Mistral-8x7B-Instruct-v0.1	56B parameters	64G	3 GPU
Llama3-8b	8B parameters	20GB	1
Llama3-70B	70B parameters	160GB	8
Bert	110M parameters	8GB	1 or 0

As you can see above, the resource requirements to run these models varies drastically and can get quite expensive. The Llama3-70B model comes close to maxing out the resources available in AWS. Let’s take a look at how resource requirements translate to cloud compute costs (AWS).

Model	Size	Memory	GPU
Mistral-7b-v0.3	7B parameters	24G	1 GPU
Mistral-8x7B-Instruct-v0.1	56B parameters	64G	3 GPU
Llama3-8b	8B parameters	20GB	1
Llama3-70B	70B parameters	160GB	8
Bert	110M parameters	8GB	1 or 0

Even with spot instances, some of these larger models can reach into the thousands for a single instance monthly deployment. However, smaller models like Bert, that aren’t trained on nearly as many parameters but can still have great performance can be achieved for as low as $378.58/month. We also need to keep in mind that these deployments are for a single node. If we need to scale the service horizontally to handle more traffic volume, we will need to add additional replicas, scaling the service and the large language model costs linearly.

Determining Large Language Model Cost

Estimating Costs for OpenAI GPT-4-turbo

So now that we are aware of the costs of our different LLM deployment options, let’s try to figure out which one we should choose. First, we’ll need to get a sense of the size of the input and output of our request and the frequency that we’ll be sending it to the model. Let’s use a sentiment analysis use case as an example:

This example highlights some of the complexities around prompt estimation. While our prompt template (the section before the quote) will be relatively consistent with 44 tokens, the remaining prompt text will be free form, coming from a customer or user review. We’ll need to take an average from a selection of sample reviews. Let’s say the average token count, including the initial prompt template, comes to 150 tokens.

Now let’s imagine that we are getting a lot of reviews and our service is running at a pretty significant volume - say 30 requests per minute. If we take GPT-4-Turbo as our example managed LLM, the math would look something like this:

30 requests/minute X 150 tokens/request X 60 min = 270,000 tokens/hour

270,000 tokens x 24 hrs x $10/M Tokens = $64.80/day input token costs

With our input prompts estimated, we’ll need to get a sense of our output prompt. Again, this can vary in token size to some degree, but should be more consistent than the input prompt as we are asking for a clear, direct answer from the LLM. We see the response below.

Great, so we are left with an output of 51 tokens. Now obviously we can adjust the instructions in the prompt to be more specific to limit the output of the prompt to just include the positive or negative sentiment judged by the LLM, but this will also add additional tokens to the input, so let’s just estimate that the average token output size is 45 tokens for now.

30 requests/minute X 45 tokens/request X 60 min = 81,000 tokens/hour

81,000 tokens x 24 hrs x $30/M Tokens = $58.32/day output token costs.

For a total cost of:

$64.80 + 58.32 = $123.12/day or $3693.60/month

To be clear, estimating these types of workflows are extremely individual. Altering one variable, such as input prompt size or slightly adjusting request volume can significantly increase or decrease the large language model cost estimate. However, with a moderate sized request volume and a fairly small input prompt and output response, we can see that the cost of this model is not insignificant and this is just one use case.

Estimating Costs for Llama3-8b

With open-source large language models, it’s a bit easier to price out as we really don’t need to worry about input or output token count. Our main concerns here are latency, throughput and concurrency. We’ll need to make sure that we are deploying enough resources to handle the request load being sent to the model in a reasonable latency while not slowing down the model too much.

Continuing the sentiment analysis example from above, let’s assume we are working with Llama3-8b and running on an AWS g5.2xLarge. Generously assuming a latency of 2-3 seconds, the 8vCPU on the g5.2xLarge should be able to handle that request volume concurrently. However, you should be mindful of resource utilization and adjust accordingly if GPU memory or CPU begin to spike.

1 g5.2xLarge on-demand hour x $1.212 x 24 hours = $29.08/hour or $872.40/month

Or if we err on the safe side, we can add an additional replica to make sure throughput isn’t a bottleneck:

2 g5.2xLarge on-demand hour x $1.212 x 24 hours = $58.16/hour or $1744.80/month

Even with the additional replica node, the open-source deployment option comes in quite a bit cheaper, and we could reduce this further if we chose to deploy the LLMs on spot instances rather than on-demand.

So Which Should You Choose?

While the example presented above may point to the open-source deployment option as the cheaper alternative, estimating LLM costs is certainly not a one size fits all option. We could easily flip one or two variables here and the managed LLM solutions would be the far better pick. There are numerous considerations or requirements that influence your decision. Internal security restrictions may prevent sensitive data from being shared with a managed solution, the accuracy of one model may wildly outperform another model with similar resource requirements and training parameters, the request volume and frequency may be small and infrequent enough that self-hosting would never be feasible, serverless hosting may offer better scaling performance, or the complexity of the prompt may be so simple that anything beyond a lightweight model would be significant overkill.

When it comes to estimating large language model expenses and choosing the right model for your workflows, the answer is ultimately it depends. You shouldn’t assume that one model or one deployment type will fit all of your use cases. You should certainly run the comparison calculations like we outlined above in your experimentation phases, but a holistic approach is required to fully account for all your organization’s needs. However, when it comes to pricing your LLM workflows, you should take into account all of the following parameters:

Token size of the input prompt
Token size of the output response
Consistency of the input prompt
Expected latency
Acceptable latency
Responsiveness of autoscaling
Complexity of the prompt
Security of the input data
Complexity of the model

Additional Costs for LLM Services

In addition to the infrastructure of deploying the model itself, many LLM applications have additional costs that accompany the model. In particular, RAG frameworks (Retrieval Augmented Generation), have become extremely popular solutions for providing additional context and data to LLMs that may be outside its training scope. In a RAG framework, data is stored in some external location. This could be a vector store, relational database, in-memory cache, or an external API call. Whether you are hosting your own RAG data source, or using a managed vector solution like Pinecone or Qwak, or you are hosting an open-source solution like Milvus or PgVector on an EC2 instance in your AWS environment, additional costs will be incurred that should be included in the overall cost of your LLM solution.

In addition, as LLM applications are starting to become more advanced, organizations are incorporating more and more components to strengthen and enhance the accuracy of LLMs. This can include traditional machine learning models like sentiment analysis or classification, LLM gateways that route traffic or prompts to specific models, security components that limit or restrict data to privileged users, or even additional, different LLMs that can serve to validate or test the output of the primary LLMs response. All of these services and components come with additional infrastructure, network ingress, and further costs that should be accounted for in your overall generative AI cost estimation.

Conclusion

The LLM industry is rapidly evolving and iterating, and your approach to estimating large language model costs will require constant revisiting and monitoring. Advances in hardware or algorithmic efficiency are regularly being made, allowing managed LLM providers and open source models to deliver inference faster, cheaper, and more accurately. But it’s important to not view your large language model costs as a black box. Keep track of your token utilization, request latency, and prediction accuracy to select the best model for your use case, and employ the strategies laid out in this blog post to observe your large language model costs to avoid receiving an eye-popping surprise bill at the end of the month.

How Qwak Can Help

Qwak is an end-to-end MLOps and Generative AI platform that manages the infrastructure required for advanced machine learning development as well as the observability and monitoring capabilities necessary for maintaining your models. Qwak provides solutions for training, experiment tracking, model registry, inference deployment - real-time, streaming, and batch - as well as monitoring, alerting, and automation. You can easily deploy open-source Large Language Models like Llama3, Mistral, and Bert with the click of a button, or, you can customize the deployment to add functionality like testing, guardrails or authentication. Qwak also provides support for traditional machine learning models as well as feature stores and vector stores, so you can easily deploy complex LLM workflows all in one place.

Also, be sure to check out our new LLM platform that will include prompt templating and versioning, LLM tracing, advanced A/B testing strategies, and specific Large Language Model Cost monitoring.