Simple Question: Is Snowflake Expensive? The answer is - It Depends!
Photo by Giorgio Trovato on Unsplash
Around five years ago, I joined Snowflake in the UK as a Senior Solution Architect, helping customers across Europe and the Middle East get up to speed with this new technology from an unknown startup - called Snowflake. Back then, Snowflake was the first and only database platform to provide a pay-as-you-go service for as little as $3.00 for an entire hour of processing, and everyone was rushing to migrate existing on-premises data warehouses.
I felt like a lone voice for a long time - surely we must be careful. Snowflake is incredibly cheap - but the costs can only go up.
Five years later, about two dozen start-ups (Altimate is one I've been advising for the past year) that claim (and in some cases do) reduce your Snowflake costs by 30% or more.
This raises the question - Is Snowflake Expensive?
In this article, I'll discuss the Snowflake pricing strategy and the time/cost paradox, proving that Snowflake is not expensive and explaining why it costs so much.
Is Snowflake Expensive?
No. Nein, Het, Bù, Non. Snowflake is incredibly cheap.
Think about it: for less than the price of a cup of coffee, you get a powerful XSMALL database server for an entire hour. If halfway through, you take a break for a way more expensive coffee (good luck buying a flat white for under $3!), the machine will silently spin down, and you'll pay nothing further until you resume running queries.
This is not just "good value" - this is AMAZING!
Before I joined Snowflake, I worked at two London-based investment banks that built Oracle data warehouses. The hardware alone costs around $1m, not to mention the cost of licenses, backups, and a full-time DBA to install the software and configure the backups.
Leaving aside the cost, it took us three months from the decision to create table,
and hello world
. With Snowflake, you can set up a trial account in three minutes and even get $400 of free time (1,200 hours of computing time at $3 per hour).
So - Snowflake is definitely NOT expensive - but customers say it is. What's going on here?
How much does Snowflake charge?
Leaving aside serverless processing (an entire subject for another blog), the main cost for Snowflake customers is from Virtual Warehouses and Storage. Virtual warehouses are sized in "T-shirt" sizes. The following table summarizes the pertinent facts.
So, for $12 an hour, you get a MEDIUM-size server with 32 virtual CPUs that you can dynamically resize up or down, plus access to a massive software base, including the ability to run Java, JavaScript, Python, and packages from hundreds of third-party vendors on petabytes of data.
My benchmark testing has demonstrated that, on a MEDIUM-size virtual warehouse, you can copy an entire terabyte of data in just under an hour, and it will support up to 600 concurrent queries running for 1-2 seconds. Even then, it will automatically queue queries if the machine is over-loaded and can be configured to start up another same-size cluster using the following SQL automatically:
alter warehouse reporting_wh
set max_cluster_count = 2;
Why do Customers Find Snowflake Expensive?
"Risk comes from not knowing what you're doing" - Warren Buffet
Let's consider how a project initially uses Snowflake. Most customers correctly start with an X-SMALL virtual warehouse (perhaps two, one for development, another for production). Over time, as we add additional workloads, performance degrades, and the DBA executes the following statement:
alter warehouse reporting_wh
set warehouse_size = 'SMALL';
This now gives you double the capacity (16 CPUs, double the memory, and a super-fast SSD), and query performance improves. The DBA is immediately pleased, and everyone gets back to work.
Further time passes, and we gradually see the situation illustrated in the diagram below:
As additional workloads are added, the DBA needs to resize the virtual warehouse again, from X-SMALL to SMALL and then MEDIUM. Each time, the resources are doubled, and queries that used to take an hour suddenly take 30 minutes.
Everyone is happy. Well, most people. The project managers see ever-increasing costs and (worst still) no end in sight.
Let's consider why Snowflake is not expensive - even when scaling up.
The Snowflake Time/Cost Paradox
All truths are easy to understand once they are discovered; the point is to discover them. -- Galileo Galilei
Simple arithmetic demonstrates that if something is billed at twice the rate, it should cost twice as much.
If an X-SMALL virtual warehouse is charged $3.00 per hour and a SMALL at $6.00 per hour, surely the total cost will double? Well, no, actually. It depends.
Consider the table below, which shows the execution time and cost of copying 1.2 TB of data on a warehouse of varying sizes.
The above table shows the cost of repeatedly executing the following SQL statement on different warehouse sizes.
insert into store_sales as
select *
from sample_data.tpcds_sf10tcl.store_sales;
The table shows that as the query execution time was halved (100% faster), the query cost remained the same, but the elapsed time was reduced.
Indeed, the benchmark demonstrates you can get the same results in 60 minutes instead of four hours for the exact same cost.
This is because each time you increase the Virtual Warehouse size, you double the compute capacity, providing double the CPUs, memory, and SSD. Because Snowflake is billed by the second until suspended, you can get the same work done twice as fast for the exact cost.
This begs the question: Why do customers complain that Snowflake is expensive? Clearly, it's not.
What Problem are we trying to Solve?
"Inside every big problem is hundreds of little ones dying to get out". - Robert A. Heinlein.
The real problem we're trying to solve is a Packing Problem. This is a mathematical problem of assembling objects into containers, leaving no blank spaces.
However, the objects are workloads (batch jobs or sequences of queries), and the containers are virtual warehouses. The diagram below (Conway's Puzzle) illustrates the challenge of keeping Virtual Warehouses fully loaded and running efficiently while minimizing idle time.
We need to pack a cube with a varying number of different-shaped blocks, and the diagram below shows the step-by-step solution.
Image Credit: Wikipedia.
Imagine the single yellow cube is a large batch job executed once daily and takes four hours to run on a MEDIUM-size warehouse. The red blocks are executed once every hour and take 15 minutes, the blue shape is executed every 10 minutes and takes 1 minute to complete, and the green shapes represent jobs executed every five minutes and take around 30 seconds on an empty machine.
This means we have 4 batch jobs with varying execution frequencies and elapsed times, all executed on the same Virtual Warehouse.
The massive batch job that used to take four hours on an empty machine now takes six hours as it competes for machine resources with other jobs. Meanwhile, the different jobs execute regularly, as illustrated in the diagram below.
The problem with the above workload allocation is the mix of workload sizes where many workloads are not large enough to justify a 32 CPU (MEDIUM-size) warehouse. A query will only execute twice as fast on the next warehouse size if it has enough work to distribute across all the available CPUs.
For example, the green jobs (executed every five minutes) take just 30 seconds to complete and are unlikely to run twice as fast on a MEDIUM-sized warehouse as a SMALL machine; however, the charge rate is doubled.
Assuming the green jobs take 45 seconds on a SMALL warehouse and 30 seconds on a MEDIUM-size warehouse with an auto-suspend time of 60 seconds and are run 24x7 every five minutes, the cost of this job on an empty warehouse is:
SMALL-Size: $341,640 per year
MEDIUM-Size: $630,640 per year
A MEDIUM-sized warehouse bills for 30 seconds execution time + 60 seconds AUTO_SUSPEND time or 40 minutes per hour at $6 per hour. A SMALL warehouse bills for 45 seconds + 60 seconds AUTO_SUSPEND at $3 per hour.
To summarize, in the worst case, we pay double the annual cost for a savings of just 15 seconds per execution.
OK, so it's not quite as clear-cut as that because it’s likely that other batch jobs run concurrently, meaning the total time waiting to suspend is reduced, but the principle remains the same.
Besides, at 4am, when there’s little running on a warehouse the repeating short jobs are likely to be running on an empty machine.
To put it another way, if a job takes an hour to complete and the warehouse suspends after 60 seconds the idle time is just 1% of execution time
If, however, a job runs for 30 seconds, the idle time accounts for 66% of the total (90 seconds) of billed time.
Running on a bigger warehouse doubles the cost of idle time, which is a potentially huge cost for frequently executed short workloads.
Scaling Up Compounds the Problem
Let's consider the above scenario. The business decided the yellow job, which takes six hours to complete in a medium-sized warehouse, is unacceptable. They need the results faster, so the DBA increases the size from a MEDIUM to a LARGE warehouse.
It's reasonable to assume the Yellow batch job execution time is reduced from six hours to just three (so the business is delighted), but the cost of the Red, Blue, and Green jobs, which are all frequently executed on a machine that's oversized for what they need, has now doubled.
Now, you can see why customers find Snowflake expensive. The real problem is that customers execute small, repetitive workloads on warehouses that are too big but resize the warehouse for the largest workload.
The Solution: Altimate
At Altimate, we've spent countless hours executing hundreds of thousands of queries against millions of data points to identify and optimize warehouse workloads.
We have built multiple automated AI agents that constantly scan your warehouse workloads and automatically change your configuration to reduce operational costs.
If the problem is Snowflake cost. The answer is to talk to Altimate.
DataPilot from Altimate.ai can proactively monitor and provide you with detailed insights on how and where time can be saved. If you, too, want to leverage this feature and save development time without increasing the cost, why not get in touch?
Click on the image below to book a demo.