laitimes

It's not just cloud costs that are out of control: 450 million a year for observability

author:Technical Alliance Forum

InfoQ Architecture Headline 2023-05-29 15:00 Posted in Heilongjiang

It's not just cloud costs that are out of control: 450 million a year for observability

Organize | Chu Xingjuan, nuclear cola

"Which company spent $65 million on Datadog in 2022?" Datadog recently revealed at a financial conference that a customer made a one-time upfront payment of up to $65 million, which instantly attracted industry attention: Which company made such a big move? Observable suppliers make so much money?

Datadog is a major force in observability, having raised funds in 2019 and currently has a market cap of $28 billion. The company's revenue in 2022 was $1.67 billion, averaging around $140 million per month. In a small survey on cost reduction, "AWS" and "Datadog" were the two most mentioned vendors. This is clear proof that infrastructure and observability costs are already quite high, with AWS being the leader in infrastructure.

On the May 4 earnings call, Datadog CFO David Obstler referred to a "non-recurring expense" (also known as a one-time expense) and said:

"Revenue for the quarter was $511 million, up 15% from the year-ago quarter. In the first quarter of 2022, a customer made a huge upfront payment; But there were no similarly large expenditures in the first quarter of 2023. According to the customer's estimates, their bills will grow less than 30% year-over-year. ”

This detail was captured by Mark Ronald Murphy, executive director of research and financial analyst at JPMorgan Chase. After making calculations, Murphy revealed that the advance payment was about $65 million (about 450 million yuan), and Datadog acknowledged the accuracy of the figure. Obstler said the company changed the billing frequency and amount, so the customer's bill would be spread more over time.

Obstler revealed, "This is a cryptocurrency company and is still our customer. They are an early-stage optimization vendor in the same area that we often discuss with the most impact and the most room for optimization. ”

Datadog co-founder and CEO Olivier Pomel said the customer's vertical has been nearly destroyed over the past year. Their own business revenue has decreased by 3 to 4 times. "In this case, we work with the customer to restructure their contract with us. We want to be part of their solution, not part of the problem. ”

At this point, the Internet is full of speculation about which crypto company spent $65 million on Datadog in 2022.

Investor Turner Novak speculates that it's Coinbase, but isn't quite sure. There are even people posing as Coinbase employees online. For example, an anonymous commenter on Hacker News claimed that $65 million was actually an advance payment for the next three years, but the content of the news could not be verified. Later, Gergely Orosz, a full-stack engineer at FARMLEND, posted that he had confirmed that the company was Coinbase and that the payment was their bill due for the year. Let's take a look at Orosz's detailed statement.

"Nobody cares about infrastructure costs anymore"

Coinbase went public in June of that year, valuing it at $85.7 billion on its first day of listing. By comparison, nearly two years later, the company's valuation is about $14 billion, down about 75%. During the boom years, trading volumes surged, hitting new highs, and the Coinbase infrastructure could barely keep up. Coinbase CEO Brian Amstrong has said:

"2021 was an incredible year for Coinbase, not to mention a rare trend in a person's lifetime, and one of the few in the entire history of business development. Our monthly trading users reached a record high of 11.4 million, a 4x year-over-year increase. With a growth rate of 400%, it's unbelievable. ”

After the IPO in 2021, no one at Coinbase cared about infrastructure costs, and the only goal was to continue growing. The company has paid huge amounts of money to vendors such as AWS, Snowflake and Datadog. As a result, the $65 million in 2021 was actually spent on Datadog, and Coinbase closed the spending in the first quarter of 2022.

But at the start of 2022, Coinbase's situation took a turn for the worse, requiring immediate cuts in infrastructure spending. This is because the crypto industry has suddenly cooled, and Coinbase's business has naturally been affected. As revenue dried up, the company began to turn its attention to reducing costs and increasing efficiency.

In terms of observability, Coinbase has formed a dedicated team with the goal of moving this functionality from Datadog to the internal Grafana/Prometheus/Clickhouse technology stack. Here's a little introduction to these techniques:

  • Prometheus: A time series database. As a very popular open source solution for system and service monitoring, Prometheus collects metrics from configured targets (services) at given intervals and combines evaluation rules to trigger alerts.

Prometheus is primarily written in Go, but also uses Java, Python, and Ruby code. Prometheus stores time series data in an efficient, customizable format on in-memory and persistent storage media (HDDs or SSDs) with support for partitioned and federated deployments.

Prometheus is part of a cloud-native base, so it's safe to build your business on top of it. The project will be maintained and supported for the present and foreseeable future.

Prometheus is self-hosted, and some cloud providers are offering Prometheus hosting services: both Googld Cloud and AWS offer production-grade service options, while Azure's services are currently only in preview.

  • Grafana: Visual indicator frontend. Grafana is a popular visualization solution for source analysis and monitoring. If you need to view or drill down into metrics or alerts, Grafana is the go-to tool for tech businesses. Example of a Grafana dashboard:
It's not just cloud costs that are out of control: 450 million a year for observability
  • Clickhouse: A logging management tool. This is a fast, open source, column-oriented database management system and a popular log management option. Clickhouse is primarily written in C++ and is widely used throughout the industry. Cloudflare, for example, uses Clickhouse to store all of its DNS and HTTP logs — over 10 million rows per second! Clickhouse is also Uber's central logging platform.

Coinbase initially chose to do it itself with the main goal not to save costs, but to gain complete control and observability. Observability and reliability are Coinbase's biggest trump cards against competitors in the market.

But as the cryptocurrency market cools, cost is starting to become a core concern, and the internally operated Grafana/Prometheus scheme is indeed much cheaper. The Coinbase team has been repeatedly debugging the new technology stack for months, eventually fixing all the issues and confirming that everything is working correctly.

In this way, Coinbase was going to say goodbye to Datadog about this, but Datadog saved the partnership at the last minute and offered Coinbase a generous term that it could not refuse. In simple terms, Datadog's subsequent bill will be far less than the $65 million in 2021. After all, as Brian Amstrong said when talking about the cryptocurrency market in 2021, $65 million in bills haven't been common in the entire history of business.

To retain customers, Datadog fights "broken bones"

Orosz asked an engineer at Coinbase who had experienced both the in-house stack and Datadog about this, and wanted to hear what they thought of the decision to keep Datadog. The engineer decided that going ahead with Datadog was the right decision, given the reasonable cost and excellent development experience.

Coinbase will eventually be able to design a similar experience in-house, but achieving a seamless developer experience similar to Datadog's will likely take decades of engineering.

And "expensive" in observability tools is a relative concept. For example, after a significant price cut, Coinbase now spends "only" $10 million a year on Datadog. So how much is that $10 million?

First reactions may still seem like a lot, but digging deeper reveals that platforms like Datadog can also help prevent outages, detect them instantly, and quickly mitigate downtime.

In 2022, Coinbase experienced 18 outages totaling approximately 12 hours. Based on 2022 revenue, the company's average daily revenue is around $9 million. Assuming that Datadog was able to prevent outages through early monitoring, thereby cutting the number of outages in half, it would have been assumed that without Datadog's involvement, the total actual downtime would have been 24 hours.

In addition, assuming that Datadog-powered Coinbase can improve recovery by up to 2x (possibly because Datadog quickly correlates health metrics with logs, debug actions, etc. to help pinpoint root causes and improve mitigation efficiency), the total downtime without Datadog will be further extended to 36 hours.

With a simple math, Coinbase saved $9 million in downtime alone by choosing Datadog, so the $10 million per year is now worth it!

Tens of millions of dollars in observable bills are not uncommon

In DataDog's case, the numbers are more complex as the company offers not only observability solutions, but also security billing. The earnings report did not say how many such SaaS services were used by the unnamed company.

"While $65 million is a staggering number, $10 million bills are not unusual for traditional observability companies." Shahar Azulay, CEO of Groundcover, an observability alternative provider, said.

"A big company like Coinbase has been buying for $10 million a year for a while." "It's not uncommon for companies to pay more than $10 million a year to observable providers like Splunk, Dynatrace, DataDog, or even multiple vendors at once, each with more than double-digit revenue," says Azulay. ”

Azulay adds that the focus is on how observable vendors set prices. Observability solutions monitor three types of data: logs, metrics, and traces (monitoring the path of interactions, such as end-to-end transactions and what happens between a service). The growth trend of these data sources is difficult to predict, especially when events like Black Friday peak user usage.

"It's fraught with a lot of unpredictability and a lot of reliance on the amount of data pushed into the logs, which is the root cause of having a lot of pricing points because you can't control and you can't know how much you're going to pay next month." What's more, Azulay says, even if the contract is for a certain gradient level, once the company exceeds that level, the vendor will charge the higher tier rate from that day onwards.

"Specific log lines can be a critical part of the infrastructure, such as Google's search engine or anything that runs 1 million times a day – it's just that customers use it 1 million times a day." Azulay says developers may just be pushing more log lines or data points into the system, but don't know more. There's a cycle where developers create applications, build the business logic that an organization should do as a product provider, and then do R&D management until two months later and realize: Oh, this is 50% up our price.

Azulay believes that the problem may fall on developers because they push too much information into the observability stack, resulting in fewer data points monitoring production. "It's a strange vicious cycle where developers want more data to troubleshoot and managers have to make trade-offs, and they have to pay a lot of money for it."

However, not all observability companies charge this way. Groundcover that uses an eBPF agent does not collect data, it charges for the number of servers running in the production environment.

Who is the "Great Injustice"

It's clear that vendors are tight-lipped about cutting customer spend, and we're just lucky enough to find Coinbase from the nuanced clues of Datadog's statement. But Coinbase's situation is by no means an isolated case and more reflective of the overall trend of the market.

Datadog CEO Olivier Pomel confirmed that similar cost optimization initiatives are happening across customers:

"Looking at our data, looking back at what we've heard from hyperscale customers, and summarizing their views on the short-term future, we really don't have much confidence in what lies ahead. In other words, in the next quarter, at most next quarter, large-scale cost reduction and efficiency increase are coming. Therefore, in terms of the current guidelines and the planning for the year, we think that will be basically the case for the rest of the year. ”

Datadog's crisis may also be ongoing. Orosz revealed that Shopify is planning to decouple from Datadog.

Orosz says that a number of large companies are building their own in-house Grafana/Prometheus technology stacks to move away from observability vendors, which is ultimately a matter of money.

"A fixed expenditure of $2 million to $5 million per year is the best reason to flee suppliers. After all, once you reach this scale, it is theoretically better to hire an internal team to take over this part of the work yourself. Orosz said.

Based on guidance experience, the operating costs of in-house infrastructure are much lower than the price offered by the supplier. That's because vendors and enterprises tend to use the same cloud infrastructure, be it AWS, Google Cloud Platform, or Azure. The biggest difference is that companies need to hire dedicated engineering teams and technicians to build and run the infrastructure.

Therefore, from a cost perspective, the final trade-off can be distilled into the following simple rule:

Infrastructure Cost + Platform Team Cost < Existing vendor cost

The platform team costs more than $1 million, sometimes more than $2 million. That's because the platform team needs to have at least 4 or 5 engineers plus a manager, all of whom earn an average salary of between $150,000 and $400,000 a year, depending on the cost base.

So when service bills reach $2 million or even $3 million a year, self-building is more reliable than outsourcing. The final kick is how much retained profit the supplier has attached to the original infrastructure.

Orosz said he couldn't figure out Coinbase's behavior: Why did the vendors set prices to the $65 million level before they started thinking about building their own teams?

That's $65 million, and Coinbase can use it to assemble a luxurious lineup of 10 senior/senior engineers in the Bay Area, even if it doesn't cost more than $5 million a year." Then there's the cost budget for infrastructure, which is less than $1 million a year. Orosz lamented.

Reference Links:

https://investors.datadoghq.com/static-files/18234a4f-04f9-4a9f-9679-668cd672fb7b

https://blog.pragmaticengineer.com/datadog-65m-year-customer-mystery/

https://thenewstack.io/datadogs-65m-bill-and-why-developers-should-care/