Can the M3 MacBook Pro improve efficiency? Programmers and product managers prove themselves, CTO: You win, the new computer is on the way

Compile | Su Mi

Produced by | CSDN（ID：CSDNnews）

At the end of October 2023, Apple quickly announced the arrival of the M3 series of chips and devices in 30 minutes, and Cook described it as "extremely powerful, technologically advanced, and extremely energy efficient."

The release of the new version has attracted the attention of many fruit fans. And for tech companies, it's no surprise to have a MacBook when employees are onboarded. Therefore, after the latest MacBook with M3 series chips went on sale, many employees wanted to be the first users to experience it, and incident.io, a foreign developer specializing in providing emergency handling software, was one of the companies that was asked by employees whether to buy a new device.

In the incident.io intranet, many employees said:

Wow, how much more would my productivity improve if my laptop looked that good?
If our machines were deep space black, maybe we would be safer 🤔

In the face of enthusiastic employee discussion, Pete, the company's CTO, responded, "If I can prove that it's worth it, then I'll buy it." ”

As soon as these words came out, many employees were moved. A team led by incident.io product manager Lawrence Jones sprang into action, first developing a custom Go hot-loading tool with the help of programmers, which tracked telemetry from developers' laptops (the final result of collecting data from various sources and subsystems within MacBooks) and loaded it into OpenAI's latest GPT-4 model, which then analyzes it, ultimately helps the company gauge whether it should abandon the existing M1/M2 device and spend money on the latest M3 MacBook Pro for programmers.

Next, we'll take a look at exactly how Lawrence Jones did it!

The work of a programmer is difficult to quantify

It's hard for Incident.io and many teams in the industry to quantify developer productivity, but getting a faster feedback loop (which fosters continuous improvement by identifying issues, improving features, and validating design choices) can be a great way to improve developer productivity.

Internally, the most common feedback loops in its local development are:

编译 Go monolith。
Run code generation: API clients, interfaces, etc.
Hot-loading techniques for front-end/mobile applications (allowing changed code to take effect without restarting the service).

All developers develop locally on their laptops using a fully fledged Incident.io environment: it allows for a <30-second feedback loop between changing code and running it.

As a result, compiling Go applications is one of the most frequent and expensive steps in these loops, with a codebase that has close to 1 million lines of code. As one of the most resource-intensive of these tasks (in terms of system resources), it may be better to have more powerful hardware.

With this in mind, the Incident.io team chose Go compilation speed as a key metric to measure MacBook performance: now the team only needs to measure it.

Get telemetry

According to Lawrence Jones, incident.io has always used codegangsta/gin (https://github.com/codegangsta/gin) as a hot-loading tool for Go programs. It works well, and whenever the code file changes, it compiles and restarts a new version of the application. However, due to some difficult reasons, incident.io had to consider replacing a tool.

However, when Lawrence Jones et al. looked at other hotloaders, they didn't find any of the hotloaders that could provide them with the telemetry data they wanted to track compiler time acquisition (the end result of collecting data from various sources and subsystems within the MacBook).

For Lawrence Jones, the ideal hot-loading tool gets:

System size: platform (M1/M2/M3), total memory, etc.
Runtime metrics: operating system, memory usage, power, battery level, etc.
Telemetry data: total duration, Go compilation phase, which files triggered the build, and more.

Since there was no off-the-shelf alternative, Lawrence Jones decided to build his own to tailor to his specific needs.

The main motivation for this experiment was to get telemetry, so the process started as a simple main.go, which Lawrence Jones and his fellow programmers tweaked until it could extract what it needed from the host.

Mac computers have a number of binaries that can help you do this, such as memory_pressure, docker, sysctl, and pmset, and the code is mainly about executing and parsing their output:

memoryFreeRegex := regexp.
 MustCompile(`^System-wide memory free percentage:\s+(\d+)%`)

// Run memory_pressure...
memoryPressure, _ := exec.Command("memory_pressure").Output()
for _, line := range strings.Split(string(memoryPressure), "\n") {
if memoryFreeRegex.MatchString(line) {
// ...parse result!
 ev.MemoryFree = memoryFreeRegex.FindStringSubmatch(line)[1]
 }
}

Once the system and runtime collector have been successfully built, it's simple to encapsulate the Go compile command so that you can extract bulid-specific metrics such as stage time (linker, compilation, etc.) and keep track of which files are causing the build to be triggered.

Eventually, Lawrence Jones runs the hotloader from the existing make run target.

"This is an invisible change for our engineering team, but it means that we now send telemetry events to the HTTP endpoint of our choice every time a build is complete. We set up a Fivetran webhook endpoint receiver (incident.io internal ETL tool) that pushes events to our data warehouse and that's it!" said Lawrence Jones.

In the end, Lawrence Jones got the data he needed to judge the performance of his Mac.

Analyze the data

Based on the steps above, Lawrence Jones et al. waited a few weeks to collect a large enough data set, and then moved on to analyzing the data. Most developers know that compilation times are variable, even on the same system, though. The Go compiler, in particular, caches a lot of the compilation process, so even the most powerful M3 compiles much slower without caching than older Intel versions of MacBooks with caches.

This means that Lawrence Jones can't directly compare the average build time of different platforms.

Instead, they need to really dig deep into the underlying data, which provides a great opportunity to use OpenAI's GPT-4 latest code interpreter.

Create an OpenAI assistant

When using GPT-4 to analyze data, Lawrence Jones shared the process:

Export data to CSV: BigQuery exported the results of select * except(payload) from developer__build_events.
Create an "assistant" that helps explain the purpose of the experiment with prompts and provides them with a CSV file with the data.

In detail, Lawrence Jones used an experimental gpt-4-1106-preview model and enabled a code interpreter for data analysis. In just a few minutes, the assistant was set up, and when everything was ready, it began to answer the relevant data analysis questions.

Cleanse data

Lawrence Jones collected a total of 25k pieces of data, all at all times of the day, on a variety of laptops, and in many different conditions.

In order to make good comparisons between platforms, and also to establish the fairest possible comparison for these compilations, Lawrence Jones removed data with confounders.

Failed or canceled compilation data: If someone exits before the compilation is complete, this is a work-in-progress and needs to be culled.
Battery-powered: OS X limits performance to extend battery life.

Next, let's take a look at how many compilations are unsuccessful and how much compilation activity there is for each platform.

Ask GPT-4 a question:

Show me the build distribution by machine platform, where the platforms are sorted by M1 to M3, with Pro ahead of Max in the platform class.

Split the number of builds per platform by compilation success (exit status of 0 and not canceled) and failure.

I'm a vision Xi and I'd like to see results like this: we sort platforms by M1 to M3, with Pro sorting before Max in each platform category.

Maybe it feels weird to talk to an AI chatbot like this, but it gives us what we want:

It is possible to go on and say to GPT-4:

From now on, exclude failed compilations from the dataset.

When the bot says that the failed compilation has been culled from the dataset. Datasets containing 12,525 successfully compiled datasets can now be processed.

It is assumed that using battery-powered machines will be limited, resulting in build times that are not a good representation of overall performance. Lawrence Jones also wanted to confirm this by comparing the performance of the same platform for different power supplies.

So, he asked:

Focusing only on the successful compilation of the Apple M1 Pro and Apple M2 Max platforms, can you visually compare the build durations of different power supplies?

The final result is as follows:

The results show that alternating current is usually faster.

Continue to solve with GPT-4:

In your opinion, does using AC power or battery power affect compilation performance?

The robot responded:

I will calculate the t-test for successfully building the dataset and report the p-value. A low p-value (typically less than 0.05) indicates a significant difference in build performance between AC and battery power.

...

The negative t-statistic indicates that the average build duration of AC power is lower than that of battery power. In addition, the P-value is about 0.0014, which is well below the commonly used cut-off of 0.05, and we have enough evidence to reject the null hypothesis.

Given the results of this statistical test, we can conclude that using AC power or battery power does affect build performance across all platforms, with builds using AC power tending to be faster than those using battery power.

"This speaks for itself: from now on, we're only going to consider computers that have successfully compiled + run on AC power!"

Compare M1 and M2

当 Lawrence Jones 进一步向 GPT-4 提问：

Focusing only on the M1 Pro and M2 Max (as these are the platforms where we have the most data), you can plot two histograms to show the duration of a successful build, and eliminate extreme outliers.

Overall, it's clear from the graph below that the M2 Max completely outperforms the M1 Pro in terms of build speed.

However, it's also important to note that these MacBooks are different not just in terms of the chipset, but also in terms of memory: Can you tell me how build events are broken down by machine platform and memory configuration?GPT-4: Here's the number of successful build events for each memory configuration within each platform:

Apple M1 Pro (16GB): 5,235 build events
Apple M2 Pro (16GB): 1,927 build events
Apple M2 Max (32GB): 3,842 build events
Apple M3 Pro 18GB: 321 build events
Apple M3 Pro (36GB): 899 build events
Apple M3 Max (36GB): 301 build events

According to Lawrence Jones, "The comparison between our M1 Pro and M2 Max is somewhat unfair because the M2 both have 32GB of RAM, which is twice as much as the M1 (16GB). ”

This is probably the reason why the performance is so different (despite claiming that Apple computers require very little RAM) and can be useful when deciding whether or not you should upgrade to what spec M3.

Thankfully, there are 2k build events in the dataset from the M2 Pro with 16GB of memory, so Lawrence Jones went on to ask GPT-4:

Can you redraw the histogram comparison of the normalized build duration of the Apple M2 Pro 16GB and the Apple M2 Max 32GB?

The final result is as follows:

To know that the M2 Pro and M2 Max are basically the same chips, the main difference is that the Max has two additional "energy-efficient" cores. These cores are about 1/5 the power of the performance cores and contribute minimal to compiling Go programs, and therefore to our build duration.

Well, this comparison is very useful: it means that 32GB has a small impact, maybe just a higher density that contributes to fast build times, but it's not very obvious overall.

Compare different versions of M3

As it stands, the M2 Pro with 16GB of RAM seems to be the best value for money, with almost twice the speed of the M1 and slightly slower than the M2 Max (even though the Max has twice the memory capacity of the M1).

Now, we need to figure out if the M3 is a noticeable improvement over the M2, and for that, Lawrence Jones has also bought some new notebooks:

M3 Pro, 12 cores (6 performance cores, 6 efficiency cores)

1 x 18GB RAM
1 x 36GB RAM

M3 Max, 14 cores (10 performance cores, 4 efficiency cores)

1 x 36GB

There are three laptops in total, with the Pro differing in memory, while the Max was bought to measure if we can see how much of a difference the extra 4 P-cores are.

If you start with Pro:

Show me the build duration histogram for the M3 Pro 18GB and M3 Pro 36GB.

It was previously concluded that memory has little impact on build performance, so it's no surprise that the charts look similar.

In fact, due to a lack of data, M3 is built far less often than other platforms. Based on this, Lawrence Jones combined the results of the two M3 Pros:

Show me the build duration histogram of the M3 Pro 18GB and 36GB combo with the M3 Max, removing the very fast build time (<3 seconds).

Again, there's a lot less data encountered here, but even so, the M3 Max doesn't look like it's doing anything really great than the already fast M3 Pro.

At the very least, it's not enough to justify a 60% price increase over the base M3 Pro.

Can the M3 MacBook Pro improve efficiency? Programmers and product managers prove themselves, CTO: You win, the new computer is on the way

M1, M2, M3 are compared together

So far, Lawrence Jones has stated:

The baseline of the M1 machine has been established.
Make sure that the performance of the M2 machines is comparable, regardless of memory or additional cores.
The M3 is also similar on the Pro and Max.

So, next, it's time to take a comprehensive look at the comparison of these platforms......

Show me a build duration histogram comparing all platform and memory combinations, excluding builds that were completed in <3 seconds.

Repeat the above, but use one histogram per row and merge the M1, M2, and M3 platforms.

This clearly shows that:

Users with M1 laptops often have to wait nearly 2 meters to complete the build.
The M2 is a major upgrade to the M1 machine.
M3 is a progressive improvement over M2.

Suggestion:

We should upgrade all M1 users to the base M3 Pro model.
M2 users do not need to upgrade.

Complete!

One More Thing

By comparing computers with different RAMs above, Lawrence Jones admits that there is no meaningful improvement from ~16GB → 32GB in terms of the build duration of the test.

Does memory size affect program development?

As a developer, it's easy to understand that sometimes the reason why builds take a long time is because of the linker. In a linker, the compiled Go packages are concatenated together to form an executable binary.

To put it to the test, Lawrence Jones asked:

There is a build_stages column in our dataset. We can add a new column to the dataset called "linker_time", which is derived from build_stages.link.duration_seconds.

GPT-4：

The linker_time column was successfully added to the dataset. The new column reflects the duration of the linker phase (link.duration_seconds) extracted from the JSON structure inside column build_stages, in seconds.

It's easy to understand that telemetry events contain the duration of linking/compilation, so Lawrence Jones can tell GPT-4 to create a new column of linker_time in the build event:

Draw the linker_time histograms for each platform and memory combination in the order specified earlier, and color each histogram differently depending on whether the machine memory is 16-18GB or 32-36GB.

This looks completely different! This means that almost every computer with more than 30GB of RAM (M1, M2, or M3) will link in less than 20 seconds, and only machines with 18GB or less RAM will link in >20 seconds.

All of this means that with just a symbolic increase in the cost of memory, it's recommended to buy 36GB of RAM, even if you think it's just "preparing for the future".

CTO: M3 is already on its way!

Based on the above tests, it is irrefutable that the new space black M3 Pro laptop not only looks "doubly faceted" in appearance, but also increases the efficiency of programmers and shortens delivery times.

Incident.io CTO Pete appeared on the intranet and quickly replied, "Okay, okay, you win!" That's the best argument I've ever heard to make in order to upgrade my computer, and to witness a truly impressive and enjoyable written report. However, based on this data, I obviously expect to increase productivity by 10% by 2024. ”

Specifically, the company decided to upgrade its M1 version of the PC to the base M3 Pro with 36GB of unified memory, but now the program using M2 will have to wait a little longer, because M2 also performs well.

Original address: https://incident.io/blog/festive-macbooks

Can the M3 MacBook Pro improve efficiency? Programmers and product managers prove themselves, CTO: You win, the new computer is on the way

The work of a programmer is difficult to quantify

Analyze the data

Cleanse data

Can the M3 MacBook Pro improve efficiency? Programmers and product managers prove themselves, CTO: You win, the new computer is on the way

M1, M2, M3 are compared together

One More Thing

Read on