Does using the final keyword in C++ improve performance?

Does using the final keyword in C++ improve performance? Many developers think they can, but they can't provide a data basis. To this end, the author of this article conducted a test to verify the veracity of this claim for himself.

Original link:https://16bpp.net/blog/post/the-performance-impact-of-cpp-final-keyword/

Translator | Produced by Zheng Liyuan | Program Life (ID: coder_life)

If you choose to write code in C++, there must be a reason, and that reason is likely to be performance.

In a lot of articles about C++, we often see "performance tips and tricks" or "it's more efficient to do this" suggestions. Sometimes these suggestions will give you a reasonably detailed explanation, but more often than not, you'll find that there isn't any actual data to support these claims.

Recently I discovered something strange, and that is the final keyword. It's a shame to say that I didn't know much about this keyword before. Many blog posts say that using Final can improve performance, and it's free with only minor changes.

However, after reading these articles, you will find an interesting fact: no one gives any relevant data, and basically relies on a "trust me, brother".

In my experience, any performance improvement is nothing unless it's backed up by data, and even if you have data, you have to be able to reproduce it - so as a good engineer with a high-performance C++ project, I really wanted to validate that.

有一个我认为非常适合测试 final 关键字的绝佳项目：PSRayTracing（https://github.com/define-private-public/PSRayTracing）。

A brief introduction to the project: this is a C++ ray tracer based on Peter Shirley's ray tracing book series. It's mostly used for academic purposes, but it also incorporates my professional experience when writing C++. The goal of the project was to show the reader how to (re)write C++ to make it more performant, concise, and structured, so it was supplemented and improved on Dr. Shirley's original code. One of the important features of PSRayTracing is the ability to toggle code changes via CMake, as well as other options such as random seeding and multi-core rendering.

How is this done?

With the build system, I've added an extra option to CMakeLists.txt:

option(WITH_FINAL "Use the `final` specifier on derived classes (faster?)" OFF)               # ...               if (WITH_FINAL)               message(STATUS "Using `final` spicifer (faster?)")               target_compile_definitions(PSRayTracing_StaticLibrary PUBLIC USE_FINAL)               else()               message(STATUS "Turned off use of `final` (slower?)")               endif()

Then in C++, we can use a preprocessor to create a FINAL macro:

#ifdef USE_FINAL               #define FINAL final               #else               #define FINAL               #endif

And, it can be easily added to any class you're interested in:

$ rg FINAL               RandomGenerator.hpp               185:class RandomGenerator FINAL : public _GeneralizedRandomGenerator<std::uniform_real_distribution, rreal, RNG_ENGINE> {               Materials/Lambertian.hpp               8:class Lambertian FINAL : public IMaterial {               ...               Materials/SurfaceNormal.hpp               7:class SurfaceNormal FINAL : public IMaterial {               ...               PDFs/HittablePDF.hpp               7:class HittablePDF FINAL : public IPDF {               ...               Objects/Sphere.hpp               19:class Sphere FINAL : public IHittable {

This allows us to start and stop using the final keyword at any time in the codebase.

Of course, you might say it's too cumbersome, and I think so. But I have to say, it's a great place to do controlled experiments: apply the final keyword to your code and use or turn it off as your experiment requires.

Almost every interface has the final keyword. In the architecture, we have IHittable, IMaterial, ITexture, etc. In the second book of Peter Shirley's ray tracing series, the final scene has a lot of over 10K virtual objects:

Also, some scenes don't have a lot of numbers (maybe just 10):

Initial concerns

For PSRT, when testing something that might improve performance, I first used the default scenario book2::final. When Final is enabled, the console reports as follows:

$ ./PSRayTracing -n 100 -j 2              Scene: book2::final_scene              ...              Render took 58.587 seconds

But then the changes were reverted:

$ ./PSRayTracing -n 100 -j 2              Scene: book2::final_scene              ...              Render took 57.53 seconds

I'm a little confused, it's slower with final? After a few more runs, I noticed that the performance dropped very little, and those blog posts must be fooling me......

But before I give up completely, I think it's best to take out the validation test script and take a look. In previous versions, this script was mainly used for fuzzing PSRayTracing, and the repository already included a small set of well-known test cases.

The script initially took about 20 minutes to run, and that's when things started to get interesting: the script reported slightly faster with Final, running at 11 minutes and 29 seconds; 11 minutes and 44 seconds without final.

It may seem like a 2% difference in duration, but it's actually important – I decided to investigate further.

Large-scale testing

Dissatisfied with the above results, I created a "large test suite" that mainly raised some test parameters to strengthen the test intensity. On my dev machine, it takes 8 hours to run. Here are the details of the adjustments:

● Number of scenario tests: 10 → 30

● Image size: [320x240, 400x400, 852x480] → [720x1280, 720x720, 1280x720]

● Ray Depth: [10, 25, 50] → [20, 35, 50]

● Number of samples per pixel: [5, 10, 25] → [25, 50, 75]

I think it's more comprehensive: now some test cases take 10 seconds to complete, others take 10 minutes to complete; The small test suite completed about 350 test cases in just over 20 minutes, while this suite completed more than 1,150 test cases in just over 8 hours.

Considering that the performance of a C++ program is closely related to the compiler (and system), to be more thorough, we tested it on three machines, three operating systems, and three different compilers; I used final once, I didn't use it once. The computers have been calculated to run for more than 125 hours.

For details, see the following table, and configure the following configurations:

● AMD Ryzen 9:

Linux：GCC & Clang

Windows：GCC & MSVC

● Apple M1 Mac：GCC & Clang

● Intel i7：Linux GCC

For example, one configuration is "AMD Ryzen 9, using Ubuntu Linux and GCC", and the other is "Apple M1 Mac, using macOS and Clang". Note that not all compiler versions are the same, and some are hard to obtain. Also, at the time of writing, Clang has released a new version. Here's a high-level summary of the test results:

The comparison shows us some interesting conclusions, but also tells us one thing: Overall, using Final doesn't guarantee that the speed will always be faster, and in some cases it will be even slower.

While it may be interesting to compare the compilers in this test, I don't think it's fair to do so: it's only fair to compare "with final" with "without final". If you want to compare compilers (and different systems), you need a more comprehensive testing system.

Still, we observed some interesting results:

Clang on x86_64 is slower.
Windows performance is poor, and Microsoft's own compiler is lagging behind.
Apple's chips are absolutely powerhouse.

But each scene is different and contains a different number of objects marked as final. In percentage, it's interesting to see how many test cases are faster or slower after using Final. Listing these data, we can get the following results:

For some C++ applications, that 1% performance increase is very desirable (e.g., high-frequency trading). If more than 50% of our test cases can achieve this, then it seems appropriate that we should consider using final. But on the other hand, we also need to look at the opposite: how much slower, for example? And how many test cases are slowing down?

Clang on x86_64 Linux is definitely a typical example: more than 90% of test cases are at least 5% slower after using final!! Remember when I said that a 1% speed boost is a great thing for some applications? So on the other hand, even a 1% slower rate is not tolerated. In addition, Windows with MSVC also performs poorly.

As mentioned above, this has a lot to do with the scene. Some scenes have only a small number of virtual objects, while others have a large bunch. Here's an average of how much faster/slower the scene is after using Final:

I don't know much about Pandas and have had some issues with creating a multi-level indexed table (created from an array) and having it well styled and formatted. Therefore, I appended a configuration number to the end of each column name. Here's what each number means:

0 - GCC 13.2.0 AMD Ryzen 9 6900HX Ubuntu 23.10

1 - Clang 17.0.2 AMD Ryzen 9 6900HX Ubuntu 23.10

2 - MSVC 17 AMD Ryzen 9 6900HX Windows 11 Home (22631.3085)

3 - GCC 13.2.0 (w64devkit) AMD Ryzen 9 6900HX Windows 11 Home (22631.3085)

4 - Clang 15 M1 macOS 14.3 (23D56)

5 - GCC 13.2.0 (homebrew) M1 macOS 14.3 (23D56)

6 - GCC 12.3.0 i7-10750H Ubuntu 22.04.3

Here's where it shines: performance can be up to 10% higher in certain configurations and in certain scenarios! For example, book1::final_scene using GCC on AMD and Linux. But other scenarios (with the same configuration) only have a 0.5% improvement, such as fun::three_spheres.

However, just switching the compiler to Clang (which still runs on AMD and Linux) degraded performance by 5% and 17% respectively for both scenarios! The situation with MSVC (on AMD) is a bit more complicated, with some scenarios being more performant with final and others being greatly affected.

Apple's M1 is a bit interesting, with both acceleration and slowdown seemingly small, but GCC has significant advantages in both scenarios.

In addition, the increase or decrease in performance after using Final almost does not depend on the number of virtual objects.

I'm more concerned about Clang

PSRayTracing also works on Android and iOS. On both platforms, there is probably only a small percentage of applications written in C++ for performance, and Clang is the compiler for both platforms.

Unfortunately, I don't have a performance testing framework like I do on a desktop system, but I can do a simple "render the scene with the same parameters, one with final, one without final" test, because the application reports the process time.

Based on the data above, my hypothesis is that the performance of these two platforms will deteriorate after using Final, but it is unclear how much it will get worse. Here are the test results:

iPhone 12: I don't think there's a difference; With or without Final, it takes about 2 minutes and 36 seconds to render the same scene.
Pixel 6 Pro: Slower speeds after using final. The render time is 49 seconds and 46 seconds, respectively, and the difference of 3 seconds may not seem very large, which equates to a 6% slowdown, which is quite significant.

I don't know if this is an issue with Clang or LLVM. If it's the latter, this may have implications for other LLVM languages like Rust and Swift as well.

Plans for the future (and what I hope I want to do)

Overall, I'm very happy with what I found in this test. If I could redo something (or get a sum of money to do the project), I would like to do the following:

Make it possible for each scene to report some metadata. For example, the number of objects, materials, and so on.
Have a better understanding of Jupyter+Pandas. Although I'm a C++ developer, not a data scientist, I wanted to see how I could better translate my measurements to make them look better.
Find a way to run automated tests on Android and iOS. Neither platform is easy to test at the moment, which is an obvious problem.
run_verfication_tests.py is more of an app (rather than a small script).
PNGs are starting to get a bit bulky, and at one point I ran out of disk space. Lossless WebP might be better as render output.
Compare more Intel chips and use more compilers.

conclusion

If you're just rushing to the end, here's the summary:

There may be some benefits to using GCC.
There is little impact on Apple chips.
Don't use final on Clang and MSVC.
It all depends on your configuration/platform, do your own to test and gauge if it's worth it.

Finally, personally, I don't think I'm going to use the C++ final keyword to improve performance, and the results in this article show that this approach isn't stable.