"C is no longer a programming language"

The author | Aria Beingessner

Translated | Hirakawa

The idea in the title of this article is very "exciting", it comes from Aria Beingessner, a Swift and Rust expert abroad, who recently wrote an article "C is no longer a programming language", which has caused a hot discussion in the technical community.

Beingessner and his friend Phantomderp found that they both shared a high degree of agreement on some aspect of the C language—getting angry at CABIs and trying to fix them. Although the reasons for their respective angers are different, what the authors of this article want to express is: "C has been elevated to a role of prestige and authority, and its domination is so absolute and eternal that it completely distorts the way we talk." "Rust and Swift can't simply 'speak' their native language or comfortable language — they have to weirdly mimic C's skin and wrap themselves in it so that the flesh undulates in the same way."

Although the parable is sharp, the basis is not unreasonable. Almost any program to do anything useful or interesting, it has to run on the operating system. That means it has to interact with that operating system — and a lot of operating systems are written in C. Therefore, the language must interact with C code, which means it must call the C API. This is done through an external functional interface (FFI). In other words, even if you've never written any code in C, you'll have to handle C variables, match C data structures and layouts, and link to C functions by name and symbol. This applies not only to interactions with the operating system in any language, but also to calling another language from one language.

Although many people have expressed their liking for C, they have also expressed their recognition and approval of the content of the article.

More precisely, the core of this article is not "C is no longer a programming language", but "C is not just a programming language". InfoQ has translated the original text for the benefit of the reader. The following is an excerpt from the original text:

C is a programming universal language, we all have to learn C, so C is no longer just a programming language, it has become a protocol that every general-purpose programming language needs to abide by.

This article only explores the "elusive mess of C caused by implementation definitions", a protocol that everyone has to use has become an even bigger nightmare.

External function interface

First, let's look at it from a technical point of view. You've completed the design of the new language, Bappyscript, which provides best-in-class support for Bappy Paws/Hooves/Fins. It's a magical language that will revolutionize the way people program!

But now, you need to do something useful with it, like, accept user input, or output results, or anything visible. If you want programs written in your language to be good citizens and work well on major operating systems, then you need to interact with operating system interfaces. I've heard that anything on Linux is "just a file," so let's open a file on Linux.

Sorry, what? This is Bappyscript, not C. So where is the Bappyscript interface for Linux?

What do you mean that Linux doesn't have a Bappyscript interface!? Well, it's a whole new language, but you'd add one, right? At this point you think, it seems like we have to use what they give.

We will need some kind of interface that enables our language to call external functions. External function interface, yes, FFI... And then you find out, what, Rust, you also have C's FFI? Do you have Swift too? Even Python?!

In order to talk to major operating systems, every language must learn to speak C. Then, when they need to talk to each other, they all speak C.

C is now the programming lingua franca. It is no longer just a programming language, it has become a protocol.

What does it take to interact with C?

Obviously, almost every language must learn to speak C. So, what does "speaking C" mean? This means describing the types and functions of interfaces in the form of c header files, and doing something in some way:

Match these types of layouts;

Do something with the linker to parse the symbol of the function as a pointer;

Call these functions with the appropriate ABI (e.g. put the arguments in the correct registers).

However, there are two problems here:

You can't really write a C parser;

C doesn't have an ABI, or even a well-defined type layout.

You can't really parse a C header file

Really, parsing the C language is basically impossible.

"But, wait! There are many tools to read C header files, such as rust-bindgen! ”

But it still doesn't work:

bindgen uses libclang to parse C and C++ header files. To modify how bindgen searches for libclang, see the clang-sys documentation. For more details on how bindgen uses libclang, see the bindgen user guide.

Anyone who spends a lot of time trying to parse the C(++) header file from the syntax will quickly say ah, him and instead do it with a C(++) compiler. Keep in mind that it doesn't make sense to parse C header files just syntactically: you'll also need to parse #includes, typedefs, and macros. So now you need to implement all of the platform's header parsing logic and somehow find the DEFINED content that corresponds to the environment you're interested in.

Take Swift as an extreme example. It has basically every advantage when it comes to C interoperability and resources.

The language was developed by Apple and effectively replaced Objective-C as the main language for defining and using system APIs on apple platforms. I think it went one step further in the process than any other language in terms of ABI stability and design.

It's also one of the best languages I've ever seen with FFI support. It can import (Objective-)C(++) header files locally and generate a beautiful native Swift interface, with related types automatically "bridged" to equivalent types in Swift (usually transparent because the ABI of these types is the same).

Swift developers are also the builders and maintainers of Apple's Clang and LLVM projects. They are all world-class experts in the C language and its derivatives. Doug Gregor is one of them, and here's what he thinks about C FFI:

See, even Swift doesn't want to do that. (See also Jordan Rose and John McCall's PPT on llvm for "Why Swift Is Doing It This Way").

So, if you don't want to use the C compiler to parse and parse header files at compile time, what are you going to do? You're going to have to "translate by hand"! int64_t？ Or do you mean to write i64.long? ......

C doesn't actually have an ABI

Well, there's nothing to be fussed about: for "portability" reasons, integer types in the C language are designed to be of varying sizes. We could bet on a somewhat weird CHAR_BIT, but we still don't know the size and alignment of the long.

"But wait! Each platform has standardized calling conventions and ABI! “

There are, and they usually define the layout of key primitives in C! (Also, some of them don't just define calling conventions for type C, see AMD64 SysV.) ）

But here's a tricky problem: the ABI is not defined in its architecture. Neither does the operating system. We have to do work on specific target triples, such as "x86_64-pc-windows-gnu" (not to be confused with "x86_64-pc-windows-msvc").

Well, how many such target triples would there be?

And also:

There are a total of 176 such target triples. I had planned to list them all for visual impact, but there were too many.

There are simply too many ABI. Also, we haven't touched on all the different calling conventions, like stdcall vs fastcall or aapcs vs aapcs-vfp!

At the very least, all these things like ABI and calling conventions must be made available to everyone in a machine-readable format: lengthy PDF files.

Well, at least for the specific target triogene group, the major C compilers agree on the ABI! Of course, there are also some strange C compilers, such as clang and gcc-.

This is the result of my running the FFI abi-checker on x64 Ubuntu 20.04. It's a pretty important, well-performing platform. What is being tested here is some very annoying situation where some integer parameters are passed by value between two static libraries compiled by clang and gcc... And it failed!

Even __int128ABI on x64 Linux, clang and gcc failed to agree. This type is a gcc extension, but the AMD64 SysV ABI is clearly defined and described in a nice PDF file.

I wrote this to check for errors in rustc, and I didn't expect to find that the two major C compilers were inconsistent on the most important and most familiar ABI!

ABI is a lie.

Try to domesticate C

So semantic parsing of C header files is a terrible nightmare that can only be done by the C compiler of that platform, and even if you let the C compiler tell you the type and how to understand the comments, in reality, you still can't know the size/alignment/calling conventions of everything.

How do you interoperate with that pile of stuff?

Your first option is to surrender completely, soul-binding your language to C, in any of the following ways:

Write the compiler/runtime in C(++), so it can speak the C language anyway.

Have your "codegen" generate C(++) directly, so that users need a C compiler.

Build your own compiler based on a well-established mainstream C compiler (gcc or clang).

But that's limited to that, because unless your language does expose unsigned long, you're inheriting C's portability mess.

So we come to the second option: lying, cheating, and stealing.

If all this is an unavoidable disaster, it is better to start translating type and interface definitions by hand in your own language. That's basically what we do every day in Rust. Yes, people use tools like rust-bindgen to automate this process, but many times, you still need to check or manually adjust those definitions, and life is short, and it is really impossible to make a C build system that has been strangely customized by someone portable.

Hey Rust, what's intmax_t on x64 linux?

Hey Nim, what is long long on x64 linux?

A lot of code has weeded C out of everything and has begun to hard-code definitions of core types. After all, they're obviously only part of the platform ABI! What are they going to do? Change the size of the intmax_t!? This is clearly a modification that breaks the ABI.

Oh, by the way, what's that thing phantomderp is working on?

Let's talk about why we can't modify the intmax_t, because if we change from long long (64-bit integer) to __int128_t (128-bit integer), some binaries will be overwhelmed and use the wrong calling convention/return convention. But is there a way — if the code is chosen — that we can upgrade function calls in a new application while leaving the old one as it is? Let's write some code to test how transparent aliases can help the ABI.

Yes, their articles are really well written and solve some very important practical problems, but... How does the programming language handle this change? How do I specify which version of the intmax_t interoperate with? If there are some C header files that involve intmax_t, which definition does it use?

The main mechanism we use when discussing the different platforms for ABI is the target triple. Do you know what a target triple is? x86_64-unknown-linux-gnu。 Do you know what's included? Basically covers all the major desktop/server Linux distributions over the past 20 years. Ostensibly, you can compile against a target and get a binary that "works" on all of these platforms. However, this may not be the case, for example, some programs will default to intmax_t larger than int64_t at compile time.

Will any platform that tries to make this change become a new target triple? x86_64-unknown-linux-gnu2？ If anything compiled for x86_64-unknown-linux-gnu can run on it, isn't that enough?

Modify the signature without breaking the ABI

"So what, won't C never improve again?"

Say no, too, because of its poor design.

Honestly, making ABI-compliant changes is an art form. Part of this work is preparation. If you're ready, it's much easier to make changes that don't break the ABI.

As the phantomderp article points out, people like glibc (g is x86_64-unknown-linux-gnu) have long been aware of this and use mechanisms like symbolic versioning to update signatures and APIs while keeping old versions for any compilation against older versions.

So, if there's a method int32_t my_rad_symbol (int32_t) and you tell the compiler to export it as a my_rad_symbol_v1, anyone who compiles against the header file you provided will write my_rad_symbol in the code, but link to the my_rad_symbol_v1.

Then, when you determine that you should actually use the int64_t, you can think of int64_t my_rad_symbol (int64_t) as a my_rad_symbol_v2, but still retain the old definition my_rad_symbol_v1. Anyone compiling against your header file will use the symbol v2 if it is for the new version, and continue to use v1 for the old version!

But there's still a compatibility issue: any compilations made against new header files can't be linked to older versions of the library! The v1 version of the library does not have the v2 symbol at all. So, if you want a popular new feature, you need to accept the fact that it's not compatible with the old system.

It's not a big problem, though, and it just makes the platform vendors feel bad because no one is able to immediately use what they've spent so much time making. You roll out a shiny new feature and wait years until people think it's popular enough/mature to rely on it and break support for the old platform (or willing to implement dynamic checking and fallback for it).

If you want people to upgrade immediately, it's a matter of forward compatibility. This requires that older versions be able to adapt to their new features that are completely conceptual.

Modify the type without breaking the ABI

Well, in addition to modifying the signature of the function, what else can we modify? Can we modify the type layout?

OK! But neither! It depends on how you are exposed to the type.

One of the really wonderful features of the C language is that it allows you to distinguish between types with known layouts and types with unknown layouts. If you only declare a type forward in a header file in C, any user code that interacts with the type will not know the layout of the type, and must always handle it opaquely through pointers.

So you can develop an API like MyRadType*make_val() and use_val(MyRadType) and then use the same symbolic versioning techniques to expose make_val_v1 and use_val_v1, and any time you want to modify the layout, you have to modify the version on everything that interacts with that type. Similarly, you'll have to keep MyRadTypeV1, MyRadTypeV2, and some type definitions to make sure people use the "right" type.

Very well, we can change the type layout between different versions! Is that right? Well, most of the time it is.

If there are multiple things built on your library that call each other without the type being transparent, a bad thing will happen:

lib1: Develop an API that calls use_val using the type MyRadType*;

lib2: Calls make_val and passes the result to lib1.

If lib1 and lib2 are compiled based on different versions of the library, then make_val_v1 will be passed to use_val_v2! At this point, you have two options to deal with this problem:

It is sad to forbid this and to warn those who do.

Design MyRadType in a forward-compatible way so that mixing is fine.

Common techniques for achieving forward compatibility are:

Leave unused fields for future releases.

All versions of MyRadType have a common prefix that allows you to "check" which version is used.

There are fields that are size adaptive so that older versions can "skip" new sections.

Case study: MINIDUMP_HANDLE_DATA

Microsoft is truly a master of forward compatibility, and they even make them really concerned about keeping layouts compatible between different architectures. An example I recently came across is the MINIDUMP_HANDLE_DATA_STREAM in Mindumpapiset.h.

This API describes a versioned list of values. The list begins with this type:

thereinto:

SizeOfHeader is the size of the MINIDUMP_HANDLE_DATA_STREAM itself. If you need to add more fields at the end, that's okay, because older versions can use this value to detect the "version" of the header and skip any fields they don't recognize.

SizeOfDescriptor is the size of each element in an array. This is also to let you know what "version" the element is, and you can skip fields that you don't know.

NumberOfDescriptors is the array length.

Reserved is a reserved field (Minidumpapiset.h is very strict and never uses any padding bytes because the value of the padding byte is undecided and is a serialized binary file format. I hope they added this field so that the size of the structure is a multiple of 8, so that there is no question of whether the array elements need to be padded after the header. Wow, that's what it takes compatibility seriously! )

In fact, Microsoft uses this versioning scheme for a reason, they define two versions of the array elements:

There are a few interesting things about the actual details of these structures:

The modification to it is simply to add fields at the end;

The "last one" has a type definition;

Keep some Maybe Padding (RVA is of the ULONG32 type).

When it comes to forward compatibility, Microsoft is definitely an indestructible beast. They were so careful with the padding that they even adopted the same layout between 32-bit and 64-bit! (Actually, this is very important because you want a minidump file processor for one schema to be able to handle the minidump file for each schema.) ）

Well, at least it's really robust, if you follow its rules, operate by reference, and use the size field.

But at least you can play it. It's just that at some point you have to say "your usage is not right". Microsoft probably won't say that, they'll just do something terrible.

Case study: jmp_buf

I'm not very familiar with this situation, but while studying the destructive changes in the history of GLIBC, I saw this great article on LWN: The Destructive Changes of the GliBC S390 ABI. I think this article is more accurate.

It turns out that glibc has broken the type of ABI, at least on the s390. According to the description of this article, it caused confusion.

In particular, they changed the layout of the state-saving type (i.e. jmp_buf) used by setjmp/longjmp. Look, they're not complete fools. They knew it was a modification that broke the ABI, so they responsibly did symbolic versioning.

However, jmp_buf is not an opaque type. Something stores instances of this type inline, such as the runtime of Perl. Needless to say, this type, which is not very easy to understand in comparison, has infiltrated many binaries, and the final conclusion is that everything in Debian needs to be recompiled.

The article even discusses the possibility of a version upgrade of libc to deal with this situation:

In a hybrid ABI environment like Debian, so name bump causes two libCs to be loaded and compete for the same symbolic namespace, while resolution (and ABI selection) is determined by ELF interpolation and scope rules. It was a nightmare. This could be a worse solution than telling everyone to rebuild and get back on track.

(This post is pretty good and highly recommended for reading it.) ）

Can you really modify intmax_t?

In my opinion, not necessarily. Like jmp_buf, it is not an opaque type, that is, it is inlined by a large number of random structures, treated as a specific representation by a large number of other languages and compilers, and may exist in a large number of public interfaces that are not under the control of libc, linux, or even distribution maintainers.

Of course, libc can use symbolic versioning techniques appropriately so that its APIs can adapt to new definitions, but changing the size of a primitive data type (like intmax_t) can cause confusion in the larger platform ecosystem.

I'd be happy if someone could prove me wrong, but as far as I know, making such a change would require a new target triple and wouldn't allow any binaries/libraries built for the old ABI to run on this new triple. Sure, you can do that, but I'm not envious of any distribution that does these things.

Even so, there's the problem with x64 int: it's a very basic type, and its size hasn't changed for a long time, and countless applications may have made imperceptible assumptions about it. That's why int is 32-bit on x64, even though it "should" be 64-bit: int has long been 32-bit, so much so that upgrading software to a new size is completely hopeless, even though it's a completely new architecture and target triple.

I also hope my point of view is wrong. If C is just a stand-alone programming language, then we can move forward without any worries. But it's not, it's a deal, it's a bad deal, and we have to use it.

Unfortunately, C, you conquered the world, but perhaps you no longer have the beauty of the past.

"C is no longer a programming language"

Read on