laitimes

Development tips from Google: All SQL and code, there is no need to hide it

Author | Galen B

Translate | Nuclear Coke

Planning | Tina

At least from an engineering perspective, your project is by no means as "invisible" as you might think.

Google is unique in its code management, developing on a "backbone" and putting more than 90 percent of its code in a single repository called Piper, shared by tens of thousands of software developers from dozens of offices around the world. For those open source projects that require external collaboration, the code is placed in the version management software Git, mainly the Android project and the Chrome project.

The entire repository has a tree structure, each team has its own directory, and the directory path is the namespace of the code. Each directory has a owner who is responsible for approving file changes for that directory.

This approach has been running at Google for more than 20 years. In 2015, Google's codebase contained about a billion files and a history of about 35 million commits. The code is generally submitted to the header of the trunk, ensuring that all users see the latest version of the same code, supporting file-level permission control, and 99% of the code is visible to all users. Only a small number of important profiles and confidential business-critical matters have access restrictions. All reads and writes have logs, and the administrator can find out who has read the file.

The benefits of Google's approach are obvious compared to the management of various access rights to code: anyone can browse and use the code company-wide, greatly facilitating code sharing and reuse; having a uniform version and path, there is no problem of not being able to find the latest version of the file; every code change, it is easy to undo or test its impact with pre-submission...

Even for the management of SQL, Google has followed these principles to some extent. The author of this article is a data engineer who worked as a supplier to Google for two years, during which time he found that Google's data engineers treat SQL very similarly to software engineers' attitudes toward code. He believes this attitude is very important, and that businesses large and small deserve to adopt this mindset in their data strategy. We've translated the author's article, and through his article, we'll learn how Google's sql-as-code approach can be helpful and what it can teach smaller organizations.

1

SQL is just a query language, why should Google think of it as code?

Like object-oriented code, SQL is time-consuming, difficult to debug, difficult to understand (not conducive to version control), and highly emphasized maintainability. In the case of Google's data engineering department itself, SQL can be used to create data pipelines, which in turn place special emphasis on being easy to debug and easy to repair. With these factors in mind, the higher the code concentration, the smoother the implementation of the data strategy.

So thinking of SQL as code means that we can introduce code management tools into the process, easily understand who is responsible for a particular change or maintain the SQL script, and constantly track the same author's adjustments in other related queries. This allows us to quickly find failed commits, revert changes, or apply necessary fixes. After you submit the SQL code, this part of the code can be deployed into the development environment immediately. The next step is to run the development pipeline to instantly identify and fix faults.

In addition, we regularly release test environments to ensure that the code runs through the production version through upgrades. After successful testing, the SQL code submitted between the old and new production versions can be officially enabled, keeping the chance of problems to a minimum.

What should small companies learn?

Examine your own software engineering culture, benchmark Google's more mature cultural philosophy, and experience and try out the tools of their choice, including Git, IDEs, and more. We should explicitly index all the code and take the time to convert the dedicated script into a global script, eliminating all unnecessary elements such as views, materialized views, stored procedures, and so on.

2

How does Google manage SQL code?

Google puts almost all of its code in a unified, centralized repository of code. So when changes need to be made to SQL, or when new scripts need to be created, Google's engineers create a list of changes — similar in nature to PR. After that, the change process needs to be tested and approved by other engineers. Approval passes smoothly before authors can commit code changes to the code repository.

While this form of change control is fairly common in enterprises, a major feature of Google is its high emphasis on the importance of code formatting. I myself have paid little attention to code formatting in the past, but my personal experience has made me realize that high-quality code formats can greatly reduce the difficulty of understanding and debugging, and also help reduce the time spent by other authors on code modification. Google paid particular attention to code formats, and even set up an automated mechanism to directly reject code that did not meet coding standards.

Choose a code repository and stick to it as your follow-up. Ideally, this repository should be shared across engineering teams, at least with all the SQL code in place. It should cover all data capabilities, including data engineering, analytics, business intelligence, and more.

In addition, be sure to standardize the code format. At present, there are many open source code format standardization tools on the market, which are not difficult to use, and can greatly improve the readability and maintainability of the code. Take the time to comb through the original code using existing formatting tools or internal original research tools, and ensure that all subsequent submissions must conform to the format standards. In a matter of days and weeks at most, data engineers can adapt to this new standard and take the SQL code within the company to the next level in terms of readability, writing quality, and comprehensibility.

3

Version control + multiple test environments = time savings

Code changes are everywhere, and without versioning to constrain it, it will be difficult for us to redress unexpected errors with rollbacks. In the unlikely event that the committed code breaks the pipeline, produces unexpected values, or doesn't work, we have to use version control to revert to the last normal state.

Google's code integration principles follow this line of thinking. Even if the changes are outrageous (really, sometimes the code submitted is so outrageous), a well-structured test environment can be digested without disrupting normal business. This way, the impact of SQL changes on the development environment is directly apparent, helping engineers quickly identify failures.

Of course, there are very few pieces of code that don't cause immediate failure in a development environment, but problems are frequent in production. The factors that contribute to this situation are varied, so we need to introduce a separate pre-production environment into the test system.

Google uses environment variables to manage a variety of test environments, which can be easily injected into table names through the interpretation layer.

At a minimum, you need to set up a development environment while maximizing the scope of the data infrastructure involved in code testing to minimize the chance of failure. Free and open source tools such as DBT significantly reduce the difficulty of testing through the abstraction layer, in which all tables have two versions, one for development and one for production. This way, in our daily, weekly, or even monthly release schedules, we can safely elevate all code submitted since the last release plan directly to production.

4

Extensive code access capabilities

As the saying goes, "success is also xiao he, failure is also xiao he", Google's practice of stuffing almost all code into a single code repository makes it difficult to distinguish who owns a product and who is using it. For example, without extensive access to this centralized code base, it can be difficult for software engineers to understand the downstream impact of changes when updating production-grade applications. With broad access, they can easily search for scripts, query operations, and other applications that depend on their current application and notify the appropriate engineers to collaborate on changes.

I know that many companies always want to isolate different development links in a way that code is confidential. Yes, it's true that some highly sensitive project code bases shouldn't be opened up, but there aren't many of them and they don't have much connection to other projects. Since a giant enterprise like Google is willing to give full play to the power of trust when building code architectures, other small companies really don't have to always hide.

Introduce trust and communication mechanisms in the structural design of code bases and repositories. At least from an engineering perspective, your project is by no means as "invisible" as you might think. After all, if you can't even trust the engineers on your team, can the business still run? In summary, take the initiative to bridge the boundaries between software engineering and data engineering in your business processes and encourage collaboration. Only in this way can we preemptively address the negative downstream impacts of the change before the code lands in production.

Read on