SQL Best Practices

In this post, I'm going to share with you some of the best practices for formatting SQL that I've learned and used over the past 20 years. Many of these tips I collected out of frustration in inheriting code, needing updates, chasing bugs, and reviewing code for others. Also, when it comes to training new employees, I've found it much harder for people to understand domain and code when dealing with poorly written SQL.

In each section of this article, I focus on the practice of one area. I'm going to give you an example of bad practices and best practices, and why I wrote code in a particular way. When you're reading, you might think, "This is not a high-performance code," and sometimes that can be the case because I'm trying to hone at a particular point to avoid unnecessary potential confusion. Also, sometimes I'd rather choose a slightly less performing code (depending on how often it's used, what the code's purpose is, and what performance impact it takes) if it makes the code easier to read and maintain.

I suspect that some people reading this might say, "My code is doing very well, so I don't care." I understand what it does. "To that (because I've heard this comment many times over the years), I want to dispute that claim. Discovering and reading sloppy "bad practice" code takes significantly more time than using clean code. These practices will also help reduce errors, and your code will be appreciated by everyone who reads it.

When we talk about programming languages, you can almost write statements the way you see fit, and the code will execute. Obviously, there are some exceptions to this (like indentation in Python), but that's another issue. This means that when I'm talking about best practices for SQL, one can make the argument that it's just a matter of opinion and writing style.

How you write it is important. For example, think of a book. Of course, you can write an entire book without paragraphs, line breaks, standard spacing between sentences, and so on. Yes, it's still a book, and yes, readers may understand it. However, without paragraphs, how easy would it be for you to find a particular chapter? If you were asked to edit something, how confident would you be to know that you were indeed in the right chapter? For these reasons, I've outlined some best practices here.

Note that while all of these practices can be considered "opinions," they are the mistakes I've made in 20 years, the challenges I've faced in effectively reading, editing, and understanding code, and the practices I've adopted after watching my team members face the same challenges. I will try to tell you about these practices and the logic behind using them. I'll ask you to consider these practices and see what works for you and what works for your team.

format

The first thing I want to talk about is formatting. The code should be well formatted and visually appealing, which makes it very easy to read. When debugging, troubleshooting, and modifying your code, we'll be rewarded with properly formatted code.

When it comes to formatting, there are many things to consider, such as intent, alignment, comma position, and text case. If you can read this, you've probably noticed how my code is aligned, and most of the projects are on one line. This has a big impact on readability.

Here's an example of code that's already written and rather difficult to read. In this example, you'll notice that there are a lot of things that go against best practices, like.

The number of items per line
Alignment issues
Trailing comma
Poor or lack of aliasing
Missing comments
Group by number instead of name
The position of the aggregate function in the select statement
Multiple hard-to-detect errors

In this post, I'll discuss these practices and more to help you write clean, error-free code that makes you proud.

Bad practices - many problems

Best practices

Take a look at the code below and compare it to the code above. Which one is easier to read? Which one provides the best context about the intent of the code and the meaning of some condition values? Which version offers a cleaner user interface that allows you to quickly spot bugs or avoid them altogether? I hope you have agreed that the following code helps to overcome all the challenges found in the above code.

I hear people retort, "Well, you have to write more lines of code. "It's irrelevant. Computers don't care about this problem, and can implement back lines and spaces or tabs with just a few keystrokes.

In the following sections, I will discuss these issues one by one and provide examples of good and bad practices.

Unified caliber

If you look at the "best practices" code above, notice that everything is left-aligned. All the commas, spaces, and indentations make the code very easy to read.

One item per line

My general rule is one item per line. This can be an element in your select statement, or a condition in a join statement, or a case statement. Again, take a look at the code above about formatting and see the difference in readability when writing one item per line. The key here is to be consistent. I've seen code that writes one project per line, but every once in a while there's a *join* clause where there's an and an or statement on the same line. This can be very frustrating when reading and debugging, as it's easy to overlook adding conditions because it's written on the same line.

Bad practice - set multiple case conditions on a single line

Here, we see a case statement all on one line. This is a bad practice because it makes the code difficult to read and difficult to quickly discover all the conditions being evaluated. Also, if not impossible, it can be very challenging to comment out the code correctly. I know that in this example, the 'main_reporting' is not descriptive and seems inconsistent with other values, but sometimes you are told to output values in this way and not logically align with other values.

Best practices - Multiple case conditions on multiple lines

Here we see a case statement written on multiple lines with comments to help provide clarity.

Comment the code

Please comment your code. I'll write another article about code comments in the future, but comments are important. I feel like I'm going to see a post every day on LinkedIn or somewhere else where someone says, "You don't need code comments." Code is just a comment. What does it matter? Don't you know how to read the code? Seriously, I've been listening to this for years. But the reality is this. Although code is a language, if proficient in this language, the reader can understand what the code is doing. However, the code never tells the reader why anyone wants the code to work this way. As for why someone wants to code it somehow, the possibilities are endless. Sometimes you can code around a bug in the back-end data, or maybe there's business logic that determines how the code should work.

While it's true that you can read the code and potentially consult the documentation for some tables, it's a lot more laborious than typing a few characters. Here are some examples of good and bad annotation practices.

Bad Comments -

Take a look at the code below. As we can see, the code only wants to return u.id>1000 results. In this very simple example, this is obvious. But the more important question is why would anyone want to do this?

Maybe they were test users before u.id=1000. Or, the code is filtering out all users from Michigan because for some reason it is assumed that all users less than u.id 1000 are from Michigan. This may sound like a terrible idea, there is actually code that will be written like this, but it all executes the same. The point here is that as new users, we don't know, and it's possible that in six months you won't know either.

Better comments - inline comments

Here we have an in-line review that tells us more about why we added the condition of u.id>1000. Obviously, we have some test users who should be cleared from the result set.

Bad comments - no block comments

Take a look at the code below. As we can see, this query will return users who are considered non-test users. The inline comments help us understand that our desire is to clear these test users from the result set. But we had to read a few lines of code. You might say, well, it's not worth setting the comment block at the top. It's just 8 lines of code and what's going on is obvious. Whether simple or not, the reader doesn't know the reason behind this code. But what if the code isn't that simple? You will definitely appreciate some reviews. For these reasons, and in order to have a better plan before you start writing code (more on that later).

Better annotations - block annotations

Here's the same code we just saw, and now we're going to tell the user why we want it

Want to run this code and things to be aware of.

Common Table Expressions (CTE)

A generic table expression, or CTE, is a method of creating an in-memory table of query results. This table can be used in the rest of your SQL script. The benefit of using CTE is that you can reduce code duplication, make your code more readable, and improve your ability to QA check the results.

Also notice the really good big guy reviews.

Code instances that do not use CTE

In the following code, we can see that there are two subqueries that are returning results. The two subqueries are then concatenated together to produce the final result set. Although this code will execute, there are several issues.

1. It's really hard to QA on a subquery and check the results. For example, what if we wanted to count the number of users with multiple records on the default screen? We can't simply execute some sql on a subquery. We have to copy/paste the subquery and then modify it to execute the problem. It would be better if we could avoid changing the code during the QA process.

2. If we need to take advantage of this user subquery elsewhere in the code, we have to rewrite or copy/paste this piece of code to the rest of our script. This will not be a DRY (don't repeat yourself) approach.

and expose more potential errors. How to say it? Let's say you used user subqueries in 5 places in your script. Also assume that the code you're working on isn't easy to read because it doesn't follow best practices. If you're asked to update your code and add another condition to clear up more test users, there's a good chance you'll miss out on adding that condition

At least one of the 5 uses to subqueries.

3. There are more cycles on the database. Each time a subquery is executed, a table scan is performed to return the results. Since our user subquery contains wildcard conditions, the database will have quite a bit of work to do. It's much cheaper to execute a subquery once, store it in memory, and then reuse the result set in your code as needed (CPU cycles and dollars, if you're using a cloud database).

4. Reading the entire code block and understanding what is being executed and why is more complex. While it's possible to scroll through the code, it can be difficult to easily understand what's going on. In general, if you have to scroll your code vertically on the monitor, your code is too long and should be refactored into smaller components.

Use the CTE example

Below we see an example of using CTE. While CTE can do a good job of helping to overcome some of the challenges we pointed out earlier, CTE usually doesn't stay in memory after displaying the final result set.

For example, if you were to run this entire block of code, it would execute. However, if you want to select all the results from the user's CTE after a few minutes, the data cannot be queried. To solve this problem, you can use volatile tables or temporary memory tables, which are usually present while your session (database connection) remains active. More on this topic will be covered later.

Use "SELECT *"

You should never write queries with "select *". I think the only exception to this rule is that if you try to check a table, in which case you should always limit the number of results returned. Writing queries this way is a bad idea for many reasons.

1. Database performance. It's more expensive to return unwanted columns than to just query for columns you care about.

2. Debugging challenges. Assuming you're using the CTE described in the previous section, it's very challenging to trace the source of certain attributes.

3. The table will change. Even if you really need to select all the columns, there's no guarantee that your table won't change over time. As the table changes, you'll be querying for new data that wasn't intentional, which can break code elsewhere, cause confusion, or affect the performance and cost of the database.

Bad Practice - Use SELECT*

Best practice - Select only the elements you need

alias

Aliases are very important to help the reader understand where the elements are located and the table being used. When aliases are not used or poor naming conventions are used, complexity increases and code reading/understanding decreases.

Bad practice - no alias is used on the field

Below you can see that these tables have an alias for 'u' and 'p', but the selected element doesn't take advantage of this alias. This can be very frustrating, and if more than one table contains fields with the same name, it will cause a run-time error. For example, user_id can be found in both the 'users' table and the 'preferences' table.

Best practices -- Aliases to use on fields

Below you can see that the table and the selected elements take advantage of the table's alias. This makes the code very readable to the end user. Even if you have only one table, it is a good practice to use aliases on table and field names. Good habits make good code.

Bad practice - generic aliases on CTE <

Below we have a CTE that has been created, but the assigned table name is called "cte". It's a very common name that doesn't tell the end user anything about the data in the table at all. If you're a user, you won't have any indication of what table is being used when reading the select statement after the CTE.

Best practices - specific aliases on the CTE

Below we have an already created CTE with a more descriptive name. The name gives the user some instructions to let them know what data it contains.

comma

In the selection statement, I prefer to use a leading comma instead of a trailing comma, which is a case where my personal opinion comes into play. I've seen a lot of people use trailing commas when writing selection elements, and in other languages, using trailing commas is a common practice. However, in other languages, it is not common to include a large number of arguments in a function, while in SQL, it is common to have a large number of elements selected (and declared in code). While you might think I'm being too extreme about this usage, here are a few reasons why I think leading commas are good.

1. The user interface looks clean. When you look at examples of best practices, look at how commas are arranged well. Compared to using a trailing comma, it's easy to see that a comma is missing and avoid runtime errors.

2. There is no confusion when dealing with longer case statements that wrap around rows. Looking at the example below, it is difficult to tell whether the end of a line is the end of an element or statement, or the end of an argument that specifies a function that is passed in.

To add some extra frustration, BigQuery's query formatter actually reprocesses your code to show all the trailing commas.

Wrong approach - trailing comma

In this example, we can see that no leading comma is used, which makes it difficult for us to spot the missing comma.

Best practice - Comma at the beginning

In this example, we can see that all commas are aligned, which makes it easy to guarantee that not a single comma is missed.

Wrong approach - confusion of trailing commas

In this example, we can see that the leading comma is not used. We have a comma-terminated line break, so it's hard to tell if this "max" line is really a separate statement, or if it's part of a longer statement. One would argue that you shouldn't use such a line break in your code, and while I support using line breaks in the right place (because it makes the code easier to read), you'll still run into the same problem when you look at your text-wrapped code in the editor or in a Git diff comparison.

Best practice - Application of leading commas with indentation on wrapped text

In the example below, it is easy to see that a comma is missing. You might immediately wonder if anyone has forgotten the comma, but since it's so easy to read code with a leading comma, it's more likely that the lack of a leading comma is well designed and shouldn't actually have a comma in front of the 'cast' statement. Also, by adding indentation to the cast statement, the code appears to have a more obvious intent as to why commas are not needed.

If someone performs a carriage return after "desc" on line 3, you will now have a common beginning on line 4. Typically, we want all commas to be used as leading characters. But this logic only applies to the properties (columns) that we are talking about being returned. In our case, the comma at the end of line 3 is part of a case statement, so things get very messy.

This practice has been very common in a lot of SQL code in the past, which may have something to do with the fact that SQL has been around for a long time, probably longer than most text editors with syntax highlights. Today, most people use (or should use) syntax highlighting in their editors, so uppercase letters shouldn't be necessary to discover reserved words.

Although I have a personal opinion and don't use capital letters, you may disagree and have different opinions. That's my reason.

1. I don't like my code yelling at me. In social contexts and in written communication, the use of capital letters is synonymous with yelling, so I try to avoid writing it that way.

2. The code reads less fluently. Psychological studies have shown that lowercase words are easier to recognize than uppercase words. This is because lowercase letters have more shape variations than uppercase letters. If all capital letters are used, the reading speed will be reduced by 13-20%. Also, when you mix uppercase and lowercase words

3. Holding down the shift key or locking/unlocking the capital letter key while I type is an extra keystroke action. I know, this is not a good argument.

Bad habits - capital letters

In this example, we can see that reserved words are uppercase and all other words are lowercase.

Best practices - lowercase

Grouping - Numbers and explicit fields

I almost always group with explicit field names instead of the position number in the select statement. While this didn't have any effect on the results, I found that it saved me time when it came to typing and debugging. Usually, I prefer to use numbers because it reduces the text on the page, but this has caused too many problems and spent a lot of time chasing errors at runtime.

Bad Practices - Grouped by job number

Here you can see that we are grouping by location number according to the selected project. What I don't like about this practice is.

1. If someone places an aggregate function anywhere in the select statement except the first or last item, then you have to skip a position number in group by. If you later decide to rearrange the elements you have chosen, this creates frustration.

2. You have to figure out how many elements you have, subtract the aggregated items, and then manually enter the location number.

Best practices - Group by field name

Here you can see that we are performing a grouping with explicit field names. While it may seem like this is a lot more wording and work than using position numbers, it's actually faster than typing with numbers. How to say it? Because all you have to do is copy what's in the selection statement (minus the aggregate field) and paste those values into the grouping. When you use numbers, you actually have to enter numbers.