Common SQL mistakes and how to avoid them.

In addition to DBAs and data engineers, SQL is used by analysts, data scientists, product managers, programmers, and designers. There are varying skill levels within these groups and below are some common mistakes I see.

Dividing Integers

In PostgreSQL dividing an integer by an integer will result in an integer. This is not always apparent based solely on the results.

select 10/4;
+----------+
| ?column? |
+----------+
|        2 |
+----------+

The reality is 10 divided by 4 results in 2.5, but the result from the SQL is 2. Telling PostgreSQL that your integer is actually a decimal will result in the correct answer. You can cast the value to a decimal, numeric, or a float.

select 10/4::DEC;
+--------------------+
|      ?column?      |
+--------------------+
| 2.5000000000000000 |
+--------------------+

select 10/4::NUMERIC;
+--------------------+
|      ?column?      |
+--------------------+
| 2.5000000000000000 |
+--------------------+

select 10/4::FLOAT;
+----------+
| ?column? |
+----------+
|      2.5 |
+----------+

Watch out for division by zero errors.

In math we know we cannot divide by zero and the same applies to SQL.

select 5/0;
ERROR:  22012: division by zero

So what if you are selecting from a table and a column could return a value of zero? You can simply set the value to NULL. When you divide a value by NULL the result will be NULL This can be accomplished with NULLIF.

SELECT first_name,
       last_name,
       dept,
       salary,
       round(LAG(salary) OVER (PARTITION BY dept ORDER BY salary DESC)/nullif(salary, 0), 4) next_employee_salary
FROM employee;

+------------+-----------+-------------+---------+----------------------+
| first_name | last_name |    dept     | salary  | next_employee_salary |
+------------+-----------+-------------+---------+----------------------+
| Lisa       | Teal      | Engineering | 2496.78 |               [NULL] |
| Gunner     | Greenburg | Engineering | 2169.32 |               1.1510 |
| Mark       | Thompson  | Finance     | 3638.42 |               [NULL] |
| Jack       | Doe       | Finance     | 1945.42 |               1.8702 |
| Sam        | Lopez     | Finance     | 1658.00 |               1.1734 |
| Julie      | Lopez     | Finance     | 1455.46 |               1.1392 |
| Sally      | Smith     | Sales       | 1195.98 |               [NULL] |
| Troy       | Anderson  | Sales       | 1158.34 |               1.0325 |
| Joe        | Jones     | Sales       | 1034.23 |               1.1200 |
| Jake       | Miller    | Sales       |       0 |               [NULL] |
+------------+-----------+-------------+---------+----------------------+

Without the NULLIF the SQL statement would error because Jake Miller has a salary of 0. Poor guy.

Watch for NULLs

NULL values can be tricky and I see people make mistakes when dealing with NULL values. I wrote a whole post about dealing with NULL values. The one thing to keep in mind is NULL does not equal anything. The most common mistake I see is when people are looking for rows that are not equal to a clause and they also miss rows with a NULL column.

SELECT * FROM employee;
+----+------------+-----------+-------------+---------+------------+
| id | first_name | last_name |    dept     | salary  | hire_date  |
+----+------------+-----------+-------------+---------+------------+
|  1 | Joe        | Jones     | Sales       | 1034.23 | 2015-01-15 |
|  2 | Sally      | Smith     | Sales       | 1195.98 | 2015-01-15 |
|  3 | Lisa       | Teal      | Engineering | 2496.78 | 2019-03-10 |
|  4 | Gunner     | Greenburg | Engineering | 2169.32 | 2017-09-22 |
|  5 | Jack       | Doe       | Finance     | 1945.42 | 2018-06-20 |
|  6 | Troy       | Anderson  | Sales       | 1158.34 | 2017-04-15 |
|  7 | Julie      | Lopez     | Finance     | 1455.46 | 2010-03-25 |
|  9 | Mark       | Thompson  | Finance     | 3638.42 | 2008-11-22 |
| 10 | Jake       | Miller    | Sales       |       0 | 2016-04-03 |
|  8 | Sam        | Lopez     | [NULL]      | 8383.23 | 2014-02-25 |
+----+------------+-----------+-------------+---------+------------+
(10 rows)

SELECT * FROM employee WHERE dept = 'Sales';
+----+------------+-----------+-------+---------+------------+
| id | first_name | last_name | dept  | salary  | hire_date  |
+----+------------+-----------+-------+---------+------------+
|  1 | Joe        | Jones     | Sales | 1034.23 | 2015-01-15 |
|  2 | Sally      | Smith     | Sales | 1195.98 | 2015-01-15 |
|  6 | Troy       | Anderson  | Sales | 1158.34 | 2017-04-15 |
| 10 | Jake       | Miller    | Sales |       0 | 2016-04-03 |
+----+------------+-----------+-------+---------+------------+
(4 rows)

SELECT * FROM employee WHERE dept != 'Sales';
+----+------------+-----------+-------------+---------+------------+
| id | first_name | last_name |    dept     | salary  | hire_date  |
+----+------------+-----------+-------------+---------+------------+
|  3 | Lisa       | Teal      | Engineering | 2496.78 | 2019-03-10 |
|  4 | Gunner     | Greenburg | Engineering | 2169.32 | 2017-09-22 |
|  5 | Jack       | Doe       | Finance     | 1945.42 | 2018-06-20 |
|  7 | Julie      | Lopez     | Finance     | 1455.46 | 2010-03-25 |
|  9 | Mark       | Thompson  | Finance     | 3638.42 | 2008-11-22 |
+----+------------+-----------+-------------+---------+------------+

Notice Sam Lopez's department is not Sales, but Sam is not found when saying dept != 'Sales'. Sam has a NULL department. This can be solved by using a coalesce on the department column.

SELECT * FROM employee WHERE coalesce(dept,'unknown') != 'Sales';
+----+------------+-----------+-------------+---------+------------+
| id | first_name | last_name |    dept     | salary  | hire_date  |
+----+------------+-----------+-------------+---------+------------+
|  3 | Lisa       | Teal      | Engineering | 2496.78 | 2019-03-10 |
|  4 | Gunner     | Greenburg | Engineering | 2169.32 | 2017-09-22 |
|  5 | Jack       | Doe       | Finance     | 1945.42 | 2018-06-20 |
|  7 | Julie      | Lopez     | Finance     | 1455.46 | 2010-03-25 |
|  9 | Mark       | Thompson  | Finance     | 3638.42 | 2008-11-22 |
|  8 | Sam        | Lopez     | [NULL]      | 8383.23 | 2014-02-25 |
+----+------------+-----------+-------------+---------+------------+

Format your SQL

Don't be one of those people and make sure you format your SQL. No one wants to look at a pile of unformatted code. There are many styles of formatting your SQL. There are a few different styles that I like. Here is a style that one of my co-workers prefers.

SELECT extract(YEAR FROM issued_date) AS YEAR
       , extract(MONTH FROM issued_date) AS MONTH
       , ie.customer_id
       , ie.customer_name
       , count(*) AS total_volume
    FROM customer_issue_event ie
    JOIN customer_issue_event_recipients ier ON ie.id = ier.issue_event_id
   WHERE 1 = 1
     AND issued_date >= '2017-01-01'
     AND issued_date < date_trunc('month', date(now() - interval '0 month'))
     AND test_member = FALSE
GROUP BY YEAR
       , MONTH
       , ie.customer_id
       , ie.customer_name
ORDER BY 1
       , 2;

Be careful of functions on indexed columns

When filtering data with your WHERE clause be careful when applying a function to an indexed column. A function may prevent PostgreSQL from using the index.

explain SELECT * FROM message_transactions WHERE status = 'SENT';
+--------------------------------------------------------------------------------------------------------------+
|                                                  QUERY PLAN                                                  |
+--------------------------------------------------------------------------------------------------------------+
| Index Scan using message_transactions_status_idx on message_transactions  (cost=0.56..1.81 rows=1 width=175) |
|   Index Cond: (status = 'SENT'::text)                                                                        |
+--------------------------------------------------------------------------------------------------------------+

explain SELECT * FROM message_transactions WHERE UPPER(status) = 'SENT';
+-----------------------------------------------------------------------------------------------+
|                                          QUERY PLAN                                           |
+-----------------------------------------------------------------------------------------------+
| Gather  (cost=1000.00..1227207.95 rows=190534 width=175)                                      |
|   Workers Planned: 2                                                                          |
|   ->  Parallel Seq Scan on message_transactions  (cost=0.00..1207154.55 rows=79389 width=175) |
|         Filter: (upper(status) = 'SENT'::text)                                                |
+-----------------------------------------------------------------------------------------------+

As soon as you add the UPPER function PostgreSQL cannot use the index on the status column. You can certainly create an UPPER index, but that might not be an option. Here is another example with dates.

explain SELECT * FROM message_transactions WHERE sent_date - INTERVAL ' 7 DAYS' > date(now());
+-----------------------------------------------------------------------------------+
|                                    QUERY PLAN                                     |
+-----------------------------------------------------------------------------------+
| Seq Scan on message_transactions  (cost=0.00..1731123.16 rows=12702269 width=175) |
|   Filter: ((sent_date - '7 days'::interval) > date(now()))                        |
+-----------------------------------------------------------------------------------+

explain SELECT * FROM message_transactions WHERE sent_date > date(now() - INTERVAL ' 7 DAYS');
+---------------------------------------------------------------------------------------------------------------------------+
|                                                        QUERY PLAN                                                         |
+---------------------------------------------------------------------------------------------------------------------------+
| Index Scan using message_transactions_sent_date_idx on message_transactions  (cost=0.57..179891.93 rows=523529 width=175) |
|   Index Cond: (sent_date > date((now() - '7 days'::interval)))                                                            |
+---------------------------------------------------------------------------------------------------------------------------+

Here you can see simply applying the function to the right side of the equation corrects full table scan performance issue.

Make sure BETWEEN is pulling the correct dates

Using BETWEEN to select a date range is a common practice, but you need to be mindful of what dates you are actually selecting. BETWEEN is inclusive. For example, if we want to look at orders for the month of October we might issue a SQL statement like this.

SELECT * FROM test_dates WHERE order_date BETWEEN '2019-10-01' AND '2019-11-01' ORDER BY order_date;

+----+---------------+---------------------------+
| id | customer_name |        order_date         |
+----+---------------+---------------------------+
|  1 | Bob Smith     | 2019-10-25 19:48:22       |
|  2 | Jane Doe      | 2019-10-31 23:56:12       |
|  3 | Matt Jones    | 2019-10-31 23:59:59.82802 |
|  4 | Sally Decker  | 2019-11-01 00:00:00       |
+----+---------------+---------------------------+

As you can see Sally Decker has an order date in November and that is not what we wanted. Limiting your between clause to October 31st also does not return the correct results.

SELECT * FROM est_dates WHERE order_date BETWEEN '2019-10-01' AND '2019-10-31';

+----+---------------+---------------------+
| id | customer_name |     order_date      |
+----+---------------+---------------------+
|  1 | Bob Smith     | 2019-10-25 19:48:22 |
+----+---------------+---------------------+

Often I see people simply extend and date to a timestamp, but are you sure you have the full timestamp?

SELECT * FROM test_dates WHERE order_date BETWEEN '2019-10-01' AND '2019-10-31  23:59:59' ORDER BY order_date;

+----+---------------+---------------------+
| id | customer_name |     order_date      |
+----+---------------+---------------------+
|  1 | Bob Smith     | 2019-10-25 19:48:22 |
|  2 | Jane Doe      | 2019-10-31 23:56:12 |
+----+---------------+---------------------+

In the above example, we are still missing the order from Matt Jones. Simply expanding our timestamp pulls in Matt Jones.

SELECT * FROM test_dates WHERE order_date BETWEEN '2019-10-01' AND '2019-10-31  23:59:59.99999' ORDER BY order_date;

+----+---------------+---------------------------+
| id | customer_name |        order_date         |
+----+---------------+---------------------------+
|  1 | Bob Smith     | 2019-10-25 19:48:22       |
|  2 | Jane Doe      | 2019-10-31 23:56:12       |
|  3 | Matt Jones    | 2019-10-31 23:59:59.82802 |
+----+---------------+---------------------------+

We are still missing 1 order though. This is why I like to use >= and < when dealing with timestamp ranges.

SELECT * FROM test_dates WHERE order_date >= '2019-10-01' AND order_date < '2019-11-01' ORDER BY order_date;

+----+---------------+----------------------------+
| id | customer_name |         order_date         |
+----+---------------+----------------------------+
|  1 | Bob Smith     | 2019-10-25 19:48:22        |
|  2 | Jane Doe      | 2019-10-31 23:56:12        |
|  3 | Matt Jones    | 2019-10-31 23:59:59.82802  |
|  8 | Scott Reed    | 2019-10-31 23:59:59.999999 |
+----+---------------+----------------------------+

Here you can see all 4 orders show up for the month of October.