Defend your data quality with constraints

In computer science, garbage in, garbage out is where bad, or nonsense input data produces nonsense output or "garbage".  How do you protect your database from allowing garbage in? Using a constraint is an excellent method for keeping your data clean.  Constraints are essentially rules that your data has to follow.

Datatypes as database constraints

Datatypes are not something people think of as a constraint (because their not), but it's probably the most basic way we constrain data we use.  If you want to store an integer and only integers then you would choose the INTEGER datatype.  PostgreSQL offers data types to store IPv4, IPv6, and MAC addresses.  You could use TEXT to store these values, but TEXT will not ensure the values are valid network values.

Common constraints and uncommon usage

The most common constraints are not null, check, unique, primary keys, and foreign Keys.  Check out the PostgreSQL constraint documentation for the common usage of these constraints.  There are also less common ways to use these constraints.  For example, partial unique indexes are less common.  Let's say you want to have a unique constraint on an email address column, but you also want to have a soft delete concept.  You may choose to create a partial unique index like this:

Beyond very basic checks such as checking for positive prices, you could expand that to multiple columns. The PostgreSQL documentation provides a good example (you want to make sure sale prices are cheaper than the original price):

There are three data rules using a check constraint in this example, the price must be greater than 0, the sale price must also be greater than 0, and the price must be greater than the sale price.   Another uncommon usage of constraints is to use a function to constrain your data.  Take for example a DATE column that you only want to store dates that are the first day of the month.  First, create a function that will do just that.

Then add the function as a constraint.

Keep your data safe with constraints

Constraining your data is extremely important to your data quality (and sometimes performance).  Don't let bad data get into your database, it can be very difficult to remove.  Don't rely on application code to keep your data clean.  Your database is the foundation of your application and the database needs to prevent bad data.

PostgreSQL connections are process based not threaded like some other databases.  Like everything else there are pros and cons to this method.  One of the cons is that each PostgreSQL connection consumes more resources when compared to threaded connections.  If you have a need to have a lot of connections to your database you may want to look at a connection pooler like pgBouncer.

What do your session look like?

PostgreSQL has views that allow you to see the details surrounding your sessions.  Run the following PostgreSQL query to see how many sessions you have and what they are doing.

Here is the result from one of my production 9.5 database clusters and the definition for each status.

  • active – This is a currently running query. The number of sessions that are active represent the number of connection you ultimately need at that time.
  • idle – This is an open connection that not currently doing anything in the database.  You will generally see these for a session that was active, are not finished, and are waiting for the next request from your client or application.  Too many of these and you may want to look into using pgBouncer.
  • idle in transaction – This session is where your application has issued a BEGIN many have done some work but is not just sitting idle and keeping a transaction open.  Too many of these can be problematic and should be investigated.  pgBouncer will not help here since these are in a transaction.

The good, bad, and the ugly

In my example above you can see, there is only one active session (mine) and there are 331 idle sessions and 61 idle in transaction sessions.  The number of idle and idle in transaction is a concern.  The application that uses this database is quite old with a great deal of technical debt.  Don't let your applications do this to your database.  If you are running PostgreSQL 10 or later you may also see sessions with a NULL state.  In that case you may want to use the following query.

Here are the result for a different production cluster.

The sessions with a NULL state are all background PostgreSQL sessions.  Also, notice 1 of my active sessions is a walsender.  Your application will be backend_type = client backend.  On this database, the numbers are much more reasonable for a database that is not during a great deal of work currently.

Preventing the bad connections.

It's critical to keep your PostgreSQL sessions in check.  Fortunately, there are a couple of settings that will keep your users and applications behaving well.  The first is statement_timeout.  A statement timeout will automatically kill queries that run longer than the allotted time.  You can set this at both a global level and for a specific session. To do this at the database level you’d run this:

This will not allow a query to be active longer than 60 seconds.  This is generally a very long time for an OLTP database, but may be far too short of a duration for an OLAP or analytics data store.  To deal with idle in transaction that has been running too long you can set idle_in_transaction_session_timeout (on Postgres 9.6 and later).  This setting is similar to statement_timeout, but act on any session that is idle in transaction.

Quick Tip:  Want to create a table just like another table (without data).  It's simple to do with this command:

 

If you are making modifications to the PostgreSQL configuration file (postgresql.conf) you will need to reload the file for your changes to take effect.  Some configs need a full database cluster restart but many just require a reload of the configuration file.

Option 1 - From the OS (Linux) command line

Option 2 - From a database session

Both options will reload the configuration file without restarting the database or interrupt any active sessions or open connections.

 

Do you want to hide column values in PostgreSQL?  We had that exact need, allow our DevOps engineers to run selects from a table and obfuscate some of the column values that are returned.  For us, this masking of data generally applies to Personally Identifiable Information (PII).  There are a couple of methods to achieve this goal.  The first is to only grant SELECT on the columns you want them to be able to read.  In this method not only will you hide column values, but you will also hide the column itself. Below is an example of limiting which columns a user can read.

Now we should not be able to read from the table at all when we log in as the test_user role.

Next, we can grant SELECT only on the ID and YES_READ columns.

Now we can select the ID and YES_READ columns, but not the NO_READ columns. In my opinion, the resulting error message can be confusing because we do have permissions on the table, just not on one of the columns.  This methods will hide column values and is more secure than method 2.

Good, but not perfect.

This solution was less than desirable for our DevOps engineers because they wanted to run queries "as-is" without any modifications.  So we came up with a different option that's not perfect but closer.  We obfuscated the data with search path, schemas, and views.  Like in the previous example we will create a table that we want to mask and a new user that we want to make the data from.

Alternatively, you could revoke the privileges at the schema level if you are applying this logic to a broad set of tables.

Next, we’re going to create a view that has the restricted view of the data.

Now we can grant the permissions necessary to read the data for our DevOps engineers.  Here I am granting access to everything in the protect_pii schema, but you could also grant select on just the one table as well.

Finally, if you set the user's search_path they will try your new non-PII schema first and select from your non-PII view.  The code below tells the user's session for queries that do not include a schema name try finding the object in the protect_pii schema first followed by the user's own schema and then finally in the public schema.

Now your users can run selects against the protected tables, see all the columns, but not be able to view the PII dataset.

Optionally you could not include the PII column in your view, but then your query could not be run "as-is" when the select explicitly calls out column names.  Also be aware of the protected column's datatype.  In this example, we set it to text with ::TEXT, but you may need to set a different datatype depending on the underlying table.

Talk PostgreSQL is a site designed to help you understand PostgreSQL, get PostgreSQL news, other helpful information. 
© Copyright 2019 - TalkPostgreSQL - All Rights Reserved
linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram