Our data engineers have built a lot of data warehouses that run on Amazon Redshift. Redshift can be a great platform when efficiently used, but like many others, it’s difficult to get it work well. We’ve now at the stage where we’ve collected enough information on what works and what doesn’t that it’s valuable enough to pull together our best tips into this post.
Optimize Your Schema
Many data engineers start out dumping data into redshift with little thought to the schema and hoping the columnar database engine will magically work out the best way to query the database. We find that this makes it very to run predictive data analytics and your data scientists are likely to spend more time waiting for queries to run than creating new models. If you don’t want to start a battle between your data scientists and your data engineers, we recommend an approach where we implement an efficient schema based on a standard data warehouse star schema.
- Distribution keys – The distribution keys define how data is distributed amongst the different nodes. You always want to have a uniform distribution of data so your distribution key should typically be set to a key that has millions of different values.
- Sort keys – The sort keys are the closest equivalent to a traditional database index. If you have data ordered by time, then the time field should almost always be the first sort key.
- Sort key styles – We always recommend to start off by using COMPOUND sort keys as the INTERLEAVED usually are less efficient.
- Foreign keys – Although Amazon Redshift does not enforce foreign key constraints, they are used by the query optimizer, and as such we always include them in our table definitions.
Manage Your Disk Space
When you’re running predictive analytics in the data warehouse, your Amazon Redshift queries can be very disk intensive, and therefore good data engineers will also ensure that there’s always disk space good headroom. Although it’s easy to add new nodes to the data warehouse, this can become very expensive. The more data Amazon Redshift needs to read from disk, the longer it’s going to take to run any queries. You can change the encoding by using the Amazon Redshift Column Encoding Utility provided by AWS. Note that this script will not update any views table specific permissions for users so you’ll need to recreate these separately and also delete the backup tables it creates.
Optimizing Slow Queries
However well optimized your schema is, there are always going to be users who can work out new and unusual ways to run slow queries on your data warehouse. As a general guide, if a query needs to access a lot of data then it’s going to be slow. Anything that doesn’t require a lot of data should quickly execute if it’s written correctly. For each database, we manage we create a list of more specific things to look at when optimizing based on the exact schema, distribution, and sort keys.