7 Best Practices for Amazon Redshift

What is Redshift? and the benefits of optimized usage

Amazon Redshift is a cloud data warehouse service that allows for fast and cost-effective analysis of petabytes worth of data stored across the data warehouse. Using the combination of machine learning, massively parallel processing, and columnar storage on SSD disks, Redshift can deliver more than 10x the performance rate compared to other data warehouses.

However, this high-performance rate can end up costing unnecessarily or you could have trouble making the most out of your cluster performance in the form of inconsistent query performance or issues with workload scaling. These Amazon Redshift Best Practices aim to improve your planning, monitoring, and configuring to make the most out of your data.

Improve Query performance with Custom Workload Manager queue

You can use the Workload Manager to manage query performance. Redshift runs queries in a queuing model. You will likely have to configure the default WLM setting which offers one queue with five slots.

You can have two benefits with configuring the WLM

  • Scaling workloads by giving them enough resources 
  • Isolating and protecting your predictable workloads from your unpredictable workloads 

You can have up to 8 queues with a total of up to 50 slots. A query will run in a single slot, by default. Queries can be routed into queues using certain rules. Setting up your WLM the right way will eliminate queue wait times and disk-based queries.

  • Match the queue slot count to the limit of concurrent queries so you can reduce or avoid wait times.
  • Assign enough memory to your queues to reduce disk-based queries  

Using distribution keys to reduces transfer time

When creating a table, you have the power to decide how the data is spread to the slices, evenly or specifically according to the columns. By choosing joined columns for distribution, you can minimize the data transfer over the network.

Selecting a good distribution key helps with the data transfer over to the slices and how well the slices perform during query execution. These are the primary attributes to pay attention to when choosing a good distribution key.

  • High cardinality: As the number of slices in the cluster increases, a relative number of unique data values must exist.
  • Low skew: The number of times each unique value occurs in the distribution key column of the table should be ideally equal.
  • Commonly joined: Choose a distribution key that joins to the other tables, if you have a few of these then choose the one that joins to the most number of rows,

A skewed distribution key results in some slices working harder than others during query execution, unbalancing CPU or memory, and ultimately only running as fast as the slowest slice.

Boost Query throughput with Redshift Spectrum

Redshift Spectrum improves performance and query throughput by enabling you to query data directly from files on Amazon S3 through a compute layer that is both independent and elastically sized. 

  • Amazon Redshift Spectrum’s compute layer allows more processing power to execute specific queries by dispersing the workload from the main cluster.
  • Skip the loading in the ELT process and instead directly run the transform on the S3 data.
  • Avoid loading the data onto the main cluster by running queries right when data arrives at Amazon S3.

Storage Optimization using Analyze and Vacuum

The Amazon Redshift Advisor automatically analyzes the current workload management (WLM) usage and makes recommendations for better performance and throughput. It also lets you know unused tables by tracking your activity. You can take advantage of this automatic analysis provided by the advisor to optimize your tables. ANALYZE is a specific command that provides queries that help address any missing or statistic for the table. These statistics are imperative for the advisor to help come up with optimal recommendations. 

When rows are DELETED or UPDATED, the computer flags them for deletion but doesn’t actually remove them from the disk yet. Updates result in writing a new block with new data appended. So the system still scans the previous rows and continues allocating disk space for them. This increases the overall storage space of the table and thereby affects the performance. You can use the VACUUM command manually to circumvent this issue by deleting the rows and restoring the sort order.

Auto Vacuum eliminates the need for you to run the VACUUM DELETE command manually. However, the automated process gets the best results when you have sorted your data via the sort key columns. 

Compress size with Copy command

The COPY command is the ideal method used to transfer data into Amazon Redshift. Copy uses parallel architecture for faster data transfer and it allows multiple sources to be transferred from such as EMR, DunamoDB, and SSH remote hosts.

You get:

  • Files uploaded faster to Amazon S3
  • Reduction in consumption of S3 Storage
  • The loading process of files becomes faster as compression and decompression happen as files are being read.

Improving export performance with the UNLOAD command

UNLOAD lets you export SQL statements made in Redshift to S3 faster through parallel processing. The data is exported in parquet format which can also be done at faster processing speeds than text formats. The data can be compressed before being exported to S3. The larger the queries being passed through, the easier it gets. 

Using CloudWatch metrics/Totalcloud custom templates to improve performance

CloudWatch Metrics lets you set up monitoring events that can target clusters, perform health checks, automate read/write process, etc. You can achieve similar monitoring features by creating a workflow in Totalcloud’s editor. You could create start and stop workflows for Redshift to save bills or manipulate queries within tables or automate data transfer. 

A combination of automated processes with manual checkups is the healthiest maintenance strategy for your clusters.

The above steps can make querying with Amazon Redshift faster, easier and more storage efficient. Efficiently handling the tables and the data within it is the best way to make manipulating clusters worthwhile.

  7 Best Practices for Amazon Redshift

Smart Scheduling at your fingertips

Go from simple to smart, real-time AWS resource scheduling to save cost and increase team productivity.

Learn More
More Posts

You Might Also Like

Cloud Computing
How To Migrate To Azure Faster?
Migrating from on premise data centers to a cloud provider is always considered a difficult endeavor. From the cost, to the planning and resource allocation, plenty of preliminary work is gone to setting up a cloud infrastructure. Which is why, Microsoft Azure’s new program stands to benefit many organizations still on the fence about migrating to the cloud.
July 21, 2021
Cloud Computing
Everything You Need To Know About Kubernetes Scheduler
When creating a Kubernetes cluster, scheduling the pod to an available node is an important component of the process. This component works under specific rules and technicalities that I’d like to explore in this article...
September 23, 2020
Cloud Computing
20 Cloud Influencers You Should Be Following in 2020
It’s important to follow the right individuals so that you remain on the loop and always find yourself learning things that you were unaware of. These thought leaders and influencers can only be the avenues by which you meet other interesting technologists.
September 23, 2020
Cloud Automation
New In: No-code cloud management workflows for Azure, VMware & Private Cloud (in addition to AWS)
At TotalCloud, we’ve been enabling workflow-based cloud management for AWS to make it intuitive, accelerated, and no-code. Instead of programming cloud management use cases or depending on siloed solutions, we built out a platform that gives you building blocks to assemble any cloud management solution. 
September 4, 2020
Cloud Computing
List of Essential Kubernetes Tools
Kubernetes is a Container-as-a-Service with tons of unique tools to choose from. External tools play a role in integrating with different systems or maintaining control over the clusters you deploy. Manual health checks and troubleshooting is not ideal to keep a system in full health.This list of tools will provide ample support to your containers and have enough configuration to leave management flexible...
August 12, 2020
AWS Use Case Files
TotalCloud Inventory Actions: Giving a new meaning to Cloud Inventory
Learn how the TotalCloud Inventory Dashboard can become equivalent to your cloud provider’s SDK. Carry out any action on any discovered resource with Inventory Actions.
July 30, 2020