Amazon Redshift is a cloud data warehouse service that allows for fast and cost-effective analysis of petabytes worth of data stored across the data warehouse. Using the combination of machine learning, massively parallel processing, and columnar storage on SSD disks, Redshift can deliver more than 10x the performance rate compared to other data warehouses.
However, this high-performance rate can end up costing unnecessarily or you could have trouble making the most out of your cluster performance in the form of inconsistent query performance or issues with workload scaling. These Amazon Redshift Best Practices aim to improve your planning, monitoring, and configuring to make the most out of your data.
You can use the Workload Manager to manage query performance. Redshift runs queries in a queuing model. You will likely have to configure the default WLM setting which offers one queue with five slots.
You can have two benefits with configuring the WLM
You can have up to 8 queues with a total of up to 50 slots. A query will run in a single slot, by default. Queries can be routed into queues using certain rules. Setting up your WLM the right way will eliminate queue wait times and disk-based queries.
When creating a table, you have the power to decide how the data is spread to the slices, evenly or specifically according to the columns. By choosing joined columns for distribution, you can minimize the data transfer over the network.
Selecting a good distribution key helps with the data transfer over to the slices and how well the slices perform during query execution. These are the primary attributes to pay attention to when choosing a good distribution key.
A skewed distribution key results in some slices working harder than others during query execution, unbalancing CPU or memory, and ultimately only running as fast as the slowest slice.
Redshift Spectrum improves performance and query throughput by enabling you to query data directly from files on Amazon S3 through a compute layer that is both independent and elastically sized.
The Amazon Redshift Advisor automatically analyzes the current workload management (WLM) usage and makes recommendations for better performance and throughput. It also lets you know unused tables by tracking your activity. You can take advantage of this automatic analysis provided by the advisor to optimize your tables. ANALYZE is a specific command that provides queries that help address any missing or statistic for the table. These statistics are imperative for the advisor to help come up with optimal recommendations.
When rows are DELETED or UPDATED, the computer flags them for deletion but doesn’t actually remove them from the disk yet. Updates result in writing a new block with new data appended. So the system still scans the previous rows and continues allocating disk space for them. This increases the overall storage space of the table and thereby affects the performance. You can use the VACUUM command manually to circumvent this issue by deleting the rows and restoring the sort order.
Auto Vacuum eliminates the need for you to run the VACUUM DELETE command manually. However, the automated process gets the best results when you have sorted your data via the sort key columns.
The COPY command is the ideal method used to transfer data into Amazon Redshift. Copy uses parallel architecture for faster data transfer and it allows multiple sources to be transferred from such as EMR, DunamoDB, and SSH remote hosts.
You get:
UNLOAD lets you export SQL statements made in Redshift to S3 faster through parallel processing. The data is exported in parquet format which can also be done at faster processing speeds than text formats. The data can be compressed before being exported to S3. The larger the queries being passed through, the easier it gets.
CloudWatch Metrics lets you set up monitoring events that can target clusters, perform health checks, automate read/write process, etc. You can achieve similar monitoring features by creating a workflow in Totalcloud’s editor. You could create start and stop workflows for Redshift to save bills or manipulate queries within tables or automate data transfer.
A combination of automated processes with manual checkups is the healthiest maintenance strategy for your clusters.
The above steps can make querying with Amazon Redshift faster, easier and more storage efficient. Efficiently handling the tables and the data within it is the best way to make manipulating clusters worthwhile.