Amazon’s S3 has been a popular storage service for many years. One of its sought out features is the storage tiers allow you to move data over. Different tiers come with different benefits. We’ll be looking at two such tiers that we know as S3 standard and S3 Glacier.
Moving data from standard to Glacier is common practice. For one, Glacier is the cheapest storage tier available and two, it’s the best archiving solution.
Glacier is used to store away data that you aren’t in any immediate need of. So let’s say you’ve got older records of customers or patients and you want to store them away for an indefinite amount of time. S3 Glacier is the perfect service for such scenarios.
Glacier charges you in two ways. One for the data transfer and two for the fixed storage, which is billed monthly. One of the ways people try to further reduce the costs of archiving is by compressing the data. This process ensures lesser data being transferred and stored. The only problem is, compressing your data is a bit of a messy process. Glacier doesn’t have any inherent compressing feature. The common solution is to use the S3DataNode part of DataPipelines and its subsequent GZIP command to do the task. This is the messy part. You have to take your data out of S3, load it into a pipeline, run the command, push it to Glacier. All manually - it's quite a hassle.
We built out an automated no-code workflow to take the same process and push it all into one seamless flow of events that does these different tasks from the same place. With this workflow, compression of your data will be the ideal way you approach your archiving. You could potentially cut your costs with this neat method. You only need 1 workflow with 8 nodes to make this complex use case a reality. No coding, no configuring on the AWS Console, or anything else.
Why and how archiving can save costs
Archiving has always been a familiar concept ever since the storage of data was introduced to the world. The idea of it is to store data in a way that access to it cannot be frequent, originally because of the sheer number of records being archived. In Amazon’s case, Archiving is a cheaper alternative to storage simply because of the trade-off it put forward.
This trade-off gives you more possibilities on how your data should be organized and how much you can reduce expenses. Your job is to pick the data for archival, this usually means personal records and older data sets. You could also run an analysis of the activity of your data to figure out which one’s worth shipping off to Glacier.
The most price difference comes due to cost of Api calls. S3 charges you per file moved from S3 to glacier. $0.05 per 1000 items. This means if you had lots of small files, let’s say 10kb per file, for 1 million files, glacier storage cost = 1gb = $0.004 per GB, while transfer = $50. So you can see, transferring small files is way costly than storage. So we can bundle these small files to create a zip even without compression and then move it to glacier.
How our workflow compresses data
Data compression is done by loading the collection of smaller S3 data onto a different bucket and into the data pipeline. It bundles small files into one large zip. Compression quality ranges from 0 to 9. We use zero here. Files like Jpeg or mp4 are already compressed to min size.
Text files and log files can be compressed. The workflow uses a bit of custom code for this specific process (since we’ve already created it, you can simply adopt it as a template). We also configure the pipeline on our workflow to enable the compression and then you just wait a while for it to happen. The process itself is no different to normal ZIP compression, we’re just enabling it on a cloud service.
Detailed remarks about the process will be explained in the next section. The compression of data reduces both the overall Glacier storage cost and the data transfer cost.
I’ll expand on the workflow and what each node does. For this process, we only employ 2 AWS services- S3 and Data Pipeline.
Step 1: Triggering the workflow
The trigger nodes determine what set of action leads to the workflow being activated. In this case, it’s a trigger from an external application. This could be a request from a web page or an app, etc.
Step 2: A Custom Code to collate the S3 Data
This node has its custom code present that does the job of collecting the S3 Data from your bucket and preparing it to be redirected. The sourceBucket is from where the data is taken and the targetBucket is where the data will be moved to.
Step 3: Create DataPipeline
This action node creates the data pipeline where the S3 data will be compressed.
Step 4: A custom code to push S3 Data into the pipeline
Step 5: Pipeline Definition
This node configures the compression of the S3 Data that is moved into the pipeline and ensure the transfer of it to S3 Glacier
Step 6: Pipeline Activation
Step 7: Delay
A 600-second delay is set to allow the data transfer to happen before the next node is activated.
Step 8: Delete DataPipeline
This action node deletes the data pipeline after the compression and archiving is successfully done.
We have many other use cases similar to this one that we solve with our workflows. Our goal with custom workflows is to give you the freedom to find your solutions to the many complications AWS put forth. If you have a particular use case you want to work on, you can sign up with us and try your hand it.