S3 Cost Saving: Archiving Compressed S3 Data into Glacier


Amazon’s S3 has been a popular storage service for many years. One of its sought out features is the storage tiers allow you to move data over. Different tiers come with different benefits. We’ll be looking at two such tiers that we know as S3 standard and S3 Glacier. 

Moving data from standard to Glacier is common practice. For one, Glacier is the cheapest storage tier available and two, it’s the best archiving solution.


Glacier is used to store away data that you aren’t in any immediate need of. So let’s say you’ve got older records of customers or patients and you want to store them away for an indefinite amount of time. S3 Glacier is the perfect service for such scenarios. 


Glacier charges you in two ways. One for the data transfer and two for the fixed storage, which is billed monthly. One of the ways people try to further reduce the costs of archiving is by compressing the data. This process ensures lesser data being transferred and stored. The only problem is, compressing your data is a bit of a messy process. Glacier doesn’t have any inherent compressing feature. The common solution is to use the S3DataNode part of DataPipelines and its subsequent GZIP command to do the task. This is the messy part. You have to take your data out of S3, load it into a pipeline, run the command, push it to Glacier. All manually - it's quite a hassle.


We built out an automated no-code workflow to take the same process and push it all into one seamless flow of events that does these different tasks from the same place. With this workflow, compression of your data will be the ideal way you approach your archiving. You could potentially cut your costs with this neat method. You only need 1 workflow with 8 nodes to make this complex use case a reality. No coding, no configuring on the AWS Console, or anything else.


Why and how archiving can save costs


Archiving has always been a familiar concept ever since the storage of data was introduced to the world. The idea of it is to store data in a way that access to it cannot be frequent, originally because of the sheer number of records being archived. In Amazon’s case, Archiving is a cheaper alternative to storage simply because of the trade-off it put forward. 


This trade-off gives you more possibilities on how your data should be organized and how much you can reduce expenses. Your job is to pick the data for archival, this usually means personal records and older data sets. You could also run an analysis of the activity of your data to figure out which one’s worth shipping off to Glacier.

The most price difference comes due to cost of Api calls. S3 charges you per file moved from S3 to glacier. $0.05 per 1000 items. This means if you had lots of small files, let’s say 10kb per file, for 1 million files, glacier storage cost = 1gb = $0.004 per GB, while transfer = $50. So you can see, transferring small files is way costly than storage. So we can bundle these small files to create a zip even without compression and then move it to glacier.


How our workflow compresses data


Data compression is done by loading the collection of smaller S3 data onto a different bucket and into the data pipeline. It bundles small files into one large zip. Compression quality ranges from 0 to 9. We use zero here. Files like Jpeg or mp4 are already compressed to min size.
Text files and log files can be compressed. The workflow uses a bit of custom code for this specific process (since we’ve already created it, you can simply adopt it as a template). We also configure the pipeline on our workflow to enable the compression and then you just wait a while for it to happen. The process itself is no different to normal ZIP compression, we’re just enabling it on a cloud service.


Detailed remarks about the process will be explained in the next section. The compression of data reduces both the overall Glacier storage cost and the data transfer cost. 


Process


I’ll expand on the workflow and what each node does. For this process, we only employ 2 AWS services- S3 and Data Pipeline.





Step 1: Triggering the workflow

The trigger nodes determine what set of action leads to the workflow being activated. In this case, it’s a trigger from an external application. This could be a request from a web page or an app, etc.


Step 2: A Custom Code to collate the S3 Data

This node has its custom code present that does the job of collecting the S3 Data from your bucket and preparing it to be redirected. The sourceBucket is from where the data is taken and the targetBucket is where the data will be moved to.


Step 3: Create DataPipeline

This action node creates the data pipeline where the S3 data will be compressed.


Step 4: A custom code to push S3 Data into the pipeline


Step 5: Pipeline Definition

This node configures the compression of the S3 Data that is moved into the pipeline and ensure the transfer of it to S3 Glacier


Step 6: Pipeline Activation


Step 7: Delay

A 600-second delay is set to allow the data transfer to happen before the next node is activated.


Step 8: Delete DataPipeline

This action node deletes the data pipeline after the compression and archiving is successfully done.

Conclusion

We have many other use cases similar to this one that we solve with our workflows. Our goal with custom workflows is to give you the freedom to find your solutions to the many complications AWS put forth. If you have a particular use case you want to work on, you can sign up with us and try your hand it.


S3 Cost Saving: Archiving Compressed S3 Data into Glacier

Smart Scheduling at your fingertips

Go from simple to smart, real-time AWS resource scheduling to save cost and increase team productivity.

Learn More
More Posts

You Might Also Like

Cloud Computing
20 Cloud Influencers You Should Be Following in 2020
It’s important to follow the right individuals so that you remain on the loop and always find yourself learning things that you were unaware of. These thought leaders and influencers can only be the avenues by which you meet other interesting technologists.
September 23, 2020
Hrishikesh
Cloud Computing
Everything You Need To Know About Kubernetes Scheduler
When creating a Kubernetes cluster, scheduling the pod to an available node is an important component of the process. This component works under specific rules and technicalities that I’d like to explore in this article...
September 23, 2020
Hrishikesh
Cloud Automation
New In: No-code cloud management workflows for Azure, VMware & Private Cloud (in addition to AWS)
At TotalCloud, we’ve been enabling workflow-based cloud management for AWS to make it intuitive, accelerated, and no-code. Instead of programming cloud management use cases or depending on siloed solutions, we built out a platform that gives you building blocks to assemble any cloud management solution. 
September 4, 2020
Sayonee
Cloud Computing
List of Essential Kubernetes Tools
Kubernetes is a Container-as-a-Service with tons of unique tools to choose from. External tools play a role in integrating with different systems or maintaining control over the clusters you deploy. Manual health checks and troubleshooting is not ideal to keep a system in full health.This list of tools will provide ample support to your containers and have enough configuration to leave management flexible...
August 12, 2020
Hrishikesh
AWS Use Case Files
TotalCloud Inventory Actions: Giving a new meaning to Cloud Inventory
Learn how the TotalCloud Inventory Dashboard can become equivalent to your cloud provider’s SDK. Carry out any action on any discovered resource with Inventory Actions.
July 30, 2020
Sayonee
AWS Tips & Tricks
AWS Tutorial: Create an AWS Instance Scheduler with Terraform
Terraform is a popular IaaS tool used by many to create, update, and maintain their AWS architecture. If you use Terraform to provision your AWS architecture, you won’t be disappointed with our new AWS tutorial video.We provide you with the means to set up your own instance scheduler from Terraform...
July 20, 2020
Hrishikesh