S3 Select & How It Help Protect Data Under Gdpr Compliance

Before the release of S3 Select, if you had to pull only a specific set of raw data from the S3 Bucket, you had to download the entire chunk of data from the bucket, unzip them, and then search for the required records. Amazon Athena helps to some extent, but it only analyzes a specific set of data (like Big Data) residing in Amazon S3. AWS S3 Select has a different use case. It scans just the requested columns from S3 and returns only that set of sieved data, not the entire dataset.  

Ever since AWS announced Amazon S3 Select, there have been several introductory articles making rounds and explaining how fast it is compared to S3. Very few spoke about its key merits. This post discusses few benefits of S3 Select and how useful it is in data protection under GDPR compliance.

Before that, for those who are unaware of Amazon S3 Select, here is a small intro. It is an add-on AWS service that can filter out only the required data from an object in an S3 bucket without retrieving the entire object itself.

The particular data you need from an object is pulled using a standard SQL expression via API/SDK. Say, for example, there is 1 TB of data in a GZIP-ed file in an S3 bucket. You want to selectively query a specific set of CSV data from this huge file. You can use AWS CLI, query the SQL, and get that required data within minutes.

One of the most lauded features of S3 Select is its ability to simplify and improve the performance of scanning and filtering object content into smaller, targeted dataset by up to 400%.

Coming back to benefits, AWS S3 Select:

#1 Integrates with other AWS services

It is interoperable with other AWS services like Lambda function, which makes it easier to pull necessary data. For example, you can invoke a simple Lambda function to run s3-select API call against a set of values to get the select data from the file, in S3.

#2 Supports CSV or JSON files “with or without” GZIP compression

This means that you can selectively query a specific set of CSV/JSON data from both GZipped and unzipped files, making the service more flexible.

#3 Eliminates the need for compute resources

While using Amazon S3 Select, your applications no longer have to use compute resources to scan and filter the data from an object. This is one of the primary reasons for the increase in query performance by up to 400%. As per AWS, you need to use SELECT instead of GET to take advantage of S3 Select.

#4 Accelerates Big Data querying by 5X

S3 Select pulls only the required data and uses 1/40th of the CPU compared to S3. This makes it an ideal service if you are extensively using Big Data frameworks, like Presto, Apache Hive, and Apache Spark, and looking to heavy lift all that unwanted data processing. Here’s the link to the AWS SlideShare, if you want to know more.

#5 Is available on AWS SDK for Ruby, not only on AWS SDK for Java and Python or AWS CLI

With the recent announcement from AWS, you can now process selected record events asynchronously with the AWS SDK for Ruby as well, with multiple usage patterns. Get more details about this announcement here.

Availability of AWS S3 Select

S3 Select is available to all AWS customers. You can use the service from the AWS SDK for Java, AWS SDK for Python, AWS CLI and now AWS SDK for Ruby. Its pricing is based on the data scanned and the data returned.

A Pro Use Case: Protecting Data Under GDPR Compliance using Amazon S3 Select & Macie

With everyone catching up quickly on the GDPR compliance lately, we thought of sharing a simple 3-step exercise to protect the data under GDPR compliance automatically using Amazon Macie and S3 Select.

1. Dump your data, including sensitive customer or PCI-DSS data to S3 buckets in the EU region.

2. Create a S3 Select pipeline from the S3 bucket, wherein you can query only the non-sensitive required data as and when required from other AWS services residing in other regions or the same region.

3. Before querying the data from S3 Select, you can use Amazon Macie Security service to validate the outward going data from S3 to S3 Select. By doing so, you can ensure that outward going data is not sensitive.

This way, all your data getting pulled from cross region services will be GDPR compliant.

Do you have any such application of AWS S3 Select?

S3 Select & How It Help Protect Data Under Gdpr Compliance

Smart Scheduling at your fingertips

Go from simple to smart, real-time AWS resource scheduling to save cost and increase team productivity.

Learn More
More Posts

You Might Also Like

AWS Use Case Files
Launch EC2 Instances with CloudFormation
CloudFormation is the gateway to Infrastructure-as-code for AWS users. Learn how you can deploy Cloudformation templates through Totalcloud workflows and increase your customization.
June 25, 2020
AWS Use Case Files
JIRA Triggered Cloud Management
What if cloud management were as easy as raising a JIRA ticket? Almost every DevOps team uses JIRA as a standard means of issue tracking & task management. It’s a given that it would be a seamless process if you could also integrate your cloud processes with it.
June 16, 2020
AWS Use Case Files
Totalcloud Launches New Temporary Rightsizing Feature
You can't always shut down your EC2 machine outside of business hours since some machines are needed up for longer periods. Totalcloud's new downgrade feature lets you optimize your costs by letting you downgrade your machines in a fixed schedule.
June 8, 2020
AWS Use Case Files
S3 Cost Saving: Archiving Compressed S3 Data into Glacier
We've devised a new workflow to cut your archiving costs. Simplify the storage, compression, and transfer of your S3 data into S3 Glacier with 1 workflow and 8 nodes.
June 8, 2020
AWS Use Case Files
Creating a 3-tier Application With Totalcloud’s Code-Free Workflows
As part of a new request by a customer, we've developed a workflow to deploy 3-tier applications much faster. Utilising merely 3 workflows to achieve a result that would have you scripting and troubleshooting for hours. This post gives you an idea of how this workflow functions, the services being used, and how you can benefit from it.
June 2, 2020
AWS Tips & Tricks
Componentized Cloud Management: The way ahead for Cloud Automation
When something gets complex, our primary approach is to break it down — even cloud management. If you’re a part of a growing company that uses the cloud, you can see your infrastructure becoming more…
May 29, 2020