S3 Select & How It Help Protect Data Under Gdpr Compliance

Before the release of S3 Select, if you had to pull only a specific set of raw data from the S3 Bucket, you had to download the entire chunk of data from the bucket, unzip them, and then search for the required records. Amazon Athena helps to some extent, but it only analyzes a specific set of data (like Big Data) residing in Amazon S3. AWS S3 Select has a different use case. It scans just the requested columns from S3 and returns only that set of sieved data, not the entire dataset.  

Ever since AWS announced Amazon S3 Select, there have been several introductory articles making rounds and explaining how fast it is compared to S3. Very few spoke about its key merits. This post discusses few benefits of S3 Select and how useful it is in data protection under GDPR compliance.

Before that, for those who are unaware of Amazon S3 Select, here is a small intro. It is an add-on AWS service that can filter out only the required data from an object in an S3 bucket without retrieving the entire object itself.

The particular data you need from an object is pulled using a standard SQL expression via API/SDK. Say, for example, there is 1 TB of data in a GZIP-ed file in an S3 bucket. You want to selectively query a specific set of CSV data from this huge file. You can use AWS CLI, query the SQL, and get that required data within minutes.

One of the most lauded features of S3 Select is its ability to simplify and improve the performance of scanning and filtering object content into smaller, targeted dataset by up to 400%.

Coming back to benefits, AWS S3 Select:

#1 Integrates with other AWS services

It is interoperable with other AWS services like Lambda function, which makes it easier to pull necessary data. For example, you can invoke a simple Lambda function to run s3-select API call against a set of values to get the select data from the file, in S3.

#2 Supports CSV or JSON files “with or without” GZIP compression

This means that you can selectively query a specific set of CSV/JSON data from both GZipped and unzipped files, making the service more flexible.

#3 Eliminates the need for compute resources

While using Amazon S3 Select, your applications no longer have to use compute resources to scan and filter the data from an object. This is one of the primary reasons for the increase in query performance by up to 400%. As per AWS, you need to use SELECT instead of GET to take advantage of S3 Select.

#4 Accelerates Big Data querying by 5X

S3 Select pulls only the required data and uses 1/40th of the CPU compared to S3. This makes it an ideal service if you are extensively using Big Data frameworks, like Presto, Apache Hive, and Apache Spark, and looking to heavy lift all that unwanted data processing. Here’s the link to the AWS SlideShare, if you want to know more.

#5 Is available on AWS SDK for Ruby, not only on AWS SDK for Java and Python or AWS CLI

With the recent announcement from AWS, you can now process selected record events asynchronously with the AWS SDK for Ruby as well, with multiple usage patterns. Get more details about this announcement here.

Availability of AWS S3 Select

S3 Select is available to all AWS customers. You can use the service from the AWS SDK for Java, AWS SDK for Python, AWS CLI and now AWS SDK for Ruby. Its pricing is based on the data scanned and the data returned.

A Pro Use Case: Protecting Data Under GDPR Compliance using Amazon S3 Select & Macie

With everyone catching up quickly on the GDPR compliance lately, we thought of sharing a simple 3-step exercise to protect the data under GDPR compliance automatically using Amazon Macie and S3 Select.

1. Dump your data, including sensitive customer or PCI-DSS data to S3 buckets in the EU region.

2. Create a S3 Select pipeline from the S3 bucket, wherein you can query only the non-sensitive required data as and when required from other AWS services residing in other regions or the same region.

3. Before querying the data from S3 Select, you can use Amazon Macie Security service to validate the outward going data from S3 to S3 Select. By doing so, you can ensure that outward going data is not sensitive.

This way, all your data getting pulled from cross region services will be GDPR compliant.

Do you have any such application of AWS S3 Select?

S3 Select & How It Help Protect Data Under Gdpr Compliance

Smart Scheduling at your fingertips

Go from simple to smart, real-time AWS resource scheduling to save cost and increase team productivity.

Learn More
More Posts

You Might Also Like

Cloud Computing
How To Migrate To Azure Faster?
Migrating from on premise data centers to a cloud provider is always considered a difficult endeavor. From the cost, to the planning and resource allocation, plenty of preliminary work is gone to setting up a cloud infrastructure. Which is why, Microsoft Azure’s new program stands to benefit many organizations still on the fence about migrating to the cloud.
July 21, 2021
Cloud Computing
Everything You Need To Know About Kubernetes Scheduler
When creating a Kubernetes cluster, scheduling the pod to an available node is an important component of the process. This component works under specific rules and technicalities that I’d like to explore in this article...
September 23, 2020
Cloud Computing
20 Cloud Influencers You Should Be Following in 2020
It’s important to follow the right individuals so that you remain on the loop and always find yourself learning things that you were unaware of. These thought leaders and influencers can only be the avenues by which you meet other interesting technologists.
September 23, 2020
Cloud Automation
New In: No-code cloud management workflows for Azure, VMware & Private Cloud (in addition to AWS)
At TotalCloud, we’ve been enabling workflow-based cloud management for AWS to make it intuitive, accelerated, and no-code. Instead of programming cloud management use cases or depending on siloed solutions, we built out a platform that gives you building blocks to assemble any cloud management solution. 
September 4, 2020
Cloud Computing
List of Essential Kubernetes Tools
Kubernetes is a Container-as-a-Service with tons of unique tools to choose from. External tools play a role in integrating with different systems or maintaining control over the clusters you deploy. Manual health checks and troubleshooting is not ideal to keep a system in full health.This list of tools will provide ample support to your containers and have enough configuration to leave management flexible...
August 12, 2020
AWS Use Case Files
TotalCloud Inventory Actions: Giving a new meaning to Cloud Inventory
Learn how the TotalCloud Inventory Dashboard can become equivalent to your cloud provider’s SDK. Carry out any action on any discovered resource with Inventory Actions.
July 30, 2020