Before the release of S3 Select, if you had to pull only a specific set of raw data from the S3 Bucket, you had to download the entire chunk of data from the bucket, unzip them, and then search for the required records. Amazon Athena helps to some extent, but it only analyzes a specific set of data (like Big Data) residing in Amazon S3. AWS S3 Select has a different use case. It scans just the requested columns from S3 and returns only that set of sieved data, not the entire dataset.
Ever since AWS announced Amazon S3 Select, there have been several introductory articles making rounds and explaining how fast it is compared to S3. Very few spoke about its key merits. This post discusses few benefits of S3 Select and how useful it is in data protection under GDPR compliance.
Before that, for those who are unaware of Amazon S3 Select, here is a small intro. It is an add-on AWS service that can filter out only the required data from an object in an S3 bucket without retrieving the entire object itself.
The particular data you need from an object is pulled using a standard SQL expression via API/SDK. Say, for example, there is 1 TB of data in a GZIP-ed file in an S3 bucket. You want to selectively query a specific set of CSV data from this huge file. You can use AWS CLI, query the SQL, and get that required data within minutes.
Coming back to benefits, AWS S3 Select:
It is interoperable with other AWS services like Lambda function, which makes it easier to pull necessary data. For example, you can invoke a simple Lambda function to run s3-select API call against a set of values to get the select data from the file, in S3.
This means that you can selectively query a specific set of CSV/JSON data from both GZipped and unzipped files, making the service more flexible.
While using Amazon S3 Select, your applications no longer have to use compute resources to scan and filter the data from an object. This is one of the primary reasons for the increase in query performance by up to 400%. As per AWS, you need to use SELECT instead of GET to take advantage of S3 Select.
S3 Select pulls only the required data and uses 1/40th of the CPU compared to S3. This makes it an ideal service if you are extensively using Big Data frameworks, like Presto, Apache Hive, and Apache Spark, and looking to heavy lift all that unwanted data processing. Here’s the link to the AWS SlideShare, if you want to know more.
With the recent announcement from AWS, you can now process selected record events asynchronously with the AWS SDK for Ruby as well, with multiple usage patterns. Get more details about this announcement here.
S3 Select is available to all AWS customers. You can use the service from the AWS SDK for Java, AWS SDK for Python, AWS CLI and now AWS SDK for Ruby. Its pricing is based on the data scanned and the data returned.
With everyone catching up quickly on the GDPR compliance lately, we thought of sharing a simple 3-step exercise to protect the data under GDPR compliance automatically using Amazon Macie and S3 Select.
1. Dump your data, including sensitive customer or PCI-DSS data to S3 buckets in the EU region.
2. Create a S3 Select pipeline from the S3 bucket, wherein you can query only the non-sensitive required data as and when required from other AWS services residing in other regions or the same region.
3. Before querying the data from S3 Select, you can use Amazon Macie Security service to validate the outward going data from S3 to S3 Select. By doing so, you can ensure that outward going data is not sensitive.
This way, all your data getting pulled from cross region services will be GDPR compliant.
Do you have any such application of AWS S3 Select?