Skip to main content

A Guide to S3 Batch on AWS

· 18 min read
Alex DeBrie

AWS just announced the release of S3 Batch Operations. This is a hotly-anticpated release that was originally announced at re:Invent 2018. With S3 Batch, you can run tasks on existing S3 objects. This will make it much easier to run previously difficult tasks like retagging S3 objects, copying objects to another bucket, or processing large numbers of objects in bulk.

In this post, we'll do a deep dive into S3 Batch. You will learn when, why, and how to use S3 Batch. First, we'll do an overview of the key elements involved in an S3 Batch job. Then, we'll walkthrough an example by doing sentiment analysis on a group of existing objects with AWS Lambda and Amazon Comprehend.

For a detailed look at what we'll cover, here's a table of contents:

Let's get started!

Background: What is S3 Batch and why would I use it?

Amazon S3 is an object storage service that powers a wide variety of use cases. You can use S3 to host a static web stite, store images (a la Instagram), save log files, keep backups, and many other tasks.

With S3's low price, flexibility, and scalability, you can easily find yourself with millions or billions of objects in S3. At some point down the line, you may have a need to modify a huge number of objects in your bucket. There are a number of reasons you might need to modify objects in bulk, such as:

This is where S3 Batch is helpful. You can specify a list of objects you want to modify and an operation to perform on those objects. AWS will manage the scheduling and management of your job.

Elements of an S3 Batch Job

Now that we know why we would use S3 Batch, let's understand the core elements of a Batch job.

There are four core elements to an S3 Batch Operation:

  • Manifest: A file indicating which objects should be processed in a Batch job

  • Operation: The task to be performed on each object in the job

  • Report: An output file that summarizes the results of your job

  • Role Arn: An IAM role assumed by the batch operation.

Let's dig into each of these more closely.

Manifest

When submitting an S3 Batch job, you must specify which objects are to be included in your job. This is done by way of a manifest.

A manifest is a CSV file where each row is an S3 object in the job. Your CSV manifest must contain fields for the object's bucket and key name.

For example, I could have a CSV with the following:

mybucket,myfirstobject
mybucket,mysecondobject
mybucket,mythirdobject
...

In the example above, the first value in each row is the bucket (mybucket) and the second value is the key to be processed.

You may optionally specify a version ID for each object. An S3 Batch job may take a long time to run. If you're using versioned buckets, it's possible that some of your objects will be written with a different version between the time you start the job and the time the object is processed.

If you wanted to use version IDs, your CSV could look as follows:

mybucket,myfirstobject,version234kjas
mybucket,mysecondobject,versionjlsflj214
mybucket,mythirdobject,versionliaosper01
...

In the example above, each line contains a version ID in addition to the bucket and key names.

In addition to a CSV manifest, you can also use an S3 Inventory report as a manifest. This will process all objects in your inventory report.

Your manifest file must be in an S3 bucket for S3 Batch to read. When specifying a manifest file, you must include an ETag of the file. An ETag is basically a hash of the contents of a file. S3 Batch makes you specify an ETag in order to know that you're using the correct version of a manifest.

Operation

Once you've identified which objects you want to process with the manifest, you need to specify what to do to those objects via the operation.

There are five different operations you can perform with S3 Batch:

  • PUT copy object (for copying objects into a new bucket)
  • PUT object tagging (for adding tags to an object)
  • PUT object ACL (for changing the access control list permissions on an object)
  • Initiate Glacier restore
  • Invoke Lambda function

The first four operations are common operations that AWS manages for you. You simply provide a few configuration options and you're ready to go.

The last operation -- invoking a Lambda function -- gives you more flexibility. You can do anything you want -- perform sentiment analysis on your objects, index your objects in a database, delete your objects if they meet certain conditions -- but you'll need to write the logic yourself.

Let's look a little deeper at using a Lambda function in an S3 Batch job.

Using a Lambda function in S3 Batch

If you choose a Lambda function as your operation type, S3 Batch will invoke a Lambda function for each object in your manifest.

Your Lambda function should process the object and return a result indicating whether the job succeeded or failed.

Your Lambda function will be invoked with an event with the following shape:

{
"invocationSchemaVersion": "1.0",
"invocationId": "<longInvocationIdValue>",
"job": {
"id": "a2ba35a8-da5a-4a15-8c95-62d3ef611e78"
},
"tasks": [
{
"taskId": "<longTaskIdValue>",
"s3BucketArn": "arn:aws:s3:::mybucket",
"s3Key": "myfirstobject",
"s3VersionId": null
}
]
}

Information about the object to be processed is available in the tasks property. This is an array, but currently S3 Batch only passes in a single object per Lambda invocation. Hopefully they will allow batches of objects in a Lambda request in the future.

After performing the work to process the object in your Lambda function, you must return a specific response. The shape of the response should be as follows:

{
"invocationSchemaVersion": "1.0",
"treatMissingKeysAs": "PermanentFailure",
"invocationId": "<longInvocationIdValue>",
"results": [
{
"taskId": "<longTaskIdValue>",
"resultCode": "Succeeded",
"resultString": "Custom message here"
}
]
}

Let's walk through this.

The invocationSchemaVersion is the same value as the invocationSchemaVersion on your incoming event. Similarly, the invocationId is the same as the invocationId on your event.

In the results object, you should have an entry for each element in your tasks array from the event object. Currently, there is only one object per invocation but this could change.

The object in your results array must include a taskId, which matches the taskId from the task. You must also provide a resultCode, indicating the result of your processing.

The valid values for resultCode are:

  • Successful: The operation on this object succeeded.
  • TemporaryFailure: The operation on this object failed but the failure was temporary. S3 Batch will retry this job a few times before ultimately failing.
  • PermanentFailure: The operation on this object failed in a way that is unlikely to change. S3 Batch should not retry, and this object will be marked as failed.

You may also include a resultString property which will display a message in your report about the operation.

Finally, in the top-level object, there is a treatMissingKeysAs property that indicates the result code that should be assumed for keys that are not returned in the response. You may use any of the result codes mentioned above as the default value.

Report

S3 Batch allows you to specify a summary report for the end of your job. While the manifest and operation are required elements of an S3 Batch job, the report is optional.

Your S3 Batch report is a CSV file written to S3 with a summary of your job. You may choose to have a summary of all tasks written in the report or just the failed tasks.

I would strongly recommend using a report, at least for failed tasks. It is quite helpful for figuring out where your configuration or processing went wrong. If you are using a Lambda function operation, be sure to include a resultString message in each failed task to give yourself helpful guidance on how to resolve the issue.

Role ARN

The final part of an S3 Batch job is the IAM Role used for the operation. You will pass the ARN of an IAM role that will be assumed by S3 Batch to perform your job.

Configuring this IAM role can be tricky and can cause your job to fail in opaque ways. There are two key things you need to configure for your IAM Role ARN:

  • IAM permissions

  • Trust policy

Let's review each of these below.

IAM role permissions

The Batch job itself needs certain permissions to run the job. For example, we discussed the manifest file above that lists the objects to be processed. The manifest file is a file on S3, and the Batch job will need permissions to read that file and initialize the job.

Further, the Batch job will need permissions to perform the specified operation. If your operation is a Lambda function, it will need the lambda:InvokeFunction permission on the specified Lambda function.

If your operation is a PUT object tagging operation, it will need the s3:PutObjectTagging permission. Likewise with the PUT object ACL or other managed operations from S3.

Finally, if you have enabled a report for your Batch job, the report will be written to a specified location on S3. Your Batch job will need s3:PutObject permissions to write that file to S3.

Trust policy

When you have an AWS service assume an IAM role in your account, there needs to be a trust policy indicating that your IAM can be assumed by the specified service.

When creating your IAM role, add the following trust policy to your role so that it can be assumed by S3 Batch:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "batchoperations.s3.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}

If you're using CloudFormation to create your IAM role, the trust policy will go in the AssumeRolePolicyDocument property.

Additional configuration options

The four elements noted above are the key parts of an S3 Batch job. However, there are a few additional configuration options worth mentioning briefly.

A Batch job must have a Priority associated with it. This can be used to indicate relative priority of jobs within your account. S3 Batch will only run so many operations across your account at a time. You can expedite a new job without cancelling a long-running existing job by indicating a higher priority for the new job.

You may submit a ClientRequestToken to make sure you don't run the same job twice. If two jobs are submitted with the same ClientRequestToken, S3 Batch won't kick off a second job. This can save you from a costly accident if you're running a large job.

Finally, you may indicate that a job must be confirmed before running by setting ConfirmationRequired to "True". With this option, you can configure a job and ensure it looks correct while still requiring additional approval before starting the job.

Now that we know the basics about S3 Batch, let's make it real by running a job.

Walkthrough: Running our first S3 Batch job

Alright, we've done a lot of background talk. It's time to put it into action. In this section, you will run your first S3 Batch job. We'll use the Lambda invocation operation.

Imagine that you have a bunch of text files in S3. One day, your boss asks you to detect and record the sentiment of these text files -- is the context positive, negative, or neutral?

You can use Amazon Comprehend in your Lambda function in an S3 Batch to handle this operation. Follow the steps below to see how.

Why use the Serverless Framework?

In this walkthrough, we'll use the Serverless Framework to deploy our Lambda functions and kick off an S3 Batch job. Before we do that, let's see why we'll use the Serverless Framework.

The Serverless Framework is a tool for developing and deploying AWS Lambda functions as part of larger serverless applications. Packaging and deploying your Lambda functions can require a lot of scripting and configuration. The Serverless Framework offloads a lot of that boilerplate and makes it easy to focus on the important work.

If you want to know more about the Serverless Framework, check out my previous post on getting started with Serverless in 2019.

In addition to assisting with AWS Lambda development, the Serverless Framework has a nice plugin architecture that allows you to extend its functionality.

I've created the serverless-s3-batch plugin to show you how this works. This plugin helps with a few things:

  • Provisioning your IAM role for an S3 Batch job

  • Reducing the boilerplate configuration around starting a job.

We'll use the plugin in the steps below.

Setting up

Note: I'm assuming your environment is configured with AWS credentials. If you need help with this, read this guide or sign up for my serverless email course above.

Let's get set up with the Serverless Framework and our sample project.

  1. Install the Serverless Framework

    If you haven't used Serverless before, install it using NPM:

    $ npm install -g serverless
  2. Create a Serverless service from the example project

    Run the following command to bring the example service onto your local machine:

    $ sls create --template-url https://github.com/alexdebrie/serverless-s3-batch/tree/master/examples/lambda-comprehend
  3. Install service dependencies

    Change into the service directory and install the dependencies:

    $ cd lambda-comprehend
    $ npm i
  4. Create your S3 bucket

    We'll need an S3 bucket for this exercise. If you don't have one, you can create one using the AWS CLI:

    $ aws s3 mb s3://<bucket-name>

    Be sure to provide your own value for <bucket-name>.

    Once you've created your bucket, export it as an environment variable:

    $ export S3_BATCH_BUCKET=<your-bucket-name>

Deploying our function and role

Now that we've done some basic set up, let's deploy our Lambda function.

The core file in a Serverless Framework project is the serverless.yml file. This is a configuration file that describes the infrastructure you want to create, from AWS Lambda functions, to API Gateway endpoints, to DynamoDB tables.

Our serverless.yml file looks as follows:

service: lambda-comprehend

plugins:
- serverless-s3-batch

custom:
s3batch:
manifest: s3://${env:S3_BATCH_BUCKET}/manifest.txt
report: s3://${env:S3_BATCH_BUCKET}/reports
operation: detectSentiment

provider:
name: aws
runtime: python3.7
stage: dev
region: us-east-1
iamRoleStatements:
- Effect: Allow
Action: "s3:GetObject"
Resource: "arn:aws:s3:::${env:S3_BATCH_BUCKET}/*"
- Effect: Allow
Action: "comprehend:DetectSentiment"
Resource: "*"

functions:
detectSentiment:
handler: handler.detect_sentiment

The key parts for us to note right now are:

  • The plugins block. This is where we register the serverless-s3-batch plugin that will extend the functionality of the Framework.

  • The functions block. This is where we configure our AWS Lambda function that will call Amazon Comprehend with each object in our Batch job manifest.

  • The provider block. This is where we describe additional information about our function, such as the AWS region to deploy to and some IAM policy statements to add to our Lambda function. In our service, we've given the ability to read objects in our S3 bucket and to call DetectSentiment in Comprehend.

If you want to see our function logic, you can look at the code in the handler.py file.

To deploy this service, run the following command:

$ sls deploy

This is doing two main things. First, it's deploying and configuring the AWS Lambda function that you've declared in the functions block. Second, the serverless-s3-batch plugin is creating an IAM role that can be used for your S3 Batch job.

After a few minutes, your deploy should be complete and you'll see output like the following:

Serverless: Stack update finished...
Service Information
service: lambda-comprehend
stage: dev
region: us-east-1
stack: lambda-comprehend-dev
resources: 6
api keys:
None
endpoints:
None
functions:
detectSentiment: lambda-comprehend-dev-detectSentiment
layers:
None

Creating an S3 Batch job

Now that our function is deployed and our IAM statement is ready, let's create our job.

As mentioned in the overview section above, each S3 Batch job needs a manifest file that specifies the S3 objects that are to be included in the job.

In this example, there are some example files in the files/ directory. Since we're doing sentiment analysis, we've got a few different types of files.

The first, think-of-love.txt, is a famous Shakespeare piece on love:

If I should think of love
I'd think of you, your arms uplifted,
...

On the other end of the spectrum, we have the text of Alfalfa's love letter to Darla, as dictacted by Buckwheat, in the 1990s movie Little Rascal's. Its contents are a little less uplifting:

Dear Darla, I hate your stinking guts. You make me vomit. You're scum between my toes! Love, Alfalfa.

Finally, we have Marc Antony's speech in Julius Caesar:

Friends, Romans, countrymen, lend me your ears;
I come to bury Caesar, not to praise him.
...

Let's create a manifest and upload both the manifest and these files to our S3 bucket. You can do so by running:

make upload

If you set your S3_BATCH_BUCKET environment variable, it should upload the files.

Let's take one look at our serverless.yml file again. In the custom block, we've got a configuration section for the S3 batch plugin. Its contents are:

# serverless.yml
custom:
s3batch:
manifest: s3://${env:S3_BATCH_BUCKET}/manifest.txt
report: s3://${env:S3_BATCH_BUCKET}/reports
operation: detectSentiment

It's specifying the location of the manifest, where we want the report to be saved, and the Lambda function to use in our operation.

We can start our job with the following command:

npx sls s3batch create

You should see output that your job was created as well as a link to view your job in the AWS console.

Serverless: S3 Batch Job created. Job Id: 8179eff2-5b4d-4be9-9238-4eaa86711b9c
Serverless:
Serverless: View in browser: https://console.aws.amazon.com/s3/jobs/8179eff2-5b4d-4be9-9238-4eaa86711b9c

Open the link in your browser to check on your job.

Viewing the report

When you view the job in your browser, you should see a screen like this:

Screen Shot 2019-05-04 at 1 43 43 PM

It includes helpful information like the time it was created and the number of objects in your manifest.

If you want a minute, you can check the details of your job in the Status section. It lets you know how much of your job is complete and how many tasks succeeded and failed.

Screen Shot 2019-05-04 at 1 43 53 PM

Finally, the report section at the bottom includes a link to your report:

Screen Shot 2019-05-04 at 1 43 59 PM

Your report will be located at <your-bucket>/<your-prefix>/job-<job-id>/results/<file>.csv. The results will be in CSV format, as shown below:

Screen Shot 2019-05-04 at 1 49 12 PM

In your CSV file, it will include the name of the object for each object in your manifest. It will also show the result for each object -- whether the task succeeded or failed.

Finally, it will include a result string, if provided in the Lambda function. In our Lambda function, we returned the sentiment analysis result from Amazon Comprehend. You can see that the love poem was rated POSITIVE, the Little Rascal's piece was rated NEGATIVE, and Marc Antony's speech was rated NEUTRAL.

That's it! You ran your first S3 Batch job. Feel free to tweak the parameters or to experiment on some of your own files.

Conclusion

In this post, we learned about S3 Batch. These are a powerful new feature from AWS, and they allow for some interesting use cases on your existing S3 objects.

First, we covered the key elements of S3 Batch, including the manifest, the operation, the report, and the role ARN. Then, we walked through a real example using the Serverless Framework and Amazon Comprehend.

If you have questions or comments on this piece, feel free to leave a note below or email me directly.