Validating data on Amazon S3

This is a very short tutorial on using goodtables.io to continuously validate data hosted on Amazon S3.

Pre-requisites

Instructions

Setting up Amazon S3 bucket and read-only user

  1. Create a bucket on S3 to hold your data
    • Create the bucket on the us-west-2 region. It’s a current limitation of goodtables.io that we’re working to fix.
  2. Create a new IAM user. This user will be used by goodtables.io to read your bucket.
    • Make sure you take note of the AWS Access Key ID, AWS Secret Access Key, and the User ARN.
  3. Go to your bucket’s overview page, click on the Permissions tab, and find the Bucket Policy link. We need the permissions:
    • s3:ListBucket: To list the bucket’s contents
    • s3:GetObject: To read the bucket’s files
    • s3:GetBucketPolicy, s3:PutBucketPolicy, s3:GetBucketLocation, and s3:PutBucketNotification: To set up the AWS Lambda functions that notifies goodtables.io when a new file is added

The final bucket policy should look like:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "statement1",
            "Effect": "Allow",
            "Principal": {
                "AWS": "IAM_USER_ARN"
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation",
                "s3:GetBucketPolicy",
                "s3:PutBucketPolicy",
                "s3:PutBucketNotification"
            ],
            "Resource": "arn:aws:s3:::BUCKET_NAME"
        },
        {
            "Sid": "statement2",
            "Effect": "Allow",
            "Principal": {
                "AWS": "IAM_USER_ARN"
            },
            "Action": ["s3:GetObject"],
            "Resource": "arn:aws:s3:::BUCKET_NAME/*"
        }
    ]
}

With your IAM User ARN and Bucket Name substituting the IAM_USER_ARN and BUCKET_NAME.

Setting up goodtables.io

  1. Login on goodtables.io using your GitHub account.
  2. Go to the Manage Sources page, click on the Amazon tab, and on the plus sign on the right of the Filter input.
  3. Fill in the Access Key Id, Secret Access Key and Bucket Name with the IAM User and bucket you just created in the previous section.

We’re all set. Goodtables will automatically validate whenever a file is added or modified in the bucket. You can now upload data to your bucket and goodtables will automatically validate any tabular files (CSV, XLS, ODS, …) and tabular data packages.

Next steps