Detect Sensitive Data (PII) by Amazon Macie

Park Sehun
3 min readMay 19, 2022

Sensitive data protection is getting more critical regardless of industry or compliance to be followed. Amazon Macie is a fully managed data security and data privacy service that uses ML and pattern matching to discover and protect your sensitive data.

Macie provides the data classification by analyzing S3 buckets and objects, detecting PII (Personally Identifiable Information), and following security best practices.

It scans multiple files like RDS snapshots, word files, text, etc. You can add customized regular expressions so that Macie can detect the specialized data set for your organization. Lastly, it can run based on the scheduler and send alerts via Cloudwatch events.

Demo:

First, I prepared for PII data sets in my S3 buckets. There are fake credit card numbers and HKID numbers in the files.

Then, you go to AWS Macie and enable the Macie service. You will see the summary and buckets, but it hasn’t been scanned yet.

Let’s create the ‘Job’. The Macie will run over S3 so if you are trying to scan RDS, you should put RDS snapshots to S3, and for logs files as well.

The Macie can be scheduled on a regular basis if you have any compliance issues to detect the PII regularly and remove them from the logs or database asap.

There are only three objects in the bucket so it will be free, but Macie will be commonly run over RDS database snapshot or Logs files to detect any PII files, so AWS will estimate so that you don’t need to be surprised after it runs.

You can also create a custom data identifier if the scanner runs over certain circumstances like region, industry, etc. (e.g. Hong Kong passport below)

Lastly, you can go to ‘finding’ to find any sensitive data Macie finds during/after scanning.

It tells there are 4410 credit card numbers and 1000 names found in the one file.

Alert

As you can schedule the Macie service like daily, weekly, you will also need to set up the alert where any PII (sensitive) data is detected.

You can go to EventBridge and set up the event (Macie alert and findings) and targets to invoke when an event matches the event pattern. (It can be SNS, SQS, etc.)

In conclusion, the Macie is not a free service but as a fully managed service, you can get rid of the high overhead on the PII detection and enjoy automated service. Additionally, you also can do sampling not putting 10 million logs, instead, you can put 10K logs files as a sample assuming the users will have a pattern to use so the log does.

So how traditionally your company does detect PII? or doing nothing?

--

--