Monitoring cloud data with Trillian
Over the past few weeks, we’ve been working with Google’s open source project, Trillian, to show how new tools can help make digital services more trustworthy. Dave has written an introduction to this project, explaining why we're doing this work.
This post goes into more detail for developers and other specialists who might be interested in experimenting with Trillian. It describes how we deployed Trillian on Amazon Web Services (AWS) and used it to create a verifiable log monitoring data written into an S3 storage bucket.
Trillian enables the creation of tamper-evident data sets. This means that it’s possible to check if data has been changed or removed from a dataset stored in Trillian. We’ve written before about how Trillian can be used to ensure data integrity, and about tools we’ve built for it in the past.
How Trillian is normally used
Trillian is normally deployed as a gRPC server with a database storage layer. This core is paired with a ‘personality’ - an additional layer of code that adds specific functionality.
For example, Trillian is used to create the verifiable logs that underpin the Certificate Transparency project. In this case, the Certificate Transparency personality would perform additional, project-specific functions, like checking if the item to be included in the log is a valid certificate.
How we used Trillian
We wanted to minimise the set-up and maintenance of AWS infrastructure needed to run Trillian, so we decided to use Trillian as a library in a serverless context. For our use case, we used Trillian to monitor account activity data produced by CloudTrail.
We created most AWS resources using Terraform, a tool that allowed us to write our infrastructure set-up in configuration files that can be shared as a part of our project.
Our set-up consisted of the following:
CloudTrail to log account activity.
A corresponding S3 bucket where CloudTrail logs are stored. This S3 bucket is what Trillian will monitor.
A Lambda function, triggered by CloudTrail adding events to the S3 bucket. It will take the event object and processes it into a leaf, then queue it for inclusion into Trillian’s logs.
An Aurora database cluster running in serverless mode. This is where Trillian’s logs will be stored.
A temporary EC2 instance to do initial provisioning of the databases.
A second Lambda function which will run once per day. It will take batches of queued leaves and add them to Trillian’s logs. It will then sign the log and return the signed log root.
An S3 bucket to store signed log roots. Signed log roots are what Trillian uses to provide “proofs” about the data, such as verifying that nothing in the log has been changed or removed.
We also wrote a command-line tool for verifying signed log roots.
Decisions and issues
Overall, using Trillian as a library worked well. However, there were some small issues we ran into that provided opportunities to learn more about how Trillian could work best in a serverless context:
Some of Trillian’s basic operations weren’t straightforward to run when it’s used as a library
We created a temporary EC2 instance to provision the database and initialize the logs. We explored provisioning the database from our Lambda functions, but eventually decided it would be cleaner and easier to do this manually.
Because the EC2 instance would only be required for initial database provisioning, we didn’t describe it in our Terraform code. Instead, we created it via the AWS console and terminated it when database provisioning was complete.
Integration between different AWS resources wasn’t always seamless
AWS has over 100 products available on its platform. They are designed to work together, and this modularity allowed us to quickly develop a custom environment for deploying Trillian into.
However, this integration isn’t always seamless. For example, we created the Aurora database cluster in a Virtual Private Cloud (VPC). Because the database was running in serverless mode, it was not possible to assign it a public IP address. Any resources that interact with the database cluster also need to be created in the VPC, necessitating additional resources like internet gateways and route tables. This meant we had to run our Lambda functions in the same VPC which comes with a similar set of limitations.
AWS doesn’t currently provide cryptographically secure hash of objects stored in S3
The S3 event we are storing in Trillian and the response from listing bucket's contents only provide a hash intended to be used as an eTag in cache invalidation. The algorithm used is MD5 which is known to be broken, and isn’t particularly fast either. It would be helpful if Amazon provided a cryptographically secure hash of objects.
Running Trillian’s storage layer in serverless mode required some changes to Trillian’s codebase
An Aurora cluster running in serverless mode auto-scales depending on the load on your application. This is advantageous in terms of billing because you are only paying for what you need.
We noticed that our database cluster always had open connections from our Lambda functions, despite our functions only running intermittently throughout the day. It turned out Golang’s default number of idle connections is two, so our Lambda functions were defaulting to keeping connections to the database open in between execution runs.
We fixed this by adding a flag to Trillian that allows for setting the number of idle connections. We could then specify that no idle connections should be kept open to allow our database cluster to scale to zero.
Our work is available as an open-source project on GitHub. You can explore the code and start building verifiable logs monitoring information contained in S3 buckets.
In the next few weeks we’ll publish a post about some potential use cases for Trillian on AWS. We want other developers to experiment with the tools and we’d love to know what you think. Write to us at email@example.com.