Introduction
More organisations are starting to recognize the value of data. But to get started requires storage, tools and skills. With these requirements, and especially when concerning (personal) data, security related questions start popping up. AWS offers solutions to all these problems. In this blog post we want to touch on our top 5 security related topics when it comes to managing a data lake or running a data warehouse on AWS.
Top 5 data and security topics on AWS
Setting up a data lake in AWS allows you to get started quickly with all the managed services that AWS offers and at a scale that has virtually no limits. However there are a few essential things that you need to get right before you start your cloud journey with your data. In the overview below we give a description of what the risk is and a high level approach on how you can mitigate or avoid that risk.
1. GDPR and legal compliance
The risk: Data is considered to be an important and valuable asset. Therefore data collection often results in collecting as much data as possible in order to get the maximum value out of it. However collecting all the data you can get your hands on will most likely result in storing personal identifiable information (PII). Regulations like GDPR dictate that you have to comply with rules like the right to be forgotten or the requirement of removing data after a customer leaves the platform.
How to avoid with cloud: Be selective about the data that is being collected and stored. By removing personal data immediately in the ingestion phase you can greatly reduce the complexity that comes when you need to remove it at a later stage. Either remove or anonymise data that comes in. Be careful with metadata, although on its own it might not be personal, when combined with other data it could still allow you to pinpoint specific people. Encryption can also help for granting selective access or implement selective removal of data.
2. Data Generation (data integrity and validation)
The risk: How data is generated and validated determines the integrity of the data. Always beware of who is generating the data and whether this was done correctly before using it for further analysis. Dashboarding or training models with dirty data can lead to incorrect or dangerous results.
How to avoid with cloud: AWS allows you to apply fine-grained authorisation rules with services like AWS Lake Formation to stay in full control of who has access to the data. Encryption adds an extra layer of authorisation to your data and should be applied at all times. Additionally, there is pipeline validation and scanning for outliers. Always make sure to leverage these functionalities in your data pipeline and check regularly. Control for and focus on the content of the data.
3. (Near) Real time security compliance
The risk: Let’s say you design for security and treat it like job zero. That’s great! But your work is not over. We have seen clean environments get cluttered horribly after 6 months in use: new people join that don’t adhere to the standards, actions outside of best practices are performed or rules are straight up ignored. This happens when the security officer has no control or proper dashboards.
How to avoid with cloud: Familiarise yourself with event-based security and dashboarding based on configuration changes in your environment. Utilise features like automated remediations on AWS with out-of-the-box configuration rules or implement custom ones to align with your specific security policy. And finally: get to grips with output management and monitor your information through real time security dashboards.
4. Encryption
The risk: Every time you don’t encrypt something (or remove encryption), you potentially give others access to your data. Additionally it threatens the integrity of your data as it is now open to be manipulated. Storing data unencrypted increases the risk of potential data leakage. This is the case for both encryption at rest and when in transit. And as we will discuss in the final topic: do you know who is managing the encryption keys?
How to avoid with cloud: Encrypting your data should be the default with AWS both in transit and at rest. It is as simple as flipping a switch for most services. Encryption is considered to be best practice in the cloud. Make sure that you apply fine grained access policies to your keys to control who can access the data. What you do after launching your services is however up to the people that handle it, more on that in the last part:
5. (External) Access control
The risk: Without proper access control, your encryptions are still not solid on their own. Giving users broad permissions like administrator access, might allow them to access all the data that you have available. This isn’t just true for individuals or users but also for the permissions you pass to services and the permissions to use those services.
How to avoid with cloud: Always maintain the principle of least privilege: only give people, services and processes the minimum access they actually need to perform the job and limit access to your encryption keys. This way no one gets access to services or data that they should not be working with. Additionally, you can set up services like CloudTrail to monitor all API calls made by your users and monitor their behaviour. Log and monitor usage of the encryption keys to allow you to fine-tune your permissions and analyse access patterns.
Final words
In this post we highlighted some areas that require some extra attention when you want to store (sensitive) data. Most of the items mentioned above can be applied to any environment. However as more and more data is moving to the cloud we feel that the points mentioned above become even more important. Utilising the correct services and features from AWS is very important to keep your data secure and your environment compliant.
In a follow up post we will go more in depth on some of these topics and talk about the possible solutions for them.