Anonymise transparently with MKYD
With the increasing use of personal data in processes, be it business, marketing, healthcare or education etc., data privacy is becoming a key issue for companies, universities, research institutes and governments alike. Data anonymisation provides a relatively safe way to use data without compromising an individual’s privacy. However, at the same time, there is increased demand for transparency in these processes. In this context, anonymisation and transparency appear to be opposite concepts, yet it is often necessary to combine them.
Let us imagine the case of a hospital that wants to share its patients’ data with a university, which is testing experimental diagnosis. The hospital clearly wants to keep its patients’ information confidential. However, if the diagnosis provided by the university’s research is positive, the hospital would want to be able to identify the concerned patients to be able to inform them.
This article will summarize how MakeYourData, an anonymisation software, achieves this apparently impossible combination using hashing.
Hashing is the process of converting a string of characters into another, often fixed length, string of characters. For example using the hashing algorithm SHA1:
- John → 5753a498f025464d72e088a9d5d6e872592d5f91
- Mary → 94f85995c7492eec546c321821aa4beca9a3e2b1.
These seemingly random strings have very specific properties including:
- When a specific input is entered in the hashing algorithm, we obtain always the same answer. For example John will always be 5753a498f025464d72e088a9d5d6e872592d5f91.
- If even a small part of the input changes the output changes completely.
- There is no easy way to reverse the algorithm. The best an attacker can do is try every possible input and check if it produces the correct hash. For example, for the algorithm SHA256, a brute-force attack would need to make 2^256 attempts to generate the initial data. This is more than the number of atoms in the universe!
In the case of the hospital described above, a solution would be to produce a hash based on some of the original information and ship it with the anonymised data. The hospital will receive from the university’s research team, a diagnosis for each individual in the dataset but the individual’s identity information was never shared with the university, only their hash. The hospital can now identify individuals with a positive diagnosis to inform them. Note that only the owner of the original non-anonymised dataset (in this case the hospital) can perform such an operation as the hash itself cannot be reversed.
With MakeYourData, this anonymisation and hashing can be performed with a few clicks. We describe the process below:
1. Load your data in MakeYourData. The data can be in delimited text format or come from several databases including SQLServer, Oracle and Snowflake.
2. Configure the anonymisation. In this case we will simply remove all information about the patient which is not needed to obtain a diagnosis. For example, the name and address of the patient are not needed. Only the medical information is needed. This way we can be sure the patient cannot be identified by the external research team.
3. The hospital however does need to be able to identify the individuals in the data. For that, we will ask MakeYourData to add a hash:
The generated dataset looks like the following :
The hash is calculated for each individual and contains enough information to track back the original data if one has the original dataset at hand.
However, if a hash is based on a small set of elements, one can reasonably brute force it by trying all elements. For example, if the hash is based on names of people, you could find a list of the 10000 most popular names and try them all very quickly. In some instances, this has already been done by other hackers.
MakeYourData provides steps to make the hashing process even more secure with the introduction of random noise and passwords when the hash is created.
Hashing with salt
The random noise added to a hash is also known as “salt”. Adding “salt” improves the security of hashes by adding noise at the moment of creating the hash. In our example above, this will make sure that trying out names will never give the correct hash because the noise is missing and even a small difference results in completely different hashes.
Hashing with passwords
Adding a password improves the security by making it extra-hard to decode the dash. In addition it can act as a signature when decoding the data. Only the person with the password can recalculate the hashes and therefore match the line with its original.
Conclusion
Hashing is a powerful technique that allows anonymisation while retaining transparency in a secure way. MakeYourData implements this technique and makes it easily accessible to business end users to enable them to share their data and make the most out of it. If you want to know more and download your free trial to test MKYD first hand visit www.argusa.ch/mkyd or contact us at info@argusa.ch.
With the increasing use of personal data in processes, be it business, marketing, healthcare or education etc., data privacy is becoming a key issue for companies, universities, research institutes and governments alike. Data anonymisation provides a relatively safe way to use data without compromising an individual’s privacy. However, at the same time, there is increased demand for transparency in these processes. In this context, anonymisation and transparency appear to be opposite concepts, yet it is often necessary to combine them.
Let us imagine the case of a hospital that wants to share its patients’ data with a university, which is testing experimental diagnosis. The hospital clearly wants to keep its patients’ information confidential. However, if the diagnosis provided by the university’s research is positive, the hospital would want to be able to identify the concerned patients to be able to inform them.
This article will summarize how MakeYourData, an anonymisation software, achieves this apparently impossible combination using hashing.
Hashing is the process of converting a string of characters into another, often fixed length, string of characters. For example using the hashing algorithm SHA1:
- John → 5753a498f025464d72e088a9d5d6e872592d5f91
- Mary → 94f85995c7492eec546c321821aa4beca9a3e2b1.
These seemingly random strings have very specific properties including:
- When a specific input is entered in the hashing algorithm, we obtain always the same answer. For example John will always be 5753a498f025464d72e088a9d5d6e872592d5f91.
- If even a small part of the input changes the output changes completely.
- There is no easy way to reverse the algorithm. The best an attacker can do is try every possible input and check if it produces the correct hash. For example, for the algorithm SHA256, a brute-force attack would need to make 2^256 attempts to generate the initial data. This is more than the number of atoms in the universe!
In the case of the hospital described above, a solution would be to produce a hash based on some of the original information and ship it with the anonymised data. The hospital will receive from the university’s research team, a diagnosis for each individual in the dataset but the individual’s identity information was never shared with the university, only their hash. The hospital can now identify individuals with a positive diagnosis to inform them. Note that only the owner of the original non-anonymised dataset (in this case the hospital) can perform such an operation as the hash itself cannot be reversed.
With MakeYourData, this anonymisation and hashing can be performed with a few clicks. We describe the process below:
1. Load your data in MakeYourData. The data can be in delimited text format or come from several databases including SQLServer, Oracle and Snowflake.
2. Configure the anonymisation. In this case we will simply remove all information about the patient which is not needed to obtain a diagnosis. For example, the name and address of the patient are not needed. Only the medical information is needed. This way we can be sure the patient cannot be identified by the external research team.
3. The hospital however does need to be able to identify the individuals in the data. For that, we will ask MakeYourData to add a hash:
The generated dataset looks like the following :
The hash is calculated for each individual and contains enough information to track back the original data if one has the original dataset at hand.
However, if a hash is based on a small set of elements, one can reasonably brute force it by trying all elements. For example, if the hash is based on names of people, you could find a list of the 10000 most popular names and try them all very quickly. In some instances, this has already been done by other hackers.
MakeYourData provides steps to make the hashing process even more secure with the introduction of random noise and passwords when the hash is created.
Hashing with salt
The random noise added to a hash is also known as “salt”. Adding “salt” improves the security of hashes by adding noise at the moment of creating the hash. In our example above, this will make sure that trying out names will never give the correct hash because the noise is missing and even a small difference results in completely different hashes.
Hashing with passwords
Adding a password improves the security by making it extra-hard to decode the dash. In addition it can act as a signature when decoding the data. Only the person with the password can recalculate the hashes and therefore match the line with its original.
Conclusion
Hashing is a powerful technique that allows anonymisation while retaining transparency in a secure way. MakeYourData implements this technique and makes it easily accessible to business end users to enable them to share their data and make the most out of it. If you want to know more and download your free trial to test MKYD first hand visit www.argusa.ch/mkyd or contact us at info@argusa.ch.