Data Masking and anonymization are fundamental aspects of data protection. These techniques make it possible to “play” with the information in a dataset in order to make it anonymous. This notion of anonymization can take different forms depending on the algorithms that exist. Thus, it is possible to set up forms of encoding that substitute certain data for others, others that completely hide certain data, or others that manipulate certain values in order to make the initial data set completely impossible to find. In order to better understand how each algorithm works, we will detail the different data masking techniques to better understand their value.
In our examples, we will start with the following dataset, containing a name and a salary:
Name : Brown – Salary : 95000
Name: Smith – Salary : 125000
Substitution algorithms: maintaining a realistic appearance
When a substitution algorithm is used, some of the information in the main dataset is substituted by alternative information. The information looks realistic, but it anonymizes and protects the identity of the people in the original dataset. In our example, the new data would then be as follows:
Name : Green – Salary : 95000
Name: Jones – Salary : 125000
Randomized algorithms: shuffling the data
With this algorithm, the characters in each column are randomly shuffled. This makes it very difficult to retrieve the original information. Based on the example dataset, we could obtain the following result:
Name : Worbn – Salary : 95000
Name: Miths – Salary : 125000
Numerical variations algorithms: reproducing a result representative of the original dataset
Using a number and date variation algorithm, it is possible to create a fictitious dataset based on numerical information from the original dataset. With the help of a significant numerical range (e.g. +/- 10%), it is possible to display results that are realistic, which would at the same time make the original dataset completely untraceable. This could give the following result in our example:
Name : Brown – Salary : 102600
Name: Smith – Salary : 112500
Redaction algorithms: artificially replacing data
To make a dataset completely anonymous, it is possible to use a redaction algorithm. This replaces all real data with a constant or random unrelated string. In other words, it is a substitution algorithm where the information does not attempt to look realistic. This could give the following result in our example:
Name : xxxxx – Salary : 95000
Name: xxxxx – Salary : 125000
Masking algorithms: keeping a usable database
Not so different from the previous algorithm, the masking algorithm allows for a partial redaction, where some information is retained during anonymization. In our example, the result could be:
Name : Bxxxx – Salary : 95000
Name: Sxxxx – Salary : 125000
Customized algorithms: to meet more specific needs
Sometimes, the algorithms listed earlier are not sufficient or do not meet specific requirements. In such cases, algorithms can be customized. These are generally customized on request. For example, a company may need to invert certain information across different lines to make the data anonymous. In our example; this could result in:
Name : Brown – Salary : 125000
Name: Smith – Salary : 95000
We have seen that there are many different data masking and anonymization algorithms and all of them enable the creation of new and very different datasets. Not all of them mask information in the same way, but they allow organizations to find the right solution that meets their specific needs and constraints.