IBM i DevOps TechTalk
Test Data Management – Part 2 Data Protection #9
by the experts at ARCAD
What are the biggest misconceptions when dealing with test data? What is driving the protection of the data outside of the production? What are challenges when you anonymize data?
In this episode, Alan Ashley explains different techniques of data protection, regulations for protecting PII, and why DOT Anonymizer is the best solution to protect your PII.
The Story Behind the Mic: Podcast Transcription
R.B. – Welcome to IBM i DevOps TechTalk, where we discuss key topics and questions with ARCAD experts. I’m Ray Bernardi and I’ll be your host today. I’ll be speaking with Alan Ashley. He’s a senior Solutions architect here at ARCAD Software. We’ll be discussing the importance of the management and protection of your test data. So Alan, in some of our previous discussions, we’ve talked about test data management. You’ve talked about data anonymization and so on. Can you give us an idea of what some of the biggest misconceptions are when dealing with test data management and the protection of the test data?
A.A. – Sure. So as I’ve done some of the research on this and as I’ve studied our products and other products that are out there, some of the things that really come to mind are people confuse encryption and masking and assuming that they’re kind of doing some of the same things. Whereas encryption is you have a key and there’s a door and you can open the door and you can see the data. Whereas the masking or anonymization part of it is you open the door and the data is totally different. That’s one of the big misconceptions because a lot of applications use encryption or encryption at rest to protect the data, and that’s really in a production standpoint. Now another one is the anonymization and pseudonymization.
It goes along the same lines where a pseudonymization can be hidden behind a token or a key. It’s not really anonymized. It may look it, but it’s not. Another one is and this is kind of tricky when you talk about PII there’s sensitive and then there’s non-sensitive such as my date of birth by itself is irrelevant.
It’s PII data, but it’s irrelevant because there’s nothing else in context. When you start adding it into context, then it becomes sensitive. Such as if you had my name in the same record with my date of birth, with my Social security number, with my address, then it becomes sensitive. It was just a date of birth in my state. It’s not really relevant.
So that’s some of the misconceptions when it comes into the test data management and the protection of the data that really come up when we start talking to customers and looking at what products do and what customers really want out of their products.
R.B. – So why do we need to be worried about non-production data?
A.A. – When you start building out your test data and your test data management aspect of things, you’re pulling that data from production. Now, as a former system admin for a couple of decades, production data is very protected. There’s so many rules over it. There’s audits. There’s regulations. But as soon as that data comes off and becomes part of your test data off on your dev system or your QA system, or maybe even a training system that non-production data, it kind of gets forgotten about.
The problem is it’s still production data, just not in a production environment. So now you have to start thinking about do your developers actually have access to this data? Are they allowed to have access? Is the data going offshore? Maybe it’s protected data and can’t be seen outside of the US. As soon as you go into these non production and environments, that often happens and people don’t think about that.
And that has been one of the things that has come up when I speak at Common about this. I asked a question right off the bat: where do you get your data from for your test systems, for your dev systems and the algo production? Is it the same data? And they’re like, Yes. It’s like, do the people that work on your test environment have access to production?
Are they allowed to see the data and they’re like, no, not really. I say, you realize you have PII data that they can look at and their eyes get real big. And so that’s the reason we need to start thinking about this non-production data and protecting the PII information within it.
R.B. – So then is there something beyond that driving, why we need to protect this non-production data in this test? What are the forces here that are making us want to do this?
A.A. – We can think our friends over in the EU for really kicking this off. Now, if we go back to history, there’s various documents around the digital age and even going back to newspapers. There’s an article from the Supreme Court on this, back in the early 90s, late 80s, where it starts talking about some of this protection of people’s information.
Now, at the time, it wasn’t really PII as we know it today. But like I said, this really started coming about with GDPR coming out of the EU. And from there, it has kind of exploded across the globe, because anybody that deals with anything in Europe has to follow those rules. And then we now have rules out of California, the CCPA, we have Law 25 coming out of Quebec in Canada.
They’re all following the very similar type of rules that come out of the GDPR. And now still, they’re looking at your production primarily. But what’s going to happen is this is going to start filtering down because auditors are going to start finding out where this test data comes from. Now, with this protection of this data from either non-production or production yourself, things called the right to be forgotten.
This is where you’re basically expunged from a production database. Now, there are some neat little tricks with this because you don’t want to lose the metadata that goes with it because that’s something that businesses can really use. So you kind of have to fake the person in there or anonymize them. But this goes back to our friends in the EU with GDPR.
They really started kicking this down the road. And now we’re all following in and trying to keep up with those rules.
R.B. – So it’s pretty obvious that we need to be concerned about both production and non-production data. Our test data, if you will. What can ARCAD do to help?
A.A. – Here at ARCAD, we do have a product called DOT Anonymizer. It is part of our test data management suite of tools that goes along with DOT Extract that we’ve talked about in the past. But one of the neat things that we’ve done with DOT Anonymizer is it was actually born within ARCAD for GDPR. So we were thinking on this protection.
So this product was already in development when GDPR came around. So we were able to jump right in and start helping customers, including ourselves internally for this. Now, one of the nice things that comes with our tool DOT Anonymizer is the methods that we can use to anonymize. It could be something as simple as just random.
It just puts random stuff in there. You can use groovy scripts. One of my favorites is using regex and regular expressions for that data. Now all that sounds very manual that you’re having to go through and select this field and it makes say I want to do this. This is where DOT Anonymizer really helps things out here.
It has the data detection part of it. So you can go through and to find out what you want it to look at and it can go through and say this meets these rules. Would you like to apply it here? And so maybe it’s an email address and it can apply the email rule maybe to it.
R.B. – Actually it automatically finds sensitive data?
A.A. – Yes, you tell it where to go to look. And based on the rules that you give it, it can say this looks like your Social Security number. It may not be a Social Security number. It could be as much as an account number that looks very similar. But it’s going to come up and say, you said look at this for 85% accuracy.
This was 90% matching a Social Security number. And here’s the rule that you should probably run against it. Now you can look at it and go, that’s just a random number. It’s not a Social Security number. You don’t have to do it. But it does find this thing zip codes, names, addresses.
It is a big time saver, particularly when you have hundreds of thousands and millions of records and fields and different databases. Because I say different databases because this is what is cool is we grew up on the IBM i and we go way back on the IBM i back, system 38. Over time, what we have found is you never have just an IBM i database. There’s always a Linux server or a Windows server or something, some other Oracle database in front of it. There’s always something there tied into it. We’ve DOT Anonymizer there. You can anonymize both sides at the same time.
And the nice thing is the internal mechanisms can ensure that if you’re changing the name to Tom Smith from Henry Jones, every Henry Jones becomes Tom Smith. So that you end up with usable data that still works within your applications. And it still matched across your databases.
R.B. – So this is all becoming part of your ETL extract transfer load process.
A.A. – And it’s kind of a misleading term or an overused term because of what you’re going to get out of it, some of it can be just as much as saving it to a tape and restoring it. When it comes to our test out of management aspects of things, you would extract the data, you would anonymize the data.
You could then transfer the data and then you could load the data. And all of this can be fed right into a pipeline. So you could be using it out of a Jenkins pipeline to say once a month we’re going to kick this off, we’re going to reload our test data, we’re going to anonymize it across our entire enterprise.
So that’s how anonymize are in a test data management can really come in and help put together a protected test environment.
R.B. – Thanks, Alan. There really is a lot to think about when it comes to test data management. The policies that you need to adhere to, who has their eyes on your production data, once you move it to test whether or not you need to anonymize what to anonymize, there’s quite a bit to think about and making that easy to use, easy to install.
I guess that’s key as well. So I mean, that follows right along with DevOps and what we’re always talking about. That’s why DOT Anonymizer by ARCAD is a great solution for what we’ve just discussed.
Our Hosts
Alan Ashley
Solution Architect, ARCAD Software
Alan has been in support and promotion of the IBM i platform for over 30 years and is the Presales Consultant for DevOps on IBM i role with ARCAD Software. Prior to joining ARCAD Software, he spent many years in multiple roles within IBM from supporting customers through HA to DR to Application promotion to migrations of the IBM i to the cloud. In those roles, he saw first hand the pains many have with Application Lifecycle Management, modernization, and data protection. His passion in those areas fits right in with the ARCAD suite of products.
Ray Bernardi
Senior Consultant, ARCAD Software
Ray is a 30-year IT veteran and currently a Pre/Post Sales technical Support Specialist for ARCAD Software, international ISV and IBM Business Partner. He has been involved with the development and sales of many cutting edge software products throughout his career, with specialist knowledge in Application Lifecycle Management (ALM) products from ARCAD Software covering a broad range of functional areas including enterprise IBM i modernization and DevOps.