IBM i DevOps TechTalk #6
Test Data Management – Data Extraction part 1

by the experts at ARCAD

The topic of this IBM i DevOps TechTalk is Test Data Management with the focus on data extraction. It features Alan Ashley, an expert in test data management, who will answers questions by our host, Ray Bernardi:

Where/how to start a test data management plan
What are the main benefits of data extraction
How to manage data security & privacy concerns
How to automate the test data management process
How to maintain integrity between IBM i and non-IBM i databases during extraction

Listen to this episode to get started with the management of your own test data!

Listen to the Podcast

The Story Behind the Mic: Podcast Transcription

Ray Bernardi – Welcome to IBM i DevOps TechTalk, where we discuss key questions and topics with ARCAD experts. Today we’ll discuss test data and test data management. My name is Ray Bernardi and with me is Alan Ashley.

So Alan, I guess the first question would be: where would a person start with test data management? I mean, of course talking about extracting data today, but what do you need to even think about? What are the goals that you need to achieve?

Alan Ashley – When you start talking about taking your production data and relocating it for another purpose, you need to kind of lay out why. Is it going to be a test database, is it going to be training? What kind of data do you need to bring over? Is this something that you’re going to maybe have to repeat?
How often are you going to have to do this? So when you start laying out your goals or kind of almost your workflow of how this is going to go, you need to plan your extraction, how are you going to do it? When are you going to do it? Do you need to take into account, maybe security?
And I know in the States, the GDPR aspect of things that they have in the EU and the PII security that happens in the EU has not quite made it all the way over to the States yet. To the extreme, give it time. They’ll figure it out. I know some new laws out of California and Quebec are making their way around, but you also have to factor in that, what kind of security do you need to filter out? Maybe data that doesn’t contain certain information?

R.B. – I don’t think a lot of American companies know that if one person in their database is a European customer, all this applies to them.

A.A. – Exactly. And they think of it as a production level, not what happens when the data leaves production and becomes a test environment. But guess what? That’s still applies. When you start extracting the data, you can start filtering some of that information out so that it does not go to test. And one of the big keys to this is how are you going to automate it?
How often are you going to have to do this? So is it going to be part of a pipeline, are you going to do it manually every two months? That’s something you need to think about.

R.B. – So I think basically what you said here was in order to get going, you should understand your data requirements. You should plan your data extraction, you should ensure data privacy and security when you’re done with that, and you should use automation.

A.A. – Exactly.

R.B. – So now that we have an idea of why we’re doing this. What are some of the benefits of creating these subsets of data?

A.A. – If you’re building a test environment or a training environment, you can go after certain segments of your production database. So let’s say you want to focus on US customers for your training environment because you’re training maybe us based personnel. If it’s a global company, maybe you will need to pull data that is only French.
You can create this subset and then it can make it a repeatable process from that. So that’s one of the benefits is you can really segment what type of data you want to use and pull in.

R.B. – That’s what you meant just a minute ago when you said filter the data as you bring it out of the like get rid of the European customers as you’re extracting data from a production. Is that what you meant?

A.A. – Yes, exactly. So that you can get the data that you need to test on, that fits your plan. Now with that, you can also really diversify the data that you get. Let’s say you’re testing a big release and you need all sorts of data, but not a lot of it. So you can pull a little bit from the U.S., a little bit from the EU, a little bit from APAC, and put it all together so that you can test for the maybe double bi character sets in your data fields, but without having 100 million records to go through.
And lastly, part of this, this is becoming a bigger topic, is building environments on the fly. So you get your front end backend server, backend database. When you have these subsets of data, you can provision those temporary environments a lot faster. It takes a lot less time to load a million records than 100 million records, and there are many companies out there that have hundreds of millions of records across multiple tables, and this would save them time if they were building up these little small data sets.

R.B. – All right. Since we have this need to handle and manage test data, how exactly can ARCAD help?

A.A. – ARCAD has this nice product called DOT Extract as part of our test out of management suite of tools. The other one is DOT Anonymizer, which really plays into anonymizing data, hence the name, to protect your PII, to follow GDPR rules and things like that. With DOT Extract, what it does is it allows you to start creating your data sources, either DDL or GDS.
If you’re using our observer product, observer DB so that it can build out the referential integrity and then map it, because that’s the key here. When you start creating these subsets of data, you need to have the integrity of the data. Otherwise it’s kind of worthless data and DOT Extract really comes into play there because it one maintains and builds a repository of those database relationships to help filter it out. Not everyone is an SQL guru, so it has a little wizard that can help go through it and build your filter. Maybe you’re reducing it by country, by product line, by sales numbers, you name it, and it will help guide you through that. Now, if you are the SQL guru, it has an option to allow you to just write your own right there.
It gives you the option to include the where and that will build out your filter. And with this, you can build filters on filters across multiple tables within the schema so that when you go to do your extract, it can go through each one of those rules, each one of those filters, and it pulls the data together. And what is really great about this, it maintains that integrity. So if you have an account number that is, let’s say, for example, based in the U.S., it’s going to find all the relative data that goes with that record.

R.B. – But most of the data usable.

A.A. – Right. It’s got to be usable and it’s got to be consistent so that you know what you’re getting there. And then lastly, it stores this information or this extract in a repository. And what is nice is now you can inject this data repeatedly. If you have a database DB2 for i and you pull off your extracted data and you need to build out three training environments, one for EU, one for the U.S., maybe one for Asia-Pac. With one extract, it can point to three injections on DB2 for i and it will populate those at the same time.

R.B. – So you’re starting to talk about a process here. Now, I’ve heard you say this term before, ETL (Extract Transfer Load). What would be a typical workflow?

A.A. – So DOT Extract when you start put it into workflow is a little bit less ETL because ETL is a true extract transfer load, it’s going to be one fluid process going through. Extract is really about refining the subset of your data. So for us, a good workflow would be you extract your data, you anonymize your data, that’s key.
So now you’re protecting your PII information before anybody else can get to it in your test environments. And I brought this up at a conference and I asked the crowd: where do you get your test out it from? And everybody said production. Do you have PII data? And they’re like, yes. Do you protect it in your test?
And all our eyes glazed over as they realized: we’ve got a problem here. So this is when you come in and anonymize because nobody thinks about the test side of things. And then lastly, you inject it into the environment that you want to go to. So whether or not you’re injecting it going from MySQL out of MySQL, PostgreSQL to PostgreSQL.
That’s where the injections would take place.

R.B. – So just I guess a final question here, Alan. What databases does this thing support?

A.A. – So currently today DOT Extract handles DB2 for i, Microsoft SQL Server, MySQL, Oracle and PostgreSQL. That’s where it sits today. And of course, this is all the stuff in development. So we’re adding new databases down the road. I imagine MangoDB is going to be coming. I imagine some flat files support will probably come as well.
But right now we’re starting on those five databases because those are some of the most popular databases in the market today.

R.B. – Thank you, Alan. I don’t know about you, but I learned a lot about test data extraction today. I learned that first, I need to understand my data requirements, and then I really need to plan that data extraction. During that, I need to ensure data privacy and its security. And the best way to perform all this is through automation. Thanks for listening today.

Our Hosts

Alan Ashley

Solution Architect, ARCAD Software

Alan has been in support and promotion of the IBM i platform for over 30 years and is the Presales Consultant for DevOps on IBM i role with ARCAD Software. Prior to joining ARCAD Software, he spent many years in multiple roles within IBM from supporting customers through HA to DR to Application promotion to migrations of the IBM i to the cloud. In those roles, he saw first hand the pains many have with Application Lifecycle Management, modernization, and data protection. His passion in those areas fits right in with the ARCAD suite of products.

Ray Bernardi

Senior Consultant, ARCAD Software

Ray is a 30-year IT veteran and currently a Pre/Post Sales technical Support Specialist for ARCAD Software, international ISV and IBM Business Partner. He has been involved with the development and sales of many cutting edge software products throughout his career, with specialist knowledge in Application Lifecycle Management (ALM) products from ARCAD Software covering a broad range of functional areas including enterprise IBM i modernization and DevOps.

IBM i DevOps TechTalk #6 Test Data Management – Data Extraction part 1