An SCD is a slowly changing dimension in a data warehouse. There are three different forms of SCDs. SCD1 overwrites data, SCD2 maintains historical data and SCD3 maintains some of the data.
SCDs and testing in a data warehouse is likely a topic that you’re either entirely interested in or it’s a tech area off your radar. Over the past few months I’ve been doing some work in a data warehouse again and have at times looked through blogs and forums for information and experiences, and in most cases found little information written from a testing perspective. I’m going to try to share some of my experiences being of course, mindful of client confidentially and steering very wide from any specific data – I’ll make up data if I need to. Testing SCD2s is one topic I’ve looked to hear experiences from other testers but have found little to no shared experiences.
In my experiences in data warehouses, I have yet to work with an SCD3. SCD1’s overwrite all of the data and I’ve yet to encounter issues with a data load overwriting so it’s SCD2’s that have captured my interest.
In terms of learning about SCD2’s there are places to read – the best reading material I’ve had are the Kimball books. Instead of going into a tech lesson on SCD2s what I want to share is from the testing view. A short explanation of SCD2 is that a new record gets inserted and an effective date field is set.
So what about the testing perspective? It seems to be a chant I have on Business Intelligence projects.
As data inevitably is reloaded, knowing which data is supposed to keep its history versus which data is supposed overwrite is good to know. Why? One reason is to have a sense of how data is being handled on refreshes. Another reason to know which data elements will be handled as SCD2s is to distinctly and purposefully test SCD2’s. This is where concept and reality starts to break down.
So conceptually it’s a good idea to know which data elements will be handled as an SCD2 vs. an SCD1 but where can you get that information? Typically that information is buried into the details of ETL job and in the case of the Microsoft BI stack of tools; you have to know how to navigate around SSIS to find the details. That navigation isn’t impossible but what gets unwieldy is doing that investigation times the number of data elements makes the effort a large labor spend. If a data dictionary is written that includes that information – then it could be information that’s easy to find.
In the case of a warehouse where design changes are still rippling through and it’s not only a significant labor but a time spend that might have to be executed over and over. I have no gorgeous solutions to this. Sometimes the best I can do is understand a problem, groan at the solutions when I don’t like any of them and forge ahead the best I can.
Another reality about testing SCD2’s is like testing other software – where you realize you don’t want to retest the tool. What you want to test is the implementation of something. Case in point, I don’t want to retest Microsoft’s SSIS packages to see that SCD2’s are handled as expected. But gee, it would certainly feel good to take perhaps the most vulnerable or most essential data element or two and step through the process to see that it was implemented without issue. So what if I decided to test just a couple of SCD2s. Where to start?
Building a small data set to test with has the advantage of being able to know the data set. By white boarding ideas or using Excel to map out a small data set that will test handling the updates in theory sounds straightforward. But the simple approach has a few challenges to it.
One issue is knowing all the columns for a single record and creating a small sample can be downright ugly. Ugly in the sense that all the columns for a single record could be quite a few columns and that setting or defining some of the values could be tough. Not all the values are easily readable and setting a value incorrectly can skew testing.
A system architect and I talked through this and concluded that it would be less labor to have a tester design data to test with and have a developer build the test data set. A SQL query or small extract of data can be more readily pulled out of the database by a developer who can then be paired to work with a tester to massage the data into a sample set ready for testing.
But after running the test set, the same challenge comes back to haunt meaning after the same sample data set has been run, the SQL knowledge needed to drill in and check out the data can be an issue. Each time a tester has to go back to a developer for assistance, the sense of objectivity gets rattled.
What about looking at the problem from a different perspective? Using a developer paired with a tester to set a sample test can work. After running the test data, it can be more feasible to test from the outside – meaning to view the changed data as a user might view the data – using a tool like Excel or Tableau to connect to the cube and show the data. It still requires a certain amount of know how that might not be a skill set findable in testers I meet in the market.
Another alternative is to run full data loads that include such data element changes and then test the data through a data reconciliation process to see that the new data has appeared. And that is the process I’ve seen. It doesn’t mean it’s the best solution – but it was the solution we could implement.
I think one of my core frustrations in BI testing is that the projects tend to be highly technical; testers are often not placed on the BI team leaving developers at the helm not realizing or thinking (of course) like a tester. This makes injecting ideas for testing early in the process fall by the wayside. And like any project, testing not done early on means bugs are found later and bugs are found by end users.