Determining the size of the sample set

I have a challenge to solve that I’ve had to solve before: When I have more data to test with than there is time or is practical to test with, then what should I build for a sample data set?

This question is a long standing classic testing question that can be answered in a variety of ways – sometimes the answer is resolved through a mathematical method but quantity doesn’t resolve the entire question.

The sample set challenge has come up in different testing situations such as data replication, data migration, and data loading. And on projects such as palm pilot synchronization, relational database vendor changes such as moving from Sybase to SQL Server, customer data migration where one product was being retired and the customer data was migrated to another application, and data loading on a data warehouse project.

The sample set challenge has a few perspectives to consider. One perspective is what is practical and what am I limited to by time. Another perspective is even if I test with all of the data that I have, what if the data doesn’t include challenging data I might like to test with. From one point of view, “all” may feel too large and yet from another perspective, “all” can feel not robust enough.

And yet another perspective, how large of data set would it take to feel confident? Not to mention, who has to feel confident and how I can bring that comfort to a range of product stakeholders whose opinions can be diverse.

I start with building my own set of comfort which peels down to a few essentials:

A short stack of the happy path
I run a handful of fairly happy path data through first to make sure the process and/or the product I’m working with is functioning. How do I find these records or this data? If I’m working with data that’s in a database, I’ll run a query; retrieve the data and eyeball scan my way through the result set, possibly the entire collection if the amount of data is something I can sift through. If the collection is larger than that, I look through the result set to sense what I can. I look for patterns, I look for repeating data, I try to learn what normal is.

The challenge data set
Once I determine the happy path is working or I flush out issues, then I start stepping up the data to more challenging cases. In some cases, trying to pass NULL through could be a challenge, in other cases; it could an unexpected entry such as numeric field in a string field. Depending on the project, I might limit challenge data to the same boundaries as its source. For example if I’m testing a data sync scenario and the data is restricted by the application or source system, then I might limit the entries to what could be entered through the source system.

Unexpected entries – boundaries and beyond
I move onto unexpected entries. Using data that doesn’t look the same as what I’ve been able to find at the source. For instance on a string field, such as a name field, I might not find a name with special characters, so I’ll build values that give me that type of data. If I’m testing in monetary fields, I want to push the field limits to see if I can provoke any wrapping or truncating. If I’m testing error logging, I might push past the field boundaries to see errors detected. Just out of range and data hanging right at the cusp of what is allowed covers boundary testing at the field level.

Combinations
Some years ago when I was testing data replication, I was essentially testing insert, update and delete with a master/slave relationship, I found the majority of the bugs were around combinations in data versus singular tests. In other words, bugs were found when I was inserting one record and updating another and the two elements of data were batched together in the same set of data being passed through. I can’t recall some of the specifics now but what has lingered in my memory is not to test with too naive of a data set.

When I can, however I can (SQL queries, error logs) I try to find out what “normal” might look like in production. So if the amount of data being replicated on a sync session is 100 rows of data, then I want to working with not just singular row updates but updates at the size and beyond the size of what might be expected. I realize what’s happening is that I’m testing the boundaries in another way – not the boundaries of a singular field but instead boundaries of the set of data.

Building sets
I build data collections. If the testing I’m executing is more about testing with data than functional testing, then I build data sets to test with. I collect different types of data – as I just mentioned in a spreadsheet, or whatever format I might be working with. I collect my challenge data and I might take the time and labor to build multiple sets so that I have a small pools of data for me to choose from. I might continue to spoon feed small amounts of challenge data or I might keep increasing the data pool to include more columns or more variety – or both more columns and more challenging data until I feel I’ve exhausted the challenge conditions.

At some point, after I’ve stepped through each of these sample sets, a certain quantity of data does feel good to see processed. I might consider the volume of the data in production such as on the customer migration project: how many customers and how much data are we expecting overall ? Or in the case of a palm pilot sync, what’s the volume of data a single user will sync? Or in the case of data warehouse loading, what will an incremental data load look like on a particular ETL?

Twice I’ve worked on projects with a statistician where the answer to how large of a sample set was calculated and prescribed but that is not normally the case. It isn’t the number of the size of the sample set that gives comfort or confidence. It’s the robustness of testing; it’s the extent of the challenge data, and the pure error provoking test data I want to have worked with to know that the sample set was strong enough that moving to production feels reasonably comfortable.

Determining the size of the sample set

Archives

Meta