A large amount of the work in my department is associating music repertoire in our system to “claiming rights owners,” usually labels, using some tools I maintain. When I make an update or add a feature, I like to generate dummy repertoire associations for unit testing. Instead of manually finding individual records and targets, I find it useful to do it programmatically. This allows me to scale the test to however many instances I want.
Obviously, the first step is to write a query that finds repertoire in our system that meets the criteria of my test, and another query that finds rights owner that meets the criteria of my test. The question is, how best do I combine the two?
To avoid talking to much about the specifics of my work problem, I’ll generalize this by framing it in a purchases database. Let’s pretend we want to generate random product purchases for tests. We want the random purchases to include customers from various cities and also products connected to several brands. Depending on how our database was populated, just taking the first few entries from the CUSTOMERS table might result in them all being from the same region, and similarly taking the first few entries in the PRODUCTS table could result in them all being from the same brand. Not really a problem, but it would be better to have diverse test cases.
The first step is randomly shuffling our datasets. Here’s how you accomplish this in oracle:
select * from customers order by dbms_random.value
If you are using a different RDBMS, dbms_random.value probably doesn’t exist. Pete Freitag wrote a blog post describing the appropriate shuffling methods for different RDBMS’s which you will probably find useful.
Now, we need to map our shuffled data. Why not just put the shuffled datasets side by side? To do this, we join on the rownumbers of our shuffled datasets:
select cust.id customer_id, prod.id product_id from (select id, rownum joinval from ( select id from customers order by dbms_random.value)) cust, (select id, rownum joinval from ( select id from products order by dbms_random.value)) prod, where cust.joinval = prod.joinval
Again, this is oracle style. rownum will probably have a different name or may not even be supported depending on your RDBMS of choice.
This solution works for me because my analogs of the customers and products tables are large and we have many, many customers and products, so I can feel safe that I’ll get a good mix. If you’re not so lucky or want to be really damn sure that no two customers are from the same city (or that no two products are from the same brand), you can, but things will start to get messy fast. Intuitively, we want something like this:
select b.brand_id, min(p.id) from products p, xref_products_brands b where p.id = b.prod_id group by b.brand_id
But this will always give us the same product ids! We don’t actually want the min(p.id) for each brand, we want a random product id for each brand, so this deduplication method probably doesn’t cut it for us unless we have lots of brands in our system (which we very well may). Since shuffling the data suits my needs I don’t really feel the need to delve into this problem further, but be cognizant of your own needs and data if you decide to use these methods.