[Follow our blog posts, obsession with data, and original articles on Twitter @RJMetrics]

In college, I took a course called “Regression and Applied Time Series.” The cornerstone of the course was a competition called the “OJ Game” in which a dozen student teams operated fictional orange juice companies over several simulated years.

The concept may be simple, but believe me the game is not. Every variable in the system is driven by statistical formulas that incorporate dozens of complex, interconnected systems. Where do you buy your oranges? How do you get them to your factories? How much of your supply do you keep pure? Concentrate and reconstitute? Freeze and thaw (or sell frozen)?

The answers to these questions don’t come easily. They depend on things like labor, shipping, and building costs around globe, the behavior of financial markets (yes, ), weather’s role in fruit rot, macroeconomic trends that influence consumer demand, and, of course, how all of these things change over time.

There’s more about the OJ game , but what brought back this memory was a recent experience in which I had to simulate data for a fictional e-commerce company. We had to create a fake data set for giving demos of the RJMetrics eCommerce Dashboard, and I volunteered to create a realistic database for a fictional clothing importer/exporter Vandelay Industries (yes, the Seinfeld references are laid on thick in this data set). The first step was to create the database structure for this company, so I laid out a bare-bones relational database structure for e-commerce consisting of sixteen tables.

The basic DB structure as it was slapped up on my whiteboard (please excuse the lefty handwriting)

It basically looks like this: you’ve got customers placing orders that consist of items that come from inventory with stock levels maintained by restocks. Each inventory SKU is a unique combination of a product and its applicable attributes (i.e. gender, color, size). These products are supplied by vendors, and orders can be referred by affiliates. Items from orders can also be returned for refunds.

Basic stuff, right? When a company accumulates a set of data like this over several years it seems so. However, simulating this data in a realistic way is actually a lot more complex than it might seem.

Like the OJ game, the interconnectedness of the system’s components makes things quite tricky. Here are some of the bumps I had to address along the way:

  • The data can’t look simulated. Each slice of the data, no matter how detailed, has to both represent some definite trend (so the demo data actually shows something interesting) and not look overly mathematical or otherwise contrived. This means random noise needs to be included alongside complex, nonlinear growth formulas.
  • To achieve interesting trends within deep-level slices, orders can’t just be random customers buying random products. Existing customers need to have a higher average order value, loyalty to brands and styles, and an increasing probability of making incremental purchases with each purchase they make.
  • Over time, new vendors, products, and affiliates need to be added to the system (at an accelerating rate). However, our analyses might require us to isolate the names of some of these things. This means they all need logical names (not random characters), which means a data bank of distinct names or an algorithm to intelligently create them.
  • To do geographic analysis, customers will need to have addresses that can then be associated with their orders. These addresses will need to have real zip codes and ideally orders should generally be in some proportion to a zip code’s relative population.
  • There obviously needs to be some underlying year-to-year growth rate, but there also needs to be a “seasonality” over the course of the year, over the course of any given week, and over the course of any given day.
  • Some products need to be more popular than others, as do certain sizes, colors and other attributes. Also some affiliates need to be more successful than others.
  • Orders can’t be placed for items that are out of stock, and a restocking process needs to exist to keep items in stock.
  • We want there to always be “fresh” data in the system, so the simulation needs to be able to run each day and update with new, fresh information that is consistent with the objectives above.

I was able to accomplish these objectives by building a nice little PHP CLI function library of about 500 lines. The script starts on July 4, 2004, and then simulates the activity of each day up until today, one day at a time. Here’s the process at a really high level:

  1. Simulate whether any new vendors, affiliates, or products will be added that day. Naturally, new products must be supplied by vendors already in the system. SKUs are created for any new products, and permutations of color, gender, and size are stocked into inventory.
  2. Check and restock any low-inventory SKUs.
  3. Based on overlapping seasonality weightings (months, days, hours), simulate the day’s purchases made by existing customers. Customers who have made more purchases historically are slightly more likely to make repeat purchases. The items (and quantities) they purchase are influenced by both the items they have purchased in the past and the overall popularity of the existing items available in the system.
  4. Based on similar seasonality weightings, simulate purchases from new customers. This involves simulating the items and quantities purchased and “creating” the new customer in the database (including e-mail address geographic information). This customer is then a potential re-buyer in step 3 of future days.
  5. Returns of recent merchandise purchases are simulated using similar methodologies.

The entire simulation process is really interesting to watch. As you can imagine, the early days are simulated several-per-second. However, as the company grows over time and the number of purchases, customers, and items expand, each day’s simulation becomes significantly more complex. By 2009, the days are taking well over a second each to simulate.

After about an hour or so, however, we end up with a really interesting data set! All in a good day’s work.

If you have a business that might benefit from hosted business intelligence solutions, contact RJMetrics and we can give you a personal tour of the RJMetrics Dashboard. Maybe you’ll even get a sneak peek at the inside data of Vandelay Industries!