Optimizing your RJMetrics Data Warehouse for Speed

My name is Javi Calvo — a mathematician, big data expert, and marketing analyst at Tutellus. Tutellus is the leading cooperative educational platform in the Spanish language. It’s a place where anyone can create and manage a course, or learn anything — like how to make the perfect mojito, or complete an entire university degree in web development, or add a professionally curated skill to your resume. Check us out!

At Tutellus, we work with an amazing amount of information, coming from very different places, and RJMetrics provides us with a great platform to put it all together. It has become a tool we regularly use for multiple purposes, like:

Having a holistic view of our data, consolidating information from multiple sources
Running complex analyses with the help of RJMetrics’s Data Analysis Services team members
Creating reports for internal and external use, turning the information into something easy to visualize and understand
Monitoring our key metrics
Keeping track of odd behavior in our database

Recently, I was asked by my team to lower the update cycle time in our RJMetrics account. Here at Tutellus, we have a tendency to monitor many business metrics and study very large data sets to understand the way our products and services are consumed. All of this analysis was taking a heavy toll on our update cycle time — conflicting with our need for fresh data for our critical metrics. At its highest point, our data warehouse was refreshed once every 15 hours. Needless to say, it was time for a clean-up.

After some maintenance, which I’ll describe below, the length of our update cycle was reduced to 6 hours, less than half of the previous length. This shortening of the analytical life cycle means that our team make decisions twice as quickly, and it has greatly enhanced our use of the platform. What follows is a summary of how I did it. For the rest of you RJMetrics users out there — I hope you find it useful!

The ETL process

Updates in RJMetrics are broken up into three phases, and closely follow the ETL process in data warehousing:

Fresh data from all data sources is synced into your data warehouse
Transformations and joins on your data are performed for complex calculations and custom reporting
Using those transformations, all dashboards and reports are refreshed with the latest data

To optimize your RJMetrics account, first I went backwards through this process. The first step is to determine which of your team’s reports are actually used, and which aren’t (and can be deleted). The second step is to clean house on your metrics and columns that were needed only for those deleted reports. Finally, we check for any unnecessary data sources that don’t need to be synced into the platform, or that are being synced inefficiently.

Removing unused reports

This is perhaps the most complicated step, since it requires talking your team members using the reports, and reaching a compromise between the usefulness of a report and the time it adds to your update cycle. RJMetrics provides an excellent way to gather, visualize, and analyze data. Any report you create is easy to understand, intuitive, and useful. But it is always created by a user with a specific goal in mind. For every report, ask yourself:

What was the original reason we created this report?
Is it still useful? Is it actively used right now?

Some of the reports that I deleted, I had created myself as the final user. Often, I create reports as a tool to visualize a problem, using RJMetrics to get a visual “feeling” of the data that I’m analyzing. An example of this might be a report created to get a first look at some unusual behavior, such as the sudden increase in the number of user actions. Visualizing and cross-referencing information is really easy in RJMetrics, so in a few minutes we can see if there is a relationship between that behavior and a user’s country, their payment system, the product they’ve purchased, and so on. Once I find the information I’m looking for, though, the report is no longer useful. I also create reports to convey an idea — supporting my more raw, less pretty formulas and tables with more intuitive graphics. After all, one of the most important skills in data science is to create a story, to be able to turn the insight you’ve discovered in your raw data into something useful in the real world. Once the story is told, it has no purpose, and can be deleted. (This aspect of big data analysis as storytelling is often overlooked — I recommend taking a look at IBM’s Big Data University introductory courses for data science for more detail here.)

In other cases, the unused reports were requested by other people on our team. A product manager may want to get insights about a product or service we offer, or a sales manager might be interested in knowing the best hour or day of the week to launch a campaign in a specific country. In most cases, these analyses don’t need to recalculated update after update — they can often be a one-time-use. After talking with the person who requested it (and documenting the results), you can usually delete the report.

Reports are important to keep track of your business metrics, but in our case, information, fresh information should always be available. Questions like “how is yesterday’s campaign performing?” demand a fast analytical life cycle. Your first step, then, is to delete all reports that are no longer useful. This includes all those you created for yourself, and those requested by others for (perhaps) a temporal purpose.

Removing unused columns and metrics

When you’ve deleted any unnecessary reports, it’s time to clean up your metrics. Going through the list of your metrics, you may find that some of them no longer have dependencies, so they can be deleted. In addition to this, you should check for any duplicated metrics. In the platform, it’s very easy to clone a metric for use in a slightly different report than the original metric. This is helpful, but if your team overdoes it, it can create redundancy and confusion.

For your calculated columns, we received a list of columns which were unused. The calculations involved in these can take a heavy toll on the update speed. While some of these might relate to analyses you are still interested in, the RJMetrics support team can pause or deactivate specific columns if you prefer. Any that aren’t used, however, should be removed.

Optimizing table synchronization

As a final step, you may find that entire database tables aren’t being used for analysis anymore. These can be dropped from your data warehouse without affecting your dashboards, decreasing your total “sync” time. There is also an option for pausing replication as well.

The replication method of your active tables is crucially important. Some methods are quite faster than others. If your table contains a date column that updates when a row changes, or contains an incremental primary key, setting your table’s replication method to either of these does wonders for efficiency.

The fastest method is Modified At, that requires a date column that is populated when a row is created, and updated each time the row changes. The column should never have a null value, and you shouldn’t delete rows from the table. It works best if the date column is indexed, as well. For very large tables, it’s important to speak with your own developer team to see if columns like these can be added to your schema.

For us, the best example was our users MongoDB collection. Following MongoDB’s design patterns, documents can include a lot of information, so we can get all we need about a user in a single query. Every time a user does something, their document is modified. With over 600,000 active users at Tutellus, replicating this collection without Modified At became very inefficient. Using the Modified At replication method here saved us hours in replication speed.

For RJMetrics users, these steps (or a combination of them) are all you need to greatly lower your own analytical cycle time. It takes some research on our end, but is greatly worth it to see fresher data and make decisions faster. If you have any questions about the process at all, the RJMetrics support team is happy to help.

	RFM Analysis for Online Businesses: Part One Salvatore Calvo\|January 12, 2017
	Understanding your Customer Concentration Jim Roddy\|May 18, 2017
	Understanding your Magento Series: Taking Stock of your Inventory Chris Schmid\|April 20, 2017
	Introducing Magento Business Intelligence Essentials Johanna Richardson\|April 10, 2017
	Quantifying Customer Loyalty Michael Kreshtool\|March 30, 2017
	Optimizing Customer Support Michael Kreshtool\|March 23, 2017

Optimizing your RJMetrics Data Warehouse for Speed

Javier Calvo | January 26, 2017

The ETL process

Removing unused reports

Removing unused columns and metrics

Optimizing table synchronization

About The Author

Javier Calvo

mwasley
2 Posts

Chris Schmid
3 Posts

Tommy Baw
1 Posts

Anita Andrews
12 Posts

Ryan Williams
6 Posts

Ben Garvey
4 Posts

The ETL process

Removing unused reports

Removing unused columns and metrics

Optimizing table synchronization

About The Author

Javier Calvo

You Might Also Enjoy

RFM Analysis for Online Businesses: Part One

Salvatore Calvo|January 12, 2017

Understanding your Customer Concentration

Jim Roddy|May 18, 2017

Understanding your Magento Series: Taking Stock of your Inventory

Chris Schmid|April 20, 2017

Introducing Magento Business Intelligence Essentials

Johanna Richardson|April 10, 2017

Quantifying Customer Loyalty

Michael Kreshtool|March 30, 2017

Optimizing Customer Support

Michael Kreshtool|March 23, 2017