When a Magento Business Intelligence customer has a question or needs assistance, they reach out to our support team. Some requests might be very simple to resolve, while others take quite a bit of time. The key to the support team running smoothly is that balance of resources — what we call triage.

This blog post will address what triage means to us here at Magento BI, what our process used to look like, and how and why we optimized the process.

So, what is triage?

At Magento BI, triage is the process we use to sort through each customer request, and makes sure it ends up in the right location at the right time. The triager is the analyst who controls that process.

You can think of the triager as air traffic control. At an airport, air traffic control makes sure each plane departs from and lands on the correct runway (as close to) on time (as possible). Air traffic control has information about which runways are backing up with traffic, where other planes are landing, and what weather might impact departures and landings. In our world, the triager sends each customer request to the proper support channel, because the triager knows:

  • how many requests are still outstanding,
  • which team members’ workloads are piling up,
  • and who is best suited for a given request.

By being aware on a global scale, the triager can appropriately address each individual request.

Triage process, v1

In order to triage each incoming request, our first triage process required two full-time, triage-only team members: a first responder and an assigner. The first responder was responsible for:

  1. greeting each customer and letting them know we received their request,
  2. resolving all simple requests, and
  3. reviewing all other requests to determine if they
    1. needed more information provided by the customer, or
    2. were defined well enough to pass along to a non-triaging team member (an analyst).

All requests that fell under option (2) were resolved by the first responder. We call those quick-solves. If a quick-solve came back with follow-up questions or tweaks, that same first responder was responsible for those follow-ups and tweaks. All request that fell under option (3a) were worked a bit by the first responder. We called that process prepping. The first responder would prep a request until enough information was provided to allow an analyst to take action. Once that was the case, those requests, as well as the requests that fell under option (3b) to begin with, were sent to a queue, waiting assignment to a non-triage analyst. We call that queue the chute.

In this two-person triage process, the only responsibility left for the assigner was to do just that: assign the requests from the chute out to the appropriate analyst.

Why, you might ask, did this require two full-time, dedicated analysts? Well, quick-solves and prep work required by the first responder piled up enough so that we needed the first responder to have a break. Therefore, the role of assigner allowed for “off days.” The assigner assigned requests from the chute, attended to the quick-solve tickets from their first response days that popped back with questions, and continued to prep the requests from their first response days that needed more work. The combination of the requirements of the first responder as well as the continued work done while the assigner meant that a short-term rotation among triage and non-triage analysts was simply not feasible. 

Triage process, v2

We knew something needed to change. The two triage roles were mentally draining for the team members. The requests kept coming and you never knew if or when you would get a reprieve. (Personally, I likened it to running on a treadmill: you run and run and run and make no physical progress.)

In mid-2016, I led a project to overhaul our triage process. This project had the following goal: transform triage into a one-analyst role that could be a short-term rotation. While that goal was quite clear, we needed success criteria to guide us and help determine if the new process was, well, successful. The overall goal of the support team is to provide the best service to our customers. Therefore, our triage overhaul couldn’t just meet the goal; we also had to improve our level of support.

With that in mind, the one-person, rotational triage process we would move to had to be a process that:

  1. could be handled physically and mentally by one team member,
  2. allows for a personal first response to each request,
  3. allows quick-solves to be resolved immediately,
  4. does not increase the time to assignment, and
  5. does not increase the time to resolution.

With that goal and those success criteria, we proposed two changes to the triage that would allow it to become a one-person, rotational process.

First, we removed prepping from the triager’s responsibilities. Instead of having the triager ask questions specific to each request, we wrote a series of ticket guideline articles with details relevant to a specific request type. For example, if a customer is writing to us looking for a new calculated column to be built, the relevant ticket guideline article asks questions like “what table do you need the new column(s) built on?” and “what logic do we need to create the new column(s)?” By sending a customer to the relevant help center article, not only would we cut out the need for the triager to prep, but we would also hopefully train our customers to know where to go to find what details we need in order to proceed.

Second, we developed templated messages that would allow for quicker first responses. This would cut down on the actual amount of time the triager would spend replying to each request. However, we did not want this to come off as robotic and impersonal, so we created templated messages that would quickly provide the filler text, but also allow for customization of the meat of the response.

The result of these two changes meant the triager, which would be filled by one analyst per day, in daily rotation, would now be responsible for:

  1. greeting each customer and letting them know we received their request,
  2. resolving the quick-solves,
  3. sending all other requests to their appropriate location (very likely the chute), and
  4. assign tickets from the chute to analyst based on their bandwidth.

Responsibility (1) addresses success criteria (2), and below, I will address the remaining success criteria.

So, did it work?

In one word: yes. Before getting to the data (which we have a lot of!), I’ll address the less quantifiable success criteria (1). Could the new triage process be handled by one analyst at a time? The answer to this is yes. By incorporating a daily rotation, each analyst only triages about once per week, leaving the rest of the week for non-triage responsibilities. On the rare occasion that a triager needs some assistance (too many meetings!), off-duty triagers are willing and able to pitch in for some tag-team first responding.

Now for the data, which addresses success criteria (3), (4) and (5). As far as resolving the quick-solves (success criteria (3)), the new process is incredibly effective. As I write this blog post, triage process v2 has been in place almost 11 months. In that time, just over 47% of all requests submitted to us are resolved immediately or almost immediately. (If your request isn’t resolved immediately, don’t be upset! We have gotten very good at identify when a request will need more work and when it can be resolved right away. If we send your request to the chute, trust us that that is the best route for it in the long run.)

Regarding time to assignment (success criteria (4)), the report below, created using the SQL Report Builder in Magento BI, illustrates that not only did we not increase the time to assignment, we actually decreased it and reduced the weekly variability as well as the difference between the median, 75th and 90th percentiles.

Assignment times

All points to the left of the red line correspond to triage v1, and all points to the right correspond to triage v2. The blue line represents the median time to assignment, the orange line represents the 75th percentile, and the green line represents the 90th percentile.

Lastly, for time to resolution (success criteria (5)), we noticed a similar reduction in the week-to-week variability as well as the difference between the median, 75th and 90th percentiles.

After beta testing the triage v2 process for long enough to collect a trustworthy amount of data, it was clear that not only was triage v2 manageable internally, it allowed us to offer a superior level of support to our customers.