Survey Design 101: Choosing Survey Response Scales

Survey Design 101: Choosing Survey Response Scales

How you design a survey or a form will affect the answers you get. This includes the language you use, the order of the questions, and of course the default values and ranges you use.

This article will focus on response scales. This could be in a survey (1-5 vs 1-10 scale, etc) or on a dropdown menu, and it also includes the language you use.

Think back to the last survey that you filled out that asked you your age or salary. How do the default ranges (age 20-25 vs age 20-30) affect the precision of your data?

Surveys, of course, provide a great source of insight into your visitors’ attitudes. Certain surveys also allow you to compare yourself with the competition. But as Jared Spool of UIE put it in a recent talk, because of nuances in survey design, cross-company comparison may not be that easy:

Jared Spool:

“My point with this example is that scale design and anchor choice will influence respondents’ ratings — both higher and lower. This is a key reason why I’m skeptical of the cross-company comparison data sets where each company is using a different survey instrument. So many variables are in play that legitimate comparisons are quixotic.”

So what can you do to get accurate data? It starts with understanding some of the differences and shortcomings in different types of survey responses.

First, here are some different types of response scales…

3 Types of Survey Response Scales

When designing surveys, there tend to be three different models for survey response scales:

  1. Dichotomous
  2. Rating Scales
  3. Semantic Differential Scales

1. Dichotomous Scales

Dichotomous scales have two choices that are diametrically opposed to each other. Some examples would be:

  • “Yes” or “No”
  • “True” or “False”
  • “Fair” or “Unfair”
  • “Agree” or “Disagree”

There’s no chance for nuance in a response, and there’s no way for a respondent to be neutral. But there’s actually a lot of value in the lack of a neutral option.

Sometimes, especially in long surveys, you’re subject to what’s known as the error of central tendency: when answers gradually regress to the middle of the scale, or the neutral options. A dichotomous scale will give you a clearer, binary answer, but can also fall prey to fatigue – respondents then tend to lean toward positive answers.

2. Rating Scales

Rating scales are probably what you’re most familiar with. “On a scale of 1-10, how satisfied were you with our service today?”

The three most common rating scales are:

  • 1-10 scale
  • 1-7 scale
  • Likert scale (1-5)

Is there a difference in the outcome based on which scale you choose? Totally. There’s more variance in the larger scales, so the norm is to use the Likert scale.

The most common, then, is the Likert scale. Dr. Rob Balon advises to “always use the 1-5 scale with 5 being the positive end and 1 being the negative end. NEVER use 1 as the positive end.”

A Point On The Likert Scale

Another great point from Jared Spool’s talk is on Likert Scales, one of the most commonly used in survey design. He actually rails against the labels we use on the scales (satisfied and dissatisfied) instead of the scale itself:

Jared Spool:

“It’s about how we create the scale. We start with this neutral point in our scales. This is how a five-point Likert scale works. We add two forms to it—in this case, satisfied and dissatisfied – and then, because we think that people can’t just be satisfied or dissatisfied, we’re going to enhance those with adjectives that say “somewhat” or “extremely.” OK, but extremely satisfied is like extremely edible. It’s not that meaningful a term.

What if we made that the neutral, and we built a scale around delight and frustration? Now we’ve got something to work with here. Now we’ve got something that tells us a lot more. We should not be doing satisfaction surveys, we should be doing delight surveys. We need to change our language at its core, to make sure we’re focusing on the right thing. Otherwise, we get crap like this.”

So even if satisfied and dissatisfied are “common practices,” they may not be “best practices.” Especially in user experience research, where you’re really trying to delight customers, not just satisfy them.

3. Semantic Differential Scales

Semantic differential scales are used to gather data and “interpret based on the connotative meaning of the respondent’s answer.” These two usually have dichotomous words at either end of the spectrum. They generally measure more specific attitudinal responses, such as the following:

Image Source
Image Source

According to Dr Rob Balon, CEO of The Benchmark Company, “ironically, when you factor analyze SD scales, they basically break out into two factors: positive and negative. There is really no need for seven steps.”

Which Should You Use?

It depends what type of data you want.

Dichotomous scales (“yes” vs “no”) are great for precision in your data, but they don’t allow for any sort of nuance in respondents’ answers. For instance, asking if a customer was happy with the experience (yes or no), gives you almost no insight into if you’re improving experiences for the average customer.

Something like a Likert Scale or an NPS could be better for that because of the increased range of the scale. Although, and this is a big point, Jared Spool said, “Anytime you’re enlarging the scale to see higher-resolution data it’s probably a flag that the data means nothing.”

I think, then, that the more quantifiable the information is (behavior questions for instance), the smaller the range should be. When you want to measure attitudes or feelings, using 5 or 7 point semantic differential scale is a good strategy. Likert scales (satisfied vs dissatisfied) are a little generic for attitudes, and as SurveyGizmo said, “semantic differential questions are posed within the context of evaluating attitudes.”

There’s also something an old technique known as a Guttman scale that puts a twist on either dichotomous or Likert scales. What you do is ask a series of questions that build on each other and escalate in intensity. Here’s a great example from changingminds.org:

Image Source
Image Source

Jared Spool talked about the Guttman scale in its relation to customer surveys, saying, “If you’re not happy enough to recommend the product, you’re not going to be confident and you’re not going to feel it has good integrity if you’re not confident and you’re not going to have pride in it unless they have good integrity, and you’re definitely not going to be passionate about them unless they do everything else.”

This can be a useful tool for measuring satisfaction.

Ordinal and Interval Scales

Developed by S.S. Stevens and published in a 1946 paper, there are 4 types of

  • Nominal
  • Ordinal
  • Interval
  • Ratio

Pertaining to response scale, there’s a decent debate forever waging over ordinal and interval scales.

Ordinal scales are numbers that have an order, like “a runner’s finishing place in a race, the rank of a sports team and the values you get from rating scales used in surveys or questionnaires like the Single Ease Question.” (source)

With ordinal scales, if you’re asking on a scale of 1-5 how satisfied a customer was, a 4 doesn’t necessarily mean they’re twice as satisfied as a 2. The difference between a 1 and a 2 isn’t necessarily the same as the difference between a 4 and a 5.

Interval scales are when we can establish equal distances between ordinal numbers – for example, when we measure temperature in degrees Fahrenheit. The difference between 19 and 20 degrees is the same as 80 and 81.

According to Jeff Sauro, Founder of MeasuringU, rating scales can be scaled to be interval:

Jeff Sauro:

“Rating scales can be scaled to have equal intervals. For example, the Subjective Mental Effort Questionnaire (SMEQ) has values that correspond to the appropriate labels. You can see the distance between the numbers is equal, but the labels vary depending on how enough people interpreted their meaning (originally in Dutch).”

What’s The Practical Difference?

There are two arguments here.

The classic stance, from S.S. Stevens, is that you can’t compute means on anything other than interval data. As Sauro explained it, “he said that you can’t add, subtract much less compute a mean or standard deviations on anything less than interval data.” Sauro continues:

Jeff Sauro:

“This restriction is a problem for many academics and applied researchers because rating scale data is at the heart of marketing, usability and much of social sciences research. If we cannot use means and standard deviations we also cannot use most statistical tests (which use means and standard deviations in their calculations). Even most non-parametric tests convert raw values to ranks (ordinal data) and then compute the mean or median.”

However, the other argument set forth by Frederick Lord (inventor of the SAT) says you can. According to him, it doesn’t matter where the numbers come from, you can work with them the same way. Jeff Sauro gave a great example of this

Here are 6 task times (ratio data):

7,6,4,2,9,10

Here are 6 high temperatures in Celsius from a Northeastern US city (interval data):

7,6,4,2,9,10

Here are 6 responses to the Likelihood to Recommend Question (ordinal data):

7,6,4,2,9,10

Now here are 6 numbers that came from the back of football jerseys (nominal data):

7,6,4,2,9,10

Does it Matter Whether Your Data is Interval or Ordinal?

Outside of academia, there’s not a lot of debate. While the magnitude of the difference is important, too, what’s actually important is the evidence of improvement. Jeff Sauro explains in a practical light what this means:

Jeff Sauro:

“In applied research we are in most cases interested in determining which product or design generates higher scores, whether these be on satisfaction, usability or loyalty. The magnitude of the difference is also important–a 2 point difference is likely more noticeable to users than a ¼ point difference. But even if you were to commit the error and say that users were twice as satisfied on one product you’ve almost surely identified the better of two products even if the actual difference in satisfaction is more modest.”

And according to Dr Rob Balon, “outside of academia there is virtually no argument. Most online surveys utilize descriptive statistics and simple banners or cross tabs that can be analyzed using Chi-square which is a nonparametric analytical tool.”

Anyway, it’s impossible to evaluate the validity of the ratings of human perception, anyway. So feel good working with ordinal data in general.

The Limitations of Survey Scales

Even if you design the perfect survey with the appropriate scales, there are still limitations in the insight you can conceivable uncover. Especially if you’re only running a limited range of surveys or conduct such surveys sporadically (and without other forms of conversion research).

The Meaning Behind The Numbers

When you run a scale like, say, the Net Promoter Score, you get a number and you can compare that with your competitors and your past scores, but there are certainly limitations in how much it can tell you about your user experience.

I haven’t heard a better explanation of this than in Jared Spool’s talk on design and metrics:

Jared Spool:

“I was so disappointed when the people at Medium sent me this: “How likely are you to recommend writing on Medium to a friend or colleague?” It’s not even a 10-point scale. It’s an 11-point scale, because 10 was not big enough.

This is called a Net Promoter Score, and Net Promoter Scores, if you look at the industry averages that everybody wants to compare themselves to, the low end is typically in the mid-60s and the high end is typically in the mid-80s. You need a 10-point scale because, if you had a 3-point scale, you could never see a difference. Anytime you’re enlarging the scale to see higher-resolution data it’s probably a flag that the data means nothing. Here’s the deal. Would a net promoter score for a company say, like United, catch this problem?

Alton Brown bought a $50 guest pass to the United Club in LA and had to sit on the floor. I wonder what his net promoter score for that purchase would be? It probably wouldn’t tell anybody at United what the problem is.

But that’s a negative. What about the positive side?

What’s actually working well? Customers of Harley-Davidson are fond of Harley-Davidson, so fond that they actually tattoo the company’s logo on their body. This is branding in the most primal of definitions.”

It also depends what you’re selling. As Caroline Jarrett, author of Forms That Work, said:

Caroline Jarrett:

“Just from the point of view of using a Net Promoter Score as a question in a survey, we have to ask whether that question means as much to the people answering it as it might to the business. There are some things where “I’ll recommend this to a friend” is a really important thing that people would actually do. But there are other things where you’d never recommend it to a friend because you don’t do recommending, and you certainly don’t do recommending of those type of things.

So you might actually be very enthusiastic about the product, but you just might not ever feel the urge to recommend hemorrhoid cream to your pals. You know? That’s not then giving a true measure of the value of that product. I have my skepticism about Net Promoter Score.”

All of this is to say that ratings scales can tell you a lot, but they can’t tell you everything. Be skeptical when people tell you there’s one question that will tell you how your company is doing.

Little Tweaks, Big Differences

Almost any factor can influence the outcome of a survey (which is why Spool’s quote above talked about the difficulty of accurate benchmarking data).

GreatBrook, a research consulting firm, did an experiment with a client where they created a bunch of different surveys with the same attributes, just different scale designs. They posed the questionnaires to 10,000 people, and found some interesting things:

  • Providing a numeric scale with anchors only for the endpoints (e.g. a 1 to 5 scale was presented with verbal descriptions only for the 1 and 5 endpoints) led to more people choosing the endpoints.
  • Presenting a scale as a series of verbal descriptions (e.g. “Are you extremely satisfied, very satisfied, somewhat satisfied, somewhat dissatisfied, very dissatisfied, or extremely dissatisfied?”) lead to more dispersion and less clustering of responses.
  • A “school grade” scale led to more dispersion. A school grade scale is where you ask the respondent to grade performance on an A, B, C, D, and F scale.

Using Appropriate Language and Scales

For certain information (say age) there are many ways you can ask for it. Each one produces a different level of precision.

According to MyMarketResearchMethods.com, if you want to report an average age, you would want to use a ratio scale instead of a nominal scale.

Image Source
Image Source

Since the ratio scale is more accurate, why do you see ranges for this question (and questions like income)? Because they’re personal. And some people can be sensitive about disclosing their exact age or income, so people are more comfortable giving a range, such as that seen in the nominal scale example above.

According to Dr Rob Balon, “you almost have to ask age, income, ethnicity, etc. using a nominal clustering approach, Otherwise you run a great risk of non-response error.”

Group Based on Known Characteristics

For certain information, like age or salary, you want to group based on known characteristics. In other words, the way you group incomes depends on the population you’re studying.

If it’s college students the ranges would be much lower. If it’s a study of the general population, $20k or less is a good first rung. $21 to 39k is next, $40-$69, $70-$99k, $100 to $150k, and $150 plus.

As Dr Rob Balon advises, “for whatever population you’re studying, make sure those income breaks line up with the known characteristics of the population. Not doing this can create additional bias.”

Similarly, writing surveys in your customers’ own language is important. How do you do that? You get on the phone and talk to your customers. Or run focus groups. Or run some on-site surveys.

No matter what, you want to use the words – the phrases, jargon, emotions – that your customers are used to communicating with.

Best Practices for Demographic Insights

So with sensitive information like demographic info, how do you establish which defaults to use, which words to use, which scale to use?

Well, other than focus groups and interviews to get to know your customers better, there are some general guidelines and best practices (listed here). If you follow these, at the very least your respondents will likely have taken surveys like it before, and therefore will know how to answer things based on that context.

For example, but the guideline for age ranges is the following:

  • Under 18 years
  • 18 to 24 years
  • 25 to 34 years
  • 35 to 44 years
  • 45 to 54 years
  • 55 to 64 years
  • Age 65 or older

You can also have an experienced market research consultant come in and tell you if you’re running things well, but of course, that’s a another expense entirely.

Again though, you’ll have to balance the level of specificity you want to obtain with the comfort your audience feels in answering the questions.

Conclusion

Even though it seems like default ranges for survey questions are arbitrary, much though and design is behind them. Whether you use a Likert scale, a dichotomous scale, or a semantic differential scale depends on what you’re trying to learn.

In addition, when trying to obtain sensitive information like age or income, asking for exact numbers just won’t work (non-response bias), so we use nominal clusters (18-24 etc).

The best thing you can do is, before designing a customer survey, learn survey scale best practices (there’s a link above) and learn what words, phrases, and ranges your audience will best respond to.

Feature image source

Leave a Reply

Your email address will not be published. Required fields are marked *

Current article:

Survey Design 101: Choosing Survey Response Scales