Whenever items or individuals are measured, error is likely. Unintentional mistakes may
occur when something under investigation is measured and the true response is sought but not revealed.
This is common in research. Since virtually all research efforts are flawed. Marketing researchers must routinely measure the accuracy of their information.
Researchers must determine measurement error, which is the difference between the information sought and the information actually obtained in the measurement process.
Measurement includes true, accurate information plus some degree of error. We can summarize this idea as follows:
Measurement results = True measurement + Measurement error
Two potential sources of error exist in the measurement process: systematic error and random error. Therefore we can extend the equation as follows:
Measurement results = True measurement + (Systematic error + Random error)
Systematic error is caused by a constant bias in the design or implementation of the measurement situation.
This type of error occurs in samples where the findings are either consistently higher or consistently lower than the actual value of the population parameter being measured.
Systematic error is also referred to as non-sampling error, because it encompasses all types of errors except those brought about by random sampling.
In relation to assessing errors in the measurement instruments, reliability and validity are two central concepts.
A reliable scale is a prerequisite to sound research. Reliability refers to the ability of a scale to produce a consistent result if repeated measurements are taken.
If a lecturer gives a group of students two different tests to measure their knowledge of marketing research, and the students’ scores from the two measures are very similar, then the measures can be said to be reliable since they replicated each other’s scores.
Reliability is the extent to which scales are free of random error and so produce consistent results.
In general, the less random error detected, the more reliable the data will be. Systematic sources of error do not lessen reliability, because they consistently influence the measurement rather than create inconsistencies to it.
In test-retest reliability, respondents are given identical sets of scale items at two different times, under nearly equivalent conditions.
The time interval between tests is typically two to four weeks. The degree of similarity between the two measurements is determined by computing a correlation coefficient. The higher the correlation coefficient, the greater the reliability.
That test retest approach is subject to time constraints, which creates another potential problem for the researcher.
The greater the time interval between the first and second tests, the less reliable the scale will be. Also, environmental and personal elements may change and alter the results of the second test.
Another issue is that it is often difficult to persuade the original respondents to take a second test. There may also be a carryover effect from the first measure.
This is called the halo effect. Suppose respondents were initially asked to rate the service of a shop.
Their response to a similar question asked two weeks later may be influenced by their initial response.
A third problem with test-retest reliability is that some situations can be measured only once.
Alternative-forms reliability (sometimes called “equivalent-forms reliability”) is the ability of two “equivalent” scales to obtain consistent results.
To carry out this test, researchers administer one scale to respondents and, about two weeks later, administer the second equivalent scale to the same respondents.
In theory, there should be no carryover effect, because the items are different, so scores from the first scale should not affect scores on the second.
A similar number of questions should be used on each scale to measure the topic under investigation.
After the respondents have completed the two scales, researchers compare the measurement instruments item-by-item to determine how similar they are.
The problem with alternative-forms reliability lies in constructing two scales that appear different yet have similar content.
The alternative-forms test is similar to the test-retest method except that the test-retest method uses the same measurement instrument both times, not two different instruments.
In internal-consistency reliability, two or more measurements of the same concept are taken at the same time and then compared to see whether they agree.
Suppose the following four statements using a Likert scale (choices range from “strongly agree” to “strongly disagree”) are used to determine consumers’ attitudes towards My Bank’s cutomer service: ”
I always enjoy visiting My Bank,” “I like the people who work at Nt. Bank,” “My Bank satisfies my banking needs,” “The services I receive at My Bank are excellent.”
The extent to which the four measures correlate across a sample of respondents indicates the reliability of the measures. As the correlation increases, the reliability of the measures increases.
The easiest way to test for internal consistency is to use the split-half technique This method assumes that these items can be divided into two equivalent subsets tha: can be compared.
Several methods have been devised to divide the items randomly into halves and compute a measure of similarity of the total scores of the two halves acro» the sample.
An average split-half measure of similarity, coefficient alpha, can he obtained from a procedure that has the effect of comparing every item with every other item.
Coefficient alpha (or Cronbach’s alpha) is a technique for judging internal consistency of a measurement instrument by averaging all possible ways of splitting test items and examining their degree of correlation (Cronbach, 1951).
The greater the correlation is to a score of 1, the higher the internal consistency (Cronbach, 1990). A score of 0.60 or less indicates that the items measure different characteristics.
How can reliability be improved?
Here are some ways to improve reliability:
? Increase the number of measurements. Instead of using the scores from one test, sum or average the scores on several equivalent forms of the test.
To do this, increase the number of test items, checking to make sure that the new items examine similar concepts.
? Use good experimental controls. To minimize non-systematic or random factors, the testing situation must be conducive to achieving consistent responses.
Therefore, make sure that lighting is comfortable and consistent, measuring devices such as stopwatches work properly, the measurement scale is reliable and consistent, and test administrators know how to avoid creating bias in respondents.
? Be careful to select only items relevant to the topic for measurement. Define the study topic carefully and correctly and then write test items that will accurately reveal information about that topic.
Just because a measurement scale produces consistent results doesn’t mean it measures the right concept.
Validity is the degree to which a test measures what it is supposed to measure: “Are we measuring what we think we are measuring?”
All too often researchers think they are measuring one thing when they are actually measuring something else.
There are several ways to assess the validity of measurement instruments: content validity, criterion-related validity, construct validity, convergent validity, and discriminant validity.
One way to judge the content validity of a scale is to ask experts on the test topic to assess the scale.
Scales that pass this test are said to have content validity. This test is subjective because the personal experiences and beliefs of the experts inevitably come into play.
Content validity is the most often used validation technique, because it is not timeconsuming and is easy to do.
Criterion-related validity is the ability of a scale to perform as predicted in relation to a specified criterion.
The criterion is the attribute of interest. The predictor is the respondent’s score. Suppose that a post graduate business school tries to determine applicants’ potential by asking them all to take an admissions test.
The criterion is each applicant’s potential to succeed on the course. The predictor is the applicant’s test score.
What is important is how well the predictor determines the applicant’s potential for success in the course.
There are two types of criterion-related validity: concurrent validity and predictive validity.
Concurrent validity evaluates how well the results from one scale correspond with the result from another when the scales measure the same phenomenon at the same time.
Validity is determined by how closely the results correlate with each other. Concurrent validity also assesses how well a set of independent variables can predict the dependent variable in the light of new information.
Assume that respondents have filled in a questionnaire and that a model has been built based on relationships between variables.
Predictive validity is the ability of a scale to predict a future occurrence or phenomenon. What differentiates this form of validity from concurrent validity is the time period when the tests are administered.
If a brand’s market share one year after the launch is 17 percent and the agency’s market research prior to the launch predicted a share of 16 to 19 percent with a 95 percent probability, this is an example of good predictive validity.
Construct validity concerns an abstract, unobservable, hypothesized concept. Constructs can be characteristics such as intelligence, aptitude, strength, love, and creativeness.
In marketing, constructs that researchers often want to measure include service quality, customer satisfaction, and brand loyalty.
Because of their abstract nature, there is no direct way to measure constructs, so researchers measure observable phenomena that theoretically demonstrate the presence of the construct.
Suppose a researcher wants to measure the quality of a company’s service. Theory states that the amount of repeat business, an observable phenomenon, reflects service quality, an unobservable phenomenon.
If this theory is valid, an instrument can be created that measures repeat business and uses the results as a measurement of service quality.
A scale has construct validity if it measures an observable phenomenon that an underlying theory correlates with the construct of interest.
Stated another way, construct validity assesses how well ideas or theories are translated into real measures.
In this case, a scale has construct validity if it can show that repeat business demonstrates service quality.
The validity of the underlying theory, that repeat business demonstrates service quality, is key to the validity of the scale.
If the theory is wrong and there is no association between the two, then the scale is not valid; that is, it won’t measure service quality even if it does measure repeat business well.
When construct validity is not found, it may be due to either a lack of construct validity or a flaw in the theory.
To avoid these problems, researchers try to establish the construct validity of a measure by relating it to many constructs rather than just one. Researchers also try to use proven theories.
Construct validity exists if both convergent and discriminate validity are present.
Convergent validity is the ability of a scale to correlate with other scales that purport to measure the same concept, the logic being that two or more measurements of the same concept using different scales should agree if they are valid measures of the concept.
If the results from different scales that claim to measure the same construct are highly correlated. then convergent validity is established.
Relationship between reliability and validity
Ideally, a measurement used by a market researcher should be reliable and valid. Figure 6. 10 shows various types of reliability and validity measures.
Although this figure treats validity and reliability as being independent of each other, there is actually a one-way relationship between them.
A scale must be reliable to be valid; but it does not have to be valid to be reliable. Further, reliability is a necessary but not sufficient condition for validity, because validity also requires other factors to be satisfied (that is, supported from theory and observation).
Validity is not a necessary condition for reliability. For deeper coverage of validity and reliability see Carmines and Zeller (1979).
Generalizability refers to the extent to which it is possible to generalize from the present sample data to the universe.
Usually, a poll that is based on a nationally representative sample of a thousand or more voters is believed to possess generalizability.
In contrast comprehensive quantitative models of consumer behaviour like the LISREL model sometimes are criticized due to low generalizability.
This happens because the model’s parameters cannot be applied to a universe or market environment outside the one (the specific sample) that was used for building the model.
A study finding or a model may possess good validity and be reliable while its findings are not generalizable.
What are the arguments for and against the inclusion of a neutral response position in a symmetric scale?
Can random error be avoided? If so, how? If not, why not?
How is criterion validity assessed?
When assessing a measurement instrument’s construct validity, why is it necessary for the instrument to have a theoretical foundation?
What is the halo effect, and how does a researcher allow for it?
What are the necessary conditions for a study’s generalizability?
Keywords: Reliability, validity, measurement error, systematic error, random error, test-retest reliability, halo effect, Alternative-forms reliability, equivalent-forms reliability, In internal-consistency reliability, split-half technique, Coefficient alpha, Cronbach’s alpha, content validity, criterion-related validity, construct validity, convergent validity, discriminant validity, criterion-related validity, concurrent validity, predictive validity,