I found this article in here.
Usability concerns the efficiency, effectiveness, and satisfaction of the product. Out of the three, satisfaction is typically measured through post-task / post-session questionnaire mostly likely with Likert-style ratings. According to the expectancy disconfirmation theory (Oliver, 1977), satisfaction occurs when the quality is above expectations, and conversely dissatisfaction occurs when the quality is below expectations. And the expectation was measured through pre-task questionnaire similarly to satisfaction also with Liker-style ratings. This proposal of comparing expectation ratings and experience ratings was made by Albert and Dixon (2003). The authors Rich and McGee, argue that this method begs improvement in areas of the reliance of Likert scale, confusion over task classification, and the requirement for many users, thus decide to come up with their own design of Usability Magnitude Estimation (UME). In the article they stated that “UME is a subjective assessment method where participants assign usability values to targets using ration-based number assignment”. Through this article, they wish to show the significant improvement over Albert and Dixon (2003).
They conducted a usability test of a prototype Business Intelligence application with six participants. The application helps analysts to select optimal project to invest in given scenarios. Participants were asked to conduct 10 tasks. UME was used for both measuring the expectation and the experience of each task. And the data collected through the tests were statistically analyzed. Be noted that the participants were given instruction and practice before the test starts.
They found their method to be a Easy to administer and analyze, b Use the same underlying ratio scale for expected usability ratings and experience ratings, allowing valid comparisons, c provides a theory-based and empirical usability issue prioritization strategy.
I generally agree with findings b and c. The change that was brought by measuring ratio of the scores that the participants hand picked themselves definitely helps to distinguish the severity of the problem. I would imagine giving low scores as low as 1 and high scores 1000 for designs that are innovative in many ways but suffer from significant design flaws. One example of this is Windows 8. However, I think the article I read from last week shows that in fact this method of Rich and McGee hasn’t overthrown the method developed by Albert and Dixon at all. They showed that it was hard for participants to get the it right especially in the early stage of testing. This means the testing results from the early stage and the rest could be incomparable due to the learning curve of the participants. The problem may be a lack of “instructions” and “practices” given in the study. However, I think having to give instructions and practices to the participants just so that they can know how to use UME begs questions in its ease of administration. The article suggested two practice tasks for learning to use UME, so it brings the number of tasks that participants need to perform from 10 to 12. And among those 12, only 10 of them are even considered reliable and useful. Therefore, the conclusion of that the authors draw was hardly convincing. I think instead of just preaching the benefits of UME, I would prefer to see UME go head to head against Likert scale like what they did in the last article I wrote.