Ratings, NPS, and Machine learning
This is a revised article from my old blog, stemming from december 2018.
I recently performed a small user testing session on a design I made as part of a UX course I’m currently enrolled to. And as a part of that I decided to do a small survey on how the users themselves rated their own technical know-how (the point of the app being that it should be usable by anyone, no matter their technical background). Something I found interesting due to the fact that both users were chosen upon that they weren’t especially used to modern technology (smartphones, computers), is that both of them rated themselves about as high as I myself would. Or possibly just slightly lower.
If the only people you meet are those who hardly can handle a relatively simple TV menu, and you’re a truly masterful TV-handler, you may as well rate yourself 10/10 in technical know-how, because technology relative to you is only the TV, and your environment are those who can’t handle the TV nearly as well as you.
Me, being the person who regularly thinks about what technologies you can use to render smooth interactive web experiences and are currently learning about modular synthesis, amongst other things. You know, that kind of person that relative to my own reality (content creator and general technologist) knows a bit of this and that, but hardly is an expert at anything. A clear 7/10.
This points to that rating oneself – just as conveying whether you’re good at something, is always relative to your own surroundings. Your relative knowledge. If the only people you meet are those who hardly can handle a relatively simple TV menu, and you’re a truly masterful TV-handler, you may as well rate yourself 10/10 in technical know-how, because technology relative to you is only the TV, and your environment are those who can’t handle the TV nearly as well as you. But maybe, just maybe, you’ll have that insight that there’s actually somebody out there that knows how to handle those TV menus a little bit better, and as such deem yourself a score of 9 out of 10 instead.
I used to write resumes where I did these 1-10 scorings of myself, simply because it was a smart way to convey sort of what I knew in terms of software (I’m quite decent in many softwares). This went quite well. There were no questions asked. I imagine whomever read it, reasoned in terms of “if he’s 7/10 at Adobe Premiere, he probably knows the ropes around video editing”. That said, Adobe Premiere, just as any video editing software is sort of narrowed down too – and if you do also have a portfolio that says something about your artistic touch to complement that 7/10 rating, that may say just enough about your skills in general as well, depending on which company and position you apply to.
However, the only factors that weigh in here, isn’t your knowledge vs the worlds-, but also how you interpret the scale itself, or respond to the question. For example, if you get prompted to answer a question about yourself, you may be tempted to reply in a way that sways towards your cause. Which may or may not mean that the answer itself is “positive”, but rather an answer that cohere with your ego. If you struggle with something and are ashamed of it, you’ll probably rate yourself a bit higher, but if you identify and project to your surroundings that you’re not good with technology, then it will probably not be a big deal to rate yourself low.
To make this interpretable in any way, they do in market evaluations tend to use something called Net Promoter Score (NPS for short) that you may have heard of. With NPS, any answer on a scale out of 10, that is between 1-6 is deemed as “criticism”, 7-8 as “passive”, and 9-10 as “satisfied and beyond”. And even if these aren’t explicitly applicable to a scenario like this, where you evaluate yourself – both of my participators would have been at the negative end – even though they rated themselves at 6, which at a glance may have seemed “positive” without the use of NPS. And my own score; 7-8, would have been deemed useless. Because the only ones truly good at something are those in the 9-10 range. This do however disregard that experts at a given subject tend to be quite conservative with rating themselves as true experts – because many of them have opened the Pandora’s box, and do no longer compare themselves with their direct surroundings, but rather to the amount of knowledge there is in the world. What one can know about something.
This is called the Dunning-Kruger effect, which is formally defined as “a cognitive bias in which people of low ability have illusory superiority and mistakenly assess their cognitive ability as greater than it is”, and the inverse of that would be something like: “a cognitive bias in which people of high ability assess their abilities as lesser than what the environment expects”. Because it may very well be so that the experts assess themselves just right, even if they don’t put themselves at the top. To my own belief, there is truly few – if any subjects of which anyone can put themselves at the top. Maybe at the very simple tasks, such as running around in a circle or jumping over a stick, but if we just add a couple of dimensions and make the goal to kick a ball around a field and score as many goals as possible, instead; the datapoints for what can be deemed true expertise becomes hard to assess. Is it the amount of goals you make? Relative to certain opponents? How fast you can run? Your skills in coordinating with the team? And so forth.
Rating oneself is however not only about self awareness, it may also be due to the question itself, how its asked – in text; semantics. But in speech; also tonality, pacing etc. Which may imply that one answer is more appropriate than another.
I recall a few weeks back when a company called me and wanted me to take part in their survey regarding my experiences in partnering with our municipality during a project. All questions felt arbitrary, extremely similar to one another – possibly to capture the nuances where upon they averaged the score (I don’t have clear insight into all their methods, but I’d imagine that’s what they use different categories for). And all questions were to be rated as 1-10. They didn’t ask me to elaborate upon anything, even though I very well could have told them just straight up what I felt worked well and not.
This begs the question – why? Why would you do research like that? Does it really work, and can you get any meaningful data from it? I find it hard to believe. And I do sincerely hope that they do more elaborate evaluations with those they partner more regularly with, because if the results they get aren’t useless, it do at the very least feel close to a hoax, where results may be arbitrary at best. If you think of the 1-6 span as negative, most of my answers were negative or passive. Yet, relative to how partnerships with them has been previous years, it was actually really good this year. And as they haven’t done these evaluations previous years, none of those improvement will be visible in the data. And even if they did this evaluation again next year, my experience – and possibly even score (depending on my mood) wouldn’t improve unless they did something for my biggest complaints – which they can’t possibly know what they are right now, because no such data were collected. And due to the general improvement from previous years to this year, I may even have held them in an extra positive regard this time around, whilst next year I may rate them lower – even though they may have actually improved in regards that I don’t care about, or even notice.
Well, I think you see where this is going.
Well, I think you see where this is going.
Quantitative research can, but shouldn’t be applied to anything. Of course, finding something meaningful or measurable from qualitative data isn’t easy either. And its much more time consuming than concatenating some numbers. But even though I did some heavy googling on this, I didn’t find one single article that talked about how you could utilize machine learning for, say, analyzing freeform text in more freeform evaluations. From which I can only assume that the technology hasn’t come far enough yet, to make something actually usable out of it. But to me, it sounds like a great way to go about these kinds of problems. Our languages, no matter if its English, Swedish or something completely different – are very capable of capturing nuances in a whole other way, and if the opportunity to convert those data into numbers would arise to generate a text that points to the most common faults that people highlight – I can imagine that would be a very useful tool for those seeking to improve their organizations.