After reading all our winners this year, you’re probably wondering how we score each candidate. That’s a great question to have! We actually re-did everything this year, and I think even the most nitpicky out there will appreciate how we improved our processes. There will never be a perfect scoring algorithm, but we’re proud of what we have.
As the eponymous Gary Sims would say: Let me explain.
Last year we debuted a system of objective testing to determine the quality of smartphones, and admittedly it wasn’t as great as it could be. Specifically, the system we used to rank phones was too simplistic, and led to some unexpected results. Nothing wrong, mind you, but we can do better. This year, we generated a ton more data, all with the goal of being able to better contextualize performance instead of merely ranking it. You may have noticed our deep dive reviews here and there — that’s just a taste of what we can do now.
As a refresher, all of our tests are performed in a lab run by our employees, using turnkey solutions that are time-tested by industry professionals. For example, we reached out to our friends at Imatest and SpectraCal to create our camera testing and display testing suites, respectively. Both Imatest’s proprietary imaging analysis software and SpectraCal’s CalMAN software are what bigger manufacturers use, so when we publish data from our test units: it’s very similar to what they’re seeing.
For our processor tests, we gather an array of scores from several different benchmarks, each meant to gather performance data in different situations. For example, we use Geekbench to test the CPU, 3DMark to test the GPU, and so on. We use a large battery of benchmarks to get a complete picture of the phone. The same is true for battery performance — including charge times — and audio performance as well.
After all these tests, we’re left with a huge pile of data to sift through. How do we know what’s good? How do we know what’s bad? How do we fairly score each test?
What does the data mean?
If you’re a veteran of review-reading, you’re probably aware that there’s a sort of homogeneity around scores. That’s something we aimed to break up this time around. Instead of ranking, making gut calls, or polling people on the results, we applied a new way of thinking to the data. Instead of scoring phones in relation to each other, we scored them based on how well human beings could perceive them to be.
For each metric that could be limited by human perception (screen brightness, color accuracy, etc), we spent countless hours researching what those limits were, and added them to our master spreadsheet. Then we determined if there were any other philosophical tweaks needed to accommodate how people used their phones. Essentially, we want to reward devices for their performance in relation to how a human perceives it, but we don’t want any outliers in any one measure to tip the scales one way or another.
For each data point, we applied an equation to assign the results a score from 0-100, but the scale awards and punishes outliers at an exponentially decreasing rate. This way, phones with infinitesimally small audio distortion wouldn’t get a boost if you can’t hear the difference, and phones with one really low score wouldn’t be sunk if they had lots of other bright spots. Once we applied these curves to each minor data point for every major category, we normalized the scores to make every major category (camera, display, audio, etc.) worth the same overall. For our purposes, a score below 10 is bad, a score of 50 is right dead-center between good and bad, a score of 90 is excellent, and a score of 95 exceeds most people’s perception. Consequently, a score of 100 or 0 is nearly-impossible to achieve.
While we won’t publish our internal scores for everything, we may refer to them from time to time to drive certain points home. There’s a lot of hyperbole out there, and we’d like to put your minds at ease: even the worst smartphones are objectively pretty decent most of the time. If something scores well against our algorithms, it means that you probably won’t be able to tell the difference between it and the one “best” product for that test.
Experience pays dividends
Of course, not every category is objectively as important others — who cares about screen performance if the battery won’t let you use the display for more than an hour or so? In that light, we also applied weighting modifiers to reward categories we feel are more important than others. For example, audio is mostly differentiated by the speakers, as wired audio and Bluetooth is very much a settled question. Therefore, it isn’t weighted as heavily as, say, battery life or the screen.
For each category, our spreadsheet automatically shows us the winners in each category, as well as providing a gross score. That gross score is what determines the best smartphone overall, it also helps us pick more granular awards, like the Best smartphone for China, or US. Neat, huh?
While you may not agree with some of our picks, that usually means that your constellation of needs is unique to you: which is totally fine! You may find that if you were able to play with our weightings to reflect your needs, our data would agree with you. However, we had to make a call here, and decided that our experience was a better way to guide our final weightings than using polling data.