Seems there is some confusion in the comment by Wilco.
The article is about AndeBench-Pro, and the comments are about AndeBench.
Addressing the comments specifically:
" It must be a typical workload that is representative " - The platform scenario part of andebench pro performs a typical android workload, including decryption, encryption, image processing and effects, database queries and interaction with gui elements, xml processing etc. This scenario is using Android OS calls and is very typical of real world apps.
" In any case a 10-line function scanning numbers represents an infinitesimal small proportion of a compilation workload. Even Dhrystone is much more complex than this " - This is a reference to CoreMark, and not really to AndeBench-Pro. Users may want to check out http://www.eembc.org/techlit/coremark-whitepaper.pdf to get actual details of CoreMark. As for Dhrystone being more complex then one of the functions in CoreMark - for a person claiming to have compiler experience I find that comment curious. Characterization data has shown CoreMark to be more complex as well as much better predictor of real world performance of processors then Dhrystone.
" To make matters worse, the results are CRCd using code that is clearly trying to win the price for being the world's most inefficient CRC implementation " - again clearly a reference to CoreMark. Since Andebench-Pro does include a version of code based on coremark, which contributes ~10% of the native cpu score which makes it contribute ~2.5% of the overall score though I will address this comment specifically. If you check the documentation for Andebench-Pro, you will find it states that that particular workload that is based on coremark has been modified for processors that are typical in mobile rather than trying to make it possible to run on anything from 8b microprocessors onwards. One of the modifications mentioned is using a table based CRC implementation. Efficiency wise, if you have an 8b or 16b micro with limited memory, having a table that is 256 bytes long instead of a small loop that will compile to 16b overall may be considered the less efficient implementation. For processors with long pipelines and lots of memory, a table based approach will likely be better, except if you need to fetch values for that table from memory. I invite you to compare the execution time of the CRC loop in the original coremark for one value against the execution time of a LUT for one value. Or for that matter, the execution time in a loop with a processor that does not have a cache.
" And because the test is so simplistic and inefficient, it is trivial to add some compiler optimization to target this specific code. Various compilers can now optimize the switch statement away, giving a huge speedup. Of course this optimization does not help any other code in the real world... And you see those cheated results announced with big fanfare on the EEMBC website. " - again a reference to CoreMark and not to Andebench-Pro. It is clear you are frustrated with CoreMark. Since you worked according to a previous paragraph in compiler front-end, you may want to ask the compiler backend guys before claiming the switch statement can be optimized away, or at least read the literature (e.g. check the gcc implementation described in https://gcc.gnu.org/bugzilla/show_bug.cgi?id=54742). What compilers have actually done is exposed more control over tradeoffs between code size and performance to users. This optimization does help real code in the real world, see http://www.mcmanis.com/chuck/robotics/projects/servo.html for an example motor control where this same optimization will help the interrupt routine perform faster if your microcontroller has the available ram. If you ever work with microprocessors as in write actual application for them, you will find small state machines like this are widely used. This is actually a good example how a benchmark can help push the compiler industry to make improvements that will help real users struggling with real life problems, and not just be a marketing tool.
As for other inflamatory comments, it may perhaps be a good idea to provide what you think is a good benchmark as you feel that 99% of benchmarks fail the very first test of what a good benchmark is. In particular, a benchmark that you have actually written or at least participated in the design of. EEMBC is providing a valuable resource for the industry by creating benchmarks that are well thought out and designed to be relevant for their targets.
In regard to benchmarking, I do agree that the average user is not interested in benchmarks in general. They often care more about the service, applications, and content. However, benchmarking is still of value to carriers, system designers, and others that are evaluating platforms.
In regard to the other benchmarks mentioned, I have referenced or used just about every one of them, but we believe that there should be some basic requirements for all benchmarks and the most important is full transparency into the benchmark development process, code, testing, and scoring. Unfortunately, most of the benchmarks in PCs and mobile devices do not provide full transparency.
As indicated in the article, no benchmark is perfect nor can it test for every possible usage model, which is why no single benchmark should be used alone. However, we should have standards for the benchmarks that are used.
Currently there is Geekbench which is available on all the platforms you mention, plus WinRT and Linux. Compared to other benchmarks, it is perhaps the least bad mobile benchmark as it uses actual workloads just like SPEC (just smaller). I say "least bad" as like SPEC it has many flaws and issues.
It's interesting to see the article claim that AndEBench sets a high standard - my 20+ years of experience in analyzing benchmarks has been that EEMBC benchmarks are among the worst. Given that AndEBench includes CoreMark, there is little doubt in my mind that this latest variant continues the long EEMBC tradition of doing benchmarking completely incorrectly.
One of the key aspects of a benchmark is that it must be relevant. It must be a typical workload that is representative and a good compiler target. 99% of benchmarks fail this test. To give a concrete example, much of the time in CoreMark is spent in an extremely inefficiently written state machine which simulates scanning of simple integer/FP numbers using a complex switch statement. As a result it is nothing but an indirect branch predictor torture test.
Having worked on various compiler frontends and written many similar scanner functions, I know actual scanner code is never written like this. In any case a 10-line function scanning numbers represents an infinitesimal small proportion of a compilation workload. Even Dhrystone is much more complex than this!
And because the test is so simplistic and inefficient, it is trivial to add some compiler optimization to target this specific code. Various compilers can now optimize the switch statement away, giving a huge speedup. Of course this optimization does not help any other code in the real world... And you see those cheated results announced with big fanfare on the EEMBC website. So much for independent verification, certification, oversight and consistency...
To make matters worse, the results are CRCd using code that is clearly trying to win the price for being the world's most inefficient CRC implementation, taking some 20% of the total benchmark time. Do CPUs really spend 20% of their time doing CRC checks? Hint: should checks that test the benchmark computed the correct result perhaps be excluded from the timing of the benchmark?
With benchmarking it's garbage in, garbage out. No matter whether your methodology is perfect, a rubbish benchmark will give bogus results and will be quickly cheated as we see time and time again. The only thing that actually matters is that you start with a good benchmark. And that's where the real issue is.
Until there is some sort of killer app that benefits from faster mobile devices to the point that it might influence the average person's purchase decision, what's the point? Mobile benchmarks are essentially a fanboy you-know-what measuring contest that only a few percent of people will care about. The average person isn't going to be swayed from buying one phone to another if you can tell them the other phone is 25% faster. They just won't care.
One can argue that few people have ever cared about benchmarking, and that's true, but for years average PC buyers measured the PCs they bought in megahertz and later gigahertz, and until Intel went off the rails with the P4 design that was a fairly reasonable, if not completely accurate, way to compare two CPUs. Once PCs become more than fast enough for everyday use by average people, performance no longer mattered and Intel went to their bizarre and indecipherable model number scheme.
While cross-platform benchmarking would be nice to have, there are political issues to consider. From the political side, people don't buy iOS products because of their performance. Although it might just be nice to know how they differ from Android devices, it probably won't change someone's buying decision. If Apple joins EEMBC, it will provide more motivation to develop an iOS benchmark.
For cross-platform testing today, EEMBC has its BrowsingBench benchmark, which implements a unique, accurate, and effective mechanism for measuring true page load time (and all the underlying functions required to actually load a page, including the packet transmission).
In the article, Jim also discusses future-proofing. Actually, the day that EEMBC released AndEBench-Pro, the members already started to work on the next version. EEMBC is encouraging others to join the working group and help proliferate this benchmark technology.
Lastly, I just want to point out that AndEBench-Pro is available for free download from Google Play. Give it a try and compare your device to many others already in the EEMBC database.
Another notable company is Rightware, which has the Basemark and Browsermark benchmarks. As a spinoff of Futuremark, Rightware has a similar open development effort and maintains similar testing standards. The Basemark OS II is one of the most complete mobile benchmarks we have reviewed thus far. Note that we are still reviewing the other mobile benchmarks and will make our recommendations on which should be used and which should not in a follow-up article. In a similar manner, we are also reviewing the PC benchmarks. It's time to cleanup benchmarking and end the use of those that don't make the cut.
Supporting multiple OS, multiple system functions, and system-level testing is a huge task. There are various entities working on various solutions. For example, Futuremark's 3DMark benchmark supports multiple OS and multiple platforms from smartphones to PCs and Futuremark is completely transparent about the process, a key requirement. However, 3DMark is just for graphics functions. EEMBC has all the system and system-level tests, but only for Android. There are many others, but in many cases they lack the transparency we are advocating. So, there is no "best of all worlds" solution yet.