Use of Silvermont validates ARM (and other low power procesor architectures) in more than one way.
Silvermont uses a distruted reservation scheme unlike the centralized scheme in Haswell. Haswell is low power but a distributed scheme is even more efficient at the cost of static partining of instr. queues.
It also retains the direct x86 decoding and not a x86 to uOP convesrion for most x86 opcodes (need to reconfirm this).
So all low power cores are starting to look alike architecturally !
Was there any discussion of software support for these beasts? Operating systems have supported multiprocessors for a while now, but it doesn't seem like they have really made the best use of them. Much of their support seems to be how to throttle down to the minimum number of active cores needed for the applicaiton load. I can see this for server farms that want to provide maximum capability while minimizing power usage, but it seems like with this number of cores we may be needing some fundamental architectural changes in operating systems and / or application software. Is that true or is it just more of the same?
OS and library support for such devices have been there for a while. The problem is lack of trained SW engineers who can develop massively threaded systems. Back in the mid 90s when I was at Sybase, CC NUMA systems with 20+ cores would be running 1k+ threads per core, for a total of 20k+ threads. We could never get enough cores. MIMD variants woudl saturate a 1024 core IBM SP2 easily. I used to borrow time from NASA Ames on an SP2 since 1024 core systems were a trifle hard to come by !
So a server architect can never get enough cores. Even 10k+ cores will get satuarted ina a large OLTP system easily. Intel is supposed to send us a coupe of Xeon Phi cards. Will know more about their sacalability when I get them. On a related note, we have our own 100 core device under design, still trying to figure out if it makes sense supporting CC on such large systems since message passing designs rather than SMP designs scale better for such large core counts.
Well, parallel processing is so easy, there's so many ways to do it :)
Actually, that's the problem: from what I've seen, there is no one approach that is best suited for all problems. And, I think it's been pretty well proven that most software developers have a hard time writing bug-free and high performance code using "traditional" techniques such as threads/locking/semaphores.
So it's not surprising there's a movement torwards functional programming and "shared-nothing" programming (message passing, actors/Erlang-model, etc). However, depending on how much has to passed around, that might not be the best approach.
There are also at least two separate levels of programming that may or may not need to adapt. The operating systems control the resources provided by the hardware and parcel them out to applications. The current abstractions provided by those operating systems lean heavily on processes and threads within those processes. Applications typically use these abstractions and rely on the OS to map them into hardware appropriately. The techniques that you list, @Tony, could be very useful in terms of giving the OS more latitude in terms of these assignments, but they still need a synchronization mechanism (like the Ada rendevous concept). I sympathize with @Colin's comment that more programmer training is needed, but a good model for applications to follow would make that training more effective. The OS and tools guys need to figure that model out.
The slide states Xeon binary compatibility, but the article states Xeon Phi binary compatibility. Since AVX-512 is not compatible with LNI, I suspect the slide is correct and the quote (or the quotee) is not.
Actually looking at Linux/Unix type older generation OSs have scalability issues and are not really geared for large core counts. Current OS theory basically leans towrads lean kernels (micro, exo, zeo) where resource management is NOT done by the OS. This is a much better model to adopt. I am still amazed that OSs created in my Dad's days are still around ! It is realy time to throw these dinosaurs out.
One of the reasons Sybase scaled so well was that all scheduling, memory memgmt and I/O mgmt was done by a user level kermnel and not the OS. Context switch was less than 50 instructions with zero kernel overhead.
Once new generation OSs are adopted, constructs in languages like Erlang, Rust etc are more than enough to take care of these issues. So bottom line is that the tools, languages and OSs are in place, we just need adoption of new variants and not stick to sticking MPI in Linux and hope it scales.
For really high scalability, the CPU protection arch also needs to be changed. Memory protecttion via MMU and address space seperation also has to be chucked out. There are better ways to protect processes from each other. New languages also have better protection against bad pointers. Virtual caches which obviate the need for TLBs and allow single address spaces are ideal for large shared adress spaces across large core counts. For one approach on how to do this look at crash-safe.org
All of this stems from the fact that we are still stuck with late 60s style architecture propounded by MUltics and its ilk. Nerwer approaches and architectures do not get coverage and tirades like LInus' tfamous one against Microketnels do not help. But change is happening, Blackberry does run on a microkernel, QNX and not Linux. Hope theye have not changed the OS too much and made it monolithic. We use L4/genode for all our work but other Microkernl and zero-kernel options are also equally viable. Folks just need to start using them. It is not easy, Linux usage is the Cigarette addiction of the OS world !
This is quite true that processor manufacturer are going towards massive parallel processing, and multi core processors, but at the same time applications supporting massive parallelism should also come up, Intel's initiative like "an educational program designed to give every new programmer on the planet the opportunity to learn how to code for parallel processors", is really a good thought and step.