Dark silicon has a few variants for special purpose hardware: loosely-coupled auxiliary processors, tightly-coupled coprocessors or functional units, and "conservation cores"--which seem to fit in between the other two, being relatively tightly coupled to the general purpose processor but using less of the GP processor's functionality to perform its function.
There is also specialization of performance using performance-asymmetric chip multiprocessors (like ARM's big.LITTLE).
Configurable memory hierarchies have also been explored academically. E.g., reducing size or associativity to reduce power use when such does not
have a significant impact on performance. Cache placement and replacement policies can also impact energy efficiency. While most of the academic research in this seems to have been on improving performance, concepts like Non-Uniform Cache Architecture can also apply to power-saving goals.
There may be some benefit to tuning the performance of particular components to avoid stalls. E.g., if application performance is limited by memory bandwidth, it may be more energy efficient to run the processor at a constant lower frequency than to use light sleep modes between bursts of memory activity.
Inexpensive fast persistent memories may also provide significant power savings by allowing power to be removed from the memory without losing state while avoiding data transfers to and from a separate persistent storage.
Your first point is not practical. On-chip caches require ECC. If ECC coverage is over 8 bits, there is 62.5% overhead; if ECC coverage is over 64 bits, there is 12.5% overhead. (One could use per-byte parity and use a write-through L1 cache with ECC in L2, but this also has power trade-offs.) You might also note that current PC processors support 128-bit wide loads and stores efficiently to provide decent performance for their SIMD extensions.
Most 64-bit systems define 'int' as 32 bits, so only pointers double in size (and even x86-64 does not double code size; MIPS and Power add no code size increase for full use of 64-bit features).
Even the use of 64-bit pointers is not required for a 64-bit processor, though the software interfaces supporting the use of 32-bit pointers are not broadly used.
For PCs with largish memory (between 3 and perhaps 16 GiB), using a 64-bit OS with support for 32-bit applications may be a sweet spot--allowing the OS to map all of memory in a flat space while allowing applications that do not need large amounts of memory to use smaller pointers.
For your second point, the data size (in memory) is software controlled and 'int' tends to be defined as 32 bits (for PCs).
Also, while there would be some power savings from penalizing the performance of 64-bit operations, for PC processors this would be a small relative savings (e.g., support for aggressive out-of-order execution uses much more power).
Your third point also seems debatable. Few processors support full 64-bit physical address spaces.
While right-sizing processors is attractive, there are issues of design complexity and validation when a broad range of applications is targeted--and targeting a broad range of applications increases production volume which reduces the impact of fixed costs and moves the product up the learning curve faster.
Today's micron scale TSVs tackle the chip-to-chip interconnect problem, and reduce their power. To reduce the on-chip interconnect power, one needs monolithic 3D technology. Check out www.monolithic3d.com
Revision of the communication standards would go a long way in reducing interconnect power. They date long back and are revised only for speed, not power. The interfaces need a complete re-haul for overall power reduction.
Regarding interconnect: is anybody aware of the state of on-chip networks?
If interconnect is so wasteful, way not "time share" the interconnect to save real estate, power at a acceptable penalty in performance. As feature size to chip size ratio is getting higher and higher, we encounter the problem of synchronizing far sub-systems. A network is a good step to solve this.
in theory a tiny 8-bit micro could handle the keystrokes I am typing. In theory a 16-bit processor could handle the wysiwyg display/menu interface efficiently, and in practice we use at least 32-bits to handle the live spell-check dictionary lookup and 128 and more for graphics engines' text-to-pixel formatting. I haven't mentioned the mass of the Internet that will get involved when I submit this entry.
So because we use heavy tackle to do the fancy graphics and mass lookups, we don't bother much with smaller devices to handle small tasks. There are exceptions, and some stuff happens in bytewide hardware, whatever we imagine is going on on top.
Variable-width buses on ARM allow some tweaking of power, but it is heavily application-specific and there is a cost to switching modes.
Systems that only use power when actually doing something are commonly used now, but they still have to withstand the dissipation when full speed is demanded.
I think the answers will also evolve at user-level as smart ideas and friendlier devices. Maybe my laptop's webcam could decide if i am not looking at it and dim the screen; it works on a dumb timer at the moment which means it is always switching off when I want it!
For lower-power we could sense an RF tag printed on spectacle frames.
junk 64bit. We need processors that have:
1) efficient addressing/fetch of 1-byte, 2-byte, 4-byte and 8-byte locations.
memory that loads and stores all these sizes equally efficiently. I now have a 64b operating system, and with it came the need to double my memory size to get the same performance (that's considered an improvement?).
2) 32b data
3) no more than 48b address bus.