As jamestate mentions, if the TPU2 board (with 4 chips) has lower power than the top end GPU, then the TPU could indeed beat the GPU on training given the power limitations of the warehouse-scale server.
The Volta has a TDP of 300 W, while a single old TPU1 in 28nm had a 40W TDP. In a power constrained system, it seems likely that more TPU chips could be used per CPU host and that they could run at peak/boost speeds (if applicable to the new TPU2), while the GPU may be power/thermally constrained. The TPU1 paper even specifically mentions that the GPU boost clock could not be used in the datacenter because of power limits and slow boost timing resolution.
Of course, Votla was also just released and Jensen admitted that the product shown was not yet in manufacturing ramp-up, so Google's statements may have been made without respect to the not-yet-tested Volta. Maybe we should take both Google's and Nvidia's statements with a grain of salt until real benchmarks are released. Their are also bound to be plenty of confounding factors related to framework optimization; it's likely that hardware-software co-optimization between the TPU ISA and TensorFlow drivers could make it easier to achieve peak throughput for a wider variety of neural network models.