In the summer of 2006 Randy Katz worked on a sabbatical at Google Inc.
The professor of computer science at the University of California, Berkeley, was setting up a new lab to study the impact of the rise of large data centers, and he wanted to get a closer look at the one on the other side of San Francisco Bay. Katz returned to Berkeley with a bagful of observations about the future of computing, programming and design.
"I brought back not just insights into technological trends and programming skills that our students need to thrive in this new commercial environment, but also how we should organize our own activities for maximum collaboration and productivity, even in engineering research groups," he said.
Click on image to enlarge.
For a computer scientist, the experience was like being a kid in a candy shop. The sweets included more than 100,000 networked computers running like a handful of super-sized machines.
"Researchers like me are lucky to have access to a few hundred or a thousand computers [but] here was Google two years ago organizing computations across 100 times as many machines, and they have probably taken that to a factor of ten times more machines since 2006," Katz said.
"They can spread out the processing of things like Web search and advertising over multiple thousands of machines. A major building block they use is a data-intensive parallel programming paradigm called MapReduce. They have applied it very broadly across many of the things under the hood at Google."
Click on image to enlarge.
With MapReduce, Google technicians were able to take the data, partition it across a large number of machines and run algorithms on each piece simultaneously. Intermediate results are then transferred over the interconnect to a reduction step, which combines the results. "This is like a giant sort/merge type of application," Katz explained.
Googlers have not released the code behind MapReduce, but they have published papers on it. That has enabled IBM, Yahoo and others to develop an open source version called Hadoop now widely used at other big Internet data centers. Today Hadoop powers a pioneering service at Amazon.com that has become the poster child of cloud computing, an approach many say could become the next big thing in computing.
Cloud computing essentially scales up the client/server PC style of computing to Internet proportions. It is being driven by the confluence of several trends, including mature x86 servers, multicore processors, virtualization software and widespread broadband connections.
The view, according to cloud computing proponents, is that more and more applications will run not as big blobs of CPU--and memory-hungry code on a client system--but as services in big data centers in the Internet cloud. The timing is good, given the rise of the Web savvy cellphones and TVs.
"There are something like three billion handsets now becoming first-class citizens of the Internet," said Katz. " The phones are limited in what they can do locally, so you will have an ever-increasing demand for Internet data centers" and servers for the rising tide of mobile systems.