In recent years, advances in semiconductor electronics have pushed the instrumentation of our world to unprecedented levels. Sensors are now all around us: many cell phones contain GPS receivers as well as cameras, doorways have motion detectors, stop lights sense vehicles at intersections, and satellites orbiting overhead are constantly imaging the Earth.
Additionally, we have data sourced electronically: feeds from social networking sites, crawls of Web pages, repositories of medical images, results from computer simulations, etc. Many of the data objects from these sources are collected for analysis, archived, subjected to re-analysis, cross-correlated with other data objects, and processed to create additional, derived data sets.
The result is that we live in a world that is data rich. In this article, we consider two types of data sources: stored and streaming. A stored data object is just that, information that has been archived in some way. A corpus of digital images stored on a collection of magnetic disks would be an example of stored data. Streaming data objects have a real-time component; a live video feed is the canonical example of streaming data.
The two types of data present different processing challenges in that applications operating on stored data are often throughput-sensitive, while those operating on streaming data are often latency-sensitive. While the two types of data present subtly different performance constraints, both require significant, scalable computing resources.
For example, an image search application operating on stored images may need to scale out depending on the number of images or complexity of the search. Similarly, an application executing a face-detection algorithm on live video may need to scale out if faces are detected and more compute-intensive face recognition algorithms are invoked.
Cloud computing technologies enable many users to share modern computing clusters while providing mechanisms for scaling applications as needed. As a result, researchers in Intel Labs are investigating what challenges arise when leveraging cloud computing technologies in the context of rich data applications operating on either stored or streaming data, and what solutions may address those challenges. This research program includes support of the Open Cirrus* research test bed, development of an open source software stack for operating on stored data, development of a runtime system for operating on streaming data, and exploration of the benefits resulting from integration of optical networks in compute clusters.