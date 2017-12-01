IoT Apps Need Good Metadata
1/12/2017 01:00 PM EST
The lack of metadata when collecting sensor data can lead to wrong decisions and missed opportunities for analysis.
I have been working directly with many engineers creating and deploying big data solutions for file-based systems. Not everyone can stream data to the latest and greatest cloud technologies. There are a lot of legacy systems that must be maintained, and most of the time their output is a good old .txt or .csv file.
Managing file-based data sets is hard because they come from a variety of sources, the output can be different from one iteration to the next, and files can get lost easily. There are a few preventable mistakes that can be addressed to make overcoming some of these challenges a bit easier. One key is step is to save more metadata in your files.
Time and time again, I have opened a file to only see a channel name and then rows and rows of data. This may be fine if you immediately open each and every file as soon as your test is completed, but more often than not you look at these files in bulk on a Friday afternoon in order to prepare next week’s test or to compare multiple test runs.
How are you supposed to make decisions or remember the testing parameters for a particular file? It is imperative that as much metadata be saved to files as possible so that the data can be traced, filtered and properly analyzed at a later date.
According to an IDC Digital Universe Study conducted in 2014, only 22% of information in the digital universe was a candidate for analysis and only 5% was actually analyzed. Why are we collecting all of this data, spending thousands, if not hundreds of thousands of dollars per test to collect data that is not being analyzed?
Imagine taking the equivalent of a bag of M&Ms, pouring five into your palm, and assuming those five M&Ms represented the entire distribution of colors in the M&M bag. This is called making a hasty generalization, and I think it may be a reason why there seems to be an increasing number of problems with quality in products in the marketplace.
This is a complicated problem, but there are steps you can take in new and legacy applications to overcome the metadata problem. For new applications, include a section in the project specification about the data and the conclusions to be made from the results before you develop your application.
This step will allow a team to document the small details, such as channel names, what metadata will be included and how it will be formatted and the output of the test (a file, stream to database/cloud, etc.). It also will specify the intent of collecting the data in the first place. No project should be without this information.
For legacy apps, perform an audit of all data sources being collected and check for inconsistencies. If you find any, create a patch to help standardize or add additional metadata so that the remaining files will be more consistent in documentation. At a minimum, you will know where your current metadata weaknesses are and use them to guide future updates or projects.
If you need to frequently reference legacy data, run the legacy files through a custom program that will extract and retroactively add metadata. You can create a new copy of the data with the additional information to keep the raw data untouched.
It may be possible to do this by parsing file names. Most of the time, there is a ton of information encoded in a seemingly obscure file and/or folder name. It may also be possible to run analysis on the data in the file and add some statistical parameters as metadata to help filter the data to make it easier to see trends.
There is no way to cover all of the options in a short article. I would love to hear how you handled metadata proactively in new projects or what you’ve done retroactively once you realized you had a metadata problem.
--Stephanie Amrite is a senior product manager for core software at National Instruments, developing frameworks and best practices for extracting value out of sensor data collected from test, measurement and control systems. She also is a participating member in the Industrial Internet Consortium.