HP's Top 10 Trends in BI (and HIT) for 2009: #9 Structuring the Unstructured | [node:field-byline] | Healthcare Blogs Skip to content Skip to navigation

HP's Top 10 Trends in BI (and HIT) for 2009: #9 Structuring the Unstructured

December 3, 2009
by Marc D. Paradis MS
| Reprints

Sometimes vendors do get it (mostly) right. Hewlett-Packard put together a brief white paper in February of this year laying out their view of Business Intelligence (BI) for 2009 (and beyond). I think that they got it largely right. Their #9 trend is that perennial favorite, the convergence of structured and unstructured data. Below is a summary of the trend, my thoughts on whether HP got it right and what the trend may mean for HIT.


HP Predicts: That technology advances of the past few years (and in particular some made by HP) have finally brought us to the threshold where automated processes will be able to pull structure out of unstructured data such as electronic medical records, call center logs, and emails. To be fair, HP doesn’t actually predict anything here, they just state that technologies for the management of unstructured data have moved beyond the BLOB and that there is good reason to believe that “truth” lies more closely with unstructured data than with structured (as the structure of structured data often presupposes a particular end or goal, such as claims approval or operational efficiency)


The Verdict: 2009 is NOT the year when unstructured data comes of age. In fact, 2010 probably won’t be that year either and perhaps not even 2011. While I have seen some interesting developments in the fields of Natural Language Processing, Text Analytics, Speech Recognition and Metatagging, with memory and processing advances as drivers, the stubborn fact remains that context is the key and that to-date, no one has figured out how build a context engine even 1/1000th as robust as the human brain.


Ok, I grant you, I made the 1/1000th number up, but consider how easy it was for you to read the previous sentence and effortlessly parse the (intentionally contorted) multiple phrases, clauses and lists; understand that when I used “key”, “engine” and “drivers” I was at no time referring to a physical object or even a unified concept of an automobile; and to interpret, weigh and deliver a value judgment against a quantitative assertion (1/1000th) on the exceedingly ambiguous and vague qualitative measure of “robustness” in the human brain. Now consider things like irony, humor, style, jargon, sarcasm, etc. These are hard enough for humans to accurately pick out of the lifeless text of an email or SMS message, let alone expecting a computer to accurately do so.


What’s more, we can easily be led astray: Facebook recently analyzed the content of it’s status posts looking for words that indicated “happiness” or “sadness” in order to generate and assess a measure of Gross National Happiness. They found that people were overwhelmingly “happier” on holidays and “sadder” when celebrities died. I would posit that this finding simply reflects that people tend to exchange holiday wishes in and around holidays and that people tend to note the sadness of a celebrities passing. These occasions provide only the most indirect insight into the internal emotional states of the posters - there is a big difference between posting: “I am happy this Thanksgiving” and “Happy Turkey Day” and “I bet you’re happy that Thanksgiving is over”.


HIT Impact: Of course, the promise and potential of being able to pull true patterns and insight out of unstructured data is too enormous for words (pun-ish irony intended). However, I think that for some time to come we will have to rely on ever-more sophisticated expert systems. Expert systems can be incredibly powerful and in many instantiations readily out-perform humans (as with chess), but expert systems require a very carefully bounded, even if very large, problem space, a long list of heuristics and most importantly they must presuppose and valuate certain endstates or hypotheses (in other words, they must be able to qualify a certain endstate or hypothesis as better or worse than at least one other endstate or hypothesis). They are, in essence, brute force approaches to a complex problem, relying essentially on raw memory and processing speed.