Everything You Know About Business Intelligence, Data Warehousing and ETL is Wrong — Part I

April 16, 2010
3 Comments
| Share | Print

A History of Yesterday

For at least the last 25 years or so, certainly ever since researchers Barry Devlin and Paul Murphy coined the term “business data warehouse”, various vendors and technologies have been carving up and attempting to lay exclusive claim to overlapping slices of the data warehouse ecosystem - the sum total of the tools and methods required to support a data warehouse from source systems to end-users. You are familiar with these slices, they go by names and acronyms like: Extract, Transform & Load (ETL); Extract, Load & Transform (ELT); Data Quality (DQ); Data Profiling (DP); Master Data Management (MDM); Datamarting and Cubing; Database Federation; Data Warehouse Appliances (DWA); Business Intelligence (BI); Decision Support Systems (DSS); Executive Information Systems (EIS); Query & Reporting (Q&R); Enterprise Information Integration (EII); Advanced Analytics (AA); and Visualization, among many others. For each of these you can probably name at least two or three distinct vendors off of the top of your head. The thing of it is though, these are first and foremost marketing distinctions, driven by the needs of these vendors to differentiate themselves; and secondarily these slices are historical atavisms, reflective of sometimes decades-old technological limitations. In truth, data warehousing begins with data and it ends with data, and there is nothing in between but data. To understand how fundamentally this should impact both your strategic and operational approaches to information architecture, data governance, and vendor management, we need a quick review of the history of data warehousing.

In the beginning, there were systems, usually mainframes, optimized for the processing of business transactions. These systems were rather straight-forwardly known as OLTP, or on-line transactional processing, systems. OLTP systems were (and still are) great for handling large numbers of concurrent transactions which require the application of complex business rules. OLTP systems were (and still are) terrible at organizing, aggregating and trending either their input or output data values, in other words, they are terrible at actually reporting on the business processes they support. The amount of effort required to collect, clean, organize, aggregate and store these input and output data values for reporting was dear in terms of time, people and dollars. What was worse was that the efforts were often repeated independently for each new report. It was in response to this business pain that Devlin and Murphy in 1988 proposed an architecture for a “business data warehouse”.

Their architecture made use of some new and some old technologies – most notably dimensional data schemas which had been around since the ‘60s, and the database management systems developed in the ‘70s which were optimized to query them. Within 5 years of Devlin and Murphy, a series of firsts: the first database optimized for data warehousing; the first software for developing data warehouses; the first book on data warehousing; and the first publication of the 12 rules of on-line analytical processing (OLAP) which has provided the conceptual and architectural underpinnings for every relational database management system since. By 1996 the two major philosophies of data warehousing were established and doing battle to the death. Bill Inmon’s top-down, subject-oriented, non-volatile and integrated corporate information factory versus Ralph Kimball’s bottom-up, departmentally-oriented, versioned and conforming datamarts.

The chip, memory, disk, bus and software architectures of the early- to mid-‘90s severely restricted both the size and the speed of the data warehouse relative to the amount of data that was available for collection and processing. Furthermore, the implementation of a data warehouse architecture created an absolute need for the movement and manipulation of relatively large amounts of data between physical devices and logical schemas. This was the fertile soil in which a profusion of vendors and proprietary technologies germinated, each trying to define and grow into a niche from which to out-compete both their direct and next-nearest rivals. What had begun as a somewhat academic exercise in the ‘60s and ‘70s was a crowded and growing, multi-billion dollar, world-wide market by the turn of the millennium.

It was also around this time in the early ‘00s that many companies which had been relying on extremely labor-intensive processes such as custom-coded applications, manual data extracts, and analyst-maintained spreadsheets, began to become aware of a better way to manage their data. As they began to look to the consultants and vendors who could help them understand and implement this better way, they encountered and internalized the sprawling ecosystem of acronyms with which we began this editorial. The model for a data warehouse implementation was to engage the services of a systems integration consulting firm in order to recommend the purchase of several distinct, and expensive best-of-breed tools each with it’s own dedicated hardware and then to spend years stitching all of these pieces together while integrating them into the existing corporate business processes and IT infrastructure.

Suddenly data warehouses were big, expensive, inefficient and prone to failure. Somewhere, somehow “state-of-the-art”, tool-centric data warehousing had recreated nearly every one of the business pains which had inspired the original “business data warehouse” architecture.

If much of the ‘90s and early ‘00s were about the proliferation of specialized vendors, technologies, methodologies and proprietary hardware/software, the latter half of the ‘00s have been about the consolidation of vendors through acquisition, the integration of technologies via either metadata “glue” or operating system coupling, the convergence of methodologies and the commoditization of data warehousing hardware and software. Unfortunately, this consolidation has been driven less by a vision of what data warehousing should be than it has been driven by a defensive strategy in an attempt to forestall market disruption. Over the last 5 years, Open Source Software (OSS), especially Free Open Source Software (F/OSS) and Hybrid Open Source Software (H/OSS), has matured from a fringe movement of academics and anti-corporate radicals into the mainstream of enterprise software development. In fact, it is essentially impossible to find an enterprise software suite or platform today which does not contain a significant amount of OSS code. Furthermore, for just about every acronym in the first paragraph of the first post in this series, there is now one or more OSS applications, with anywhere from 40%-80% of the functionality, features, performance and stability of their proprietary progenitors.

One final development completes our quick review of the history of data warehousing, and this is the rise of The Cloud. The Cloud is an overhyped buzzword and it is in many ways simply a repackaging and updating of old mainframe timesharing technologies from the ‘60s and ‘70s and/or and/or client-server technologies from the ‘80s and/or grid computing technologies from the ‘90s and/or virtualization technologies from the ‘00s. But this view misses the point. The Cloud is really three different on demand, scalable, zero-latency services; it is Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS). IaaS eliminates the need to install, configure and maintain server and network hardware, while PaaS and SaaS eliminate the need to install, configure and maintain enterprise platform and application software. All three eliminate, and this is the key, the need to purchase and maintain excess capacity as a buffer against both anticipated and unanticipated changes in future demand. This is where the bulk of the cost savings in The Cloud comes from.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

Comments

Marc, I appreciate the clarity of your model:

1) non-rotten data,
2) heavy-lifting, end-user "rowers",
3) a competent "coxswain", and
4) competitively matched (ideally superior in some area) data and staffing.

An old friend once shared with me, "There are a thousand ways to fail and only one to succeed." Sounded very harsh. I think he was making the same point you did with four components of a chain. Failure of one link is failure of the whole chain.

My friend, by the way, subsequently softened and sharpened his view. He evolved the model to three requisite and sequential steps. Step one, develop mutual respect from those involved, including internal within your company and external with your customers. Neither of you can exploit the other in the short or long run.

Step two, clearly define everyone's responsibilities. Step three, with the first two steps in place, then focus on results.

There's still a chain but a chain that can be created and managed.

I think your model, Marc, is highly concordant, especially the respect for the rowers, the responsibilities including that of the coxswain, and the fact that the result is in a competitive context. "The industry grades on a curve." In business, you don't have to be perfect. (In healthcare delivery, on the other hand, you often have to perfectly follow procedures.)

How many Parts will this series be?

Yes Joe,

Any end-user facing IT implementation will live or die on the basis of it's ability to make the end-user happy. Most end-users are very difficult to make happy. Behavior change, culture change and truly solving an unmet end-user need are the distinction between a killer app and a dead app.

All that being said however, if the data behind a killer app is rotten (and by rotten I mean relative to end-user standards, not rotten relative to internal IT standards), the killer app will soon commit hari-kari.

When I was rowing, we used to have a saying about the coxswain - s/he can lose a race, but they can't win one. The assumption of every rower is that the coswain is steering a perfect coarse, calling out a perfect strategy and perfectly assessing position and speed relative to the other shells. When all coxswains are performing this way, the best crew will win. When your coxswain is performing worse below perfection, the rowers have to pull that much harder, generate that much more boat speed and expend that much more energy to win the same race.

The coxswain is necessary but not sufficient, while the rowers are necessary and sufficient. In the same way, the appropriate data governance and lifecycle management is necessary but not sufficient. Just as the rowers are necessary and sufficient, so are the end-users who do the heavy lifting. However, the energy and coordination necessary to win without the coxswain or to succeed with data governance are nearly prohibitive - unless of course you are competing against another boat without a coxswain or habituated to using a system without data governance.

Marc,
I am looking forward to Part II (two). So far, this is a terrific contextualization which I've never seen brought together with such candor.

For me, the eight hundred pound gorilla in the room (and my spell checker suggested that, although I spelled gorilla correctly, I should consider Godzilla instead), is the people side of BI, DW, DSS, SaaS, etc. Whether it's the behavior of those entering the initial data in a workflow context, those using it in the course of overseeing departments, service lines and business units, or the CEO deciding what to do about it all (and more importantly the how to address conflict avoidance), the BI data mechanics become only the gasoline. Not the business fire.

My friends working in this world would have started with Culture. See my current point on Culture and the Perils of Guesswork. Is this the flip side of your post's coin?