date: | 2007-01-23 11:13:49 |
---|---|
category: | Data Structures and Algorithms |
See “When a Column is Not a Column ” for a rant on failed attempts to scale spreadsheets up into relational databases. Bottom Line: A spreadsheet column is rarely an RDBMS column. Often the SS column’s label is a key value, just like the SS row’s label, and the cell’s value is RDBMS data in the RDBMS row. Both models borrow terminology, but the concepts under those terms don’t map very well. It’s almost universally a mistake to transform a spreadsheet page (which is table-like) into a proper RDBMS table.
Here’s the interesting problem du jour . The customer has a collection of massive spreadsheets (20 tabs) that are prepared by 20 or so people. This collection of 20 copies of the same template must then be consolidated into one amalgam.
If the spreadsheets were just data entry vehicles, we’d be replacing them with a simple web application, and that would be straight-forward.
The essential problem (to consolidate the data) is compounded by the spreadsheets having some pretty fancy calculations and feedback. We could replace them with a web application that had enough AJAX to be similar to the Google Docs spreadsheet: it would provide a rich user interface, and the data would be centrally located.
Collaboration and Consolidation.
In this case, the collaboration is more than just consolidation. While the proximate problem is consolidation, we quickly uncovered additional use cases. There are at least three classes of actors: directors who originate the templates, managers who fill in the templates, and analysts who consolidate and report on the data.
The managers filling in the templates, in some cases, actively use them to control their business. They create alternate scenarios, they make decisions, they take action and they compare predicted results with actual results. Other managers, however, don’t do more than fill in the templates and dutifully post them to the analysts.
Our collaboration goes beyond simple consolidation. The superficial collaborative task is only the tip of a larger iceberg of use cases for the spreadsheets. What is a technology choice that supports consolidation, but preserves the other use cases?
Excel Extraction.
Some managers make extensive use of the original spreadsheet functionality; the spreadsheet’s native use cases are essential to the enterprise. In addition to the spreadsheet use cases, consolidation and refreshing history must be added to this. One of the decisions we have to make is where we allocate the additional ETL (Extract Transform Load) processing that produces consolidated information. Since we working with (against?) MS-Office products, we’re talking about some fairly complex use cases with Excel at the center.
Note that we have to strike a fine balance between two opposing forces.
We have a spectrum of platform alternatives.
A Show Stopper.
Here’s a potential show-stopper for the web alternative, “Considerations for service-side Automation of Office ”: “Microsoft does not currently recommend, and does not support, Automation of Microsoft Office applications from any unattended, non-interactive client application or component (including ASP, DCOM, and NT Services), because Office may exhibit unstable behavior and/or deadlock when run in this environment.”
Ouch.
The web application could have had a pretty nifty overview:
The end-user experience would be slightly different than today’s mass-emailing frenzy. There’d be a nice collaboration web site for upload and download. The upload would validate and provide feedback as part of the upload process.
It appears that we can’t make this work very easily.
Desktop and Excel Platforms.
I’m not a big fan of the desktop as a platform. Primarily, I hate the configuration management problem: who has what version of the application.
“But there are tools to help.”
While true, desktop deployment tools merely plaster over the essential problem: the people who “control” the desktop aren’t very disciplined. We can only make desktop software work as part of a total lock-down of the end-user’s computer. Since the client doesn’t do this (and won’t for this one business application), we can’t really make use of the desktop as a platform.
Excel is a viable programming platform. However, it, too is prone to getting out of control. The current 20-plus-tab monstrosity is packed full of macros and VB modules and doesn’t work reliably. In addition to bugs, people can easily add inter-workbook links to documents on their C: drive and in their Windows TEMP directory. The whole thing rapidly spins out of control when we try to make use of the clever features of Excel. We need something simpler and more reliable.
Enter POI and XML.
We can exploit two technologies to make a simple, reliable web-based solution. First, we have Jakarta POI , which allows us to read Excel files directly. This is pleasant, and the HSSF reliably picks apart a spreadsheet. Second, we can use XML versions of the spreadsheets, making them readable by SAX or Xerces .
Here’s the overall Compiler design pattern, and how we would implement it:
Once we’ve parsed the spreadsheet and have the Business Entities, we can then do the required transform and load operations. These will lead to the consolidated data. We can then cough out the next generation template, or a reporting pivot table, or simply redirect the user to a typical data warehouse reporting portal.
Spreadsheet as Syntax.
This leaves us with the spreadsheet document filling an interesting role in this processing. Rather than being an active platform, the spreadsheet is downgraded to a mostly passive document with a few active elements.
Once we look at a spreadsheet as a kind of syntax – a sequence of tokens – we can parse it using either of a couple of techniques. We can try to create an LR or LL kind of grammar, which may work out, depending on how complex the spreadsheet is. Often, user inputs are preceded by labels which allow us to do very simple LR(0) parsing.
We can, for example, look for the cell which contains the “Weekly Forecast” data. In the next row, a cell will have a product name, and the following cell will have a forecast number of cases sold.
The other technique is to use a more sophisticated regular expression technique where we need to see a longer sequence of cells or rows to determine the pattern. These aren’t as easy to implement because most RE processing software works with individual characters. We would need to write a RE matcher as a non-deterministic finite automaton that worked with Cells and Rows instead of characters.
Solution Outline.
Here’s a fun kind of solution. It works best if the spreadsheets are pared down to just the input sections with just enough calculation and history to facilitate creating high-quality plans. From the current spreadsheets, we would delete the various tabs that are simply reporting and consolidation within the spreadsheet.
The upload process uses a SAX application to parse, validate, extract, transform and load the spreadsheet. In the (all-too-common) situation where the spreadsheet doesn’t parse successfully, there are two kinds of feedback:
We’ll come back to spreadsheet as syntax in future posts.