Data Analysis Gets Hot

 

Overwhelmed by the volumes of data we’re producing, organizations are looking for new ways to analyze it into useful information.  Because of the volumes involved, they need to find techniques to provide as much automation as possible.

Both large and established vendors and small start-ups have offerings here and some very impressive customer lists.  We’ve been watching this space with great interest, hoping to spot some winners, but it’s clear that it’s time to do that watching more publicly (in the hope that you’ll tell us what you’re looking at, too).

What Do We Want To Do:

Typically, we’re looking for products, which combine these features:

Scalability:  These databases (or piles of less structured data such as text or images) can be huge.  A good solution should handle a collection of any size with good performance.

 

Federation:  The ability to combine multiple, heterogeneous databases and data sources, from a variety of distributed locations, and treat them as if they were in a single homogeneous database.

 

Data Query:  Being able to ask the federated data relevant business questions, preferably in business language, rather than in a programming language or a structured query language.  This allows the business experts to directly formulate and change their own queries.  The Holy Grail is unstructured natural language queries (just ask for what you want in your native language), with the computer figuring out what to find and how to present the answer.  Today, query tools can’t do that or can do natural language processing in only a fairly limited way (limited vocabularies, required word ordering, and other tricks).

 

Data Analysis:  Analyzing the data (generally algorithmically), looking for information, which fits specific conditions (rules), exceptions to rules, and interesting patterns (discovery).

 

Data Presentation:  How is the information presented to the user?  Good tools offer a variety of presentation styles including text summaries, text in context, charts, and graphs.  They should also support an opportunity to drill down to the data itself and to ask for variations of the query of further, more detailed, questions.

Almost all data analysis tools assume that (1) they will require some set-up (at least to tie them to their data sources) and customization and (2) their users will need some training.

But not all tools are created equal.  Some are essentially on-going science projects that will require vast quantities of on-going professional support; others, once installed and tested, will be very accessible by ordinary knowledge workers.  You make a decision based on how much customization you need (and how much time and money you’re willing to spend).

IBM Has A Rich, Robust Offering

An excellent example of an “establishment” offering is IBM’s, which takes a soup to nuts approach.  It ranges from their #1 best selling UDB (DB2) data base manager through their latest experiments in interfaces and XML tools.  These include their DB2 OLAP Server, which allows business customers to ask questions against a multidimensional set of data and DB2 Warehouse Manager (which replaces Visual Warehouse).  IBM offers two tools for federation. DB2 DataJoiner is a standalone federation engine which supports distributed two-phase commit and heterogeneous replication; DB2 Relational Connect can extend DB2 federation to other databases (Access, Informix, Oracle, SQL Server, and Sybase).

Most approaches to information and data analysis now recognize that information is not just columns of data, but also mounds of unstructured text documents. IBM offers Intelligent Miner for Text, a toolkit for system integrators, solutions providers and application developers featuring text analysis tools that automatically identify the language of a document, create clusters as logical views, categorize and summarize documents, and extract relevant textual information, such as proper names and multi-word terms.  It includes the IBM Text Search Engine, the NetQuestion solution for inter-intranet text searches, and a Web crawler.  (Find more details at http://www-3.ibm.com/software/data/iminer/fordata/.)

IBM also supports a number of Research Projects in this area, including:

Visual Attribute Explorer offers a way to explore data by showing it via bar charts and coordinate plots, allowing the analyst to apply constraints and immediately see and assess results.  It allows for the interactive discovery of both the nature of the data and of relationships between fields within the data.  It may be used independently or with IBM’s DB2 intelligent Miner for Data. Visual Attribute Explorer is downloadable from Alphaworks http://www.alphaworks.ibm.com/tech/visualexplorer.

IBM’s Xperanto Project is a consolidated development effort that offers access to diverse data stores (the federation thing – see above) and supports queries in both SQL and the XQuery language.  Built on DB2 underpinnings, it supports store, search, cache, transformation, and replication features as well as integration with WebSphere.  You can view a demonstration of Xperanto at www.ibm.com/software/data/developer/demos/xperanto/.

Aleri

Aleri has applied high-speed vector processing to the problem of data analysis.  This allows Aleri customers to work with real time transaction processing data and get a continuously updated picture of their business.

One common method of data analysis is to build a data cube and to then do analysis by sending queries against the cube’s tables.  It might help to think of Aleri as a virtual n-dimensional cube.  In fact, Aleri doesn’t need to take the time and resources to build the cube (many high volume applications require building many data cubes); it simply analyzes the vectors (columns or rows) of an n-dimensional cube of any size that it would need to create to answer a particular query.  The virtual cube is built on-the-fly, in real time.

This avoids segmenting data and fragmenting it as well as wondering whether all of the data is current and up-to-date.  In the Aleri version of data analysis there is no data except current data.

The single meta engine can be pointed at multiple data sources and it will map the data and build a relational-seeming data model and write the data into Aleri’s data base.  The data is accessed via SQL or Aleri’s own VCL (vector command language). 

 

Once you have created a query, it can be updated with realtime data or drilled down (in other words, details of the information can be computed in real time as if it were a cube).

Aleri is 2.5 years old and it thinks it’s ready to go up the ramp and find its market.  With U.S. offices in New York and Chicago and development and sales in St. Petersburg and Western Europe, Aleri is looking for $30 million in revenue in 2002.  It expects to announce its first partnership deals shortly.

With the interest in data analysis – and all that data piling up – we suspect there will be interested customers eager to see if a real time solution has finally arrived.

InfoTame

Aleri is not the only Russian-developed data analysis software we’ve seen in recent months.  A somewhat more mature product is InfoTame, with more than 20 installed customers in its native Russia (including TV stations, government bodies, and oil companies).

The InfoTame product examines very large text data bases and develops “information portraits” to help users understand both their content and the hidden relationships which these document repositories may contain.  This analysis can be done without relation to the language in which the documents are written.

InfoTame can offer text summaries, access to the text in context, or graphical depictions of the patterns it has discerned. 

Recently, the California-based company booked its first US order, from a major law firm.  They will use InfoTame to discover links and associations buried within large volumes (~50 Gbytes, over 10 million pages) of text data. InfoTame’s text-analysis software uncovers relationships between objects that one is not aware of, and therefore one cannot use a common search engine.  In a day and a half of a preliminary sample-data analysis, InfoTame was able to extract over 52,000 text files and uncover many useful links from only 7% of the total data available. It would take around 100 days just to load these files manually into a database, without considering any subsequent analyses.

Wohl Associates wrote a White Paper about InfoTame’s technology and its business benefits which you can find at http://www.infotamecorp.com/InfoTame_White_Paper_by_Amy_Wohl.pdf.  (In the interests of full disclosure, we need to report that this results in Wohl Associates receiving a small and illiquid financial interest in the firm.)  


(back to top)  

Comments or Questions: Send Email to opinions@wohl.com

Home/ Search / 2005 Articles / Issue Archive / Free Newsletter

Entire contents © 2001  by Amy D. Wohl. All rights reserved. Reproduction of this publication in any form without prior written permission is forbidden.