Thursday, August 19, 2010

Continuous Data Quality Management: The Cornerstone of Zero-Latency Business Analytics Part 2: One Solution

Implementing A CDQM Application

It is impossible to improve that which cannot be measured. A CDQM (Continuous Data Quality Management) tool provides a real-time, up-to-date scorecard to measure data quality within the enterprise. By checking data quality in real-time, "data fires" can be detected when they are just starting, before any real damage has occurred. Most enterprises fight fires with axes, fire hoses, trucks, and hordes of firemen, but the CDQM approach is a smoke detector. It's far less expensive to put a fire out when it's just smoldering, rather than to extinguish a blazing house fire and then remodel the entire house.

Data quality must be a constant commitment. Most companies, when implementing a data quality initiative, look at it as a massive data-cleansing project that scrubs data as part of a system upgrade or new system implementation. That approach is a lot like taking a shower at the beginning of the month and saying, "Now I'm clean!".

Without metrics and constant measurement, there is no way to verify data quality on an ongoing basis and keep the data in the system clean. And if the quality of data is not fully understood, one cannot be confident in the decisions made based upon that data. Considering that US enterprises are expected to spend upwards of $22B by 2005 on business intelligence initiatives, doesn't it make sense to implement a system to track the quality of that data?

This is Part Two of a two-part article on the importance of maintaining data quality, based on the author's experience.

Part One defined the problem of maintaining data quality to an enterprise.

When Good Data Goes Bad

The classic "garbage in, garbage out" scenario becomes all too real when there are quality problems with the data on which important decisions are based. At Metagenix, we like to tell the story of a fictional electronic parts manufacturing company we call Huntington Corp. We pieced together Huntington from several real-world companies whose data quality issues almost did them in, and whom, for obvious reasons, we cannot name. We like to tell the Huntington story because it so clearly illustrates the need for a continuous data quality tool. Take, for example, the data issues that arose when Huntington acquired rival company, Systron, and began the process of combining the two companies' customer accounts into one customer database. What started as an innocent attempt to merge this data became a mess that almost spiraled out of control.

The problem began with the two companies' customer account numbers. Huntington's were eight numbers long; Systron's were 10 alphanumeric characters long. When Huntington's IT department began merging the account data, it was decided that Huntington's eight-number account number would be the default format. All of Systron's accounts were uploaded to an eight-number format, changing all alpha characters to zeroes and truncating the 10-character format to eight, eliminating the validity of Systron account numbers in one stroke—with all the attendant downstream angst and confusion for employees and customers. Having a continuous data quality tool in place during this ETL process would prevent this data nightmare from occurring.

Another problem occurred during the Systron integration when IT combined the two companies' parts and inventory tables. Like the customer account number fiasco, Systron used alphanumeric parts numbers in relational tables from each of its divisions. Huntington only had one master table, with numeric parts listings. When Huntington's IT department combined Systron's multiple alphanumeric tables into Huntington's master table, it failed to include the system division number assigned to each division table in the new Huntington part number. As a result, customers calling to order Systron parts and familiar with a Systron part number were befuddled by Huntington's new part numbers, a disaster not fixed until it was called to IT's attention.

Other data quality issues arose even before the Systron acquisition. Huntington's accounting department was still batch processing journal and other general ledger entries on a nightly and weekly basis. Over time, Huntington's accounting department was spending more and more time researching and reconciling erroneous entries. Though Huntington Corp. accounts were clearly defined in a chart of accounts by department and type of account, the general ledger and other systems were allowing invalid account numbers, thus causing subsequent delays in invoicing and payment processing. A continuous data quality tool set up to check business rules on account numbers would have ensured that account entries were going to valid account numbers.

The same type of problem happened when Huntington installed a new CRM software system. While customer service representatives were doing their best to get the customer data needed to build effective communications with the company's customers, they were frequently not capturing addresses and telephone numbers. When Huntington's marketing or customer relations departments decided to conduct campaigns or send follow-up messages to customers, this customer contact data was missing—unbeknownst to these departments and to the detriment of these campaigns and messages. A continuous data quality tool would have produced an error report by detecting that contact information was not being captured, thereby allowing this information to be obtained in time for these customer communications.

The downside of the data quality issues in these situations is fairly obviouspoor data quality negatively impacts the value of the data used to support decision-making and operations. From as simple as the case of an incorrect general ledger account that throws off accounts payable analysis and reconciliation, to a missing email address that results in failure to notify a multi-million dollar customer of a backorder, and everything in between, data quality plays a crucial role in the support structure of today's business.

The Solution

Businesses clearly need a framework for implementing and monitoring data quality on a continuous basis as part of any business intelligence initiative. At Metagenix, we have developed this new framework, a continuous data quality management tool to work in concert with enterprise business intelligence and database systems. Built with an easy-to-use interface like simple address checkers, yet developed for robust enterprise use across departments, data warehouses and silos, Metagenix's CDQM tool works by applying specific business rules as a continuous check of data quality. The CDQM framework consists of several interconnected processes, which we have outlined below.

Business Rules Capture Repository and Interface

This system provides a meta-data repository to capture business rules and knowledge about data across the enterprise. Business rule assertions are expressed as algorithms and functions that indicate the validity of data. These assertions can be in the context of a field, a virtual record, or an entire data source. One example of a rule is that zip codes must be five or nine digits for US addresses, and must match the expected city and state fields. Another example is that the schema for the Customers table is not expected to change.

Metrics Repository

This tracks the results of processes within the CDQM system and provides historical information. A complete data quality scorecard can be constructed based upon the information stored in the Metrics Repository. The interface allows a user to view the results, and slice and dice the results data. Depending upon the job function of the user, the interface will provide different views of the scorecard. For example, the CIO might be interested in which systems are generating the most quality problems, while a DBA might be interested in table schema changes that have occurred.

Event Processor

This is an object interface that allows external applications to communicate events of special interest to the CDQM framework. For example, an ETL job might inform the system that a transfer of a file of 225,003 records time-stamped from yesterday took place at 12:03AM into table CUSTOMERS and took 14 seconds.

The CDQM framework could then be used to track execution speeds, check sums on the data, and check the timeliness of the transfers. Problems such as loading the same file twice could be immediately recognized. Likewise, recognizing an incremental increase in transfer times over the last month could spur investigation into potential difficulties in the ETL process. The Event Processor is not limited to monitoring data movements; external applications could signal a variety of events to be tracked as part of the data quality monitoring effort.

Transaction Server

This system allows an enterprise to centralize data validation. Instead of multiple applications each implementing their own validation logic, a central set of business rules is used to determine the validity of a data gram.

External applications use an object interface to transmit data to the Transaction Server, which determines the validity of the data according to the rules stored in the repository and returns a result indicating potential problems with the data. At the same time, the Transaction Server updates the Metrics Repository with the available information about the transaction. Thus, decision makers can determine the sources and causes of faulty data and adjust the business processes accordingly. Just as easily, implementing new business rules merely requires adjusting the meta-data in the Business Rules Capture Repository, rather than recoding potentially hundreds of applications that deal with data in a slightly different fashion.

Rules Checker

The Rules Checker mirrors the Transaction Server, but operates on a macro level. Instead of providing a real-time service, the Rules Checker is run periodically to verify compliance with the business rules against a variety of data sources. For instance, the Rules Checker might be run every night against all records in the order entry database to verify that all orders reference parts numbers that are in the catalog. Another example would be a check each night that the domain of the Customers->Type field is what was expected when compared to the values stored in the repository.

The Rules Checker updates the Metrics Repository, and can also generate events such as an email to a responsible manager when certain rules are violated. Imagine being able to come into the office each morning and receive an email indicating which records were loaded incorrectly last night and what's wrong with them!

Scheduler

The Schedule activates the Rules Checker processes according to schedules and dependencies determined by the user. For example, a user could specify that the check of the master customer name file should be run every night immediately following the successful completion of an ETL job.

The Metagenix CDQM framework employing these components is extremely scalable and user-friendly. All interfaces are delivered via a web browser. The repositories are implemented in standard, ODBC compliant relational databases. The Rules Checker, Transaction Server, and Event Processor are implemented from conception as massively parallel, high-performance systems capable of handling the massive amounts of data required.

Editor note: The information presented here is the opinion of the author, based on his experience in using a continuous data quality management tool. TEC does not endorse this specific product per se. This article is published by TEC because it contains some useful information for companies concerned with data quality management issues.

CDQM: In Summary

With an ever-increasing dependence on data for near and real-time decision-making and with more connectedness of databases, data warehouses, marts, and silos across the enterprise, we at Metagenix believe the call for CDQM is loud and clear.

Data becomes information when it is used in analysis upon which decisions are made. Data quality problems result in bad information, necessarily leading to bad decisions. At Metagenix, we strongly believe CDQM is the answer to the data quality problem, a problem that our technology will solve.'



SOURCE:
http://www.technologyevaluation.com/research/articles/continuous-data-quality-management-the-cornerstone-of-zero-latency-business-analytics-part-2-one-solution-16778/

No comments:

Post a Comment