Data Conversion - Many Souces, One Target
Nathan W. Pollock IES Utilities 2001s’St. SE, Cedar Rapids, Iowa 52401, USA Jeff Nash MSE Corporation 941 N. Meridian Street, Indianapolis, Indiana 46204, USA
Introduction
We have all heard many times, “...communication is the key”. This remains true in a successfd data conversion project, even more so when it is complicated. It cannot be managed by memory and conversation alone. It requires a well-organized, documented data conversion plan. The conversion specification covers all components of a data conversion project. The conversion matrix documents the relationship of existing data to its place in the target data model. When a project must combine many sources into one target system, documentation of these technical details becomes vitally important to keep everyone on schedule and synchronized. The expectation remains to deliver quality data to the users quickly. Time does not permit the luxury of careful deliberation and reiteration in favor of quick, correct decisions and implementation. A process must be followed to stage each activity where it can be accomplished most efficiently. Remove all activities from the critical path that unnecessarily delay getting converted data to users. This paper was based on the data conversion portion of the Vision IMPACT project at IES Utilities. This combines the positive experiences and those that could have been better with a focus on documentation and process. If we had to do it all over again... Documentation In the simplest terms, the conversion matrix is a database itself with three major components: the data sources, the target data model and the relationship between them. When it is well organized it serves as a list of issues to resolve during the conversion process. As questions arise over an issue, the matrix can be revisited to refresh or reevaluate. It serves as the conversion vendor’s guide in creating the translation program or conversion procedures. Missing, inconsistent or unidentifiable information is identified and will be handled during data scrub. ![]() Figure 1 Sources The most valuable step in the data conversion process is documentation of what you have. What data sources exist in your company? Where are they located? How are they used in your work flow processes? What specific information is on each source? What attribution is kept on each item? How would a human or a computer identify each item? For each item, what is its exclusive, primary or secondary source? Already, this may sound like a huge intensive project in itself, because it is a critical path item. A valuable exercise in discovering hidden information or information that is known by very few people, it can also point out what information is missing. Source documentation needs to be started early. This can even be done before a conversion project is identified, and is very helpful if completed before data modeling begins. A clear picture of what and how information is used today will insure those items are not forgotten. As new information is considered for inclusion in the data model, identifying its source must be considered as well. Once the new information is selected and the source is identified, it must be added to the conversion matrix. There are many ways to extract information from the data sources depending on how the information is stored. There are experts with tools who can “reverse engineer” data to achieve these results; a helpful step toward getting started or building much of the documentation. The results will tell you how the data was structured, but additional work is needed to list the actual data values. Studying that information and conducting user interviews can explain why it is the way it is. Field Inventory If a specific field inventory has not been identified, the source matrix may suggest one. Completion of the data model and the relationship matching in the conversion matrix can finalize those decisions. The premise of this paper suggests a limitation of information that a field inventory can provide. If a field inventory is planned, it is tempting to gather “just one more bit of information since we are out there anyway”. Some information may be acquired for “free” if the specialist is waiting for a GPS coordinate. Most items will cost more time and money, so it is important to clearly define a strict scope of work and stick to it. Judge the true usefulness of each piece of information. Will office staff be running queries and reports on that information? Will this information only be needed occasionally for specific instances and be verified by a field visit? The scope of work here has another important use; it is documentation for this source of information. Once it is complete, it can be included in the conversion matrix. Sources - Svmbolo~v and Annotation Up to this point, this could apply to converting any type of data set, Map conversion has the special consideration of graphics. What do they look like today? What should they look like when in the target system? Documenting symbology and annotation of existing sources should be done as part of the source documentation. This part of the documentation is particularly crucial for paper sources. All symbology variations need to be defined. If conversion has many sources, chances are there are many symbol and annotation variations between them (the source documentation would verify that). Some Geographic Information Systems may allow for multiple symbol and annotation variations for same item. Hand drawn paper maps almost insure this will be the case. A conversion project is the perfect opportunity to set common standards. Symbology and annotation for the new system needs to be documented. Many of these can carry over from the source documentation, but as conflicts arise how can one be selected over another. Adopting industry standards (ANSI, IEEE) can speed up the process of choosing the graphical representation of the data. This is a very subjective activity and everyone has an opinion. While this is started with the source documentation it cannot be completed until the data model is finished. Sources - Key Identifiers Each item needs at least one attribute that can uniquely identify it from all other objects. Some may need to rely on two or more attributes, but the fewer the better. It is important when selecting key attributes to select those with the most positive results and chance for fewest errors. As an example, a symbol that that is printed, viewed, reviewed and recognized by many, is preferable to a level designation, which maybe accessed by only a few people. This is even more critical if the source application allows these two to be separately entered and maintained. When these are determined, any item not conforming to the standards specification must be scrubbed before conversion. Document the key identifiers when the sources are documented. This portion of scrub can be done once they are set and must be done before conversion begins. TarRet Data Model The second most important piece of the data conversion puzzle is defining the data model. This is by no means trivial, but since there are many others who can and will cover this in better detail, here it is only stressed how important it is in the data conversion process. The data model is best designed within the confines of a specific project and must be completed before conversion can begin. If we look at data conversion as a one-way, one-time bridge from the old system to the new, the data source documentation is on the side of the old and the data model is the entrance into the new. The data source documentation can be very helpful in the data modeling exercise. It tells you what information is readily available and how it is organized. You will also discover what you do not have. Once it is decided the missing information is needed, there are several ways to obtain it: outside sources, office research, field inventory or apply rules to extrapolate it from existing information. Whichever method is chosen, the new information must be considered as another data set to be documented. The data model exercise may help complete the next conversion document, the source to target matrix. The source to target matrix begins as source documentation. It is completed when all of the data model attributes are included. Each data source attribute should have a target destination or be tagged as “do not convert”. Each target data model attribute should be matched to a source or tagged as “to be gathered later”. Matching This component of the conversion documentation ties the source matrix to the data model. Once the obvious matches have been made, a thorough check must be made of the source matrix and the data model to resolve all unmatched components. An unmatched source line means that data will not be converted. It could be ignored and discarded or it could be replaced by a more reliable source. An unmatched data model line means there is no conversion source. It may be populated by routines getting the information from its surroundings or related converted objects or it may be a placeholder for future population when users become involved. Placement rules further define how each item looks in relation to the other objects. The greatest opportunist y here is when converting from paper sources. Business rules apply to extrapolated data. This may not be explicit in source materials but it can be extended based on key information. For example, the secondary voltage class assigned to a substation transformer can be applied as the circuit voltage class to the primary side of all distribution transformers of that circuit. These may be identified in the data modeling exercise. Cross Reference Lists This is the most detailed level of your data. Unless the exact values in the data source are to be placed in the target system in every case, a cross-reference list is needed to insure the conversion vendor populates the correct information. Various tricks were used in older databases like concatenating several fields into one or using codes to represent long or complicated descriptions. The new data model may separate concatenated fields into separate descriptors or may include any number of other changes. This is a very good opportunity to cleanup any ambiguities and set definite standards. The first part of this exercise can be started at any time by first generating a domain list of every field in the database. That list can be obtained by first getting a list of all values for each particular field. Then database queries can reduce each list to the unique values only. The next step is to put real-world descriptions to all abbreviated values. This will show any duplications or omissions in information. The last part of this list is to match the source values with actual values allowed in the target data model. Naturally, this part cannot be completed until the data modeling is completed, but the source half can be very helpful in developing the fields and domains in the target half. Building the cross reference list during data modeling will complete the task and serve as reference notes to the decisions that were made. Process Approach Most likely the data conversion matrix has illustrated some imperfections, inconsistencies and blatant errors. Now data of this quality is being populated into one database. The conversion vendor will likely find conflicting information between sources. It is not hard to imagine the hundreds or thousands of questions that need to be resolved. The conversion schedule will not be extended to allow time to resolve them individually before data is delivered. A process must be developed to address the issues. How does the QA/QC process affect the release of data to users? There are two approaches that could be taken. The first is to convert all the data and resolve all conflicts before releasing it to users. It will be a long time before data is available, and a long time before benefits are realized. To get data released to users sooner, QA/QC after data conversion can be split into two activities. The first activity is quick and finite, data acceptance. Converted data from the vendor is checked so the specifications are met. Are the objects from the sources in the target system with the correct attributes. Were the business rules followed? Were all unanswered questions or discrepancies noted clearly? When those objectives are met, release the data to the users. Why would anyone release imperfect data to users? Because this is the same data that has been used for years. Only through the conversion and combination process have these questions been raised so clearly. The second QA/QC activity is the post-conversion data scrub. A centralized group can address and resolve many of the common items. The users can play a very valuable role in finding and resolving issues as well. ![]() Figure 2 QA/QC Components The QA/QC process through conversion can be broken into four distinct components: pre-conversion scrub, data conversion, data acceptance and post-conversion scrub. Each one has its own purpose and unique opportunities. Pre-conversion scrub insures the key qualities of each object in the electronic files are correct. This enables the translator to read each object correctly and get it to the proper place in the target system. The conversion matrix may need to be updated to more accurately reflect variations on the sources. Cross-reference lists may also need to be updated to handle new information. Data conversion provides many opportunities to improve data. The conversion vendor can incorporate business rules into the translation programs and digitizing process to correct known problems. Any item that cannot be clearly resolved during conversion should be noted. These notes are very usefid during the post-conversion scrub. The purpose of data acceptance is very simple; is the converted data fkom the vendor acceptable? Getting to that determination is not so simple. Did the target data set get populated as expected from the conversion specification and supporting documentation. If a data set is rejected, why? First the data must be fixed to satisfaction, then adjust the process so it does not happen again. Post-conversion scrub is very simple, fix everything that was not fixed before. Its unique opportunity is that all of the data is one system, electronic interfaces may be available to other systems for the first time and tools are available to generate concise reports. Although its purpose is simple and resolving errors maybe straightforward in most cases, it will be time consuming. A dedicated, centralized group can do a lot of this work. Other users can help discover and resolve more elusive issues. Conclusion This paper began with the statement that multiple source conversion can seem like an ugly and impossible task. It may still seem ugly, but the process approach presented here will make it more possible. The important point to remember is that a process approach, with organized, concise details outlined from the very beginning is critically important to the success of the project. Data conversion can proceed regardless of the number of source documents. Understanding why those documents are important, where they fit into the system, and how you are going to capture them is the key. Remember to be objective about what information you obtain for your new system -- be ruthless in your evaluation of sources and unfailingly committed to the project scope, and success will be guaranteed. | ||
|
|