Primary data gathering and analysis system - I. Problem formulation, data collecting and storing by Sabir Asadullaev

Primary data gathering and analysis system - I Problem formulation, data collecting and storing Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 17.02.2011 http://www.ibm.com/developerworks/ru/library/sabir/warehouse-1/index.html

Abstract A standard solution for primary data collection, storage and analysis is proposed. The solution is based on manual input using IBM Forms and IBM InfoSphere Warehouse for storage. The analysis of collected data by means of IBM InfoSphere Warehouse analytical tools and IBM Cognos is discussed in the second part of the article [1]. The proposed approach can be implemented as a basis for a variety of solutions for different industries and enterprises.

Introduction A distribution company purchases goods and distributes them to regional customers. New purchases’ planning requires information from the regions on the goods balance. Customer’s representatives enter data manually. This leads to the fact that, despite the training, instructions and reminders, data input does not fit the expectations. As a result, a whole department in the central office verifies collected data by phone together with regional representatives. A transportation company has an extensive fleet of cargo conveyances. Despite the presence of automatic diagnostic tools, the technical validation data of rolling stock are maintained manually on paper forms and are later entered into the computer. During the time between the problem detection and entering data from paper to the information system, a defective vehicle can be directed for loading accidentally or intentionally. The fines, specified for this situation, lead to losses of the transportation company and generate profits for the loading firm which can hardly be called fair. Federal agency collects reports from the regions on a regular basis. Experts in the regions are trained, and they use these gained skills to provide the agency with data so that the region would look decent. The real picture, of course, is blurred, and the agency operates with inaccurate data. The list of similar examples can be continued. For example, an offender can steal a car in one region, to commit a robbery in the second, to purchase illegally a weapon in the third, and his crimes can not be combined into one case due to small errors in the recording of incidents. Or an employer can reside in one region, open a business in another, and pay taxes in the third. The same functional requirements unite together these examples: •

Collection of primary, statistical, or reported data from remote locations is required;

•

It is necessary to check the data on the workplace prior sending to the center.

•

The collected data should be purified and stored for a period defined by regulations;

•

It necessary to form management statements to assess the state of affairs in the regions.

•

It is necessary to perform an analysis based on the collected statistics to identify regular patterns and to make a management decision;

In this paper we consider a standard solution for collection, storage and analysis of primary data that was entered manually. Despite obvious limitations, this approach can be used as a basis of various solutions for different industries and enterprises.

System requirements Consider the most general requirements specific to similar systems. Certainly, this task formulation is somewhat artificial. At the same time it gets rid of the details that are specific to real systems, but arenâ&#x20AC;&#x2122;t central to the typical task of primary data collection, storage and analysis. Information systems, operating in the regions, are external to the "Collection and analysis of primary data" system that is being developed and do not interact with it. The required information must be entered manually using the on-screen forms. Forms should represent the existing and approved paper forms for data collection as accurately as possible. Data input error check should be provided before sending the filled e-form. The completed e-form is always sent from the region to the central office. Form is resubmitted as a whole in case of revealed data input errors. The completed e-form is not required to be kept wholly for the audit or to follow legal requirements. Only the data contained in e-form fields are retained. Therefore, there is no need to send to the center the entire e-form, only extracted data from it can be sent. Consumers of internal management statements are employees and executives. Management statement is provided as on-screen reports and as hard copies. External users of reports are staff and management of superior bodies and cooperating organizations and agencies. Statements for external organizations are available in hard copy. The number of users in each region is limited to one user (data input clerk) at any one time. That is, one workplace is enough to operate the system in each region. The number of regions will equal 1000. The number of users of the system in the head office at one moment can be estimated as 100 analysts. The number of approved input forms is 10 daily forms with 100 fields each. To evaluate the data streams one can confine oneself by the following approximate values. As experience shows, on average one field corresponds to 3 - 5 KB of forms. Thus, the size of one form can be estimated at 300 - 500 Kbytes, and the daily flow from a single location is about 3 - 5 MB / day. Given that the e-forms are filled in by hand during the workday, the minimum required connection throughput shall provide for the transfer of about 1 form per hour, that is, about 1 kbit / sec. The total daily flow from the regions is 3 - 5 GB / day. In case of insufficient throughput the peak data flow may be reduced through the difference in time zones of regions and by an approved schedule of data transmission. Storage period for on-line access is 5 years, after which the data are transferred to the archives. Storage period in the archives is 50 years. Backup and restore tools and procedures should be provided. Telecommunication infrastructure (active and passive network equipment, communication lines and channels) is beyond the scope of the project. 84

The proposed solution must be expandable and scalable. For example, the integration of the "Collection and analysis of primary data" system with the document workflow system should be anticipated.

Project objectives The following tasks should be performed within the project: •

Development of e-forms for approved paper forms

•

Development of e-forms for new paper forms

•

Development of storage for detailed data

•

Development of analytical tools

•

Development of reporting and visualization tools

•

Information Security

•

Data back-up

•

Data archiving

•

Logging of system events

Development of e-forms for approved paper forms On-screen forms should be designed to correspond with approved paper forms of statistical indicators. Data entry e-forms for offline and online modes should be offered. Online mode is the primary mode for data entry.

Development of e-forms for new paper forms Developers should be provided with simple and intuitive design tools for creation of new e-forms of data collection to extend the production system. There is no need in simplified new forms development tools, standard tools can be used for this purpose. New forms can be developed both by internal staff, and by third-party organizations. The customer must be completely independent of the external development company to be able to change external developers or give them direct access to the instruments of development and testing of e-forms and applications.

Development of storage for detailed data As collected detailed data are subject oriented, integrated, time-variant and non volatile, not a database but a data warehouse is proposed to use for data storage. Traditional database is oriented for execution of a large number of short transactions, while analytical tasks require a relatively small number of queries to large volumes of data. A data warehouse meets the conditions. The data warehouse should be focused on storage of not individual e-forms but of data, from which the e-form must be prepared. So the forms’ field must be mutually agreed upon. Intrinsically, the eform should be drawn from these agreed data. We highly recommend using a ready-made data model that should be adapted to the task needs. In case of specific requirements, the data warehouse model will be modified jointly with the customers. It is not necessary to store the history of e-forms, or algorithms of e-forms calculation and assembly.

Data can be aggregated into large period indices for long term storage. For example, data with storage period of more than 5 years, can be combined into 5 years indices; data with storage period more than 1 year, can be combined into 1 years indices; data with storage period less than 1 year, can be left as monthly indices.

Development of analytical tools Tools for analytical calculations based on gathered statistics should provide the following capabilities: •

Quite simple statistical analysis. For example, the calculation of the efficiency of usage of various resources;

•

Scenario and forecast calculations.

Development of reporting and visualization tools Reporting and visualization should provide •

On-screen reports generation

•

Paper reports generation

•

Reports visualization in graphical form through the web interface (browser)

•

Grouping of graphs into a desired set (control panel or dashboard).

Information security Due to the fact that the system for data collection, storage and analysis should not contain sensitive data, information security will be provided by built-in tools of operating system, databases and data warehouses, application servers and applications.

Data back-up Backups should be performed by means of built-in tools of databases and data warehouse.

Data archiving Data archiving is necessary for long term storage. Expiration time is currently defined as 50 years. Perhaps in the future it will be necessary to reduce the reporting forms in coarse-grained forms, that is, to combine monthly statements into yearly statistics, and yearly data into for periods of several years.

Logging system events Source data input logging should be ensured to eliminate possible conflicts with the non-receipt of data sent by the user from a region.

Success criteria Data collection e-forms must comply with the approved list of data input forms. Performed procedures should follow the business process of collecting, processing, storage and dissemination of information, agreed with the customer. Electronic and paper output forms must conform to the approved list of management statements forms. 86

Logging of data input processes should be ensured to track the timeliness of reporting. Reliability of the delivered information should not be worse than the quality of collected data.

Architecture of system for data collection, storage and analysis In this paper we consider a typical task without taking into account the specific requirements of various projects. Therefore, the proposed architecture is based on the most simple software configuration. Data are collected with the help of IBM Lotus Forms. Storage, analysis and reporting are implemented using IBM InfoSphere Warehouse. The architecture should include IBM Cognos software to manage corporate performance and data interpretation. Separation of subsystems for data entry, collection, storage and analysis allows us to construct different architectures, depending on the needs of the task and requirements of enterprise infrastructure. Centralized architecture for data collection, data storage and analysis is represented on Pic. 1. This architecture assumes that data input can be carried out remotely, and all servers for data acquisition (Lotus Forms), data storage (InfoSphere Warehouse), and data analysis and interpretation (Cognos) are installed in a single data center. Analysts can work both locally and remotely, with the help of the Web interface provided by Cognos for analytical calculations preparation and their execution. A distributed Lotus Forms server architecture can be created if various regional forms must be filled in. In this case, initial forms processing should be implemented on a regional level, and gathered data are sent to the central office where the data storage servers reside. A combination of large volumes of regional data and poor telecommunication lines may require a solution with forms processing and data storage system that are both decentralized. Analytical works with large amount of ad hoc queries may require the creation of a distributed infrastructure of Cognos servers. In this case, data from a centralized repository can be transmitted in advance to the regional centers, where Cognos servers are deployed. This architecture provides an acceptable response time and high-performance execution of analytical tasks in the regions, even in the absence of high-speed communication channels. Various options of the system architecture for data collection, storage and analysis will be discussed in more detail in a separate article. Another advantage of the proposed modular system is the possibility of its functionality expansion. Since all modules interact by standard protocols, it is possible to integrate the system with document management, metadata and master data management, and enterprise resource planning systems, as well as with a variety of analytical and statistical packages.

Pic.1. Centralized architecture of system for collecting, storing and analyzing data

Data collection Lotus Forms is a set of products that enables organizations to use e-forms for manual data entry and transfer the collected data to other systems [2]. Lotus Forms Server can be further integrated with repositories of data (e.g., IBM DB2, Oracle, and MS SQL Server), with a variety of document management and document repositories (for example, IBM File Net). Architecture of primary data collection based on Lotus Forms is shown in Pic. 2.

Pic. 2. Primary data collection using Lotus Forms Forms designer prepares e-forms for data entry using Lotus Forms Designer. The e-forms are stored in the forms repository in XFDL format [3, 4], which is a standard approved by W3C. An application developer is developing Forms application logic, Webform serverâ&#x20AC;&#x2122;s servlets and mapping for Transformation Extender (TX), which associates the form fields to values in the database. A translator converts the e-form from XFDL format to HTML and JavaScript for users who are using a thin client (browser). 89

Users who have installed Lotus Form Viewer (thick client) may work with e-forms in XFDL format, bypassing the translation to HTML. Users in the regions enter data by means of Lotus Form Viewer or browser. The data can pass several stages of verification: •

On user's computer during the form filling, using form’s built-in logic

•

On Lotus Form Server invoking the application logic

•

When data is being loaded into the database.

The data can be transmitted to the InfoSphere Warehouse database through CLI, ODBC, and JDBC protocols.

Data storage IBM InfoSphere Warehouse Enterprise Edition [4,5] consists of the following products: •

InfoSphere Warehouse Design Studio, that includes IBM Data Server Developer Workbench – subset of IBM Rational Data Architect components.

•

InfoSphere Warehouse SQL Warehousing Tool

•

InfoSphere Warehouse Administration Console, which is the part of Integrated Solutions Console.

•

DB2 Enterprise Server Edition for Linux, UNIX and Windows

•

InfoSphere Warehouse Cubing Services

•

DB2 Query Patroller

•

InfoSphere Warehouse Intelligent Miner

•

IBM Alphablox and companion documentation

•

WebSphere Application Server

The architecture of data storage in the IBM InfoSphere Warehouse is shown in Pic.3. Design Studio provides a common design environment for creating physical models, OLAP cubes and data mining models, for data flow and SQL control flow design, as well as for Alphablox Blox Builder analytical applications. Design Studio is based on open source Eclipse platform. The application developer develops applications using InfoSphere Warehouse Design Studio and deploys them on the server, providing data processing in accordance with the required business logic. SQL Warehousing Tool (SQW) is a graphical tool that, replacing manual SQL coding, generates SQL code to support and administer the data warehouse. Based on the visual flow of statements, modeled in Design Studio, SQW automatically generates SQL code that is specific to DB2. The integration of SQW with IBM WebSphere DataStage extends the development capabilities of analytical systems based on DB2. In this project e-forms filled out according to strict rules are the only data source, so at this stage there is no need for Extract, Transform and Load (ETL) tools, such as DataStage. However, as the project evolves, it is expected that other sources will be connected. The ability of using ETL tools provides functional extensibility of the system without the need of radical changes. 90

The administrator uses the Administration Console, which is a WebSphere application, for deploying and managing applications created in Design Studio. Administration Console allows you to: •

Create and manage database resources, view logs and manage SQW processes.

•

Perform and monitor database applications, review the history of their deployment, and execution statistics.

•

Manage cube services, import and export cubes and models, as well as to execute OLAP Metadata Optimization Advisor.

•

Maintain database jobs for data mining; to load, import and export data mining models.

Pic. 3. Data storage in IBM InfoSphere Warehouse DB2, IBM Alphablox, and WebSphere Application Server have their own administration tools, but these tools can also be executed from Integrated Solutions Console. The administrator uses the DB2 Query Patroller to manage dynamically the flow of queries to the DB2 database. Query Patroller allows you to adjust the database resource usage so that short queries or queries with the highest priority will be executed in the first place, ensuring efficient use of resources. In addition, administrators can collect and analyze information about the executed queries to determine the temporal patterns, frequently used tables and indexes, as well as resource intensive applications. 91

Conclusion The proposed solution is scalable and has expandable functionality. In the future, you can connect different document workflow systems, enterprise planning, metadata and master data management systems. The system for collecting and analyzing primary data can be easily integrated into existing enterprise IT infrastructure. In other circumstances it may be treated as a first step in implementation of an enterprise system for data collection, storage and analysis. Various solutions for data analysis by means of IBM InfoSphere Warehouse and IBM Cognos BI will be described in the second part of the article. The author thanks M.Barinstein, V.Ivanov, M.Ozerova, D.Savustjan, A.Son, and E.Fischukova for useful discussions.

Literature 1. Asadullaev S. “Primary data gathering and analysis system – II.”, 2011, http://www.ibm.com/developerworks/ru/library/sabir/warehouse-2/index.html 2. IBM Forms documentation, https://www.ibm.com/developerworks/lotus/documentation/forms/ 3. Boyer J., Bray T., Gordon M. “Extensible Forms Description Language (XFDL) 4.0”. 1998, http://www.w3.org/TR/1998/NOTE-XFDL-19980902 4. IBM, “XFDL 8 Specification”, 2010, http://www10.lotus.com/ldd/lfwiki.nsf/xpViewCategories.xsp?lookupName=XFDL%208%20Specification 5. IBM, “InfoSphere Warehouse overview 9.7”, 2010, http://publib.boulder.ibm.com/infocenter/idm/v2r2/index.jsp?topic=/com.ibm.isw.release.doc/he lpindex_isw.html 6. IBM, “IBM DB2 Database for Linux, UNIX, and Windows Information Center”, 2011, http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/index.jsp?topic=/com.ibm.db2.luw.doc/w elcome.html