Primary data gathering and analysis system - II Analysis of primary data Sabir Asadullaev, Execitive IT Architect, SWG IBM EE/A 17.02.2011 http://www.ibm.com/developerworks/ru/library/sabir/warehouse-2/index.html
Abstract A standard solution for primary data collection, storage and analysis was proposed in [1]. The solution is based on IBM Forms for manual input and IBM InfoSphere Warehouse for storage. This article considers the analysis of collected data by means of IBM InfoSphere Warehouse analytical tools and IBM Cognos. The proposed approach can be implemented as a basis for a variety of solutions for different industries and enterprises.
Data analysis using IBM InfoSphere Warehouse Cubing Services & Alphablox based OLAP IBM Alphablox and the components of Cubing Services are used to provide direct data access from InfoSphere Warehouse. Cubing Services components include tools for metadata modeling for OLAP cubes, an optimizer of materialized query tables (MQT), and cube server for multidimensional data access (Pic. 1).
Pic. 1. Cubing Services & Alphablox based OLAP 93
Due to full integration of Cubing Services components with the user interface of InfoSphere Warehouse, the design is performed using the Design Studio and the administration and support is provided through the Administration Console. The cube server of Cubing Services processes multidimensional queries expressed in the MDX query language and returns the results of multidimensional queries. In response to MDX queries cube server retrieves data from DB2 via SQL queries. Materialized query tables MQT are used by the DB2 optimizer, which rewrites incoming SQL queries and forwards them to the appropriate MQT for high execution performance of queries. IBM Alphablox provides a rapid development of analytical Web applications that meet the requirements of enterprise infrastructure and are available both in intranet and behind the enterprise firewall. Alphablox applications allow to perform multidimensional data analysis in real time, using a standard browser as a client. Alphablox applications can work with data from multiple sources, including DB2 and Cubing Services, create structured reports and provide data in the required form with the help of filters and drill-down tools. Alphablox analytic applications are aimed to improve the quality of decision-making and to optimize the performance of financial reporting and analysis, planning, operational analysis, task analysis and reporting, performance analysis and analysis of key performance indicators (KPI).
Text and data mining Data mining algorithms embedded into the InfoSphere Warehouse are used to understand a customer’s or a business unit’s behavior. Data discovery tools allow to identify hidden data relationships, to profile data, to browse the tables’ contents and to visualize correlated statistics to identify data suitable for analysis. [InfoSphere Warehouse provides the following data mining tools: •
Miningblox
•
Intelligent Miner Easy Mining
•
Intelligent Miner Modeling
•
Intelligent Miner Scoring
•
Intelligent Miner Visualization
•
Unstructured text analysis
Data mining using MiningBlox & Alphablox A typical data mining application can include the following steps: •
Data selection for analysis
•
Analysis beginning and tracking its progress
•
View the results of the analysis
•
Selection, management or control of the data mining tasks
The Miningblox tag library provides tags for each step and is designed to perform predictive analysis using Alphablox functions. In this configuration the J2EE application server includes Miningblox applications, Alphablox, and Miningblox tag library, data warehouse applications, as well as the Administration Console (Pic. 2). 94
Miningblox web applications include Java Server Pages (JSP) that use Alphablox and Miningblox JSP tag library. JSP-page which invokes Alphablox, is compiled during the execution on the application server. Alphablox manages queries and the Web server returns the dynamic content. The data warehouse application contains control flows, which are invoked by Miningblox web application. Control flows contain data flows and data mining flows. DB2 database stores both data, analyzed by data mining flows, and the results in the form of data models and resulting tables.
Pic. 2. Data mining using MiningBlox & Alphablox Administration Console can be used to deploy and administer data warehouse applications related to Miningblox applications. Design Studio is used for visual design of the flow of statements of data mining or text analysis, as well as preprocessing statements and text statements. The generated SQL query can be embedded in Alphablox application or other application to invoke the data mining flow.
Data mining using Intelligent Miner To solve data mining and text analysis tasks it is necessary to develop applications using a specialized SQL API application interface, which consists of two levels with different levels of detail and abstraction. •
Application interface of Easy Mining tasks is problem-oriented and is used to perform the basic tasks of data mining;
•
IM Scoring / Modeling SQL / MM API application interface conforms to ISO / IEC 132496: Data Mining standard and allows to create data mining applications for specific individual 95
user requirements. This interface can be used by SQL scripts, or from any JDBC, CLI, ODBC, or SQLJ application. Easy Mining Procedures provide the core functionality of typical data mining tasks. To do this users should have knowledge of their subject area and are not required to deeply understand the intricacies of data mining. IM Scoring and IM Modeling is a set of software development tools (software development kit, SDK). These tools are DB2 extensions, and include SQL API, which allows to invoke data mining functions from applications. With the help of modeling tools of IM Modeling you can use the following functions for the development of analytical PMML models (Predictive Model Markup Language): association rules, sequence rules, cluster models, regression models, and classification. Generated PMML models can be used in the IM Visualization modules or in IM Scoring. IM Scoring evaluating tools allow application programs to apply the PMML model to a large database, its subset, or to individual rows. IM Scoring can work with the following PMML models: association rules, sequences rules, cluster models, regression models, classification, naive Bayesian approach, neural networks, and decision trees. Forecasting models created using Intelligent Miner for Data are not part of PMML. These models can be exported from the IM for Data to XML format and can be used in IM Scoring. The results of data simulation (associations, sequences, classification, clustering and regression) can be viewed by ready-made Java visualization tools of IM Visualization. To present the modeling results these visualization tools can be invoked by applications or as a browser applet. Design Studio contains editors and visual tools for data mining application development integrated into the Eclipse environment. An application developer can visually model the data mining tasks and generate SQL code to include the Intelligent Miner SQL functionality in analytical applications. For prototyping, you can use an Excel extension, which allows you to prove a concept, avoiding the complexities of SQL API. Administrators can configure a database for data mining, to manage the data mining model, to optimize the performance of analytic queries through the Web interface of Administration Console.
Text analysis Text analysis allows to extract business information from patient records, from a repairs report, from database text fields and from records of a call center. This information can be used in multidimensional analysis, reports, or as an input for data mining. Text analysis covers a broad area of computer science, such as: •
Automatic classification of documents into groups of similar documents (clustering, or unsupervised categorization)
•
Automatic classification of documents into predefined categories (supervised categorization)
•
Structured data extraction from unstructured text.
Text analysis functions of InfoSphere Warehouse are targeted on data extraction, which generates structured data for business intelligence, together with other structured information by data mining and multidimensional analysis tools and reporting tools. 96
In InfoSphere Warehouse the UIMA (Unstructured Information Management Architecture [2]) based software tools are used for text analysis. UIMA is an open, industrial-oriented, scalable and extensible platform for integration and deployment of solutions of text analysis. Text analysis function of the InfoSphere Warehouse can perform the following tasks: •
Explore tables that contain text columns;
•
Extract data invoking regular expressions, such as phone numbers, email addresses, social insurance number, or a unified resource locators (URL);
•
Extract data using dictionaries and classifiers, for example, product names, or names of clients;
•
Extract data using the UIMA compliant components.
Data mining application development
Pic. 3. Intelligent Miner using scenario You can select one of the approaches to application development, depending on the complexity of the problem and the experience, preferences and skills of specialists: •
To use examples and tutorials for quick code adjustment to fit your goals.
97
•
To use graphical user interface of Design Studio to determine the analysis process and to generate code and to integrate it into the application. Include data mining steps in the process of automated data conversion.
•
To use the Easy Mining procedures for basic functionality of typical mining tasks.
•
Use the command-line script generator idmmkSQL as a starting point for scoring statements.
•
Invoke a powerful low-level application interface SQL/MM API from SQL scripts or from any JDBC, CLI, ODBC, or SQLJ application.
Picture 3 shows a typical scenario for Intelligent Miner used for data mining tasks. The application developer integrates SQL functionality of Intelligent Miner into applications using development tools of Design Studio. The analyst uses the tools of data mining from applications.
Data analysis using IBM Cognos Business Intelligence The proposed solution (Fig. 4) is based on the previous architecture, with the expansion of analytic functionality by means of IBM Cognos 10 Business Intelligence [3]. IBM Cognos 10 Business Intelligence (BI) is an integrated software suite to manage enterprise performance and is designed to aid in interpretation of data arising during the operation of the organization. Cognos BI 10 allows to draw graphs to compare the plan and fact, to create different types of reports, to embed reports into a convenient portal and to create a custom dashboard. Any organization’s employee can use IBM Cognos 10 BI to create business reports, to analyze data and monitor events and metrics in order to make effective business decisions. Cognos BI 10 includes the following components: •
Cognos Connection - content publishing, managing and viewing.
•
Cognos Administration console - viewing, organizing and scheduling of content, administration and data protection
•
Cognos Business Insight - interactive dashboards
•
Cognos Business Insight Advanced - easy reporting and data research
•
Cognos Query Studio - arbitrary queries
•
Cognos Report Studio - managed accounts
•
Cognos Event Studio - event management and notification
•
Cognos Metric Studio - metrics and scorecarding
•
Cognos Analysis Studio - business analysis
•
Cognos for Microsoft Office - working with Cognos BI data in Microsoft Office
•
Framework Manager - business metadata management for cube connection.
•
Metric Designer - data extraction.
•
Transformer - multidimensional data cubes PowerCubes modeling 98
•
Map Manager - import maps and update labels
•
Cognos Software Development Kit - Cognos BI application development
Cognos Connection is a portal that provides a single access point to all enterpise data available in Cognos 10BI. Portal allows users to publish, find, organize and view data. Having appropriate access rights, users can work through the portal with a variety of applications and manage the portal’s content, including schedules management, preparation and distribution of reports. Cognos Administration console together with the Cognos Connection provides system administrators with abilities to administer Cognos servers, tune performance and manage user access rights. Cognos Business Insight allows users to create complex interactive dashboard using data from Cognos and from external sources, such as TM1 Websheets and CubeViews. A user can open a personal dashboard, manage reports and send the dashboard via e-mail and participate in collective decision making. Cognos Business Insight Advanced allows users to create simple reports and explore data from internal and external data sources, both relational and multidimensional. When an analyst uses his personal dashboard and wants to perform a deeper data analysis, he can pass to Business Insight Advanced, where it is possible to add a new dimension, conditional formatting, and complex calculations. The user can launch the Business Insight Advanced directly from Cognos Connection portal. Query Studio provides an interface for creating simple queries and reports in Cognos 10 BI. Users without special training can use Query Studio to create reports that answer simple business questions. With minimal effort, users can change report layout, filter and sort data, add formatting, and create charts. Report Studio is a tool that professional report authors and developers use to create sophisticated and managed reports, multi-page reports with composite queries to multiple databases (relational or multidimensional). Using Report Studio, you can create any reports required by organizations such as a sales invoice, budgets, or weekly activity reports of any complexity. Event Studio is a tool for event management in IBM Cognos 10. It allows to notify users of events as they approach to take timely and effective decisions. Event Studio can be used to create agents that monitor changes of various states of financial and operational performance of the company and of key customers to identify any important events. When an event occurs, the agent can send an email, publish information on the portal, or prepare a report. Metric Studio allows users to create and use a balanced scorecard to track and analyze key performance indicators (KPI) of the organization. You can use a standard or a custom scorecard, if it is already implemented in the company. Metric Studio translates the organization's strategy into measurable goals, which allow each employee to correlate their actions with the strategic plan of the company. Scorecarding environment reveals both successful activities of the company, and those that need improvement. It monitors progress in achieving these goals and shows the current state of business. Therefore, all employees and managers of the organization can make necessary decisions and can plan their works. Analysis Studio is intended for research, analysis and comparison of multidimensional data, and provides real-time processing (OLAP) of various multidimensional data sources. The results of the analysis are available for creation reports of professional quality in Report Studio. 99
Рис. 4. Solution architecture using Lotus Forms, InfoSphere Warehouse и Cognos BI 100
Managers and analysts use Analysis Studio to quickly analyze the reasons of past events and to understand the required actions to improve performance. The analysis allows users to identify unobvious, but influencing on business patterns and abnormalities in big data volumes. Other types of reports do not provide such an opportunity. Cognos for Microsoft Office allows you to work with Cognos reports directly from MS office, and offers two types of client software: 1. “Smart client� does not require installation, administration, and is updated automatically. 2. The client software is a COM add-in and requires installation. Updates are performed by reinstalling the software. Cognos for Microsoft Office allows to work with reports created in Query Studio, Analysis Studio, or Report Studio, and users get full access to the contents of the report, including data, metadata, headers, footers, and pictures. Framework Manager is a simulation tool that is designed to create and manage business metadata for use in analysis and reporting tools of Cognos BI. Metadata provides a common understanding of data from different sources. OLAP cubes contain metadata for business analysis and reporting. Since the cube metadata can be changed, Framework Manager models the minimum amount of information required to connect to a cube. Metric Designer is a simulation tool for data extraction. Extracts are used for mapping and data transfer to the scorecarding environment from existing sources of metadata, such as files of Framework Manager and Impromptu Query Definition. Typically, a data model is optimized for storage, rather than reporting. Therefore, a developer of data models uses Framework Manager to create data models that are optimized for the needs of business users. For example, a model can define business rules that describe the data and their relationships, dimensions and hierarchies from a business perspective. Transformer is used for modeling of multidimensional data cubes PowerCubes for business reporting in Cognos BI. After collecting all necessary metadata from various data sources, dimensions modeling, measures customization and dimensional filtering, you can create PowerCubes based on this model. These cubes can be deployed to support OLAP analysis and reporting. Map Manager allows administrators and modeling specialists to import maps and update maps labels in Report Studio, and add alternative names of countries and cities for the creation of multilingual texts that appear on maps. IBM Cognos Software Development Kit is designed to create custom reports, to manage the deployment of components of Cognos BI, to ensure the safety of the portal and its functionality in accordance with the requirements of user, local legislation and existing IT infrastructure. Cognos SDK includes cross platform web services, libraries and programming interfaces.
Enterprise planning using Cognos TM1 Analytical tools can be extended by means of IBM Cognos TM1 enterprise planning software [4], which provides a complete, robust and dynamic planning environment for the timely preparation of personalized budgets and forecasts. 64-bit OLAP kernel provides analysis performance of complex models, large data sets, and even streamed data. A full set of requirements for enterprise planning is supported: from the calculation of profitability, financial analysts and flexible modeling up to revealing of the contribution of each unit. 101
Ability to create unlimited number of custom scripts allows employees, groups, departments and companies to respond more quickly to changing conditions. Best practices, based on the driver-based planning and rolling forecasts, can become a part of the enterprise planning process. Models’ and data access’s configuration tools can provide data in familiar formats. Managed team work provides a quick and automated collection of results from different systems and entities, their assembly into a single enterprise planning process and presentation of results. Consistent scorecarding, reporting and analysis environment of Cognos BI give a complete picture from goal planning and setting to progress measurement and reporting. Financial and production units have full control over the processes of planning, budgeting and forecasting. Users have the ability to work with familiar interfaces (Microsoft Excel and the client software Cognos TM1 Web or Contributor).
Conclusion The proposed solution is scalable and functionally expandable. The solution can be integrated with various document management systems, enterprise planning, metadata and master data management systems. Primary data gathering and analysis system can be easily integrated into existing enterprise IT infrastructure. In other circumstances the solution may be realized as a first step in implementation of an enterprise system for collecting, storing and analyzing data. The author thanks M.Barinstein, V.Ivanov, M.Ozerova, D.Savustjan, A.Son, and E.Fischukova for useful discussions.
Literature 1. Asadullaev S., “Primary data gathering and analysis system - I. Problem formulation, data collecting and storing”, 2011, http://www.ibm.com/developerworks/ru/library/sabir/warehouse-1/index.html 2. Apache Software Foundations, “Apache UIMA”, 2010, http://uima.apache.org/ 3. IBM, “Cognos Business Intelligence”, 2010, http://publib.boulder.ibm.com/infocenter/cfpm/v10r1m0/index.jsp?topic=/com.ibm.swg.im.cogn os.wig_cr.10.1.0.doc/wig_cr_id111gtstd_c8_bi.html 4. IBM, “Cognos TM1”, 2010, http://www-01.ibm.com/software/data/cognos/products/tm1/
102