Best Practices for managing unstructured Data

Page 1

Best Practices for Managing Unstructured Data


Best Practices for Managing Unstructured Data

As critical business information is increasingly found Contents How to Manage SemiStructured Data Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration

in unstructured and semi-structured data such as documents, images and emails as well as tweets and other social media data, enterprises must re-evaluate existing data-handling strategies. This E-Guide explores several best practices and approaches for managing and integrating these new types of data. Learn about the differences between using enterprise search versus text analysis for processing unstructured data. How to Manage Semi-Structured Data By: Mark Whitehom, Contributor Data can be classified as structured, semi-structured or unstructured – but what bearing do these classifications have on a company’s data-handling strategy? The short answer is that it is becoming more important in our rapidly changing IT world to be aware of different data forms and how (or if) you need to manage them. Structured data is data that has been split into small, discrete units. Each piece of data concerns one thing (to use a good Anglo-Saxon catchall word), for example, the last name of a customer. Structured data is typically stored in tables. Continuing with our example, one column of data would list the last names of all customers, and each row would pertain to one customer. These tables, in turn, are typically stored in a relational database. In very many cases, we find that data in the real world is not structured quite as neatly as this. But we impose the database structure upon it for the simple reason that doing so makes the data easy to retrieve and query. In practice, this works well for managing most business data; stock control, finance, human resources and other corporate systems all submit fairly readily to an imposed data structure. The problem is that some data is not amenable to rigorous structuring – and such data is becoming more and more prevalent. A great deal of data

Page 2 of 13

Sponsored by


Best Practices for Managing Unstructured Data

relevant to the enterprise is turning up in documents, images and emails as well as tweets and other social media data. All of these can be described as

Contents How to Manage SemiStructured Data

semi-structured data. The term unstructured data is also bandied around but is not, in my opinion, a viable classification. Virtually all data has some kind of structure – only random noise is truly unstructured, and it contains very little commercial

Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration

value. Our options for managing semi-structured data are a)

Ignore it (probably fatal in a competitive climate)

b)

Force it into structured relational form

c)

Adopt a different storage mechanism

Let’s consider those options one by one. Ignore it. This is a good one to rule out: So much data is being created and collected in semi-structured forms that most enterprises cannot afford to disregard the outpouring of it. Doing so is viable only if there is no compelling business advantage in being able to track and analyse such data. Stay relational. Relational database engines have been significantly modified over the years to handle what are characterised by the database manufacturers as ―complex data types.‖ XML is one example: It is considered by many to be an excellent way of holding classic semi-structured data. Most common document formats are, or can be, rendered into XML, and almost all relational engines now have an XML data type, which means that documents often can be stored in a relational database. But the additional complexity of handling semi-structured data means there will inevitably be a trade-off, and in general that will equate to slower retrieval times. However, it does make it very easy to find all tweets that refer to your product, all emails that mention ―politician‖ and so on.

Page 3 of 13

Sponsored by


Best Practices for Managing Unstructured Data

Other examples of complex data types are those that can handle spatial and image data.

Contents How to Manage SemiStructured Data

Adopt a different approach. There is increasing interest in adopting alternative data management and storage mechanisms. Imagine you store patient X-rays as images. We store data so we can retrieve it later and also so we can query it, but running a query against an X-ray image is a

Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration

somewhat bizarre concept because the X-ray is simply a collection of pixels. What often happens in practice is that this and other semi-structured data comes with some attached metadata and can also undergo some form of analysis in order to generate further metadata. (In a nutshell, metadata is data about data). In the case of an email, the attached metadata might include length, sender, recipient, time/date and so on. Automatic semantic analysis of the email could be performed and that might yield metadata about the tone of the email (as in, angry, conciliatory, praising, etc.), its grammatical construction (correct, lax, etc.) and so on. Metadata is typically highly structured and is therefore highly susceptible to analysis. So you could then store the emails and the metadata in a relational database and query the metadata to find, not just those emails that mention your product, but more specifically those that are well-written and also positive about the product. And now think again about X-rays, which are classic semi-structured data. While you wouldn’t query against a raw X-ray image, you can query against its metadata. The attached metadata might include patient ID, doctor ID, extensive information about how and when the X-ray was taken and so on. Automatic analysis of the image might yield metadata like diagnosis, prognosis and so on. In this solution the semi-structured data might be stored simply as image files in the file system and the structured metadata would be stored in a relational database and linked to the image. A query could then pull out all the X-rays for doctor ID 1234 that involved broken limbs and display the images.

Page 4 of 13

Sponsored by


Best Practices for Managing Unstructured Data

The bottom line is that semi-structured data is here to stay, and it offers the potential for business advantage to any company that handles and analyses

Contents How to Manage SemiStructured Data

it well. About the author: Dr. Mark Whitehorn specializes in the areas of data analysis, data modeling and business intelligence (BI). Based in the UK, Whitehorn works as a

Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration

consultant for a number of national and international companies and is a mentor wit Solid Quality Mentors. In addition, he is a well-recognized commentator on the computer world, publishing articles, white papers and books. Whitehorn is also a senior lecturer in the School of Computing at the University of Dundee, where he teaches the masters course in BI. His academic interests include the application of BI to scientific research.

Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration By: Krish Krishnan A great debate is raging in the industry, and it is being fanned by the adoption of "big data." The simple question is: Do we create better search techniques or do we go all the way to text analysis for integrating unstructured data? A simple answer is to say ―yes‖ to both the questions, but there are hidden layers of complexity in the answer, which this article will attempt to explain. Search vs. Analysis At a fundamental level, both search and analysis engines operate on text data. Here is where the similarity ends. With search, you typically look for patterns and present the findings to the user in short order. There is no further transformation to the text. Analysis deals with the discovery of the pattern (akin to search); but, more importantly, transformations are applied to the text to create a meaningful outcome. Analysis assumes that text must be integrated and transformed before it can be analyzed. This advanced treatment of text in terms of analysis is where complexities arise, and the field – though decades rich in terms of algorithms, research and

Page 5 of 13

Sponsored by


Best Practices for Managing Unstructured Data

development, and published theses – continues to be nascent and niche.

Contents How to Manage SemiStructured Data

The fundamental characteristic of text is termed best in one adjective ―erose‖ (do not confuse with ―verbose‖). The Latin word ―erose‖ means ―irregularly notched, toothed, or indented‖(from dictionary.com), and is used more in botany to describe leaves of a plant. The underlying reason for this attribution is text is long, complex and unpredictable. It is a combination of

Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration

words and phrases to form contextual statements, which may contain repeatable patterns (this repeatability can also differ based on context within a single document or text). When discussing ―unstructured‖ data, we use this lack of repeatability and the associated ambiguity to distinguish text data analysis and outcomes, as opposed to structured data where there is great repeatability of data, a structured and formatted storage architecture, which lends itself well to integration and analytics. Applying Search for Unstructured Integration With the available search infrastructure and algorithms, one can make the argument that in order to integrate any ―unstructured data,‖ why not just extend search outputs? Why do we need to create a text analysis platform separately? There have been attempts at doing that, but including integration and transformation as part of search is not a good approach. 

Search engines or enterprise appliances will become lethargic and slow upon including integration and transformation to the normal workload. For example, let us assume that 10,000 searches have to be done for a contract database on a content management platform for every user query. Every search transaction will create operational structures and return quick hits on a set of patterns as its output. Adding analysis type of transformation introduces great inefficiencies of operation to this exercise. The critical reason here is analysis requires creating clarity and context around the unstructured information, and both of these operations are highly complex and require processing. The additional operation will cause immense slowdown of search.

Page 6 of 13

Sponsored by


Best Practices for Managing Unstructured Data



Search engines do a lot of pattern matching, metadata (taxonomy and ontology) based indexing and large-scale distributed data processing. Metadata and patterns are definitely nimble and agile

Contents

techniques for transforming the minimal data required for search processing, but the same will not scale to support the complex

How to Manage SemiStructured Data Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration

nature of unstructured data analysis. 

Searches are designed to process patterns for every user query and are inconsistent by design. No two users will search for the same pattern at a given time. Thus, the same algorithms are replayed over and over, for multiple types of data patterns, which are short life cycle and efficient despite of processing inconsistencies.

While these are the key reasons where applying search to analyze unstructured data is not the best option, these are not the only reasons. Analysis of text requires a lot of additional processing including spelling correction, alternate spellings, synonyms, user defined rules and much more deep processing. Text Analysis Let’s look at how analysis will be different from search: Text analysis advances the integration of unstructured data beyond just light indexing and pattern matching of search. Analysis consists of multiple transformation steps, each of which needs to be run once per set of patterns, metadata terms or context. Analysis creates multiple iterations of metadata output as opposed to simple result sets of entire pages, which create a powerful set of indexes within the text and its context. Analysis always processes data in a consist manner as opposed to search.

Page 7 of 13

Sponsored by


Best Practices for Managing Unstructured Data

For example, here is a popular example found in Wikipedia under Natural

Contents How to Manage SemiStructured Data

Language Processing The sentence "I never said she stole my money" demonstrates the importance stress can play in a sentence, and thus the inherent difficulty a natural language processor can have in parsing it.

Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration

"I never said she stole my money" – Someone else said it, but I didn't.

"I never said she stole my money" – I simply didn't ever say it.

"I never said she stole my money" – I might have implied it in some way, but I never explicitly said it.

"I never said she stole my money" – I said someone took it; I didn't say it was she.

"I never said she stole my money" – I just said she probably borrowed it.

"I never said she stole my money" – I said she stole someone else's money.

"I never said she stole my money" – I said she stole something, but not my money

Depending on which word the speaker stresses, you can see how this sentence could have several different meanings. If you search for this pattern, you will get all the statements, and you have to search for the extended meaning and interpret the same. If you process t his through a text analysis platform, you can create a context-oriented result set that will provide you not only the result, but also the associated context, which is far more useful.

Page 8 of 13

Sponsored by


Best Practices for Managing Unstructured Data

The need for transforming data before it becomes useful for analytics and

Contents How to Manage SemiStructured Data

reporting is not a new thought. We have always designed the data warehouse to process data in this fashion, and call it ETL. Extending this analysis to text creates a powerful concept: textual ETL. This need for transformation and integration of text has some interesting

Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration

challenges. One challenge is the size of the data to be transformed. Let us assume that you intend to take the Internet as your data set. Is it possible to transform and analyze all the text found on the Internet? In a nutshell, it is not practical or feasible. In such a situation, you primarily rely on search and can use a subset of data from the result set for deeper analysis. But there are other data sets such as enterprise data that are large in volume, complex in formats and have multiple contexts, yet lend themselves to rigors of text analysis and processing. A simple example is the contracts existing across the different business divisions such as purchasing, supply chain, inventory management, logistics, transportation and human resources. Each of these contracts has a different purpose, and there may be many contracts of a type that can provide insights beyond just start and end dates. Insights include legal terms and conditions with applied context, liabilities and obligations and much more. After analysis, such text will create a powerful and rich metadata output with context that can be simply integrated into a decision-support system ecosystem. Other challenges include the variety of formats, the volumes of text, the ambiguous nature of the data itself and lack of formal documentation, to name a few. But once the challenges are addressed, the output from such an analysis is powerful to create a huge visualization platform for looking into text and unstructured data within the enterprise. This is where you can leverage the data that has been stored on content management platforms for years for useful output of trends and behaviors. The major differences between a result set produced by a search and text analytics system are as follows :

Page 9 of 13

Sponsored by


Best Practices for Managing Unstructured Data

Search

Contents How to Manage SemiStructured Data

Search is oriented to process informational needs of a single user query

The search result set is proprietary to that user and cannot be shared

Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration

Result set is temporary (under normal circumstances)

Transformation rules are repeated with every query and are minimalistic

Result set cannot be integrated with a DBMS

Search processing cannot scale for large and complex operations – context-based search has always added significant overhead

Text Analysis 

Can be defined by users for processing with business rules, like an ETL tool

Produces a result set that is a key-value column pair often stored in an RDBMS

Result set can be used for further analytical processing

Result sets can be stored as snapshots for repeated processing

Transformation of data and associated context is repeatable in multiple passes of processing cycles

Text of different languages for global organizations can be stored in the same result database based on metadata integrations and rules

Page 10 of 13

Sponsored by


Best Practices for Managing Unstructured Data

Contents How to Manage SemiStructured Data Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration

Text analysis can scale easily based on the infrastructure capabilities

Based on the discussion here, you can discern that search is good for finding things on an ad hoc basis in a large set of data. Analysis is good for creating a platform that can be used repeatedly against a large but finite amount of textual data as related to a corporation. In order to perform text analysis and deep text mining, you need to process the text rather than extend a search engine or appliance. A robust text analysis system will provide for the following: 

Categorization

Classification

Spelling correction

Synonyms, antonyms and homonyms

Integration with taxonomies

Metadata

Business rules integration

Reprocessing capabilities

Document fracturing and processing

Each of these steps allows you to process large text and create the result database for processing. This database can be used with search to create guided search and navigation, and can be extended to machine learning using a search and analysis combination platform. The major advantage of text analysis is the ability to track changes as they occurred or occur within the text environment in a similar manner to tracking

Page 11 of 13

Sponsored by


Best Practices for Managing Unstructured Data

changes in a dimension. This is the most powerful output that makes analysis such a better proposition than search and is called document mid-

Contents How to Manage SemiStructured Data

point reprocessing. You can extend this concept to emails, Excel spreadsheets and other document types very easily. In conclusion, search and text analysis both serve different purposes for processing unstructured data and can be effectively leveraged. Search can

Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration

Page 12 of 13

be used for early stage data discovery, and text analysis can be used for the detailed analysis and downstream analytical processing. But remember this: Do not substitute search as the alternative to traditional text analytics.

Sponsored by


Best Practices for Managing Unstructured Data

Contents

Free resources for technology professionals TechTarget publishes targeted technology media that address your need for

How to Manage SemiStructured Data

information and resources for researching products, developing strategy and

Enterprise Search or Text Analysis: Approaches for Unstructured Data Integration

Web sites gives you access to industry experts, independent content and

making cost-effective purchase decisions. Our network of technology-specific analysis and the Web’s largest library of vendor-provided white papers, webcasts, podcasts, videos, virtual trade shows, research reports and more —drawing on the rich R&D resources of technology providers to address market trends, challenges and solutions. Our live events and virtual seminars give you access to vendor neutral, expert commentary and advice on the issues and challenges you face daily. Our social community IT Knowledge Exchange allows you to share real world information in real time with peers and experts.

What makes TechTarget unique? TechTarget is squarely focused on the enterprise IT space. Our team of editors and network of industry experts provide the richest, most relevant content to IT professionals and management. We leverage the immediacy of the Web, the networking and face-to-face opportunities of events and virtual events, and the ability to interact with peers—all to create compelling and actionable information for enterprise IT professionals across all industries and markets.

Related TechTarget Websites

Page 13 of 13

Sponsored by


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.