Interfacing with Big Data Repositories Boris Katz MIT Computer Science and Artificial Intelligence Laboratory July 18, 2013
Our Claim As we develop storage capacity, compute platforms and algorithms for scaling to big data, we will need to create new ways to access and interact with massive scale data
MIT
The Big Picture Q: ----A: ----Q: ----A: -----
Data But, what about access?Language Visualization Interaction imagery feature spaces graphs diagrams
terms relationships descriptions
Analysis
Big Data unstructured
semi-structured
structured
We can view Big Data through…
Big Data
… language-colored glasses Query: “What diseases present with fever and a rash?”
Query: What diseases present with fever and a rash? Answer: Scarlet fever is an illness with a characteristic rash that is caused by a strep infection. Chickenpox, Fifth Disease and Systemic Lupus Erythematosus Roseola – this is one of the most common causes of fever and rash in infants and young children. It starts out with three days of moderate to ...
… visualizationcolored glasses
Language can help manipulate visualization Query: “Rule out patients under 25.” MIT
From Big Data to Manageable Data by understanding structure Parse into T-expressions
Apply S-Rules
OPV Model Manageable Data
Big Data
Annotate
Decomposition
•  Language focuses our attention on what is important in data and helps make data more manageable
MIT
START: Natural language tools ¢
Providing Machines with New Knowledge: NL text
¢
semantic representation
Explaining Computer Actions or Describing its Knowledge: semantic representation
¢
NL text
Testing Computer Understanding by Answering Questions: NL queries
semantic representation
NL responses computer actions
MIT
Building blocks in the START system ¢
Syntactic Analysis: parse trees
¢
Semantic Representation: ternary expressions
¢
Matching and Transformational Rules
¢
Language Generation
¢
Replying
¢
Object–Property–Value Data Model
¢
Question Decomposition MIT
Syntactic Analysis: parse trees “Because the average sea level in the Northeast is expected to rise due to climate change, flooding will likely become more frequent and cause property damage in coastal areas to increase.�
MIT
Semantic Representation: From Parse Trees to Ternary Expressions “Because the average sea level in the Northeast is expected to rise due to climate change, flooding will likely become more frequent and cause property damage in coastal areas to increase.” [subject relation object] [become because expect] [flooding become frequent] [become has_modifier likely] [somebody expect rise] [level related_to sea] [level is average] [level in Northeast] [frequent has_quantity more] [level rise null]
[rise due_to change] [change related_to climate] [flooding cause increase] [damage increase null] [damage related_to property] [damage in areas] [areas is coastal] …
MIT
Ternary expression representation
¢
a versatile syntax-driven representation of language
¢
highlights significant semantic relations
¢
very efficient for indexing, matching and retrieval
MIT
Three types of Ternary Expressions “A young man’s friend was visiting Taiwan” ¢
Related to the syntactic structure of the sentence [friend visit Taiwan] [friend related_to man] [man has_property young]
¢
Related to syntactic features that change from sentence to sentence [visit has_tense past] [visit is_progressive yes] [man has_det indefinite]
¢
Related to lexical features of words that don’t change from sentence to sentence [Taiwan is_proper yes] [man has_number singular]
MIT
Creating semantic representations
MIT
Matching T-Expressions Assertion: “Average sea level in the Northeast is expected to rise higher due to climate change.” Query:
“What sea levels are expected to rise?
T-Expressions" from Query"
Matcher"
T-Expressions" from Assertion"
[somebody expect rise]
[somebody expect rise]
[level related_to sea]
[level related_to sea] [level is average] [level in Northeast]
[level rise null]
[level rise null] [rise due_to change] [change related_to climate] MIT
Matching in START T-Exps from Questions
¢
term matching: l l l
¢
T-Exps from Assertions
lexical match synonym match hyponym match
structure matching: l l
exact match match via transformational S-rules
MIT
Verb argument alternations and paraphrases “Surprise”: “The patient surprised the doctor with his fast recovery.” “The patient’s fast recovery surprised the doctor.” “Load”: “The crane loaded the ship with containers.” “The crane loaded containers onto the ship.” “Provide”: “Did Iran provide Syria with weapons?” “Did Iran provide weapons to Syria?”
MIT
Verb classes and S-Rules “Surprise”: “The patient surprised the doctor with his fast recovery.” “The patient’s fast recovery surprised the doctor.” “Confuse”: “The patient confused the doctor with his slow recovery.” “The patient’s slow recovery confuse the doctor.” … Emotional Reaction Verbs (semantic class): anger, confuse, disappoint, embarrass, frighten, impress, please, surprise, threaten, … S-Rule: If: Then:
[[subject verb object1] with object2] [object2 verb object1] [object2 related_to subject]
Provided:
verb ∈ emotional reaction class MIT
Language generation As intelligent systems become more mature, they will be expected to... l l l l l l l
Explain their actions Answer complex questions Keep track of conversation history and state Engage in mixed-initiative dialog Offer related information of potential interest to the user Help users correct and refine their questions Indicate incomplete understanding of questions and offer partial responses
MIT
START's generator structural ternary expressions ternary expressions for syntactic features ternary expressions for lexical features user/machine-provided task specification Generator • linguistic constraints • syntactic rules • morphological rules • lexical knowledge • anaphoric reference • heuristic defaults natural language sentence MIT
Generator in action
MIT
Generator in action
MIT
Replying to a question after a match Generate a sentence from semantic representation related-to"
Query “How are the glucose molecules converted into pyruvate molecules?”
pyruvate" quantity"
into"
two" converts"
molecules"
related-to"
is"
related-to"
smaller"
chain" quantifier" glucose"
reactions" molecule"
A chain of reactions converts each molecule of glucose into two smaller molecules of pyruvate."
each"
Execute a procedure to obtain an answer from the data source Script"
Query “Who directed Gone with the Wind?”
Annotation"
• get http://us.imdb.com/ Details?0031381" • match regexp...
+!IMDb"
Gone with the Wind (1939) was directed by George Cukor, Victor Fleming, and Sam Wood. Source: The Internet Movie Database
MIT
START in action
MIT
Google in action
MIT
The Object–Property–Value data model The object–property–value (OPV) model applies to: ¢
¢
structured data: Record
Category
Manufacturer
Units in stock
Retail price
25387
keyboard
Dell
56
32.25
53289
mouse
Apple
72
39.99
heterogeneous semi-structured information sources: l l l
countries and their capitals, areas, populations, … individuals and their biographies, birthdates, spouses, … cities and their weather reports, maps, elevations, …
The OPV Model makes it possible to view and use large segments of the Web as a database MIT
Implementing the OPV Model: START and Omnibase Omnibase supports START by providing access to structured and semi-structured information in databases, on the Web, etc. Data Resources
User Questions
World Factbook
START
structured query
1. What does the question mean? 2. Where can the answer be found? 3. What are the object and property?
Omnibase
Wikipedia IMDb Internet Public Library NASA Big Data… etc.
1. Go to the specific data source or Web page containing the answer. 2. Extract the answer from the data source. MIT
Answering complex questions “How many people live in the capital of the 8th richest Asian country?” ¢
Syntactically decompose a complex question into a set of nested ternary expressions
¢
Successively resolve groups of ternary expressions containing variables l
Answer sub-questions by replacing variables with obtained values
“How many people live in the capital of the 8th richest Asian country?” What is the 8th richest Asian country? What is its capital? How many people live there? MIT
Replying: syntactic decomposition
MIT
START Question Answering System
MIT
START: linguistically-motivated representations and approaches ¢
Ternary expressions representation
¢
OPV Data Model: Uniform access to heterogeneous resources
¢
Natural language annotations
¢
Decomposition of complex questions
¢
Same representation for sentence analysis, sentence generation, and question answering MIT
Contributions ¢
The START system pioneered language-based services on the Web. The public START server handles millions of questions from users all over the world.
¢
START provides high-precision “one-stop shopping” for information from diverse sources: structured, semi-structured, and unstructured.
¢
System responses can fuse information from multiple sources and multiple formats.
¢
Natural language interaction is a flexible and convenient way to access massive scale data. MIT
MIT
Recent Successes of Artificial Intelligence Applications ¢
Google’s Goggles
¢
Microsoft’s Kinect
¢
IBM’s Watson
¢
Apple’s Siri
… but are these systems truly intelligent? These systems don’t have any knowledge or understanding about the world outside of their narrow area of expertise. MIT
The challenge of creating a truly intelligent machine
Meeting this challenge will require moving across: ¢
modalities – language, vision, robotics, reasoning, …
¢
disciplines – AI, linguistics, cognitive science, neuroscience, …
Much work remains to be done!
MIT