Indexing and document searching by keywords using Replicated workers paradigm and the Map Reduc 561

Page 1

DEmo & PrEsEntation

Indexing and document searching by keywords using Replicated workers paradigm and the Map Reduce model. Georgiana Tache & Abu Rahat Chowdhury University of Louisiana at Lafayette


outlinE


outlinE

~ MapReduce & Motivation ~ Challenges & Basic Idea ~ Implementation ~ Our Java way ~ Results ~ DEMO ~ Concluding remark and future work


What is maP-rEDucE ?

• Parallel programming model meant for large clusters • Functions borrowed from functional programming languages (eg. Lisp) • •

Map() – Process a key/value pair to generate intermediate key/value pairs• Reduce() – Merge all intermediate values associated with the same key


motiVation – GooGlE’s PaPEr MapReduce: Simplied Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google’s implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use. Hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day


tWo main challEnGEs • How to simulate a distributive environment to test MapReduce with our smaller test files. • Indexing and making sure we have replicated worker paradigm to do the MapReduce.


Basic iDEa: User Program fork

fork assign map Input Data Split 0 read Split 1 Split 2

Master

fork

assign reduce

Worker

Worker

Worker

local write

write Worker

Worker

Output File 0

remote read, sort

Output File 1


imPlEmEntation oVErViEW ●

● ●

● ●

1-CPU intel core i3 x86 machines, 1 GB of memory limit Java based user programs. Input files A: 50+ randomly selected documents from different websites saved in txt files. Block or chunk size : 500 bytes Output is shown in GUI and if clicked opened in notepad. Made in Eclipse and Implementation done in Java swing library linked into user programs


imPlEmEntation - Data DistriBution Input files are split into k pieces on temp file in windows file system typically ~ 500 bytes blocks for our project Intermediate files created from map tasks are written to local disk also GUI output creates ranking with frequency and minimum distance. Also for the evaluation part it has some cache files to state the precision and recall values.


imPlEmEntation Processing such a block will represent a task, and all tasks will reside in something called a Work pool. The reason for dividing a document into block is to make sure that each Worker will have equal work to process.

User Program fork

assign map

Input Data

Split 0 Split 1

fork

fork

Master assign reduce

Worker local write

read

write

Worker

Split 2

Worker Worker

Output File 0

Worker

Output File 1


imPlEmEntation Each worker will sequentially grab a task from the Work pool, work on it and then put back the result in the pool.

User Program fork

assign map

Input Data

Split 0

The coordination between the workers is being done by the Master which does I/O communication and also fills the work pool.

Split 1

fork

fork

Master assign reduce

Worker local write

read

write

Worker

Split 2

Worker Worker

Output File 0

Worker

Output File 1


imPlEmEntation A worker can perform two kinds of work: Map () – applies a function to a list, in this case: given the words in the file chunk, find the frequencies and positioning of all words Reduce() – combines results previously obtained in Map()

User Program fork

assign map

Input Data

Split 0 Split 1

fork

fork

Master assign reduce

Worker local write

read

write

Worker

Split 2

Worker Worker

Output File 0

Worker

Output File 1


imPlEmEntation Each document will finally have a set of N most frequent words which appear in it.

User Program fork

assign map

Input Data

The reason we keep only the first N words is that the rest of them would be too little significant for the document and we also want to save space when storing the indexes.

Split 0 Split 1

fork

fork

Master assign reduce

Worker local write

read

write

Worker

Split 2

Worker Worker

Output File 0

Worker

Output File 1


our Way – maP coDE


our Way – rEDucE coDE


our Way – JaVa

• Worker.java - it has the main Map and reduce function for the threads.


our Way – JaVa

• WorkPool.java - it has all the collection of tasks - workers get the tasks and put the solved tasks back once it is done.


our Way – JaVa

• Words.java - once a worker completes the map function this creates object of type words and puts it in the workpool.


our Way – JaVa

• ReplicatedWorkers.java - it’s the master which assign task to the workers


our Way – JaVa

• Occurrences.java - for each word it stores number of occurrences and its positions.


our Way – JaVa

• GUI.java - it is the graphical interface.


our Way – JaVa

• Evaluation.java - is the part where it evaluate with the data with the human results to find precision and precall.


rEsult – sEarch & shoW


rEsult – WorkErs & timE


rEsult – WorkErs & timE


rEsult - EValuation


rEsult - EValuation


rEsult - EValuation


DEmo


conclusion

Simplifies large-scale computations that fit this model Allows user to focus on the problem without worrying about details

30


FuturE Works

WE CAN TRY INCORPORATE parallel sortings & ADDRESS WORKER FAILURE ALSO HOW TO SKIP BAD RECORDS Slow machines Fault Tolerance, Distribution


Back uP sliDEs 561


sitE surVEy


Lifecycle of a MapReduce Job Time

Input Splits

Map Wave 1

Map Wave 2

Reduce Wave 1

Reduce Wave 2

How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?


http://static.googleusercontent.com/media/research.google.com/en/us/archive/pape

http://www0.cs.ucl.ac.uk/staff/K.Jamieson/gz06/s2013/notes/mapreduce-notes.pdf http://static.googleusercontent.com/m http://research.google.com/archive/mapreduce-osdi04-slides/index.html

http://research.google.com/archive/mapreduce.html http://userpages.uni-koblenz.de/~laemmel/MapReduce/paper.pdf

http://en.wikipedia.org/wiki/MapReduce


CS4961-L22.ppt http://www.cs.utah.edu/~mhall/cs4961f10/CS4961-L22.ppt Show in folderRemove from list Whitney.ppt http://www.cc.gatech.edu/~lingliu/courses/cs4440/notes/Whitney.ppt Show in folderRemove from list s07-map-reduce.ppt http://rakaposhi.eas.asu.edu/cse494/notes/s07-map-reduce.ppt Show in folderRemove from list intro_to_mapreduce.ppt http://www.cs.duke.edu/courses/fall11/cps216/Lectures/intro_to_mapreduce.ppt Show in folderRemove from list L06MapReduce.ppt http://cecs.wright.edu/~tkprasad/courses/cs707/L06MapReduce.ppt Show in folderRemove from list


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.