DEmo & PrEsEntation
Indexing and document searching by keywords using Replicated workers paradigm and the Map Reduce model. Georgiana Tache & Abu Rahat Chowdhury University of Louisiana at Lafayette
outlinE
outlinE
~ MapReduce & Motivation ~ Challenges & Basic Idea ~ Implementation ~ Our Java way ~ Results ~ DEMO ~ Concluding remark and future work
What is maP-rEDucE ?
• Parallel programming model meant for large clusters • Functions borrowed from functional programming languages (eg. Lisp) • •
Map() – Process a key/value pair to generate intermediate key/value pairs• Reduce() – Merge all intermediate values associated with the same key
motiVation – GooGlE’s PaPEr MapReduce: Simplied Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google’s implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use. Hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day
tWo main challEnGEs • How to simulate a distributive environment to test MapReduce with our smaller test files. • Indexing and making sure we have replicated worker paradigm to do the MapReduce.
Basic iDEa: User Program fork
fork assign map Input Data Split 0 read Split 1 Split 2
Master
fork
assign reduce
Worker
Worker
Worker
local write
write Worker
Worker
Output File 0
remote read, sort
Output File 1
imPlEmEntation oVErViEW ●
● ●
● ●
●
1-CPU intel core i3 x86 machines, 1 GB of memory limit Java based user programs. Input files A: 50+ randomly selected documents from different websites saved in txt files. Block or chunk size : 500 bytes Output is shown in GUI and if clicked opened in notepad. Made in Eclipse and Implementation done in Java swing library linked into user programs
imPlEmEntation - Data DistriBution Input files are split into k pieces on temp file in windows file system typically ~ 500 bytes blocks for our project Intermediate files created from map tasks are written to local disk also GUI output creates ranking with frequency and minimum distance. Also for the evaluation part it has some cache files to state the precision and recall values.
imPlEmEntation Processing such a block will represent a task, and all tasks will reside in something called a Work pool. The reason for dividing a document into block is to make sure that each Worker will have equal work to process.
User Program fork
assign map
Input Data
Split 0 Split 1
fork
fork
Master assign reduce
Worker local write
read
write
Worker
Split 2
Worker Worker
Output File 0
Worker
Output File 1
imPlEmEntation Each worker will sequentially grab a task from the Work pool, work on it and then put back the result in the pool.
User Program fork
assign map
Input Data
Split 0
The coordination between the workers is being done by the Master which does I/O communication and also fills the work pool.
Split 1
fork
fork
Master assign reduce
Worker local write
read
write
Worker
Split 2
Worker Worker
Output File 0
Worker
Output File 1
imPlEmEntation A worker can perform two kinds of work: Map () – applies a function to a list, in this case: given the words in the file chunk, find the frequencies and positioning of all words Reduce() – combines results previously obtained in Map()
User Program fork
assign map
Input Data
Split 0 Split 1
fork
fork
Master assign reduce
Worker local write
read
write
Worker
Split 2
Worker Worker
Output File 0
Worker
Output File 1
imPlEmEntation Each document will finally have a set of N most frequent words which appear in it.
User Program fork
assign map
Input Data
The reason we keep only the first N words is that the rest of them would be too little significant for the document and we also want to save space when storing the indexes.
Split 0 Split 1
fork
fork
Master assign reduce
Worker local write
read
write
Worker
Split 2
Worker Worker
Output File 0
Worker
Output File 1
our Way – maP coDE
our Way – rEDucE coDE
our Way – JaVa
• Worker.java - it has the main Map and reduce function for the threads.
our Way – JaVa
• WorkPool.java - it has all the collection of tasks - workers get the tasks and put the solved tasks back once it is done.
our Way – JaVa
• Words.java - once a worker completes the map function this creates object of type words and puts it in the workpool.
our Way – JaVa
• ReplicatedWorkers.java - it’s the master which assign task to the workers
our Way – JaVa
• Occurrences.java - for each word it stores number of occurrences and its positions.
our Way – JaVa
• GUI.java - it is the graphical interface.
our Way – JaVa
• Evaluation.java - is the part where it evaluate with the data with the human results to find precision and precall.
rEsult – sEarch & shoW
rEsult – WorkErs & timE
rEsult – WorkErs & timE
rEsult - EValuation
rEsult - EValuation
rEsult - EValuation
DEmo
conclusion
Simplifies large-scale computations that fit this model Allows user to focus on the problem without worrying about details
30
FuturE Works
WE CAN TRY INCORPORATE parallel sortings & ADDRESS WORKER FAILURE ALSO HOW TO SKIP BAD RECORDS Slow machines Fault Tolerance, Distribution
Back uP sliDEs 561
sitE surVEy
Lifecycle of a MapReduce Job Time
Input Splits
Map Wave 1
Map Wave 2
Reduce Wave 1
Reduce Wave 2
How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?
http://static.googleusercontent.com/media/research.google.com/en/us/archive/pape
http://www0.cs.ucl.ac.uk/staff/K.Jamieson/gz06/s2013/notes/mapreduce-notes.pdf http://static.googleusercontent.com/m http://research.google.com/archive/mapreduce-osdi04-slides/index.html
http://research.google.com/archive/mapreduce.html http://userpages.uni-koblenz.de/~laemmel/MapReduce/paper.pdf
http://en.wikipedia.org/wiki/MapReduce
CS4961-L22.ppt http://www.cs.utah.edu/~mhall/cs4961f10/CS4961-L22.ppt Show in folderRemove from list Whitney.ppt http://www.cc.gatech.edu/~lingliu/courses/cs4440/notes/Whitney.ppt Show in folderRemove from list s07-map-reduce.ppt http://rakaposhi.eas.asu.edu/cse494/notes/s07-map-reduce.ppt Show in folderRemove from list intro_to_mapreduce.ppt http://www.cs.duke.edu/courses/fall11/cps216/Lectures/intro_to_mapreduce.ppt Show in folderRemove from list L06MapReduce.ppt http://cecs.wright.edu/~tkprasad/courses/cs707/L06MapReduce.ppt Show in folderRemove from list