International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637
Back Index Generation Tool for E-Books: An Implicit Ontology Approach Raj Kumar Singh1, Prateeksha Pandey2 1
Assistant Professor , Department of Information Technology1, Bhilai Institute of Technology, Durg, India1 Assistant Professor2, Department of Computer Science & Engineering2, Chhatrapati Shivaji Institute of Technology, Durg, India2 Email: rajkumarsingh33@gmail.com1, prateekshapandey@csitdurg.in2 Abstract- Book reading is a frequent activity which each one of us does in our existence. Traditional way to navigate or search a topic in a book is done with book index. A universal strategy to find a page for reading is to use front index and back index. A front index generally contains the sections and subsections matter with their corresponding page numbers. Back index is a page number wise list of nouns used in book. The importance and accuracy of Back-of-book index is now becoming very important aspects of research. Researchers working on text mining field related to books, uses back-of-book indexes as a seed keyword for searching and topic spotting. There are different types of back indexes are used by the publishers. Most common of them are Flat and Hierarchical. Also various automatic tools are available which generates the back-of-book indexes. But they also need to define ontology before processing a book. Ontology formally represents knowledge as a set of concepts within a domain. The present paper demonstrates one efficient method which generates back-of-thebook index without any predefined ontology. Index Terms- Stanford Typed Dependency Parser, Noun Phrase Extraction, Bi-grams, Tri-grams, Hierarchical Back Index. 1. INTRODUCTION The proposed approach will automate the generation of back-of-book indexes. This approach converts the portable document format files to text format. The text files are then arranged according to the page numbers. In next step these text pages are passed through Stanford Typed Dependency Parser [1]. Parser helps us by generating the nouns phrases that can be used as ontology. In computer science and information science, ontology formally represents knowledge as a set of concepts within a domain, using a shared vocabulary to denote the types, properties and interrelationships of those concepts [2] [3]. The intermediate outputs are refined by using string matching techniques. At last the final output contains individual or combination of noun phrases can be called as nouns bi-gram and tri-gram that will be arranged in alphabetical order.
can be embedded inside each other; for instance, the noun phrase some of his constituents contains the shorter noun phrase his constituents. In some modern theories of grammar, noun phrases with determiners are analyzed as having the determiner rather than the noun as their head; they are then referred to as determiner phrases. 2. METHODOLOGY The overall process of back-of-book index generation can be represented by a block diagram as shown in fig. 1 below.
1.1. Back Index Components Back index component includes, Noun Phrases with or without sub-headings, References (i.e. page numbers). If a phrase as a noun [4] (or indefinite pronoun) as its head word, or which performs the same grammatical function as such a phrase then the phrase is called Noun Phrase. Noun phrases are very common cross-linguistically, and they may be the most frequently occurring phrase type. Noun phrases Fig.1: Process of back-of-the-book index generation
5
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 2.1. Pre-processing of e-book This step takes the e-book as input in PDF format. Ebooks are then pre-processed for the further operations. Pre-processing includes: 2.1.1. Splitting of E-book into text pages This module splits the PDF e-books into same number of text pages the book contains. PDF to text conversion can be done by various tools. A java [5] based library iText [6] is used which splits the PDF file into number of test pages. The naming conversion or the pages is as “PageNumber.txt”. The split process is important because it provides a unique identification of the page and its respective content. 2.1.2. Page number association After conversion of PDF to text each page is allotted with a page number. Page number allocation is important because the page numbers generated after splitting and conversion of PDF file and actual page number which start the chapter contents are different. For this purpose we have to identify actual page number from the chapter contents are starting and then put it manually in our program. 2.2. Typed Dependency Parsing In this step the system inputs each chapter contents to the Stanford Parser. Stanford parser then generates the type dependencies for the text. Parser from Stanford is a java program and is integrated with the system easily. 2.2.1. Typed Dependency Parser The Stanford Typed Dependencies [7] representation was designed to provide a simple description of the grammatical relationships in a sentence that can easily be understood and electively used by people without linguistic expertise who want to extract textual relations. In particular, rather than the phrase structure representations that have long dominated in the computational linguistic community, it represents all sentence relationships uniformly as typed dependency relations. Here is an example sentence: “Bell, based in Los Angeles, makes and distributes electronic, computer and building products.” For this sentence, the Stanford Dependencies (SD) representation is:
nsubj(makes-8, Bell-1) nsubj(distributes-10, Bell-1) partmod(Bell-1, based-3) nn(Angeles-6, Los-5) prep in(based-3, Angeles-6) root(ROOT-0, makes-8) conj and(makes-8, distributes-10) amod(products-16, electronic-11) conjand(electronic-1, computer-13) amod(products-16, computer-13) conjand(electronic-11, building-15) amod(products-16, building-15) dobj(makes-8, products-16) dobj(distributes-10, products-16) Each file is parsed individually and a typed dependency for each sentences are generated for all the pages. The process generates different file containing typed dependency with name “PageNumber.txt” at different folder. 2.3. Noun Phrase Extraction The output of Typed Dependency Parser is used to extract the noun phrases that will be our seed words for Back Index generation. The words inside bracket are extracted and rearranged. The above process can generate either bi-grams or tri-grams. 2.3.1. Seed Words Seed word is a first occurrence of a noun from which noun phrase chain can be constructed. For example in given relationship nn(Angles-6, Los-5), the word Angles can be treated as seed word. Seed word can play an important role in tri-gram and bi-gram generation. 2.4. Tri-grams Generation Tri-grams are those sentences which contains three seed words. This step uses noun phrases generated by typed dependency parser. If secondword-pos of first line and the seconword-pos of second line for two consecutive words are identical then we can declare it as tri-gram and rearrange them. Here rearranging process involves taking all the four terms of two consecutive nn’s delete the repeating term and finally placing them according to pos values. After rearranging process this approach deletes the entry from dependency and removes all pos values. This step also deletes the both the lines from dependency relation.
6
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 2.5. Bi-grams Generation
Step 10:
extract only nn(...)
Bi-grams are those sentences which contains three seed words. The remaining content inside bracket with nn’s after above step can give bi-grams. Here obviously secondword-pos of first line and the seconword-pos of second line for two consecutive words will not be identical then we can declare it as bi-gram and rearrange them. Here rearranging process involves taking the two terms of current line and putting them according to pos values. After rearranging process this approach deletes the entry from dependency and removes all pos values. This process will run while the file is not empty.
Step 11: if two consecutive first word inside two consecutive nn(...) are same Step 12: Then Step 13:
select both words inside nn's and generate a tri-gram
Step 14:
attach a page number separated by comma
Step 15: else Step 16:
select nn and generate a bi-gram.
Step 17:
attach a page number separated by comma
2.6. Generating Hierarchy from Bi-grams The system first selects second word of bi-gram and extracts all co-occurring terms, attaching them below second word. The co-occurring terms will be treated as sub-topics and second word will be called main topic of hierarchy. Finally, all the co-occurring terms will be allotted a page number.
Step 18:
End if
Step 19: For each bi-gram Step 20:
select last word as main topic
Step 21:
For each main topic
Step 22:
select co-occurring words
Step 23:
put co-occurring words inside the main topic
2.7. Alphabetical Ordering This last step uses simple sorting technique for arranging all bi-grams and tri-grams in alphabetical order. In this process all the bi-gram and tri-grams are merged at one location, rearranged and storing them into a different location. The all arranged terms are generated in the form of back indexed with page numbers separated by comma.
Step 24:
End loop
Step 25: End loop Step 26: Rearrange all the topics alphabetically Step 27: End
2.8 Algorithm 3. WORKING EXAMPLE
HBIG_Algo Input:
E-Book in PDF form
The figures below show the working example of the tool.
Output: Hierarchical Back index in text form Step 1: Start Step 2: For each PDF page Step 3:
convert into text files
Step 4: End loop Step 5: For each text files Step 6:
convert typed dependency for each sentences
Step 7:
store all pages with corresponding page number as file name
Step 8: End loop
Fig. 2. Typed Dependency Parsing of sample text
Step 9: For each pages of typed dependency
7
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637
Fig. 3. Extraction of nn expression
Fig. 6. HBI Generation from Bi-grams
Fig. 7. Generated Hierarchical Back Index 4. OBSERVATION AND RESULTS This paper has identified two software packages for automatic back-of-the-book indexing. Fig. 4. Tri-gram generation
4.1. PDF Index Generator PDF Index Generator is a powerful indexing utility for generating the back index for any e-book [8]. PDF Index Generator parses PDF, collects the index words and their location in the PDF, then writes the generated index to a PDF or a text file specified by user or indexer. The main target for PDF Index Generator is to automate the process of generating the book index instead of doing the hard work manually. The PDF index generator generate back index in the form of single noun with page number. It does not provide Bi-grams and Tri-grams [9]. 4.2. MCFBI
Fig. 5. Bi-gram generation
The Meta content framework to generating back indexes for e books which uses part of speech tagging [10]. It is a simple approach which takes the help of Part Of Speech (POS) Tagger [11] to identify the candidate terms or nouns. The main disadvantage of
8
International Journal of Research in Advent Technology, Vol.2, No.6, June 2014 E-ISSN: 2321-9637 this software tool is that, it does not provide Bi-grams and Tri-grams. Here both tools only provides flat back index, they are unable to provide hierarchical back index terms. 5. DISCUSSION The technique explained above provides fully automatic software package which enable a user or indexer to generate back index in hierarchical form. HBIG also includes the tri-grams which can help reader to use the book more efficiently. It takes a book in PDF form and generates back index either flat or hierarchical form that can be edited or modified further if required. It overcomes the disadvantages of PDF Index Generator and MCFBI. 6. ACKNOWLEDGMENT The author sincerely thanks Dr. Arpana Rawal, Dr. Ani Thomas and Mr. Sarang Pitale for their timely, invaluable and indispensable guidance and consequently encouragement shown towards the work group.
[4] Loos, Eugene E., et al. Glossary of linguistic terms: What is a noun? 2003. [5] Oracle Corporation, http://www.java.com [6] iText - Free / Open Source PDF Library for Java and C# , http://www.itextpdf.com/ [7] Marie-Catherine de Marneffe and Christopher D. Manning, “Stanford Typed Dependencies Manual‟, The Stanford Natural Language Processing Group, 2008. [8] www.pdfindexgenerator.com [9] Saket Soni, Raj Kumar Singh, Sarang Pitale, “Flat Back Index Generation for E-Books: A Tri-gram Approach”, International Journal of Innovative Research in Computer and Communication Engineering(IJIRCCE), vol. 2, no. 1, pp. 2540-2545, 2014. [10] Tripti Sharma, Sarang Pitale, “Meta-Content framework for back index generation‟, International Journal on Computer Science and Engineering (IJCSE), Vol. 4 No. 04, pp. 627633, 2012. [11] nlp.stanford.edu/software/tagger.shtml
7. CONCLUSION Machine-generated Back index can act as implicitly available back-ground knowledge (ontology). Such an implicit ontology plays an important role in: • NLP Machine translation tasks • Automatic topic spotting • Content Summarization • Topic relevancy computation and ranking Without the use of explicit ontology, the publish-cumprinting time is drastically reduced, index appearance: totally precise and complete. Expense-overhead is reduced against equivalent manual effort required for task completion and more comprehensive than flat back indexes. REFERENCES
[1] Extraction of Grammatical relations between text fragments: Typed dependency parser (courtesy: The Stanford Natural Language Processing Group). http://nlp.stanford. edu:8080/parser/index.jsp [2] Gruber, Thomas R. “A translation approach to portable ontology specifications”. Knowledge Acquisition (ELSEVIER), vol. 5, no. 2, pp, 199– 220, 1993. [3] Fredrik Arvidsson, Annika Flycht-Eriksson, “Ontology I”. Retrieved 26 November 2008.
9