Ranking of Document Recommendations from Conversations using Probabilistic Latent Semantic Analysis by GRD Journals

GRD Journals | Global Research and Development Journal for Engineering | International Conference on Innovations in Engineering and Technology (ICIET) - 2016 | July 2016

e-ISSN: 2455-5703

Ranking of Document Recommendations from Conversations using Probabilistic Latent Semantic Analysis 1P.

Velvizhi 2S. Aishwarya 3R. Bhuvaneswari 1,2,3 Department of Computer Science & Engineering 1,2,3 K.L.N. College of Engineering, Pottapalayam, Sivagangai 630612, India Abstract Any Information retrieval from documents is done through text search. Now a day, efficient search is done through Mining techniques. Speech is recognized for searching a document. A group of Conversations are recorded using Automatic Speech Recognition (ASR) technique. The system changes speech to text using FISHER tool. Those conversations are stored in a database. Formulation of Implicit Queries is preceded in two stages as Extraction and Clustering. The domain of the conversations is structured through Topic Modeling. Extraction of Keywords from a topic is done with high probability. In this system, Ranking of documents is done using Probabilistic Latent Semantic Analysis (PLSA) technique. Clustering of keywords from a set covers all the topics recommended. The precise document recommendation for a topic is specified intensively. The Probabilistic Latent Semantic Analysis (PLSA) technique is to provide ranking over the searched documents with weighted keywords. This reduces noise while searching a topic. Enforcing both relevance and diversity ensures effective document retrieval. The text documents are converted to speech conversation using e-Speak tool. The final retrieved conversations are as required. Keyword- Keyword Extraction, Topic Modeling, Word Frequency, PLSA, Document retrieval __________________________________________________________________________________________________

I. INTRODUCTION Many unpredictable information are available as documents, databases or multimedia resources. User’s current activities do not initiate a search to access that information. So we adopt just-in-time retrieval system. This system spontaneously recommends documents by analyzing users’ current activities. Here, the activities are recorded as conversations. For instance, conversations recorded in a meeting. A real-time Automatic Speech Recognition (ASR) technique constructs implicit queries for document retrieval. This recommendation is done from the web or a local repository. Just-in-time retrieval system must construct implicit queries from conversations and this contains a much larger number of words than a query. For instance, four people must make a list of all the items required to survive in the mountains. A short fragment of 120 seconds conversations which contains 600 words is recorded. This is then split based on the variety of domains such as ‘Job’, ‘Plan’ or ‘Business’. The Multiplicity of topics or speech disfluencies produces ASR noise. The main objective is to maintain multiple records about users’ information needs. In this paper, extracting a relevant and diverse set of keywords is the first step. Clustering into specific-topic queries are ranked by importance. This topic-based clustering technique decreases ASR errors and increases diversity of document recommendations. For instance, Word Frequency retrieves the Wikipedia pages like ‘Job’, ‘Hiring’ and ‘Interview’. Whereas, users would prefer fragments such as ‘Job’, ‘Plan’ and ‘Business’. Relevance and Diversity can be explained in three stages as follows. Extraction of keywords, which retrieves the mostly used words. Build one or several implicit queries namely Single and Multiple. While using Single query, irrelevant documents are retrieved. Whereas in Multiple query, Diversity constraint is maintained. Ranking the results is considered as the final step. It reorders the document list recommended to users. Previous methods in the formulation of implicit queries rely on Word Frequency weights. But other methods perform keyword extraction using Topical Similarity. They do not set Topic diversity constraint. From the ASR output, user’s information needs are satisfied and the number of irrelevant words is reduced. When the keywords are extracted, it is followed by clustering which builds topically separated queries. They are run independently than topically mixed query. These results are then ranked and organized. The paper is organized as follows. In Section II-A the just-in-time retrieval system and the policies used for query formulation is reviewed. In Section II-B keyword extraction methods are discussed. In Section III the proposed technique for formulation of implicit queries is given. In Section IV introduces data sets and the comparison of keyword sets is made to retrieve documents using crowd sourcing. In Section V the experimental results on keyword extraction and clustering is presented.

133

Ranking of Document Recommendations from Conversations using Probabilistic Latent Semantic Analysis (GRDJE / CONFERENCE / ICIET - 2016 / 022)

II. JUST-IN-TIME RETRIEVAL Just-in-time retrieval system produces a spontaneous change in query based information retrieval. Users’ activities are monitored to detect information needs and retrieve relevant document. For this, implicit queries are extracted that are present in the text document (Not explicitly shown). Many just-in-time retrieval systems and methodologies are used for query formulation. An Automatic Content Linking device (ACLD), a document recommendation system was intended. A. Query Formulation The first introductory system for a document recommendation was the Fixit system .This monitors the background search on a database and makes relationships between symptoms and faults. It also provides additional information for the current state. The next system, Remembrance agent was integrated into the text editor and search is made. Watson system verifies browsing history for document recommendation. Its structure formulates Word Frequency and written style. Before the solutions proposed in this paper, the ACLD modeled user’s information needs at regular time intervals. But this method includes the entire set of words rather than implicit query which reduces noises. The use of semantic similarity shows that, although this improves relevance its high computational cost makes it unfit for just-in-time retrieval from a large repository. This motivated to design a keyword extraction methodology to model users’ information needs. Even short conversation fragments include words related to several topics. Here, ASR transcript adds additional errors and keyword selection leads to improper queries. This leads to failure of relevance and user satisfaction. The extraction method proposed for keywords introduces diversity and a clustering technique for topically separated queries. B. Keyword Extraction Several methods automatically extract keywords from a text. The oldest technique used word frequencies for ranking. An alternative method by counting pair wise co-occurrence frequencies and ranking does not contain word meaning. So they ignore low frequency words. The highly important domain is notified. For instance, the words ‘Car’, ‘Wheel’, ‘Seat’ and ‘Passenger’ together indicate automobiles as a unique domain. To improve frequency based methods lexical semantic information has been proposed. Semantic relation is captured by a manual thesaurus such as Word Net or Wikipedia. Topic modeling technique such as LSA, PLSA, LDA are automatically built. Probabilistic Latent Semantic Analysis (PLSA) is built to rank the words of a conversation transcript using weighted point wise mutual information scoring function. Ranking in the transcript is based on the topical similarity. The dependencies among selected words are combined with Page Rank, Word Net or Topical Information. In proposed paper, word2vec vector space representation of a word is modeled using neural network language model. The previous work considered topical similarity and dependencies among words. But reward diversity is explicitly not mentioned. So secondary topics in a conversation is lost.

III. IMPLICIT QUERY FORMULATION A two-stage process proposes formulation of implicit queries. The foremost step is the extraction of keywords from the text document. The documents must be recommended for those transcripts by an ASR system. All the topics covered in the conversations must be detected as keywords. ASR mistakes are removed by avoiding irrelevant words. The next stage is a clustering technique which forms topically disjoint queries. A. Diverse Keyword Extraction Topic Modeling is proposed as the first step to build a topical representation. The content words are selected using topical similarity. This also includes a diverse range of topics recommended by summarization technique. The main advantage covers all the main topics of the conversation. The proposed algorithm selects a smaller number of keywords from each topic in order to include more topics. This is useful for the following reasons. Increase in the variety of documents is ensured. The second reason is that the algorithm chooses smaller number of noisy words rather than algorithm which ignores diversity. The present diverse keyword extraction includes three steps, as viewed in Fig.1. First, Topic Modeling represents the distribution of the topic z for each word w noted p (z | w). These topics are not represented explicitly. But are represented using topic modeling technique. The domain of the conversation is selected for group of conversations. Second, the modeled topics are weighted for each conversation fragment as . From the weights determined, the keyword list w = [ , ,… ] covers all the important topics. Modeling topics is done using Probabilistic Latent Semantic Analysis (PLSA) to maintain the distribution for topic z of each word w noted p (z | w). For each conversation fragment the topics are weighted. This is obtained by analyzing the probability average. If a conversation fragment t mentions set of topics Z, each word w from t is a subset in a set, then the main objective is finding a subset with k unique words that covers all the topics. A Greedy algorithm finds an optimal solution within a given time period. Finally we select one of the set which maximizes priority values over all the other topics. If keywords in ASR errors mentions lower , the selection probability is reduced. Hence the keywords are extracted in desired constrain between relevance and

134

Ranking of Document Recommendations from Conversations using Probabilistic Latent Semantic Analysis (GRDJE / CONFERENCE / ICIET - 2016 / 022)

diversity in the keyword set. For this, a parameter λ is set to cover those constraints. B. Keyword Clustering The keywords extracted using diverse keyword extraction methodologies responses user needs in the form of topics. Diversity of topics and reduction of noisy effect is maintained by splitting the keyword set into several topically disjoint subsets. Every subset forms an implicit query to retrieve the document. These subsets are formed by clustering of topically similar keywords. It is explained as follows. Ranking keywords for each main topic of the conversation is done for keyword clustering. The decreasing values of . p (z|w) is considered for ordering of keywords. The keywords with higher value are considered for each topic z. They are ranked themselves with values. C. Document Recommendation For document retrieval, implicit queries can be prepared for each conversation fragment. Improvement of the retrieval results is ensured through Formulation of multiple implicit queries. With implicit query formulation, after ranking the first d documents are retrieved. By this way, recommendation lists were prepared from the document sets. Ranking based on topical similarity corresponding to queries is formulated. This is the baseline algorithm.

IV. EXPERIMENTAL RESULTS Here, diverse keyword extraction technique is compared with query formulation and extracting keywords. This extracts more relevant keywords and reduces ASR noise. The retrieval results are compared which are formed by keyword lists and split into topically separated queries. From this method, the list generated outperforms existing methods. A. Keyword Extraction Methods The diverse keyword extraction method proposes several versions such as D (λ), WF and Topic Similarity (TS). It is calculated based on λ values where λ {.5, .75, 1}. These three methods do not ensured diversity. When the values of λ are noted, D (.5) appears to be low comparing to other values of λ. These values can be clustered into three groups: .5≤ λ<.7 can be represented as λ= .5, .7≤ λ≤.8 can be represented as λ=.75, .8< λ≤1 can be represented as λ=1. But we consider only D (.5), D (.75), D (1) and Topic Similarity. B. Topical Diversity We have compared four keyword extraction methods such as WF, TS, D (.75), D (.5). The average values of D (.75) and D (.5) are similar with high ranks rather than WF and TS. The values for TS are low and gradually increase for large number of keywords. This shows that TS can ensure topic diversity. Whereas the values for WF are uniform throughout the keyword search. So WF does not consider topics. C. Relevant Keyword Binary comparisons are performed using crowd sourcing between each extraction methods such as WF and TS, D (.5) and D (.75) and so on. The goal is to rank the methods among the four extraction techniques. This excludes redundant comparisons. Human judgments prefer the following ranking: D. Noise Reduction The differences between comparison values compared with manual ones are similar for ASR transcripts. There is degradation in the values of WF due to ASR noise. When D (.75) outperforms TS, the ranking still remains the same as D (.75) > TS > WF > D (.5). From each method the keywords are listed out for comparisons to detect ASR errors. The noise level may vary from 5% to 50% for each method. Hence, D (.75) selects smaller number of keywords than TS and WF. Here, WF method selects only words with higher frequency rather than topics. If noisy words are present in irrelevant topics, both TS and D (.75) probability will be reduced. The advantage of D (.75) over TS is that noisy keyword selection is reduced. E. Retrieval Scores The retrieval results of queries from keyword list are again binary compared using crowd sourcing. We have considered two types of implicit queries: Single queries and multiple queries. The entire keyword set is considered for formulating single query, while multiple queries are formed by dividing the keyword set into topically independent subsets. These results are combined into a unique document set. When compared with the previous result, WF outperforms comparative relevance with 87% vs. 13%. First, single queries are built by considering D (.75), TS and WF keyword extraction methods. It is then compared to view the relevant results. Next, using the same methods multiple queries are built and comparisons are done with the resulting document sets. The best results of multiple queries are compared with best results of single queries.

135

Ranking of Document Recommendations from Conversations using Probabilistic Latent Semantic Analysis (GRDJE / CONFERENCE / ICIET - 2016 / 022)

F. Single Queries Single query comparisons are made based on D (.75), TS and WF methods over the fragments. The suggested documents are listed out through transcripts. Again, D (.75) > WF > TS is the resulting rank for diversity keyword extraction technique. When single queries are used relevance of resulting document sets are superior compared to the extraction technique. Relevance (%) Relevance (%) m1 m2 TS vs. WF 40 60 WF vs. D(.75) 42 58 TS vs. D(.75) 20 80 Fig. 1: Comparative Relevance of Single Queries Inferred As: D (.75) > WF > TS. Compared Methods (m1 vs. m2)

G. Multiple Queries Binary comparisons were performed using multiple queries which results in topically disjoint sets. Here, TS and D (.75) keyword extraction methods are used. The above two methods are notified as CTS and CD (.75) (C indicates cluster of keywords). They are derived from TS and D (.75) keyword lists after clustering. Clustering on WF is not applicable because it does not formulate on topic modeling. Now, CD (.75) outperforms CTS with 62% vs. 38%. H. Single versus Multiple From the same keyword lists, single and multiple queries are compared as D (.75) and TS. It reveals that using multiple queries leas more relevant documents when compared with single queries as CD (.75) > D (.75) and CTS > TS. The relevance score indicates that 65% to 35% for D (.75) and TS.

Relevance (%) Relevance (%) m1 m2 CD(.75) vs. D(.75) 70 30 CTS vs. TS 70 30 CD(.75) vs.CTS 60 40 Fig. 2: Comparative Relevance Scores of Document Results using Multiple Queries Compared Methods (m1 vs. m2)

1) Example The priority for CD (.75) shows that it retrieves the most relevant results. In Appendix A, a list of 12 items is discussed and listed out to survive in a cold mountain. The list of keywords extracted by D (.75), TS and WF does not consider topical information (‘Job’, ‘Work’, and ‘Company’). Here no relevance information is retrieved. The CTS method covers main topics of the fragment with the keywords as ‘Plan’, ‘Interview’, ‘Company’ and ‘Business’. Then ‘time’ and ‘questions’ is selected to cover the remaining topics. Next, ‘Resume’, ‘studies’, ‘mail’ and ‘school’ covers all the other topics. The highly ranked retrieval results are obtained by WF, TS, D (.75), CD (.75) and CTS. Then, multiple queries (CTS and CD (.75)) retrieve the most relevant documents. First, WF does not recommend relevant documents. Then D (.75) retrieves the largest number of topics. But multiple queries CTS and CD (.75) in comparison with single queries retrieves the most relevant documents. Here TS and D (.75) do not separate mixture of topics. So we move onto CTS and CD (.75). On the contrary, CD (.75) covers more topics while compared to CTS. WF

D (.75)

S = {Job, Work, Interview, Plan, Company, School, Studies}

S = {Job, Hiring, Hire, Resume, Company, Preparation }

S = {Job, Resume, School, Time, Advice, Friends }

Fig. 3: Keyword Sets Obtained by Three Keyword extraction Techniques. CTS CD (.75) Q1 = {New job, New year, Studies, Graduation, Time} Q1 = {Job, New year, Friends, Studies} Q2 = {Hiring, Agency} Q2 = {Finance, Campus, Firms, Resume} Q3 = {Campus, Finance} Q3 = {Academic, Advice, Strength, Weakness} Q4 = {Performance} Fig. 4: Abstract Topics of the Conversation Fragment. WF TS D (.75) Job

Interview

Resume

Job offer

Job

School

136

Ranking of Document Recommendations from Conversations using Probabilistic Latent Semantic Analysis (GRDJE / CONFERENCE / ICIET - 2016 / 022)

Hire

Hiring

Interview

Hiring Company

Resume

Campus

CTS CD (.75) Job Finance Interview Job Hiring Advice Company Academic Agency Business Fig. 5: Comparison of Single and Multiple Queries

V. CONCLUSION A specific just-in-time retrieval system includes conversational environment which recommends documents to users along with the information needs. First, topic modeling is performed by retrieving implicit queries from fragments. A novel keyword extraction technique covers maximum number of important topics in a fragment. They are clustered to divide the keyword set into smaller topically-disjoint subsets formulated using implicit queries. Based on WF and TS, relevance of the retrieved documents is represented. Therefore, both relevance and diversity empowers improvement of keyword extraction technique. It is considered by comparing n words in addition to individual words. The current goal is to process explicit queries and rank the results to cover all the information needs of the user. These techniques are not integrated in the working environment with human users in real-life meetings.

VI. APPENDIX The following transcript of a four-party conversation (speakers A–D) was submitted to the document recommendation system. The keyword lists, queries and documents retrieved are respectively shown in Tables VII, VIII, and IX.  Nancy: Hi. It is good to see you, John.  John: Same here, Nancy. It has been a long time since I last saw you.  Nancy: Yes, the last time we saw each other was New Year’s Eve. How are you doing?  John: I am doing OK. It would be better if I have a new job right now.  Nancy: You are looking for a new job? Why?  John: I already finished my studies and graduated last week. Now, I want to get a job in the  Finance field. Payroll is not exactly Finance.  Nancy: How long have you been looking for a new job?  John: I just started this week.  Nancy: Didn’t you have any interviews with those firms that came to our campus last month? I believe quite a few companies came to recruit students for their Finance departments.  John: I could only get one interview with Fidelity Company because of my heavy work- schedule. A month has already gone by, and I have not heard from them. I guess I did not make it.  Nancy: Don’t worry, John. You always did well in school. I know your good grades will help you get a job soon. Besides, the job market is pretty good right now, and all companies need financial analysts.  John: I hope so.  Nancy: You have prepared a resume, right?  John: Yes.  Nancy: Did you mail your resume to a lot of companies? How about recruiting agencies?  John: I have sent it to a dozen companies already. No, I have not thought about recruiting agencies. But, I do look closely at the employment ads listed in the newspaper every day.  Nancy: Are there a lot of openings?  John: Quite a few. Some of them require a certain amount of experience and others are willing to train.  Nancy: My friends told me that it helps to do some homework before you go to an interview. You need to know the company well—what kind of business is it in? What types of products does it sell? How is it doing lately?  John: Yes, I know. I am doing some research on companies that I want to work for. I want to be ready whenever they call me in for an interview.  Nancy: Have you thought about questions they might ask you during the interview?  John: What types of questions do you think they will ask?  Nancy: Well, they might ask you some questions about Finance theories to test your academic understanding.  John: I can handle that.

137

Ranking of Document Recommendations from Conversations using Probabilistic Latent Semantic Analysis (GRDJE / CONFERENCE / ICIET - 2016 / 022)

         

Nancy: They might tell you about a problem and want you to come up with a solution. John: I don’t know about that. I hope I will be able to give them a decent response if the need arises. Nancy: They will want to know you a little bit before they make a hiring decision. So, they may ask you to describe yourself. For example, what are your strengths and your weaknesses? How do you get along with people? John: I need to work on that question. How would I describe myself? Huh! Nancy: Also, make sure you are on time. Nothing is worse than to be late for an interview. You do not want to give them a bad impression, right from the start. John: I know. I always plan to arrive about 10 or 15 minutes before the interview starts. Nancy: Good decision! It seems that you are well prepared for your job search. I am sure you will find a good job in no time. John: I hope so. Nancy: I need to run; otherwise, I will be late for school. Good luck in your job search, John. John: Thank you for your advice. Bye!

RREFERENCES [1] M. Habibi and A. Popescu-Belis, “Enforcing topic diversity in a document recommender for conversations,” in Proc. 25th Int. Conf. Comput. Linguist. (Coling), 2014, pp. 588–599. [2] D. Harwath and T. J. Hazen, “Topic identification based extrinsic eval-uation of summarization techniques applied to conversational speech,” in Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2012, pp. 5073–5076. [3] A. Popescu-Belis, M. Yazdani, A. Nanchen, and P. N. Garner, “A speech-based just-in-time retrieval system using semantic search,” in Proc. Annu. Conf. North Amer. Chap. ACL (HLT-NAACL), 2011, pp. 80–85. [4] D. Traum,P. Aggarwal,R. Artstein,S. Foutz,J. Gerten,A. Katsamanis, A. Leuski, D. Noren, and W. Swartout, “Ada and Grace: Direct inter-action with museum visitors,” in Proc. 12th Int. Conf. Intell. Virtual Agents, 2012, pp. 245–251. [5] A. S. M. Arif, J. T. Du, and I. Lee, “Examining collaborative query reformulation: A case of travel information searching,” in Proc. 37th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2014, pp. 875–878. [6] M. Habibi and A. Popescu-Belis, “Using crowdsourcing to compare document recommendation strategies for conversations,” Workshop Recommendat. Utility Eval.: Beyond RMSE (RUE’11), pp. 15–20, 2012. [7] M. Yazdani, “Similarity learning over large collaborative networks,” Ph.D. dissertation, EPFL Doctoral School in Information and Commu-nication (EDIC), Lausanne, Switzerland, 2013. [8] A. Nenkova and K. McKeown, “A survey of text summarization tech-niques,” in Mining Text Data, C. C. Aggarwal and C. Zhai, Eds. New York, NY, USA: Springer, 2012, ch. 3, pp. 43–76. [9] D. F. Harwath, T. J. Hazen, and J. R. Glass, “Zero resource spoken audio corpus analysis,” in Proc. Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2013, pp. 8555–8559.

138