1 minute read

Vectorize/

convert each word to the vector

# Tokenize each sentence into individual words tokens_list = [word_tokenize(sentence) for sentence in sentences]

Advertisement

# Remove punctuation from tokens punctuation = set(string.punctuation) tokens_list = [[word for word in tokens if word.lower() not in punctuation] for tokens in tokens_list]

# Remove stop words from tokens stop_words = set(stopwords.words("english")) tokens_list = [[word for word in tokens if word.lower() not in stop_words] for tokens in tokens_list]

# Calculate the TF-IDF matrix vectorizer = TfidfVectorizer(stop_words='english') tfidf_matrix = vectorizer.fit_transform(corpus) words = vectorizer.get_feature_names()

# Get the tf-idf value for each word tfidf_values = tfidf_matrix.todense().tolist()[0]

# Sort from high to low according to TF-IDF value, get the first 8 words center_words = [words[index] for index in sorted(range(len(tfidf_values)), key=lambda i: tfidf_values[i], reverse=True)[:3]]

The paragraph vectors are generated by first training a Word2Vec model on the preprocessed text data. The Word2Vec model learns high-dimensional vector representations (embeddings) of words in the text corpus. Each word in the text corpus is represented by a dense vector of real numbers, which captures its semantic meaning based on its context.

Once the Word2Vec model is trained, paragraph vectors are generated by taking the average of the word vectors for each word in the paragraph. The resulting paragraph vectors are also dense, high-dimensional vectors that capture the semantic meaning of the paragraph.

This article is from: