Logic.core.word_embedding package#

Logic.core.word_embedding.fasttext_data_loader module#

class FastTextDataLoader(file_path)#

Bases: object

This class is designed to load and pre-process data for training a FastText model.

It takes the file path to a data source containing movie information (synopses, summaries, reviews, titles, genres) as input. The class provides methods to read the data into a pandas DataFrame, pre-process the text data, and create training data (features and labels)

create_train_data()#

Reads data using the read_data_to_df function, pre-processes the text data, and creates training data (features and labels).

Returns:

A tuple containing two NumPy arrays: X (preprocessed text data) and y (encoded genre labels).

Return type:

tuple

read_data_to_df()#

Reads data from the specified file path and creates a pandas DataFrame containing movie information.

You can use an IndexReader class to access the data based on document IDs. It extracts synopses, summaries, reviews, titles, and genres for each movie. The extracted data is then stored in a pandas DataFrame with appropriate column names.

Returns:

pd.DataFrame

Return type:

A pandas DataFrame containing movie information (synopses, summaries, reviews, titles, genres).

Logic.core.word_embedding.fasttext_model module#

class FastText(method='skipgram')#

Bases: object

A class used to train a FastText model and generate embeddings for text data.

method#

The training method for the FastText model.

Type:

str

model#

The trained FastText model.

Type:

fasttext.FastText._FastText

analogy(word1, word2, word3)#

Perform an analogy task: word1 is to word2 as word3 is to __.

Parameters:
  • word1 (str) – The first word in the analogy.

  • word2 (str) – The second word in the analogy.

  • word3 (str) – The third word in the analogy.

Returns:

The word that completes the analogy.

Return type:

str

get_query_embedding(query)#

Generates an embedding for the given query.

Parameters:
  • query (str) – The query to generate an embedding for.

  • tf_idf_vectorizer (sklearn.feature_extraction.text.TfidfVectorizer) – The TfidfVectorizer to transform the query.

  • do_preprocess (bool, optional) – Whether to preprocess the query.

Returns:

The embedding for the query.

Return type:

np.ndarray

load_model(path='FastText_model.bin')#

Loads the FastText model from a file.

Parameters:

path (str, optional) – The path to load the FastText model.

prepare(dataset, mode, save=False, path='FastText_model.bin')#

Prepares the FastText model.

Parameters:
  • dataset (list of str) – The dataset to train the FastText model.

  • mode (str) – The mode to prepare the FastText model.

save_model(path='FastText_model.bin')#

Saves the FastText model to a file.

Parameters:

path (str, optional) – The path to save the FastText model.

train(texts)#

Trains the FastText model with the given texts.

Parameters:

texts (list of str) – The texts to train the FastText model.

preprocess_text(text, minimum_length=1, stopword_removal=True, stopwords_domain=[], lower_case=True, punctuation_removal=True)#

preprocess text by removing stopwords, punctuations, and converting to lowercase, and also filter based on a min length for stopwords use nltk.corpus.stopwords.words(‘english’) for punctuations use string.punctuation

Parameters:
  • text (str) – text to be preprocessed

  • minimum_length (int) – minimum length of the token

  • stopword_removal (bool) – whether to remove stopwords

  • stopwords_domain (list) – list of stopwords to be removed base on domain

  • lower_case (bool) – whether to convert to lowercase

  • punctuation_removal (bool) – whether to remove punctuations