Logic.core.word_embedding package#
Logic.core.word_embedding.fasttext_data_loader module#
- class FastTextDataLoader(file_path)#
Bases:
object
This class is designed to load and pre-process data for training a FastText model.
It takes the file path to a data source containing movie information (synopses, summaries, reviews, titles, genres) as input. The class provides methods to read the data into a pandas DataFrame, pre-process the text data, and create training data (features and labels)
- create_train_data()#
Reads data using the read_data_to_df function, pre-processes the text data, and creates training data (features and labels).
- Returns:
A tuple containing two NumPy arrays: X (preprocessed text data) and y (encoded genre labels).
- Return type:
- read_data_to_df()#
Reads data from the specified file path and creates a pandas DataFrame containing movie information.
You can use an IndexReader class to access the data based on document IDs. It extracts synopses, summaries, reviews, titles, and genres for each movie. The extracted data is then stored in a pandas DataFrame with appropriate column names.
- Returns:
pd.DataFrame
- Return type:
A pandas DataFrame containing movie information (synopses, summaries, reviews, titles, genres).
Logic.core.word_embedding.fasttext_model module#
- class FastText(method='skipgram')#
Bases:
object
A class used to train a FastText model and generate embeddings for text data.
- model#
The trained FastText model.
- Type:
fasttext.FastText._FastText
- analogy(word1, word2, word3)#
Perform an analogy task: word1 is to word2 as word3 is to __.
- get_query_embedding(query)#
Generates an embedding for the given query.
- Parameters:
- Returns:
The embedding for the query.
- Return type:
np.ndarray
- load_model(path='FastText_model.bin')#
Loads the FastText model from a file.
- Parameters:
path (str, optional) – The path to load the FastText model.
- prepare(dataset, mode, save=False, path='FastText_model.bin')#
Prepares the FastText model.
- preprocess_text(text, minimum_length=1, stopword_removal=True, stopwords_domain=[], lower_case=True, punctuation_removal=True)#
preprocess text by removing stopwords, punctuations, and converting to lowercase, and also filter based on a min length for stopwords use nltk.corpus.stopwords.words(‘english’) for punctuations use string.punctuation
- Parameters:
text (str) – text to be preprocessed
minimum_length (int) – minimum length of the token
stopword_removal (bool) – whether to remove stopwords
stopwords_domain (list) – list of stopwords to be removed base on domain
lower_case (bool) – whether to convert to lowercase
punctuation_removal (bool) – whether to remove punctuations