Logic.core.indexer package#
Logic.core.indexer.LSH module#
- class MinHashLSH(documents, num_hashes)#
 Bases:
object- build_characteristic_matrix()#
 Build the characteristic matrix representing the presence of shingles in documents.
- Returns:
 The binary characteristic matrix.
- Return type:
 
- jaccard_score(first_set, second_set)#
 Calculate jaccard score for two sets.
- jaccard_similarity_test(buckets, all_documents)#
 Test your near duplicate detection code based on jaccard similarity.
- lsh_buckets(signature, bands=10, rows_per_band=10)#
 Group documents into Locality-Sensitive Hashing (LSH) buckets based on Min-Hash signatures.
- min_hash_signature()#
 Perform Min-Hashing to generate hash signatures for documents.
- Returns:
 The Min-Hash signatures matrix.
- Return type:
 
- perform_lsh()#
 Perform the entire Locality-Sensitive Hashing (LSH) process.
- Returns:
 A dictionary mapping bucket IDs to lists of document indices.
- Return type:
 
Logic.core.indexer.document_lengths_index module#
Logic.core.indexer.index module#
- class Index(preprocessed_documents: list)#
 Bases:
object- add_document_to_index(document: dict)#
 Add a document to all the indexes
- Parameters:
 document (dict) – Document to add to all the indexes
- check_add_remove_is_correct()#
 Check if the add and remove is correct
- check_if_index_loaded_correctly(index_type: str, loaded_index: dict)#
 Check if the index is loaded correctly
- check_if_indexing_is_good(index_type: str, check_word: str = 'good')#
 Checks if the indexing is good. Do not change this function. You can use this function to check if your indexing is correct.
- check_if_key_exists(index_before_add, index, key)#
 
- delete_dummy_keys(index_before_add, index, key)#
 
- index_documents()#
 Index the documents based on the document ID. In other words, create a dictionary where the key is the document ID and the value is the document.
- Returns:
 The index of the documents based on the document ID.
- Return type:
 
- index_genres()#
 Index the documents based on the genres.
- Returns:
 The index of the documents based on the genres. You should also store each terms’ tf in each document. So the index type is: {term: {document_id: tf}}
- Return type:
 
- index_stars()#
 Index the documents based on the stars.
- Returns:
 The index of the documents based on the stars. You should also store each terms’ tf in each document. So the index type is: {term: {document_id: tf}}
- Return type:
 
- index_summaries()#
 Index the documents based on the summaries (not first_page_summary).
- Returns:
 The index of the documents based on the summaries. You should also store each terms’ tf in each document. So the index type is: {term: {document_id: tf}}
- Return type:
 
- load_index(path: str)#
 Loads the index from a file (such as a JSON file)
- Parameters:
 path (str) – Path to load the file
Logic.core.indexer.index_reader module#
Logic.core.indexer.indexes_enum module#
Logic.core.indexer.metadata_index module#
- class Metadata_index(path='index/')#
 Bases:
object- create_metadata_index()#
 Creates the metadata index.
- get_average_document_field_length(where)#
 Returns the sum of the field lengths of all documents in the index.
- Parameters:
 where (str) – The field to get the document lengths for.
- read_documents()#
 Reads the documents.
Logic.core.indexer.tiered_index module#
- class Tiered_index(path='index/')#
 Bases:
object- convert_to_tiered_index(first_tier_threshold: int, second_tier_threshold: int, index_name)#
 Convert the current index to a tiered index.
- Parameters:
 - Returns:
 The tiered index with structure of {
”first_tier”: dict, “second_tier”: dict, “third_tier”: dict
}
- Return type:
 
- store_tiered_index(path, index_name)#
 Stores the tiered index to a file.