Logic.core.indexer package#
Logic.core.indexer.LSH module#
- class MinHashLSH(documents, num_hashes)#
Bases:
object
- build_characteristic_matrix()#
Build the characteristic matrix representing the presence of shingles in documents.
- Returns:
The binary characteristic matrix.
- Return type:
- jaccard_score(first_set, second_set)#
Calculate jaccard score for two sets.
- jaccard_similarity_test(buckets, all_documents)#
Test your near duplicate detection code based on jaccard similarity.
- lsh_buckets(signature, bands=10, rows_per_band=10)#
Group documents into Locality-Sensitive Hashing (LSH) buckets based on Min-Hash signatures.
- min_hash_signature()#
Perform Min-Hashing to generate hash signatures for documents.
- Returns:
The Min-Hash signatures matrix.
- Return type:
- perform_lsh()#
Perform the entire Locality-Sensitive Hashing (LSH) process.
- Returns:
A dictionary mapping bucket IDs to lists of document indices.
- Return type:
Logic.core.indexer.document_lengths_index module#
Logic.core.indexer.index module#
- class Index(preprocessed_documents: list)#
Bases:
object
- add_document_to_index(document: dict)#
Add a document to all the indexes
- Parameters:
document (dict) – Document to add to all the indexes
- check_add_remove_is_correct()#
Check if the add and remove is correct
- check_if_index_loaded_correctly(index_type: str, loaded_index: dict)#
Check if the index is loaded correctly
- check_if_indexing_is_good(index_type: str, check_word: str = 'good')#
Checks if the indexing is good. Do not change this function. You can use this function to check if your indexing is correct.
- check_if_key_exists(index_before_add, index, key)#
- delete_dummy_keys(index_before_add, index, key)#
- index_documents()#
Index the documents based on the document ID. In other words, create a dictionary where the key is the document ID and the value is the document.
- Returns:
The index of the documents based on the document ID.
- Return type:
- index_genres()#
Index the documents based on the genres.
- Returns:
The index of the documents based on the genres. You should also store each terms’ tf in each document. So the index type is: {term: {document_id: tf}}
- Return type:
- index_stars()#
Index the documents based on the stars.
- Returns:
The index of the documents based on the stars. You should also store each terms’ tf in each document. So the index type is: {term: {document_id: tf}}
- Return type:
- index_summaries()#
Index the documents based on the summaries (not first_page_summary).
- Returns:
The index of the documents based on the summaries. You should also store each terms’ tf in each document. So the index type is: {term: {document_id: tf}}
- Return type:
- load_index(path: str)#
Loads the index from a file (such as a JSON file)
- Parameters:
path (str) – Path to load the file
Logic.core.indexer.index_reader module#
Logic.core.indexer.indexes_enum module#
Logic.core.indexer.metadata_index module#
- class Metadata_index(path='index/')#
Bases:
object
- create_metadata_index()#
Creates the metadata index.
- get_average_document_field_length(where)#
Returns the sum of the field lengths of all documents in the index.
- Parameters:
where (str) – The field to get the document lengths for.
- read_documents()#
Reads the documents.
Logic.core.indexer.tiered_index module#
- class Tiered_index(path='index/')#
Bases:
object
- convert_to_tiered_index(first_tier_threshold: int, second_tier_threshold: int, index_name)#
Convert the current index to a tiered index.
- Parameters:
- Returns:
The tiered index with structure of {
”first_tier”: dict, “second_tier”: dict, “third_tier”: dict
}
- Return type:
- store_tiered_index(path, index_name)#
Stores the tiered index to a file.