swh.indexer.metadata module#
- swh.indexer.metadata.call_with_batches(f: Callable[[List[T1]], Iterable[T2]], args: List[T1], batch_size: int) Iterator[T2][source]#
- Calls a function with batches of args, and concatenates the results. 
- class swh.indexer.metadata.ExtrinsicMetadataIndexer(config=None, **kw)[source]#
- Bases: - BaseIndexer[- bytes,- RawExtrinsicMetadata,- OriginExtrinsicMetadataRow]- Prepare and check that the indexer is ready to run. - process_journal_objects(objects: ObjectsDict) Dict[source]#
- Read swh message objects (content, origin, …) from the journal to: - retrieve the associated objects from the storage backend (e.g. storage, objstorage…) 
- execute the associated indexing computations 
- store the results in the indexer storage 
 
 - index(id: bytes, data: RawExtrinsicMetadata | None, **kwargs) List[OriginExtrinsicMetadataRow][source]#
- Index computation for the id and associated raw data. - Parameters:
- id – identifier or Dict object 
- data – id’s data from storage or objstorage depending on object type 
 
- Returns:
- a dict that makes sense for the - persist_index_computations()method.
- Return type:
 
 
- class swh.indexer.metadata.ContentMetadataIndexer(config=None, **kw)[source]#
- Bases: - ContentIndexer[- ContentMetadataRow]- Content-level indexer - This indexer is in charge of: - filtering out content already indexed in content_metadata 
- reading content from objstorage with the content’s id sha1 
- computing metadata by given context 
- using the metadata_dictionary as the ‘swh-metadata-translator’ tool 
- store result in content_metadata table 
 - Prepare and check that the indexer is ready to run. - index(id: HashDict, data: bytes | None = None, log_suffix='unknown directory', **kwargs) List[ContentMetadataRow][source]#
- Index sha1s’ content and store result. - Parameters:
- id – content’s identifier 
- data – raw content in bytes 
 
- Returns:
- dictionary representing a content_metadata. If the translation wasn’t successful the metadata keys will be returned as None 
- Return type:
 
 
- class swh.indexer.metadata.DirectoryMetadataIndexer(*args, **kwargs)[source]#
- Bases: - DirectoryIndexer[- DirectoryIntrinsicMetadataRow]- Directory-level indexer - This indexer is in charge of: - filtering directories already indexed in directory_intrinsic_metadata table with defined computation tool 
- retrieve all entry_files in directory 
- use metadata_detector for file_names containing metadata 
- compute metadata translation if necessary and possible (depends on tool) 
- send sha1s to content indexing if possible 
- store the results for directory 
 - Prepare and check that the indexer is ready to run. - index(id: bytes, data: Directory | None = None, **kwargs) List[DirectoryIntrinsicMetadataRow][source]#
- Index directory by processing it and organizing result. - use metadata_detector to iterate on filenames, passes them to the content indexers, then merges (if more than one) - Parameters:
- id – sha1_git of the directory 
- data – should always be None 
 
- Returns:
- dictionary representing a directory_intrinsic_metadata, with keys: - id: directory’s identifier (sha1_git) 
- indexer_configuration_id (bytes): tool used 
- metadata: dict of retrieved metadata 
 
- Return type:
 
 - persist_index_computations(results: List[DirectoryIntrinsicMetadataRow]) Dict[str, int][source]#
- Persist the results in storage. 
 - translate_directory_intrinsic_metadata(files: List[DirectoryLsEntry], log_suffix: str) Tuple[List[Any], Any][source]#
- Determine plan of action to translate metadata in the given root directory - Parameters:
- files – list of file entries, as returned by - swh.storage.interface.StorageInterface.directory_ls()
- Returns:
- list of mappings used and dict with translated metadata according to the CodeMeta vocabulary 
- Return type:
 
 
- class swh.indexer.metadata.OriginMetadataIndexer(config=None, **kwargs)[source]#
- Bases: - OriginIndexer[- Tuple[- OriginIntrinsicMetadataRow,- DirectoryIntrinsicMetadataRow]]- Prepare and check that the indexer is ready to run. - USE_TOOLS = False#
 - index_list(origins: List[Origin], *, check_origin_known: bool = True, **kwargs) List[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]][source]#
 - persist_index_computations(results: List[Tuple[OriginIntrinsicMetadataRow, DirectoryIntrinsicMetadataRow]]) Dict[str, int][source]#
- Persist the computation resulting from the index. - Parameters:
- results – List of results. One result is the result of the index function. 
- Returns:
- a summary dict of what has been inserted in the storage