wdoc.utils.loaders package#
Submodules#
wdoc.utils.loaders.anki module#
- wdoc.utils.loaders.anki.load_anki(verbose: bool, text_splitter: TextSplitter, loaders_temp_dir: Path, anki_profile: str | None = None, anki_deck: str | None = None, anki_notetype: str | None = None, anki_template: str | None = '{allfields}\n{image_ocr_alt}', anki_tag_filter: str | None = None, anki_tag_render_filter: str | None = None) list[Document][source]#
- wdoc.utils.loaders.anki.replace_media(content: str, media: None | dict, mode: str, strict: bool = True, replace_image: bool = True, replace_links: bool = True, replace_sounds: bool = True) tuple[str, dict][source]#
- Else: exclude any note that contains in the content:
an image (<img…)
or a sound [sound:…
or a link href / http
- This is because:
- 1 as LLMs are non deterministic I preferred
to avoid taking the risk of botching the content
2 it costs less token
The intended use is to call it first to replace each media by a simple string like [IMAGE_1] and check if it’s indeed present in the output of the LLM then replace it back.
It uses both bs4 and regex to be sure of itself
wdoc.utils.loaders.epub module#
wdoc.utils.loaders.json_dict module#
wdoc.utils.loaders.local_audio module#
wdoc.utils.loaders.local_html module#
wdoc.utils.loaders.local_video module#
wdoc.utils.loaders.logseq_markdown module#
wdoc.utils.loaders.online_media module#
wdoc.utils.loaders.online_pdf module#
wdoc.utils.loaders.pdf module#
wdoc.utils.loaders.powerpoint module#
wdoc.utils.loaders.string module#
wdoc.utils.loaders.text module#
wdoc.utils.loaders.txt module#
wdoc.utils.loaders.url module#
wdoc.utils.loaders.word module#
wdoc.utils.loaders.youtube module#
Module contents#
Called by batch_file_loader.py’s threads. Contains many cached function to load each document.