wdoc.utils.loaders package#

Submodules#

wdoc.utils.loaders.anki module#

wdoc.utils.loaders.anki.cloze_stripper(clozed: str) str[source]#
wdoc.utils.loaders.anki.load_anki(verbose: bool, text_splitter: TextSplitter, loaders_temp_dir: Path, anki_profile: str | None = None, anki_deck: str | None = None, anki_notetype: str | None = None, anki_template: str | None = '{allfields}\n{image_ocr_alt}', anki_tag_filter: str | None = None, anki_tag_render_filter: str | None = None) list[Document][source]#
wdoc.utils.loaders.anki.replace_media(content: str, media: None | dict, mode: str, strict: bool = True, replace_image: bool = True, replace_links: bool = True, replace_sounds: bool = True) tuple[str, dict][source]#
Else: exclude any note that contains in the content:
  • an image (<img…)

  • or a sound [sound:…

  • or a link href / http

This is because:
1 as LLMs are non deterministic I preferred

to avoid taking the risk of botching the content

2 it costs less token

The intended use is to call it first to replace each media by a simple string like [IMAGE_1] and check if it’s indeed present in the output of the LLM then replace it back.

It uses both bs4 and regex to be sure of itself

wdoc.utils.loaders.epub module#

wdoc.utils.loaders.json_dict module#

wdoc.utils.loaders.local_audio module#

wdoc.utils.loaders.local_html module#

wdoc.utils.loaders.local_video module#

wdoc.utils.loaders.logseq_markdown module#

wdoc.utils.loaders.online_media module#

wdoc.utils.loaders.online_pdf module#

wdoc.utils.loaders.pdf module#

class wdoc.utils.loaders.pdf.OpenparseDocumentParser(path: str | Path, table_args: dict | None = {'parsing_algorithm': 'pymupdf', 'table_output_format': 'markdown'})[source]#

Bases: object

load() list[Document][source]#
wdoc.utils.loaders.pdf.load_pdf(path: str | Path, text_splitter: TextSplitter, file_hash: str, pdf_parsers: str | list[str] = 'pymupdf', doccheck_min_lang_prob: float = 0.5, doccheck_min_token: int = 20, doccheck_max_token: int = 10000000) list[Document][source]#

wdoc.utils.loaders.powerpoint module#

wdoc.utils.loaders.shared module#

wdoc.utils.loaders.shared.get_url_title(url: str) str | None[source]#

if the title of the url is not loaded from the loader, trying as last resort with this one

wdoc.utils.loaders.shared_audio module#

wdoc.utils.loaders.shared_audio.convert_verbose_json_to_timestamped_text(transcript: dict) str[source]#
wdoc.utils.loaders.shared_audio.is_timecode(inp: float | str) bool[source]#
wdoc.utils.loaders.shared_audio.process_vtt_content_for_llm(vtt_content: str, remove_hour_prefix: bool = True) str[source]#

Process VTT content to make it more suitable for LLMs by reducing timecodes and removing unnecessary formatting.

Parameters:
  • vtt_content – The VTT content to process

  • remove_hour_prefix – Whether to remove “00:” hour prefix if all content is under 99 minutes

Returns:

Processed text content optimized for LLM consumption

wdoc.utils.loaders.shared_audio.seconds_to_timecode(inp: str | float | int) str[source]#

used for vtt subtitle conversion

wdoc.utils.loaders.shared_audio.split_too_large_audio(audio_path: Path | str) list[Path][source]#

Whisper has a file size limit of about 25mb. If we hit that limit, we split the audio file into multiple 30 minute files, then combine the outputs

wdoc.utils.loaders.shared_audio.timecode_to_second(inp: str) int[source]#

turns a vtt timecode into seconds

wdoc.utils.loaders.string module#

wdoc.utils.loaders.string.load_string() list[Document][source]#

wdoc.utils.loaders.text module#

wdoc.utils.loaders.text.load_text(path: str, file_hash: str, metadata: str | dict | None = None) list[Document][source]#

wdoc.utils.loaders.txt module#

wdoc.utils.loaders.txt.load_txt(path: str | Path, file_hash: str) list[Document][source]#

wdoc.utils.loaders.url module#

wdoc.utils.loaders.url.md_shorten_image_name(md_image: Match) str[source]#

turn a markdown image link into just the name

wdoc.utils.loaders.word module#

wdoc.utils.loaders.youtube module#

Module contents#

Called by batch_file_loader.py’s threads. Contains many cached function to load each document.

wdoc.utils.loaders.wrapper_load_one_doc(func: Callable) Callable[source]#

Decorator to wrap doc_loader to catch errors cleanly