wdoc.utils.loaders package#

Submodules#

wdoc.utils.loaders.anki module#

wdoc.utils.loaders.anki.cloze_stripper(clozed: str) → str[source]#

wdoc.utils.loaders.anki.load_anki(verbose: bool, text_splitter: TextSplitter, loaders_temp_dir: Path, anki_profile: str | None = None, anki_deck: str | None = None, anki_notetype: str | None = None, anki_template: str | None = '{allfields}\n{image_ocr_alt}', anki_tag_filter: str | None = None, anki_tag_render_filter: str | None = None) → list[Document][source]#

wdoc.utils.loaders.anki.replace_media(content: str, media: None | dict, mode: str, strict: bool = True, replace_image: bool = True, replace_links: bool = True, replace_sounds: bool = True) → tuple[str, dict][source]#

Else: exclude any note that contains in the content:

an image (<img…)
or a sound [sound:…
or a link href / http

This is because:

1 as LLMs are non deterministic I preferred: to avoid taking the risk of botching the content

2 it costs less token

The intended use is to call it first to replace each media by a simple string like [IMAGE_1] and check if it’s indeed present in the output of the LLM then replace it back.

It uses both bs4 and regex to be sure of itself

wdoc.utils.loaders.epub module#

wdoc.utils.loaders.json_dict module#

wdoc.utils.loaders.karakeep module#

Helpers for the karakeep recursive filetype.

A Karakeep instance is not a single document: one list / tag / search / the whole library fans out into many bookmarks, each carrying a link (with crawled htmlContent), free text, or an uploaded asset (pdf / image), plus a title, note, AI summary and tags. The actual fan-out into DocDict`s lives in `wdoc.utils.load_recursive.parse_karakeep; this module only holds the Karakeep-API-facing helpers so that load_recursive.py stays lean.

We deliberately do NOT re-implement any extraction here: a bookmark is resolved to one of Karakeep’s stored artifacts (the saved HTML, the pre-extracted text, or a downloaded stored PDF asset) and handed back to wdoc’s own local_html / txt / pdf loaders, which are far more capable and already cached. The online resource is never re-fetched, so the loader works offline and under –private mode against a local Karakeep instance.

Per-bookmark content resolution is wrapped in a joblib cache keyed on the bookmark id + its modifiedAt (youtube-loader style), so re-running wdoc on the same selection does not re-hit the Karakeep server or re-download assets.

wdoc.utils.loaders.karakeep.bookmark_web_url(bookmark: dict, api_endpoint: str | None) → str[source]#: Best-effort permalink for a bookmark (used as subitem_link).

wdoc.utils.loaders.karakeep.format_header(bookmark: dict) → str[source]#: A short, human + RAG friendly metadata header for a bookmark.

wdoc.utils.loaders.karakeep.get_karakeep_client(api_endpoint: str | None = None, api_key: str | None = None, verify_ssl: bool = True)[source]#

Return a connected KarakeepAPI client.

Credentials fall back to the karakeep-python-api standard environment variables KARAKEEP_PYTHON_API_ENDPOINT / KARAKEEP_PYTHON_API_KEY / KARAKEEP_PYTHON_API_VERIFY_SSL when the corresponding argument is absent.

Response validation is disabled so the client returns plain dicts, which is both robust to upstream schema drift and convenient for the fan-out.

Private-mode guard: this loader never reaches the live bookmarked URL, only the Karakeep instance itself. A local instance is therefore allowed under WDOC_PRIVATE_MODE; a remote endpoint is blocked.

wdoc.utils.loaders.karakeep.parse_selector(path: str) → tuple[str, object][source]#

Turn the --path selector into a (kind, value) pair.

Accepted forms: - library / * / all -> the whole library - favourites / favorites -> favourited bookmarks - archived -> archived bookmarks - tag:foo -> bookmarks with that tag (by name) - search:query -> a Karakeep search query - ids:ID1,ID2 (or id:) -> explicit bookmark ids - list:Name or a bare value -> a list by name

wdoc.utils.loaders.karakeep.resolve_bookmarks(client, selector: tuple[str, object]) → list[dict][source]#: Resolve a parsed selector into a list of bookmark dicts.

wdoc.utils.loaders.local_audio module#

wdoc.utils.loaders.local_html module#

wdoc.utils.loaders.local_video module#

wdoc.utils.loaders.logseq_markdown module#

wdoc.utils.loaders.online_media module#

wdoc.utils.loaders.online_pdf module#

wdoc.utils.loaders.pdf module#

class wdoc.utils.loaders.pdf.OpenparseDocumentParser(path: str | Path, table_args: dict | None = {'parsing_algorithm': 'pymupdf', 'table_output_format': 'markdown'})[source]#

Bases: object

load() → list[Document][source]#

wdoc.utils.loaders.pdf.load_pdf(path: str | Path, text_splitter: TextSplitter, file_hash: str, pdf_parsers: str | list[str] = 'pymupdf', doccheck_min_lang_prob: float = 0.5, doccheck_min_token: int = 20, doccheck_max_token: int = 10000000) → list[Document][source]#

wdoc.utils.loaders.powerpoint module#

wdoc.utils.loaders.shared module#

wdoc.utils.loaders.shared.get_url_title(url: str) → str | None[source]#: if the title of the url is not loaded from the loader, trying as last resort with this one

wdoc.utils.loaders.shared_audio module#

wdoc.utils.loaders.shared_audio.convert_verbose_json_to_timestamped_text(transcript: dict) → str[source]#

wdoc.utils.loaders.shared_audio.is_timecode(inp: float | str) → bool[source]#

wdoc.utils.loaders.shared_audio.process_vtt_content_for_llm(vtt_content: str, remove_hour_prefix: bool = True) → str[source]#

Process VTT content to make it more suitable for LLMs by reducing timecodes and removing unnecessary formatting.

Parameters:

vtt_content – The VTT content to process
remove_hour_prefix – Whether to remove “00:” hour prefix if all content is under 99 minutes

Returns:

Processed text content optimized for LLM consumption

wdoc.utils.loaders.shared_audio.seconds_to_timecode(inp: str | float | int) → str[source]#: used for vtt subtitle conversion

wdoc.utils.loaders.shared_audio.split_too_large_audio(audio_path: Path | str) → list[Path][source]#: Whisper has a file size limit of about 25mb. If we hit that limit, we split the audio file into multiple 30 minute files, then combine the outputs

wdoc.utils.loaders.shared_audio.timecode_to_second(inp: str) → int[source]#: turns a vtt timecode into seconds

wdoc.utils.loaders.string module#

wdoc.utils.loaders.string.load_string() → list[Document][source]#

wdoc.utils.loaders.text module#

wdoc.utils.loaders.text.load_text(path: str, file_hash: str, metadata: str | dict | None = None) → list[Document][source]#

wdoc.utils.loaders.txt module#

wdoc.utils.loaders.txt.load_txt(path: str | Path, file_hash: str) → list[Document][source]#

wdoc.utils.loaders.url module#

wdoc.utils.loaders.url.md_shorten_image_name(md_image: Match) → str[source]#: turn a markdown image link into just the name

wdoc.utils.loaders.word module#

wdoc.utils.loaders.youtube module#

wdoc.utils.loaders.zotero module#

Helpers for the zotero recursive filetype.

A Zotero library is not a single document: one collection / tag / saved search fans out into many items, each of which can carry several attachments (PDFs, linked files, web links), notes and bibliographic metadata. The actual fan-out into DocDict`s lives in `wdoc.utils.load_recursive.parse_zotero; this module only holds the Zotero-API-facing helpers so that load_recursive.py stays lean.

We deliberately do NOT re-implement PDF extraction here (unlike the reference project openwebui-knowledgesync-zotero-python): attachments are written to a temp file and handed back to wdoc’s own pdf/auto loaders, which are far more capable (15 parser backends) and already cached.

Connection follows pyzotero: the local Zotero HTTP API (http://localhost:23119, works offline and in –private mode) is tried first, with a fall back to the Web API (api key + numeric library id) when configured.

wdoc.utils.loaders.zotero.attachment_fulltext(zot, attachment_key: str) → str | None[source]#: Return Zotero’s pre-indexed fulltext for an attachment, or None.

wdoc.utils.loaders.zotero.attachment_to_file(zot, attachment: dict, temp_dir: Path) → tuple[str, str] | None[source]#

Materialise an attachment into something a wdoc loader can read.

Returns a (filetype, path_or_url) tuple: - ("pdf", path) / ("auto", path) for a file on disk (downloaded or linked) - ("url", url) for a linked web URL or None if the attachment cannot be turned into a document.

wdoc.utils.loaders.zotero.format_bib_header(item: dict) → str[source]#: A short, human + RAG friendly bibliographic header for an item.

wdoc.utils.loaders.zotero.get_zotero_client(connection: Literal['auto', 'local', 'web'] = 'auto', library_id: str | None = None, library_type: Literal['user', 'group'] = 'user', api_key: str | None = None)[source]#

Return a connected pyzotero.zotero.Zotero instance.

Credentials fall back to the pyzotero-standard environment variables ZOTERO_LIBRARY_ID / ZOTERO_API_KEY / ZOTERO_LIBRARY_TYPE when the corresponding argument is not provided.

connection="local": only the local Zotero HTTP API (requires the Zotero desktop app running). Probed eagerly so failures are reported up-front.
connection="web": only the Web API (needs library id + api key). Blocked under WDOC_PRIVATE_MODE since it reaches out to zotero.org.
connection="auto" (default): try local, fall back to web.

wdoc.utils.loaders.zotero.item_children(zot, item: dict) → list[dict][source]#: Return the child items (attachments + notes) of a Zotero item.

wdoc.utils.loaders.zotero.item_web_url(item: dict) → str | None[source]#: Best-effort permalink for a Zotero item (used as subitem_link).

wdoc.utils.loaders.zotero.metadata_document_text(item: dict) → str[source]#: Full text for the always-on per-item metadata document.

wdoc.utils.loaders.zotero.parse_selector(path: str) → tuple[str, object][source]#

Turn the --path selector into a (kind, value) pair.

Accepted forms: - library / * / all -> the whole library - tag:foo,bar -> items matching those tags - items:KEY1,KEY2 (or key:) -> explicit item keys - search:Name -> a Zotero saved search by name - collection:Path or a bare value -> a collection by name or nested path

wdoc.utils.loaders.zotero.resolve_items(zot, selector: tuple[str, object]) → list[dict][source]#: Resolve a parsed selector into a list of parent Zotero items.

Module contents#

Called by batch_file_loader.py’s threads. Contains many cached function to load each document.

wdoc.utils.loaders.wrapper_load_one_doc(func: Callable) → Callable[source]#: Decorator to wrap doc_loader to catch errors cleanly