wdoc.utils package#

Subpackages#

Submodules#

wdoc.utils.batch_file_loader module#

called at wdoc instance creation. It parsed the combined filetype into an individual list of DocDict describing each a document (or in some cases a list of documents for example a whole anki database). This list is then processed in loaders.py, multithreading or multiprocessing is used.

wdoc.utils.batch_file_loader.infer_filetype(path: str) → str[source]#: Heuristics to infer the ‘filetype’ argument of a –path given to wdoc.

wdoc.utils.embeddings module#

Class used to create the embeddings.
Loads and store embeddings for each document.

wdoc.utils.embeddings.create_embeddings(modelname: ModelName, cached_embeddings: Embeddings, save_embeds_as: str | Path, load_embeds_from: str | Path | None, loaded_docs: Any, dollar_limit: int | float, private: bool) → VectorStore[source]#: For each document of loaded_docs, we check if the embeddings were already computed and present in the cache or ask the CacheBackedEmbeddings class to create them and return to wdoc.loaded_embeddings.

wdoc.utils.embeddings.faiss_custom_score_function(distance: float) → float[source]#

Scoring function for faiss to make sure it’s positive. Related issue: langchain-ai/langchain#17333

In langchain the default value is the euclidean relevance score: return 1.0 - distance / math.sqrt(2)

The output is a similarity score: it must be [0,1] such that 0 is the most dissimilar, 1 is the most similar document.

wdoc.utils.embeddings.load_embeddings_engine(modelname: ModelName, cli_kwargs: dict, api_base: str | None, embed_kwargs: dict, private: bool, do_test: bool) → Embeddings[source]#: Create the Embeddings class used to compute embeddings. This class is wrapped into a CacheBackedEmbeddings to add a caching layer.

wdoc.utils.embeddings.test_embeddings(embeddings: Embeddings) → None[source]#: Simple testing of embeddings to know early if something seems wrong

wdoc.utils.env module#

Sets the default value for environment variables, parse the actual values, wdoc. Also set some variables useful to access globally like is_linux for example.

class wdoc.utils.env.EnvDataclass(__warned_unexpected__: list = <factory>, __frozen__: bool = False, WDOC_DUMMY_ENV_VAR: bool = False, WDOC_DEBUG: bool = False, WDOC_VERBOSE: bool = False, WDOC_TYPECHECKING: Literal['disabled', 'warn', 'crash']='warn', WDOC_NO_MODELNAME_MATCHING: bool = True, WDOC_ALLOW_NO_PRICE: bool = False, WDOC_OPEN_ANKI: bool = False, WDOC_STRICT_DOCDICT: bool | Literal['strip'] = False, WDOC_MAX_LOADER_TIMEOUT: int = -1, WDOC_MAX_PDF_LOADER_TIMEOUT: int = -1, WDOC_PRIVATE_MODE: bool = False, WDOC_DEBUGGER: bool = False, WDOC_EXPIRE_CACHE_DAYS: int = 0, WDOC_EMPTY_LOADER: bool = False, WDOC_BEHAVIOR_EXCL_INCL_USELESS: Literal['warn', 'crash']='warn', WDOC_IMPORT_TYPE: Literal['native', 'lazy', 'thread', 'both']='native', WDOC_LOADER_LAZY_LOADING: bool = True, WDOC_MOD_FAISS_SCORE_FN: bool = True, WDOC_FAISS_COMPRESSION: bool = True, WDOC_FAISS_BINARY: bool = False, WDOC_LLM_MAX_CONCURRENCY: int = 1, WDOC_LLM_REQUEST_TIMEOUT: int = 600, WDOC_SEMANTIC_BATCH_MAX_TOKEN_SIZE: int = 2000, WDOC_MAX_CHUNK_SIZE: int = 16000, WDOC_MAX_EMBED_CONTEXT: int = 7000, WDOC_INTERMEDIATE_ANSWER_MAX_TOKENS: int = 4000, WDOC_DEFAULT_MODEL: str = 'openrouter/deepseek/deepseek-v4-pro', WDOC_DEFAULT_EMBED_MODEL: str = 'openai/text-embedding-3-small', WDOC_DEFAULT_EMBED_DIMENSION: int | None = None, WDOC_EMBED_TESTING: bool = True, WDOC_DISABLE_EMBEDDINGS_CACHE: bool = False, WDOC_DEFAULT_QUERY_EVAL_MODEL: str = 'openrouter/deepseek/deepseek-v4-flash', WDOC_LANGFUSE_PUBLIC_KEY: str | None = None, WDOC_LANGFUSE_SECRET_KEY: str | None = None, WDOC_LANGFUSE_HOST: str | None = None, WDOC_LITELLM_TAGS: str | None = None, WDOC_LITELLM_USER: str = 'wdoc_llm', WDOC_APPLY_ASYNCIO_PATCH: bool = False, WDOC_CONTINUE_ON_INVALID_EVAL: bool = True, WDOC_WHISPER_PARALLEL_SPLITS: bool = True, WDOC_WHISPER_ENDPOINT: str | None = '', WDOC_WHISPER_API_KEY: str | None = '', WDOC_WHISPER_MODEL: str = 'whisper-1', WDOC_IN_DOCKER: bool = False)[source]#

Bases: object

This dataclass holds the env variables used by wdoc. It is frozen when env.py is done. This allows modification of env values to dynamically affect wdoc without having to restart the python execution or reimporting wdoc.

## Documentation of each environment variables:

WDOC_DEBUG
- Setting to true has the same effects as using –debug=True.
WDOC_VERBOSE
- Setting to true has the same effects as using –verbose=True.
Always set to true if WDOC_DEBUG is set to true.
WDOC_TYPECHECKING
- Setting for runtime type checking. Default value is warn. The typing is checked
using [beartype](https://beartype.readthedocs.io/en/latest/) so shouldn’t slow down the runtime. * Possible values:
disabled: disable typechecking.

warn: print a red warning if a typechecking fails.

crash: crash if a typechecking fails in any function.
WDOC_NO_MODELNAME_MATCHING
- If “false”: will try to infer the model name based on a more human readable string.
For example ‘4o’ might be matched to ‘openai/gpt-4o’. Useful for exotic or models that are fresh out of the oven, or bugs with backend parsing. As it can lead to issues it was decided to disable the matching by default, hence the default value is True.
WDOC_ALLOW_NO_PRICE
- if “true”, won’t crash if no price was found for the given
model. Useful if litellm has not yet updated its price table. Default is False.
WDOC_OPEN_ANKI
- if “true”, will automatically ask wether to open the anki browser if cards are
found in the sources. Only used if task is query or search. Default is False
WDOC_STRICT_DOCDICT
- if “True”, will crash instead of printing if trying to set an unexpected argument in a DocDict.
  Otherwise, you can specify things like “anki_profile” as argument to filetype “pdf” without crashing, this also makes no sense but can be useful if there’s a bug in wdoc that is not yet fixed
and you want to continue in the meantime. * If set to “False”: we print in red unexpected arguments but add them anyway. * If set to “strip”: we print in red unexpected arguments and ignore them. Default is False.
WDOC_MAX_LOADER_TIMEOUT
- Number of seconds to wait before giving up on loading a document (this does not include recursive types, only the DocDict arguments).
Default is -1 to disable. Disabled if <= 0.
WDOC_MAX_PDF_LOADER_TIMEOUT
- Number of seconds to wait for each pdf loader before giving up this loader. This includes the online_pdf loader.
  Note that it probably makes PDF parsing substantially. Default is -1 to disable. Disabled when using –file_loader_parallel_backend=threading as python does not allow it. Also disabled if <= 0.
WDOC_DEBUGGER
- If True, will open the debugger in case of issue. Implied by –debug
Incompatible with WDOC_IN_DOCKER. Default is False
WDOC_IN_DOCKER
- Flag set automatically, used to modify some behaviors to avoid issues when running wdoc inside docker.
Incompatible with WDOC_DEBUGGER. Default is False
WDOC_EXPIRE_CACHE_DAYS
- If an int, will remove any cached value that is older than that many days.
Otherwise keep forever. Default is 0 to disable.
WDOC_EMPTY_LOADER
- If True, loading any kind of document will return an empty string. Used for debugging. Default is False.
WDOC_BEHAVIOR_EXCL_INCL_USELESS
- If an “include” or “exclude” key is found in a loader but does not actually change anything, if warn then just print in red but
if crash then raise an error. Default is warn.
WDOC_PRIVATE_MODE
- You should never set it yourself. It is set automatically if the –private argument is used, and used throughout to triple check that it’s indeed fully private.
WDOC_IMPORT_TYPE, default native
- If native will just import the packages needed by wdoc without any tricks. This is the default as it’s bug-free but can be a bit slower to start up.
- If thread, will try to use a separate thread to import packages making the startup time potentially smaller.
- If lazy, will use lazy loading on some packages, making the startup time potentially smaller.
- If both, will try to use both.
All other than native are experimental as they rely on weird python tricks that may cause issues.
WDOC_LOADER_LAZY_LOADING, default True
- If True the function used to load documents (e.g. load_anki, load_online_pdf etc) will be imported only when needed. This
is faster but experimental for now. If False, we import all the loader function on start.
WDOC_MOD_FAISS_SCORE_FN, default True
- If True, modify on the fly the FAISS vectorstores to change their scoring function to go from 0 to 1 instead of -1 to 1. This was inspired by [this langchain issue where users claim the default scoring function is wrong](langchain-ai/langchain#17333)
WDOC_FAISS_COMPRESSION, default True
- If True, zlib compression is applied around the pickling stage (=save_local/load_local) of the faiss index. Disable this if you want to use your faiss indexes with other softwares without using wdoc’s custom classes.
If False, WDOC_FAISS_BINARY must also be False. Note that you can switch value between run, as the uncompressed loading is used as fallback.
WDOC_FAISS_BINARY, default False
- If True, use a custom langchain vectorstore mimicking [FAISS](https://python.langchain.com/api_reference/_modules/langchain_community/vectorstores/faiss.html#FAISS) but using [binary embeddings](https://simonwillison.net/2024/Mar/26/binary-vector-search/), resulting in a 32x compression ratio and faster search hurting performance too much.
Note that binary indexes of FAISS [only support embeddings with dimensions multiple of 8](facebookresearch/faiss) so if that happens we add null dimensions. Note that if you switch this value between the index creation and index usage, you’ll probably encounter errors and should rather set it once then recreate your vectorstores.
WDOC_LLM_MAX_CONCURRENCY, default 1
- Set the max_concurrency limit to give langchain. If debug is used, it is overriden and set to 1.
Must be an int.
WDOC_LLM_REQUEST_TIMEOUT, default 600
- Sets the timeout in seconds for requests made to the LLM. This helps prevent indefinite hangs if the LLM provider is unresponsive. For example with ollama.
WDOC_MAX_CHUNK_SIZE, default 32_000
- When splitting large text into chunks, wdoc infers the maximum context size from litellm’s models metadata.
The maximum chunk size is capped by this value, as the maximum advertised context length is usually optimistic and is often at the cost of prompt adherence. Note that the chunk size inferred for query is not the same as for summary as we need a much better prompt adherence for the latter. This can also be used to avoid chunking when querying a text if you want the LLM to have the entier text as context instead of chunking.
WDOC_MAX_EMBED_CONTEXT, default: 7_000
- This variable sets the maximum token_size for document chunks when the task is query or search.
This is necessary because some large language models (LLMs) might have a larger context window than their corresponding embedding models. The actual maximum chunk size will be the minimum of WDOC_MAX_CHUNK_SIZE and WDOC_MAX_EMBED_CONTEXT.
WDOC_SEMANTIC_BATCH_MAX_TOKEN_SIZE, default: 2000
- Token size considered maximum for a single batch when doing semantic batching. The tokenizer used is the one from gpt-4o-mini as we don’t have access to most models’ tokenizers.
Each batch contains at least two intermediate answers so it’s not an absolute limitation but increasing it should reduce the cost of the “combine intermediate answers” step when querying.
WDOC_DEFAULT_MODEL, default: “openrouter/deepseek/deepseek-v4-pro”
- Default strong LLM to use. This is the strongest model, it will be used to answer the query about each document,
combine those answers. It can also be used by some retrievers etc.
WDOC_DEFAULT_QUERY_EVAL_MODEL, default: “openrouter/deepseek/deepseek-v4-flash”
- Default small LLM to use. It will be used to evaluate wether each document is relevant to the query or not.
WDOC_DEFAULT_EMBED_MODEL, default: “openai/text-embedding-3-small”
- Default model to use for embeddings.
WDOC_DEFAULT_EMBED_DIMENSION, default: none
- Default number of dimension to ask from the embeddings provider.
WDOC_EMBED_TESTING, default: True
- If False, will skip the test of the embeddings model on simple sentences to find out if we loaded everything correctly.
WDOC_DISABLE_EMBEDDINGS_CACHE, default: False
- If True, bypasses the caching mechanism for embeddings and uses the embeddings model directly. This can be useful for debugging or when you want to ensure fresh embeddings are generated for each document.
- Note that disabling the cache only affects new queries, new documents, or during semantic batching. It will NOT affect embeddings that are loaded via load_embeds_from, as those embeddings are already pre-computed and stored.
WDOC_LANGFUSE_PUBLIC_KEY, default: None
- If present, will replace the env variable LANGFUSE_PUBLIC_KEY.
WDOC_LANGFUSE_SECRET_KEY, default: None
- If present, will replace the env variable LANGFUSE_SECRET_KEY.
WDOC_LANGFUSE_HOST, default: None
- If present, will replace the env variable LANGFUSE_HOST.
WDOC_LITELLM_TAGS, default: None
- If a comma separated list of string: they will be put as tags in the litellm LLM request via the ChatLiteLLM object.
WDOC_LITELLM_USER, default: wdoc_llm
- Put as user argument when creating ChatLiteLLM object that talks to LLMs.
WDOC_CONTINUE_ON_INVALID_EVAL, default: True
- If True, instead of raising an InvalidDocEvaluationByLLMEval exception when an eval LLM returns output that can’t be parsed,
the system will print the error message in red and return “5” as the evaluation score. This allows the process to continue despite evaluation parsing failures. * If False, the system will raise the exception as normal, which typically causes the process to terminate.
WDOC_INTERMEDIATE_ANSWER_MAX_TOKENS, default: 4000
- Sets the maximum number of tokens allowed for each intermediate answer when querying documents.
This controls how much content the LLM generates for each document before these answers are combined into the final response. Lower values may reduce costs but might lose important details, while higher values allow for more comprehensive individual document analysis.
WDOC_WHISPER_PARALLEL_SPLITS, default: True
- If True, when audio files need to be split for whisper transcription (due to size limits), the splits will be processed in parallel using joblib.
This can significantly speed up transcription of large audio files when using remote whisper services. * If False, audio splits will be processed sequentially. It is recommended to set this to False when using a local whisper instance to avoid overwhelming the local system with concurrent requests.
WDOC_WHISPER_ENDPOINT, default: “”
- If provided, sets a custom API endpoint for Whisper transcription services. This allows you to use local Whisper instances
or alternative Whisper-compatible services instead of OpenAI’s default endpoint. * When empty, uses the default OpenAI Whisper endpoint.
WDOC_WHISPER_API_KEY, default: “”
- If provided, sets a custom API key for Whisper transcription services. This is useful when using alternative
Whisper-compatible services that require their own authentication. * When empty, falls back to the OPENAI_API_KEY environment variable, and if that is also unset, to the WHISPER_API_KEY environment variable. The resolution order is WDOC_WHISPER_API_KEY > OPENAI_API_KEY > WHISPER_API_KEY.
WDOC_WHISPER_MODEL, default: “whisper-1”
- Specifies which Whisper model to use for audio transcription. This can be any model supported by your Whisper endpoint.
- Common values include “whisper-1” for OpenAI’s service, or model names like “base”, “small”, “medium”, “large” for local instances.
WDOC_APPLY_ASYNCIO_PATCH, default: False
- If True, applies the nest_asyncio patch to fix the Event loop closed error that can occur with Ollama and
other async-based LLM providers. Set to False if you’re experiencing issues with asyncio or if you’re handling asyncio patching elsewhere in your application. See BerriAI/litellm#files

WDOC_ALLOW_NO_PRICE: bool = False#

WDOC_APPLY_ASYNCIO_PATCH: bool = False#

WDOC_BEHAVIOR_EXCL_INCL_USELESS: Literal['warn', 'crash'] = 'warn'#

WDOC_CONTINUE_ON_INVALID_EVAL: bool = True#

WDOC_DEBUG: bool = False#

WDOC_DEBUGGER: bool = False#

WDOC_DEFAULT_EMBED_DIMENSION: int | None = None#

WDOC_DEFAULT_EMBED_MODEL: str = 'openai/text-embedding-3-small'#

WDOC_DEFAULT_MODEL: str = 'openrouter/deepseek/deepseek-v4-pro'#

WDOC_DEFAULT_QUERY_EVAL_MODEL: str = 'openrouter/deepseek/deepseek-v4-flash'#

WDOC_DISABLE_EMBEDDINGS_CACHE: bool = False#

WDOC_DUMMY_ENV_VAR: bool = False#

WDOC_EMBED_TESTING: bool = True#

WDOC_EMPTY_LOADER: bool = False#

WDOC_EXPIRE_CACHE_DAYS: int = 0#

WDOC_FAISS_BINARY: bool = False#

WDOC_FAISS_COMPRESSION: bool = True#

WDOC_IMPORT_TYPE: Literal['native', 'lazy', 'thread', 'both'] = 'native'#

WDOC_INTERMEDIATE_ANSWER_MAX_TOKENS: int = 4000#

WDOC_IN_DOCKER: bool = False#

WDOC_LANGFUSE_HOST: str | None = None#

WDOC_LANGFUSE_PUBLIC_KEY: str | None = None#

WDOC_LANGFUSE_SECRET_KEY: str | None = None#

WDOC_LITELLM_TAGS: str | None = None#

WDOC_LITELLM_USER: str = 'wdoc_llm'#

WDOC_LLM_MAX_CONCURRENCY: int = 1#

WDOC_LLM_REQUEST_TIMEOUT: int = 600#

WDOC_LOADER_LAZY_LOADING: bool = True#

WDOC_MAX_CHUNK_SIZE: int = 16000#

WDOC_MAX_EMBED_CONTEXT: int = 7000#

WDOC_MAX_LOADER_TIMEOUT: int = -1#

WDOC_MAX_PDF_LOADER_TIMEOUT: int = -1#

WDOC_MOD_FAISS_SCORE_FN: bool = True#

WDOC_NO_MODELNAME_MATCHING: bool = True#

WDOC_OPEN_ANKI: bool = False#

WDOC_PRIVATE_MODE: bool = False#

WDOC_SEMANTIC_BATCH_MAX_TOKEN_SIZE: int = 2000#

WDOC_STRICT_DOCDICT: bool | Literal['strip'] = False#

WDOC_TYPECHECKING: Literal['disabled', 'warn', 'crash'] = 'warn'#

WDOC_VERBOSE: bool = False#

WDOC_WHISPER_API_KEY: str | None = ''#

WDOC_WHISPER_ENDPOINT: str | None = ''#

WDOC_WHISPER_MODEL: str = 'whisper-1'#

WDOC_WHISPER_PARALLEL_SPLITS: bool = True#

wdoc.utils.env.check_kwargs(arg: str, abbrv: str = None) → bool[source]#

wdoc.utils.errors module#

Exception classes

exception wdoc.utils.errors.InvalidDocEvaluationByLLMEval(message: str)[source]#: Bases: Exception

exception wdoc.utils.errors.MissingDocdictArguments(message: str = 'Document loader called with missing arguments')[source]#

Bases: Exception

Raised when a document loader is called with the wrong number of arguments or missing required arguments.

exception wdoc.utils.errors.TimeoutPdfLoaderError[source]#: Bases: Exception

exception wdoc.utils.errors.UnexpectedDocDictArgument(message: str)[source]#: Bases: Exception

wdoc.utils.filters module#

Filter functions for VectorStore documents.

This module provides functions to filter VectorStore documents (e.g., FAISS) on the fly, since the langchain implementation does not support native filtering. These functions allow filtering by regex patterns on document content and metadata.

wdoc.utils.filters.create_content_filter(cli_kwargs: dict) → Callable[source]#

wdoc.utils.filters.create_metadata_filter(loaded_embeddings: VectorStore, cli_kwargs: dict) → Callable[source]#

wdoc.utils.filters.filter_vectorstore(loaded_embeddings: VectorStore, cli_kwargs: dict) → VectorStore[source]#

wdoc.utils.interact module#

Code related to the prompt (in the sense of “directly ask the user a question”)

class wdoc.utils.interact.SettingsCompleter(wdocCliSettings, wdocHistoryPrompts, wdocHistoryWords, *args, **kwargs)[source]#

Bases: Completer

get_completions(document, complete_event)[source]#

This should be a generator that yields Completion instances.

If the generation of completions is something expensive (that takes a lot of time), consider wrapping this Completer class in a ThreadedCompleter. In that case, the completer algorithm runs in a background thread and completions will be displayed as soon as they arrive.

Parameters:

document – Document instance.
complete_event – CompleteEvent instance.

wdoc.utils.interact.ask_user(settings: dict) → tuple[str, dict][source]#

## Command line manual * Available Commands:

/help or ?

/debug

/settings (syntax: ‘/settings top_k=5’)

Settings keys and values:
- top_k: int > 0
- multiline: boolean
*retriever: a string containing ‘_’ separated retriever from the following list:
‘default’ to use regular embedding search

‘knn’ to use KNN

‘svm’ to use SVM

‘multiquery’ to use Hypothetical Document Embedding search

‘parent’ to use parent retriever
To use several ‘/settings retriever=knn_svm_default’ * relevancy: float, from set [-1:+1]
Tips:
- Each LLM used has a nickname: use it to adress specific instructions. The nicknames are “Summarizer”, “Evaluator”, “Answerer” and “Combiner”.
- In multiline mode, use ctrl+D to send the text (sometimes
multiple times). * For more information run ‘wdoc –help’ * History is saved and shared across all runs * If you use ‘>>>>’ once in the middle of your text, the left part will be used as a query find the documents and the right part will be the question to answer. For example: ‘tuberculosis among medical students in the 20th century >>>> what are the statistics about epidemiology of tuberculosis among medical students in the 20th century?’. This is not always useful but in some cases depending on documents and retriever it can be needed to avoid having to set top_k too high.

wdoc.utils.interact.get_toolbar_text(settings: dict) → Any[source]#

Parse settings for display in the prompt toolbar.

Parameters:: settings (dict) – Dictionary containing the current settings
Returns:: Formatted text suitable for display in the toolbar
Return type:: Any

wdoc.utils.interact.show_help() → None[source]#

Display CLI help information.

This function displays the CLI help information by formatting and showing the docstring from the ask_user function.

Returns:: None

wdoc.utils.llm module#

Code related to loading the LLM instance, with an appropriate price counting callback.

class wdoc.utils.llm.PriceCountingCallback(verbose, *args, **kwargs)[source]#

Bases: BaseCallbackHandler

source: https://python.langchain.com/docs/modules/callbacks/

on_agent_action(action: AgentAction, **kwargs: Any) → Any[source]#: Run on agent action.

on_agent_finish(finish: AgentFinish, **kwargs: Any) → Any[source]#: Run on agent end.

on_chain_end(outputs: dict[str, Any], **kwargs: Any) → Any[source]#: Run when chain ends running.

on_chain_error(error: Exception | KeyboardInterrupt, **kwargs: Any) → Any[source]#: Run when chain errors.

on_chain_start(serialized: dict[str, Any], inputs: dict[str, Any], **kwargs: Any) → Any[source]#: Run when chain starts running.

on_chat_model_start(serialized: dict[str, Any], messages: list[list[BaseMessage]], **kwargs: Any) → Any[source]#: Run when Chat Model starts running.

on_llm_end(response: LLMResult, **kwargs: Any) → Any[source]#: Run when LLM ends running.

on_llm_error(error: Exception | KeyboardInterrupt, **kwargs: Any) → Any[source]#: Run when LLM errors.

on_llm_new_token(token: str, **kwargs: Any) → Any[source]#: Run on new LLM token. Only available when streaming is enabled.

on_llm_start(serialized: dict[str, Any], prompts: list[str], **kwargs: Any) → Any[source]#: Run when LLM starts running.

on_text(text: str, **kwargs: Any) → Any[source]#: Run on arbitrary text.

on_tool_end(output: Any, **kwargs: Any) → Any[source]#: Run when tool ends running.

on_tool_error(error: Exception | KeyboardInterrupt, **kwargs: Any) → Any[source]#: Run when tool errors.

on_tool_start(serialized: dict[str, Any], input_str: str, **kwargs: Any) → Any[source]#: Run when tool starts running.

wdoc.utils.load_recursive module#

wdoc.utils.load_recursive.parse_ddg_search(cli_kwargs: dict, path: str | Path, ddg_max_results: int = 50, ddg_region: str = '', ddg_safesearch: Literal['on', 'off', 'moderate'] = 'off', **extra_args) → list[DocDict][source]#

Turn a DocDict that has filetype==ddg into the individual DocDict of the webpage of each DuckDuckGo search result, treating the path as a search query.

Parameters:

cli_kwargs – Base CLI arguments to inherit
path – The search query string
ddg_max_results – Maximum number of search results to return, default=50
ddg_region – DuckDuckGo search region, default=’’
ddg_safesearch – SafeSearch setting (“on”, “moderate”, “off”), default=’off’
**extra_args – Additional arguments to pass to each document

Returns:

List of DocDict objects, each representing a URL from search results

wdoc.utils.load_recursive.parse_json_entries(cli_kwargs: dict, path: str | Path, **extra_args) → list[DocDict | dict][source]#

Turn a DocDict that has filetype==json_entries into the individual DocDict mentionned inside the json file.

Parameters:

cli_kwargs – Base CLI arguments to inherit
path – The path to the JSON file containing document entries
**extra_args – Additional arguments to pass to each document

Returns:

List of DocDict or dict objects, each representing an entry from the JSON file

wdoc.utils.load_recursive.parse_karakeep(cli_kwargs: dict, path: str | Path, karakeep_api_endpoint: str | None = None, karakeep_api_key: str | None = None, karakeep_verify_ssl: bool = True, karakeep_content_source: Literal['auto', 'native', 'wdoc'] = 'auto', **extra_args) → list[DocDict][source]#

Turn a DocDict that has filetype==karakeep into one DocDict per loadable bookmark of the selected Karakeep source.

The path carries the selector (see parse_selector): a list name by default, or one of tag:…, search:…, ids:…, library/*, favourites, archived. Each bookmark resolves to a single sub-document:

a local_html doc for a link bookmark’s stored crawled html;

a txt doc for a text bookmark or an asset’s pre-extracted text;

a pdf doc for a downloaded stored pdf/archive asset.

A compact metadata header (title / url / author / tags / note / summary) is prepended to text and html docs. The live bookmarked url is never re-fetched.

Parameters:

cli_kwargs – Base CLI arguments to inherit
path – The Karakeep selector (see above)
karakeep_api_endpoint – Karakeep API endpoint, else KARAKEEP_PYTHON_API_ENDPOINT
karakeep_api_key – Karakeep api key, else KARAKEEP_PYTHON_API_KEY
karakeep_verify_ssl – verify the instance’s TLS certificate
karakeep_content_source – ‘auto’ (stored text/html, else stored pdf), ‘native’ (stored extracted text/html only), or ‘wdoc’ (prefer the stored pdf/archive asset, parsed by wdoc’s loaders). None re-fetch the live url.
**extra_args – Additional arguments to pass to each document

Returns:

List of DocDict objects, each a loadable sub-document

wdoc.utils.load_recursive.parse_link_file(cli_kwargs: dict, path: str | Path, **extra_args) → list[DocDict][source]#

Turn a DocDict that has filetype==link_file into the individual DocDict of each url, where there is one url per line inside the link_file file. Note that bullet points are stripped (i.e. “- [the url]” is treated the same as “the url”), and commented lines (i.e. starting with “#”) are ignored.

Parameters:

cli_kwargs – Base CLI arguments to inherit
path – The path to the link file containing URLs
**extra_args – Additional arguments to pass to each document

Returns:

List of DocDict objects, each representing a URL from the link file

wdoc.utils.load_recursive.parse_recursive_paths(cli_kwargs: dict, path: str | Path, pattern: str, recursed_filetype: str, include: list[str] | None = None, exclude: list[str] | None = None, **extra_args) → list[DocDict | dict][source]#

Turn a DocDict that has filetype==recursive_paths into the DocDict of individual files in that path.

Parameters:

cli_kwargs – Base CLI arguments to inherit
path – The directory path to search recursively
pattern – Glob pattern to match files (e.g., “.pdf”, “*/*.txt”)
recursed_filetype – The filetype to assign to found files
include – Optional list of regex patterns to include files, default=None
exclude – Optional list of regex patterns to exclude files, default=None
**extra_args – Additional arguments to pass to each document

Returns:

List of DocDict or dict objects, each representing a found file

wdoc.utils.load_recursive.parse_toml_entries(cli_kwargs: dict, path: str | Path, **extra_args) → list[DocDict | dict][source]#

Turn a DocDict that has filetype==toml_entries into the individual DocDict mentionned inside the toml file.

Parameters:

cli_kwargs – Base CLI arguments to inherit
path – The path to the TOML file containing document entries
**extra_args – Additional arguments to pass to each document

Returns:

List of DocDict or dict objects, each representing an entry from the TOML file

wdoc.utils.load_recursive.parse_youtube_playlist(cli_kwargs: dict, path: str | Path, **extra_args) → list[DocDict][source]#

Turn a DocDict that has filetype==youtube_playlist into the individual DocDict of each youtube video part of that playlist.

Parameters:

cli_kwargs – Base CLI arguments to inherit
path – The YouTube playlist URL
**extra_args – Additional arguments to pass to each document

Returns:

List of DocDict objects, each representing a YouTube video from the playlist

wdoc.utils.load_recursive.parse_zotero(cli_kwargs: dict, path: str | Path, zotero_connection: Literal['auto', 'local', 'web'] = 'auto', zotero_library_id: str | None = None, zotero_library_type: Literal['user', 'group'] = 'user', zotero_api_key: str | None = None, zotero_attachment_text: Literal['wdoc', 'fulltext', 'hybrid'] = 'wdoc', zotero_include_notes: bool = False, zotero_include_metadata: bool = True, **extra_args) → list[DocDict][source]#

Turn a DocDict that has filetype==zotero into one DocDict per loadable sub-document of the selected Zotero items.

The path carries the selector (see parse_selector): a collection name or nested path by default, or one of tag:…, items:…, search:…, library/*. Each selected item fans out into:

one DocDict per attachment (a pdf/auto/url doc pointing at the downloaded/linked file, so wdoc’s own loaders handle the extraction; or a txt doc holding Zotero’s indexed fulltext when zotero_attachment_text requests it);

an always-on (zotero_include_metadata) txt doc with the item’s bibliographic header + abstract;

optionally (zotero_include_notes) one txt doc per attached note.

Parameters:

cli_kwargs – Base CLI arguments to inherit
path – The Zotero selector (see above)
zotero_connection – ‘auto’ (local then web), ‘local’, or ‘web’
zotero_library_id – Zotero numeric library id (web API), else ZOTERO_LIBRARY_ID
zotero_library_type – ‘user’ or ‘group’
zotero_api_key – Zotero api key (web API), else ZOTERO_API_KEY
zotero_attachment_text – how to get attachment text (‘wdoc’=reuse wdoc loaders on the file, ‘fulltext’=Zotero indexed fulltext, ‘hybrid’= fulltext then fall back to the file)
zotero_include_notes – also emit a doc per Zotero note
zotero_include_metadata – also emit a per-item metadata/abstract doc
**extra_args – Additional arguments to pass to each document

Returns:

List of DocDict objects, each a loadable sub-document

wdoc.utils.logger module#

Code related to loggings, coloured logs, etc.

wdoc.utils.logger.setup_cli_logging() → None[source]#

Install wdoc’s stderr/stdout/file sinks on the global loguru logger.

Only the CLI entry point should call this. When wdoc is imported as a library (e.g. into an open-webui tool), we leave the host’s loguru configuration alone, so wdoc records flow through whatever handlers the host already installed.

wdoc.utils.misc module#

Miscellanous functions etc.

class wdoc.utils.misc.ChonkieSemanticSplitter(chunk_size: int, chunk_overlap: int, length_function: Callable[[str], int])[source]#

Bases: TextSplitter

Text splitter using chonkie’s semantic chunker.

This splitter uses semantic boundaries from chonkie to create meaningful chunks, then merges them to reach the desired token count while respecting overlap. The semantic chunking is memoized for efficiency.

__init__(chunk_size: int, chunk_overlap: int, length_function: Callable[[str], int])[source]#

Initialize the semantic splitter.

Parameters:

chunk_size (int) – Maximum number of tokens per chunk.
chunk_overlap (int) – Number of tokens to overlap between chunks.
length_function (Callable[[str], int]) – Function to compute token length of text.

split_text(text: str) → list[str][source]#

Split text into chunks using semantic boundaries and token limits.

Semantic units from chonkie are merged until reaching chunk_size, with overlap handling between chunks.

Parameters:: text (str) – Text to split.
Returns:: List of text chunks.
Return type:: List[str]

transform_documents(documents: list[Document]) → list[Document][source]#

Transform documents by splitting them into chunks.

This method splits each document’s content using semantic boundaries and creates new Document objects for each chunk, preserving the original metadata.

Parameters:: documents (List[Document]) – List of documents to transform.
Returns:: List of transformed document chunks.
Return type:: List[Document]

wdoc.utils.misc.check_docs_tkn_length(docs: list[Document], identifier: Any, min_token: int = 20, max_token: int = 10000000, min_lang_prob: float = 0.5, check_language: bool = False) → float[source]#: checks that the number of tokens in the document is high enough, not too low, and has a high enough language probability, otherwise something probably went wrong.

wdoc.utils.misc.get_openrouter_metadata() → dict[source]#: fetch the metadata from openrouter, because litellm takes always too much time to add new models.

wdoc.utils.misc.get_piped_input() → Any | None[source]#: Read data from stdin/pipes. This is done when importing wdoc, to avoid any issues with parallelism and threads etc. The content is added to the commandline starting wdoc directly in __main__.py.

wdoc.utils.misc.get_splitter(task: wdocTask, modelname: ModelName = ModelName(original='openai/gpt-4o-mini', backend='openai', model='gpt-4o-mini', sanitized='openai_gpt-4o-mini')) → TextSplitter[source]#: we don’t use the same text splitter depending on the task

wdoc.utils.misc.html_to_text(html: str, remove_image: bool = False) → str[source]#: used to strip any html present in the text files

wdoc.utils.misc.language_detector(text: str) → float[source]#

wdoc.utils.misc.log_and_time_fn(fn: Callable) → Callable[source]#

wdoc.utils.misc.optional_strip_unexp_args(func: Callable) → Callable[source]#: if the environment variable WDOC_STRICT_DOCDICT is set to ‘true’ then this automatically removes any unexpected argument before calling a loader function for a specific filetype.

wdoc.utils.misc.wrapped_model_name_matcher(model: str) → str[source]#: find the best match for a modelname (wrapped to make some check)

wdoc.utils.prompts module#

Prompts used by wdoc.

class wdoc.utils.prompts.ExpandedQuery(*, thoughts: str, output_queries: list[str])[source]#

Bases: BaseModel

classmethod nonempty_queries(values: dict) → dict[source]#

classmethod remove_thoughts(values) → list[str][source]#

model_config = {}#: Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class wdoc.utils.prompts.Prompts_class(**prompts: ChatPromptTemplate)[source]#

Bases: object

enable_prompt_caching(prompt_key: str) → None[source]#

wdoc.utils.retrievers module#

Retrievers used to retrieve the appropriate embeddings for a given query.

wdoc.utils.retrievers.create_multiquery_retriever(llm: langchain_litellm.ChatLiteLLM, retriever: BaseRetriever) → BaseRetriever[source]#

wdoc.utils.retrievers.create_parent_retriever(task: wdocTask, loaded_embeddings: Any, loaded_docs: list[Document], top_k: int, relevancy: float) → BaseRetriever[source]#: https://python.langchain.com/docs/modules/data_connection/retrievers/parent_document_retriever

wdoc.utils.retrievers.create_retrievers(query_retrievers: str, loaded_embeddings, embedding_engine, llm, top_k: int, relevancy: float, task: wdocTask, loaded_docs: list[Document] | None) → BaseRetriever[source]#: Create and return list of retrievers based on query_retrievers setting.

wdoc.utils.retrievers.get_all_texts(loaded_embeddings: Embeddings) → list[str][source]#

wdoc.utils package#

Subpackages#

Submodules#

wdoc.utils.batch_file_loader module#

wdoc.utils.embeddings module#

wdoc.utils.env module#

wdoc.utils.errors module#

wdoc.utils.filters module#

wdoc.utils.interact module#

wdoc.utils.llm module#

wdoc.utils.load_recursive module#

wdoc.utils.logger module#

wdoc.utils.misc module#

wdoc.utils.prompts module#

wdoc.utils.retrievers module#

Module contents#