wdoc.utils.tasks package#

Submodules#

wdoc.utils.tasks.parse module#

Parse document functionality.

wdoc.utils.tasks.parse.parse_doc(filetype: str = 'auto', format: Literal['text', 'split_text', 'xml', 'langchain', 'langchain_dict'] = 'text', debug: bool = False, verbose: bool = False, out_file: str | Path | None = None, **kwargs) → List[Document] | str | List[dict][source]#

# Content of wdoc/docs/parse_doc_help.md

# Parse Doc

## Description

parse_doc is the function called when you do wdoc parse_doc –path=my_path. It takes as argument basically the file related arguments of wdoc and completely bypasses anything related to summarising, querying, LLM etc. Hence it is meant to be used as an utility that parses any input to text. You can for example use it to quickly parse anything to send to [@simonw’s](simonw/) [llm](simonw/llm) or any other .shell utility.

## Arguments

filetype: str
- Same as for wdoc
format: str, default text
- if text: returns the text, with splits joined separated by a newline
- if split_text: returns the text, with indicators for the document splits
- if xml: returns text in an xml like format
- if langchain: return a list of langchain Documents
- if langchain_dict: return a list of langchain Documents as
  python dicts (easy to json parse, and metadata are included)
debug: bool, default False
- Same as for wdoc
verbose: bool, default False
- Same as for wdoc
out_file: str or Path, default None
- If specified, writes the output to the given file path.
- If the file exists and is binary, the function will crash.
- Otherwise, the output will be appended to the file (no overwrite).
- The output is still returned normally for programmatic use.
**kwargs
- Remaning keyword arguments are assumed to be DocDict arguments,
the full list is at wdoc.utils.misc.filetype_arg_types or in the “DocDict arguments” section of wdoc –help.

## Return value - Either the document’s page_content as a string, or a list of langchain Document (so with attributes page_content and metadata).

wdoc.utils.tasks.query module#

Chain (logic) used to query a document.

wdoc.utils.tasks.query.autoincrease_top_k(filtered_docs: list[Document], top_k: int, max_top_k: int | None) → list[Document][source]#

Check if the number of filtered documents suggests top_k should be increased.

This function evaluates the ratio of filtered documents to top_k and raises an exception if the ratio is too high (>=0.9), suggesting that more documents should be retrieved. This mechanism allows the query system to automatically increase top_k when it appears that good documents might be getting cut off due to the limit.

Parameters:

filtered_docs (List[Document]) – The list of documents that passed the LLM evaluation filtering.
top_k (int) – The current top_k value used for document retrieval.
max_top_k (Optional[int]) – The maximum allowed value for top_k. If None, no automatic increase will be attempted.

Returns:

The same list of filtered documents (unchanged).

Return type:

List[Document]

Raises:

ShouldIncreaseTopKAfterLLMEvalFiltering – When the ratio of filtered documents to top_k is >= 0.9 and top_k can still be increased (i.e., top_k < max_top_k).

Notes

This function is designed to be used in a langchain pipeline where the exception can be caught to retry the query with an increased top_k value. The function logs warnings when the ratio suggests top_k should be increased but max_top_k has been reached.

wdoc.utils.tasks.query.check_intermediate_answer(ans: str) → bool[source]#: filters out the intermediate answers that are deemed irrelevant.

wdoc.utils.tasks.query.collate_relevant_intermediate_answers(list_ia: list[str]) → str[source]#: rewrite the relevant intermediate answers in a single string to be readable by the combining LLM

wdoc.utils.tasks.query.parse_eval_output(output: str) → str[source]#

Parse an LLM’s answer about wether a document is relevant or not into an integer from 0 to 10 as str.

For example, it turns an LLM answer from:

‘’’ <think> I am thinking hard about if the document is reelevant to the user query on a scale of 0 (irrelevant) to 10 (very relevant). … </think>

<answer>10</answer> ‘’’

into simply: ‘10’

wdoc.utils.tasks.query.pbar_chain(llm: langchain_litellm.ChatLiteLLM | langchain_community.chat_models.fake.FakeListChatModel, len_func: str, **tqdm_kwargs) → RunnableLambda[source]#: create a chain that just sets a tqdm progress bar

wdoc.utils.tasks.query.pbar_closer(llm: langchain_litellm.ChatLiteLLM | langchain_community.chat_models.fake.FakeListChatModel) → RunnableLambda[source]#: close a pbar created by pbar_chain

wdoc.utils.tasks.query.retrieve_documents_for_query(retriever)[source]#

Create a retrieve documents chain for query tasks.

Parameters:: retriever (object) – The retriever object to use for document retrieval.
Returns:: A chain that retrieves documents using the provided retriever.
Return type:: RunnableLambda

wdoc.utils.tasks.query.semantic_batching(texts: list[str], embedding_engine: Embeddings) → list[list[str]][source]#: Given a list of text, embed them, do a hierarchical clutering then sort the list according to the leaf order, then create buckets that best contain each subtopic while keeping a reasonnable number of tokens. This probably helps the LLM to combine the intermediate answers into one. Note that the documents are also sorted inside each batch, so that iterating over each document of each batch in order will follow the optimal leaf order.

wdoc.utils.tasks.query.sieve_documents(instance) → RunnableLambda[source]#: cap the number of retrieved documents as if multiple retrievers are used we can end up with a lot more document!

wdoc.utils.tasks.query.source_replace(input: str, mapping: dict) → str[source]#

Replace document identifiers in text with their corresponding numbers.

This function substitutes document IDs (like WDOC_1, WDOC_2) with their corresponding document numbers from the mapping dictionary. It processes in reverse order to avoid issues like WDOC_2 replacing part of WDOC_21.

Parameters:

input (str) – The text containing document identifiers to replace.
mapping (dict) – Dictionary mapping document IDs to document numbers.

Returns:

Text with document identifiers replaced by numbers.

Return type:

str

wdoc.utils.tasks.search module#

Chain (logic) used for search tasks.

wdoc.utils.tasks.search.retrieve_documents_for_search(retriever)[source]#

Create a retrieve documents chain for search tasks.

Parameters:: retriever (object) – The retriever object to use for document retrieval.
Returns:: A chain that retrieves documents using the provided retriever.
Return type:: RunnableLambda

wdoc.utils.tasks.shared_query_search module#

Shared utilities for query and search tasks.

wdoc.utils.tasks.shared_query_search.create_evaluate_doc_chain(eval_llm, eval_llm_params: list[str], query_eval_check_number: int, eval_cache_wrapper: Callable, prompts)[source]#

Create a document evaluation chain for assessing document relevance.

This function creates a chain that evaluates documents for relevance to a query using an LLM. It handles different model configurations and caching strategies.

Parameters:

eval_llm (object) – The evaluation LLM instance
eval_llm_params (List[str]) – List of supported parameters for the evaluation LLM
query_eval_check_number (int) – Number of evaluation checks to perform
eval_cache_wrapper (Callable) – Function to wrap the evaluation for caching
prompts (object) – Prompts object containing the evaluation prompt

Returns:

A langchain chain object for document evaluation

Return type:

chain

wdoc.utils.tasks.shared_query_search.split_query_parts(query: str) → tuple[str, str][source]#

Split query into parts for embedding search and answering.

If the query contains “>>>>”, splits it into: - query_for_embedding: part before >>>> - query_to_answer: part after >>>>

Otherwise returns the same query for both purposes.

Parameters:: query (str) – The input query string
Returns:: A tuple of (query_for_embedding, query_to_answer)
Return type:: Tuple[str, str]
Raises:: AssertionError – If query contains more than one occurrence of “>>>>”

wdoc.utils.tasks.summarize module#

Chain (logic) used to summarize a document.

class wdoc.utils.tasks.summarize.wdocSummary(path: str, summary: str, recursive_summaries: dict[int, str], sum_reading_length: float, sum_tkn_length: int, doc_reading_length: float, doc_total_tokens: dict[str, int], doc_total_tokens_sum: int, doc_total_tokens_str: str, doc_total_cost: float | int, author: str | None, n_chunk: int)[source]#

Bases: object

Container for document summarization results with dict-like access.

This dataclass encapsulates all outputs from the document summarization process, including metrics, costs, and the summary text itself. It provides dict-like access for backward compatibility while offering better type safety and cleaner code structure.

Variables:

path (str) – Original document path or URL that was summarized.
summary (str) – Final summary text from the best recursion pass.
recursive_summaries (Dict[int, str]) – Mapping of recursion level to summary text for each pass.
sum_reading_length (float) – Estimated reading time in minutes for the final summary.
sum_tkn_length (int) – Token count of the final summary text.
doc_reading_length (float) – Original document reading time in minutes.
doc_total_tokens (Dict[str, int]) – Token usage breakdown by type (prompt, completion, internal_reasoning).
doc_total_tokens_sum (int) – Total tokens used across all operations.
doc_total_tokens_str (str) – Human-readable string representation of token usage.
doc_total_cost (Union[float, int]) – Total cost in dollars for LLM usage.
author (Optional[str]) – Document author if available in metadata.
n_chunk (int) – Number of document chunks that were processed.

get(key: str, default=None)[source]#: Get value with default like dict.get().

items()[source]#: Return field name-value pairs like dict.items().

keys()[source]#: Return field names like dict.keys().

values()[source]#: Return field values like dict.values().

wdoc.utils.tasks.summarize.summarize_documents(path: str | Path, relevant_docs: list, summary_language: str, model: ModelName, llm: langchain_litellm.ChatLiteLLM | langchain_community.chat_models.fake.FakeListChatModel, llm_verbosity: bool, summary_n_recursion: int, llm_price: dict, in_import_mode: bool, out_file: str | None, wdoc_version: str, citation_url_template: str | None = None) → wdocSummary[source]#

Orchestrate the complete document summarization process with optional recursion.

This function serves as the main entry point for document summarization. It extracts metadata from documents, performs initial summarization, optionally applies recursive summarization to condense the output further, calculates costs and reading times, and handles output formatting and file writing. Recursive summarization continues until the summary converges or reaches the specified recursion limit.

Parameters:

path (Union[str, Path]) – Source path or URL of the document being summarized. Used for metadata and identification purposes.
relevant_docs (List) – List of Document objects containing the content to summarize. Must not be empty and should contain metadata like ‘title’, ‘author’, etc.
summary_language (str) – Target language for the summary output.
model (ModelName) – Model configuration object containing backend and tokenization info.
llm (Union[langchain_litellm.ChatLiteLLM, langchain_community.chat_models.fake.FakeListChatModel]) – Language model instance for generating summaries.
llm_verbosity (bool) – If True, enables verbose logging of LLM interactions and intermediate outputs.
summary_n_recursion (int) – Maximum number of recursive summarization passes. 0 means no recursion. Each pass attempts to further condense the previous summary.
llm_price (dict) – Pricing information for token usage calculation with keys matching token types (‘prompt’, ‘completion’, ‘internal_reasoning’).
in_import_mode (bool) – If True, suppresses console output for integration scenarios.
out_file (Optional[str]) – Path to output file for saving the summary. If None, no file is written. Intermediate recursion summaries are saved with numbered extensions.
wdoc_version (str) – Version string of wdoc for metadata tracking.

Returns:

Comprehensive summary results containing all metrics, costs, and summary text. Can be accessed as a dict for backward compatibility.

Return type:

wdocSummary

Raises:

AssertionError – If relevant_docs is empty or contains unexpected data.

Notes

Recursive summarization stops early if: - The summary text becomes identical to the previous iteration - Token length validation fails for the recursive summary chunks - Maximum recursion depth is reached

The function prioritizes cost transparency by detailed token tracking and supports both interactive and programmatic usage modes through the in_import_mode parameter.

wdoc.utils.tasks package#

Submodules#

wdoc.utils.tasks.parse module#

wdoc.utils.tasks.query module#

wdoc.utils.tasks.search module#

wdoc.utils.tasks.shared_query_search module#

wdoc.utils.tasks.summarize module#

wdoc.utils.tasks.types module#

Module contents#