wdoc.utils.tasks package#
Submodules#
wdoc.utils.tasks.parse module#
Parse document functionality.
- wdoc.utils.tasks.parse.parse_doc(filetype: str = 'auto', format: Literal['text', 'split_text', 'xml', 'langchain', 'langchain_dict'] = 'text', debug: bool = False, verbose: bool = False, out_file: str | Path | None = None, **kwargs) List[Document] | str | List[dict][source]#
# Content of wdoc/docs/parse_doc_help.md
# Parse Doc
## Description
parse_doc is the function called when you do wdoc parse_doc –path=my_path. It takes as argument basically the file related arguments of wdoc and completely bypasses anything related to summarising, querying, LLM etc. Hence it is meant to be used as an utility that parses any input to text. You can for example use it to quickly parse anything to send to [@simonw’s](simonw/) [llm](simonw/llm) or any other .shell utility.
## Arguments
- filetype: str
Same as for wdoc
- format: str, default text
if text: returns the text, with splits joined separated by a newline
if split_text: returns the text, with indicators for the document splits
if xml: returns text in an xml like format
if langchain: return a list of langchain Documents
- if langchain_dict: return a list of langchain Documents as
python dicts (easy to json parse, and metadata are included)
- debug: bool, default False
Same as for wdoc
- verbose: bool, default False
Same as for wdoc
- out_file: str or Path, default None
If specified, writes the output to the given file path.
If the file exists and is binary, the function will crash.
Otherwise, the output will be appended to the file (no overwrite).
The output is still returned normally for programmatic use.
- **kwargs
Remaning keyword arguments are assumed to be DocDict arguments,
the full list is at wdoc.utils.misc.filetype_arg_types or in the “DocDict arguments” section of wdoc –help.
## Return value - Either the document’s page_content as a string, or a list of langchain Document (so with attributes page_content and metadata).
wdoc.utils.tasks.query module#
Chain (logic) used to query a document.
- wdoc.utils.tasks.query.autoincrease_top_k(filtered_docs: list[Document], top_k: int, max_top_k: int | None) list[Document][source]#
Check if the number of filtered documents suggests top_k should be increased.
This function evaluates the ratio of filtered documents to top_k and raises an exception if the ratio is too high (>=0.9), suggesting that more documents should be retrieved. This mechanism allows the query system to automatically increase top_k when it appears that good documents might be getting cut off due to the limit.
- Parameters:
filtered_docs (
List[Document]) – The list of documents that passed the LLM evaluation filtering.top_k (
int) – The current top_k value used for document retrieval.max_top_k (
Optional[int]) – The maximum allowed value for top_k. If None, no automatic increase will be attempted.
- Returns:
The same list of filtered documents (unchanged).
- Return type:
List[Document]- Raises:
ShouldIncreaseTopKAfterLLMEvalFiltering – When the ratio of filtered documents to top_k is >= 0.9 and top_k can still be increased (i.e., top_k < max_top_k).
Notes
This function is designed to be used in a langchain pipeline where the exception can be caught to retry the query with an increased top_k value. The function logs warnings when the ratio suggests top_k should be increased but max_top_k has been reached.
- wdoc.utils.tasks.query.check_intermediate_answer(ans: str) bool[source]#
filters out the intermediate answers that are deemed irrelevant.
- wdoc.utils.tasks.query.collate_relevant_intermediate_answers(list_ia: list[str]) str[source]#
rewrite the relevant intermediate answers in a single string to be readable by the combining LLM
- wdoc.utils.tasks.query.parse_eval_output(output: str) str[source]#
Parse an LLM’s answer about wether a document is relevant or not into an integer from 0 to 10 as str.
For example, it turns an LLM answer from:
‘’’ <think> I am thinking hard about if the document is reelevant to the user query on a scale of 0 (irrelevant) to 10 (very relevant). … </think>
<answer>10</answer> ‘’’
into simply: ‘10’
- wdoc.utils.tasks.query.pbar_chain(llm: langchain_litellm.ChatLiteLLM | langchain_community.chat_models.fake.FakeListChatModel, len_func: str, **tqdm_kwargs) RunnableLambda[source]#
create a chain that just sets a tqdm progress bar
- wdoc.utils.tasks.query.pbar_closer(llm: langchain_litellm.ChatLiteLLM | langchain_community.chat_models.fake.FakeListChatModel) RunnableLambda[source]#
close a pbar created by pbar_chain
- wdoc.utils.tasks.query.retrieve_documents_for_query(retriever)[source]#
Create a retrieve documents chain for query tasks.
- Parameters:
retriever (
object) – The retriever object to use for document retrieval.- Returns:
A chain that retrieves documents using the provided retriever.
- Return type:
RunnableLambda
- wdoc.utils.tasks.query.semantic_batching(texts: list[str], embedding_engine: Embeddings) list[list[str]][source]#
Given a list of text, embed them, do a hierarchical clutering then sort the list according to the leaf order, then create buckets that best contain each subtopic while keeping a reasonnable number of tokens. This probably helps the LLM to combine the intermediate answers into one. Note that the documents are also sorted inside each batch, so that iterating over each document of each batch in order will follow the optimal leaf order.
- wdoc.utils.tasks.query.sieve_documents(instance) RunnableLambda[source]#
cap the number of retrieved documents as if multiple retrievers are used we can end up with a lot more document!
- wdoc.utils.tasks.query.source_replace(input: str, mapping: dict) str[source]#
Replace document identifiers in text with their corresponding numbers.
This function substitutes document IDs (like WDOC_1, WDOC_2) with their corresponding document numbers from the mapping dictionary. It processes in reverse order to avoid issues like WDOC_2 replacing part of WDOC_21.
- Parameters:
input (
str) – The text containing document identifiers to replace.mapping (
dict) – Dictionary mapping document IDs to document numbers.
- Returns:
Text with document identifiers replaced by numbers.
- Return type:
str
wdoc.utils.tasks.search module#
Chain (logic) used for search tasks.
- wdoc.utils.tasks.search.retrieve_documents_for_search(retriever)[source]#
Create a retrieve documents chain for search tasks.
- Parameters:
retriever (
object) – The retriever object to use for document retrieval.- Returns:
A chain that retrieves documents using the provided retriever.
- Return type:
RunnableLambda
wdoc.utils.tasks.summarize module#
Chain (logic) used to summarize a document.
- class wdoc.utils.tasks.summarize.wdocSummary(path: str, summary: str, recursive_summaries: dict[int, str], sum_reading_length: float, sum_tkn_length: int, doc_reading_length: float, doc_total_tokens: dict[str, int], doc_total_tokens_sum: int, doc_total_tokens_str: str, doc_total_cost: float | int, author: str | None, n_chunk: int)[source]#
Bases:
objectContainer for document summarization results with dict-like access.
This dataclass encapsulates all outputs from the document summarization process, including metrics, costs, and the summary text itself. It provides dict-like access for backward compatibility while offering better type safety and cleaner code structure.
- Variables:
path (
str) – Original document path or URL that was summarized.summary (
str) – Final summary text from the best recursion pass.recursive_summaries (
Dict[int,str]) – Mapping of recursion level to summary text for each pass.sum_reading_length (
float) – Estimated reading time in minutes for the final summary.sum_tkn_length (
int) – Token count of the final summary text.doc_reading_length (
float) – Original document reading time in minutes.doc_total_tokens (
Dict[str,int]) – Token usage breakdown by type (prompt, completion, internal_reasoning).doc_total_tokens_sum (
int) – Total tokens used across all operations.doc_total_tokens_str (
str) – Human-readable string representation of token usage.doc_total_cost (
Union[float,int]) – Total cost in dollars for LLM usage.author (
Optional[str]) – Document author if available in metadata.n_chunk (
int) – Number of document chunks that were processed.
- wdoc.utils.tasks.summarize.summarize_documents(path: str | Path, relevant_docs: list, summary_language: str, model: ModelName, llm: langchain_litellm.ChatLiteLLM | langchain_community.chat_models.fake.FakeListChatModel, llm_verbosity: bool, summary_n_recursion: int, llm_price: dict, in_import_mode: bool, out_file: str | None, wdoc_version: str, citation_url_template: str | None = None) wdocSummary[source]#
Orchestrate the complete document summarization process with optional recursion.
This function serves as the main entry point for document summarization. It extracts metadata from documents, performs initial summarization, optionally applies recursive summarization to condense the output further, calculates costs and reading times, and handles output formatting and file writing. Recursive summarization continues until the summary converges or reaches the specified recursion limit.
- Parameters:
path (
Union[str,Path]) – Source path or URL of the document being summarized. Used for metadata and identification purposes.relevant_docs (
List) – List of Document objects containing the content to summarize. Must not be empty and should contain metadata like ‘title’, ‘author’, etc.summary_language (
str) – Target language for the summary output.model (
ModelName) – Model configuration object containing backend and tokenization info.llm (
Union[langchain_litellm.ChatLiteLLM,langchain_community.chat_models.fake.FakeListChatModel]) – Language model instance for generating summaries.llm_verbosity (
bool) – If True, enables verbose logging of LLM interactions and intermediate outputs.summary_n_recursion (
int) – Maximum number of recursive summarization passes. 0 means no recursion. Each pass attempts to further condense the previous summary.llm_price (
dict) – Pricing information for token usage calculation with keys matching token types (‘prompt’, ‘completion’, ‘internal_reasoning’).in_import_mode (
bool) – If True, suppresses console output for integration scenarios.out_file (
Optional[str]) – Path to output file for saving the summary. If None, no file is written. Intermediate recursion summaries are saved with numbered extensions.wdoc_version (
str) – Version string of wdoc for metadata tracking.
- Returns:
Comprehensive summary results containing all metrics, costs, and summary text. Can be accessed as a dict for backward compatibility.
- Return type:
- Raises:
AssertionError – If relevant_docs is empty or contains unexpected data.
Notes
Recursive summarization stops early if: - The summary text becomes identical to the previous iteration - Token length validation fails for the recursive summary chunks - Maximum recursion depth is reached
The function prioritizes cost transparency by detailed token tracking and supports both interactive and programmatic usage modes through the in_import_mode parameter.