Full Documentation#

This is all the documentation for wdoc in a single page, for easy LLM parsing.

wdoc is a sophisticated RAG system made by Olicorne, a medical student

PyPI version Ask DeepWiki

wdoc#

I’m wdoc. I solve RAG problems.

  • wdoc, imitating Winston “The Wolf” Wolf

wdoc is a powerful RAG (Retrieval-Augmented Generation) system designed to summarize, search, and query documents across various file types. It’s particularly useful for handling large volumes of diverse document types, making it ideal for researchers, students, and professionals dealing with extensive information sources.

Created by a psychiatry resident who needed a way to get a definitive answer from multiple sources at the same time (audio recordings, video lectures, Anki flashcards, PDFs, EPUBs, etc.). wdoc was born from frustration with existing RAG solutions for querying and summarizing. Note: wdoc was coded mostly by hand, without LLM assistance as they didn’t exist at the time, Claude Code will probably be used to refactor the code as it evolves.

(The online documentation can be found here)

  • Goal and project specifications: wdoc’s goal is to create perfectly useful summaries and perfectly useful sourced answers to questions on heterogeneous corpus. It’s capable of querying tens of thousands of documents across various file types at the same time. The project also includes an opinionated summary feature to help users efficiently keep up with large amounts of information. It uses mostly LangChain and LiteLLM as backends.

  • Current status: usable, tested, still under active development, tens of planned features

    • I don’t plan on stopping to read anytime soon so if you find it promising, stick around as I have many improvements planned (see roadmap section).

    • I would greatly benefit from testing by users as it’s the quickest way for me to find the many minor quick-to-fix bugs.

    • The main branch is more stable than the dev branch, which in turns offers more features.

    • Open to feature requests and pull requests. All feedbacks, including reports of typos, are highly appreciated

    • Please open an issue before making a PR, as there may be ongoing improvements in the pipeline.

  • Key Features:

    • Docker Web UI: Easy deployment with a Gradio-based web interface for simplified document processing without CLI interaction.

    • High recall and specificity: it was made to find A LOT of documents using carefully designed embedding search then carefully aggregate gradually each answer using semantic batch to produce a single answer that mentions the source pointing to the exact portion of the source document.

      • Use both an expensive and cheap LLM to make recall as high as possible because we can afford fetching a lot of documents per query (via embeddings)

    • Supports virtually any LLM providers, including local ones, and even with extra layers of security for super secret stuff.

    • Aims to support any filetypes and query from all of them at the same time (15+ are already implemented!)

    • Actually useful AI powered summary: get the thought process of the author instead of nebulous takeaways.

    • Actually useful AI powered queries: get the sourced indented markdown answer to your questions instead of hallucinated nonsense.

    • Extensible: this is both a tool and a library. It was even turned into an Open-WebUI Tool. Also available as a Docker web UI for easy deployment.

    • Web Search: Preliminary web search support using DuckDuckGo (via the ddgs library)

Table of contents#

Comprehensive reference#

A single-page comprehensive reference covering every CLI argument, environment variable, filetype, and the full Python API can be found in SKILL.md.

Explanatory diagrams#

Query task workflow diagram showing the flow from user inputs through Raphael the Rephraser, VectorStore, Eve the Evaluator, Anna the Answerer, and recursive combining to final output Summary task workflow diagram showing the flow from user inputs through loading & chunking, Sam the Summarizer, concatenation to wdocSummary output Search task workflow diagram showing the flow from user inputs through Raphael the Rephraser, VectorStore, Eve the Evaluator to search output

Ultra short guide for people in a hurry#

Give it to me I am in a hurry!

Note: a list of examples can be found in examples.md

TL;DR for installation: when in doubt, use uvx wdoc[full]. The plain wdoc only ships PDF + URL/web loaders; everything else (youtube, audio, anki, office formats, logseq) lives in optional extras. [full] bundles all of them so you never have to think about missing dependencies. See the Direct Installation section for the full list of extras.

Quick Start with Docker: If you want an experimental web UI, check out the Docker deployment guide.

First, let’s see how to query a pdf.

link="https://situational-awareness.ai/wp-content/uploads/2024/06/situationalawareness.pdf"

uvx wdoc[full] --path=$link --task=query --filetype="online_pdf" --query="What does it say about alphago?" --query_retrievers='basic_multiquery' --top_k=auto_200_500
  • This will:

    1. parse what’s in –path as a link to a pdf to download (otherwise the url could simply be a webpage, but in most cases you can leave it to ‘auto’ by default as heuristics are in place to detect the most appropriate parser).

    2. cut the text into chunks and create embeddings for each

    3. Take the user query, create embeddings for it (‘basic’) AND ask the default LLM to generate alternative queries and embed those

    4. Use those embeddings to search through all chunks of the text and get the 200 most appropriate documents

    5. Pass each of those documents to the smaller LLM (default: openrouter/deepseek/deepseek-v4-flash) to tell us if the document seems appropriate given the user query

    6. If More than 90% of the 200 documents are appropriate, then we do another search with a higher top_k and repeat until documents start to be irrelevant OR we it 500 documents.

    7. Then each relevant doc is sent to the strong LLM (by default, openrouter/deepseek/deepseek-v4-pro) to extract relevant info and give one answer per relevant document.

    8. Then all those “intermediate” answers are ‘semantic batched’ (meaning we create embeddings, do hierarchical clustering, then create small batch containing several intermediate answers of similar semantics, sort the batch in semantic order too), each batch is combined into a single answer per batch of relevant doc (or after: per batch of batches).

    9. Rinse and repeat steps 7+8 (i.e. gradually aggregate batches) until we have only one answer, that is returned to the user.

Now, let’s see how to summarize a pdf.

link="https://situational-awareness.ai/wp-content/uploads/2024/06/situationalawareness.pdf"

uvx wdoc[full] --path=$link --task=summarize --filetype="online_pdf"
  • This will:

    1. Split the text into chunks

    2. pass each chunk into the strong LLM (by default openrouter/deepseek/deepseek-v4-pro) for a very low level (=with all details) summary. The format is markdown bullet points for each idea and with logical indentation.

    3. When creating each new chunk, the LLM has access to the previous chunk for context.

    4. All summary are then concatenated and returned to the user

  • For extra large documents like books for example, this summary can be recusively fed to wdoc using argument –summary_n_recursion=2 for example.

  • Those two tasks, query and summary, can be combined with –task summarize_then_query which will summarize the document but give you a prompt at the end to ask question in case you want to clarify things.

  • For more, you can read examples.md.

  • Note that there is an official Open-WebUI Tool that is even simpler to use.

Features#

  • 15+ filetypes: also supports combination to load recursively or define complex heterogenous corpus like a list of files, list of links, using regex, youtube playlists etc. See Filestypes and Recursive Filetypes. All filetype can be seamlessly combined in the same index, meaning you can query your anki collection at the same time as your work PDFs). It supports removing silence from audio files and youtube videos too! There is even a ddg filetype to search the web using DuckDuckGo.

  • 100+ LLMs and many embeddings: Supports any LLM by OpenAI, Mistral, Claude, Ollama, Openrouter, etc. thanks to litellm. The list of supported embeddings engine can be found here but includes at least Openai (or any openai API compatible models), Cohere, Azure, Bedrock, NVIDIA NIM, Hugginface, Mistral, Ollama, Gemini, Vertex, Voyage.

  • Local and Private LLM: When in private mode, measures are taken to make sure no data leaves your computer and goes to an LLM provider: no API keys are used, all api_base are user set, cache are isolated from the rest, outgoing connections are censored by overloading python sockets, etc.

  • Advanced RAG to query lots of diverse documents:

    1. The documents are retrieved using embeddings

    2. Then a weak LLM model (“Eve the Evaluator”) is used to tell which of those document is not relevant

    3. Then the strong LLM is used to answer (“Anna the Answerer”) the question using each individual remaining documents.

    4. Then all relevant answers are combined (“Carl the Combiner”) into a single short markdown-formatted answer. Before being combined, they are batched by semantic clusters and semantic order using scipy’s hierarchical clustering and leaf ordering, this makes it easier for the LLM to combine the answers in a manner that makes bottom up sense. Eve the Evaluator, Anna the Answerer and Carl the Combiner are the names given to each LLM in their system prompt, this way you can easily add specific additional instructions to a specific step. There’s also Sam the Summarizer for summaries and Raphael the Rephraser to expand your query.

    5. Each document is identified by a unique hash and the answers are sourced, meaning you know from which document comes each information of the answer.

    • Supports a special syntax like “QE >>>> QA” were QE is a question used to filter the embeddings and QA is the actual question you want answered.

  • Web Search: Preliminary support for web search using DuckDuckGo. Just do uvx wdoc web "How is Nvidia today this month?"

  • Advanced summary:

    • Instead of unusable “high level takeaway” points, compress the reasoning, arguments, though process etc of the author into an easy to skim markdown file.

    • The summaries are then checked again n times for correct logical indentation etc.

    • The summary can be in the same language as the documents or directly translated.

  • Many tasks: See Supported tasks.

  • Trust but verify: The answer is sourced: wdoc keeps track of the hash of each document used in the answer, allowing you to verify each assertion.

  • Markdown formatted answers and summaries: using rich.

  • Sane embeddings: By default use sophisticated embeddings like multi query retrievers but also include SVM, KNN, parent retriever etc. Customizable.

  • Fully documented Lots of docstrings, lots of in code comments, detailed --help etc. Take a look at the examples.md for a list of shell and python examples. The full help can be found in the file help.md or via uvx wdoc --help. I work hard to maintain an exhaustive documentation. The complete documentation in a single page is available on the website.

  • Scriptable / Extensible: You can use wdoc as an executable or as a library. Take a look at the scripts below. There is even an open-webui Tool.

  • Strictly Typed: Runtime type checking without performance penalty thanks to the incredible beartype! Opt out using an environment flag: WDOC_TYPECHECKING="disabled / warn / crash" wdoc (by default: warn).

  • LLM (and embeddings) caching: speed things up, as well as index storing and loading (handy for large collections).

  • Good PDF parsing PDF parsers are notoriously unreliable, so 15 (!) different loaders are used, and the best according to a parsing scorer is kept. Including table support via openparse (no GPU needed by default) or via UnstructuredPDFLoader.

  • Langfuse support: If you set the appropriate langfuse environment variables they will be used. See this guide or this one to learn more (Note: this is disabled if using private_mode to avoid any leaks).

  • Document filtering: based on regex for document content or metadata.

  • Binary embeddings support: Custom langchain VectorStore to use binary embeddings, leading (potentially, as it depends on the embeddings model) to ~32x better compression ratio, faster search and usually negligible accuracy loss.

  • Fast: Parallel document loading, parsing, embeddings, querying, etc.

  • Shell autocompletion using python-fire

  • Notification callback: Can be used for example to get summaries on your phone using ntfy.sh.

  • Hacker mindset: I’m a friendly dev! Just open an issue if you have a feature request or anything else.

Tasks#

  • query give documents and asks questions about it.

  • search only returns the documents and their metadata. For anki it can be used to directly open cards in the browser.

  • summarize give documents and read a summary. The summary prompt can be found in utils/prompts.py.

  • summarize_then_query summarize the document then allow you to query directly about it.

Filetypes#

  • anki: any subset of an anki collection db. alt and title of images can be shown to the LLM, meaning that if you used the ankiOCR addon this information will help contextualize the note for the LLM.

  • auto: default, guess the filetype for you

  • epub: barely tested because epub is in general a poorly defined format

  • json_dict: a text file containing a single json dict.

  • local_audio: supports many file formats, can use either OpenAI’s whisper or deepgram’s Nova-3 model. Supports automatically removing silence etc. Note: audio that are too large for whisper (usually >25mb) are automatically split into smaller files, transcribed, then combined. Also, audio transcripts are converted to text containing timestamps at regular intervals, making it possible to ask the LLM when something was said.

  • local_html: useful for website dumps

  • local_video: extract the audio then treat it as local_audio

  • logseq_markdown: thanks to my other project: LogseqMarkdownParser you can use your Logseq graph

  • online_media: use youtube_dl to try to download videos/audio, if fails try to intercept good url candidates using playwright to load the page. Then processed as local_audio (but works with video too).

  • online_pdf: via URL then treated as a pdf (see above)

  • pdf: 15 default loaders are implemented, heuristics are used to keep the best one and stop early. Table support via openparse or UnstructuredPDFLoader. Easy to add more.

  • powerpoint: .ppt, .pptx, .odp, …

  • string: the cli prompts you for a text so you can easily paste something, handy for paywalled articles!

  • text: send a text content directly as path

  • txt: .txt, markdown, etc

  • url: try many ways to load a webpage, with heuristics to find the better parsed one

  • word: .doc, .docx, .odt, …

  • youtube: text is then either from the yt subtitles / translation or even better: using whisper / deepgram. Note that youtube subtitles are downloaded with the timecode (so you can ask ‘when does the author talks about such and such) but at a lower sampling frequency (instead of one timecode per second, only one per 15s). Youtube chapters are also given as context to the LLM when summarizing, which probably help it a lot.

Recursive Filetypes#

  • ddg: does an online web search using DuckDuckGo. This is not an agent search, we only use wdoc over the urls fetched by DuckDuckGo and return the result. Only supported by query tasks.

  • json_entries: turns a path to a file where each line is a json dict: that contains arguments to use when loading. Example: load several other recursive types. An example can be found in docs/json_entries_example.json.

  • link_file: turn a text file where each line contains a url into appropriate loader arguments. Supports any link, so for example webpage, link to pdfs and youtube links can be in the same file. Handy for summarizing lots of things!

  • recursive_paths: turns a path, a regex pattern and a filetype into all the files found recurisvely, and treated a the specified filetype (for example many PDFs or lots of HTML files etc).

  • toml_entries: read a .toml file. An example can be found in docs/toml_entries_example.toml.

  • youtube playlists: get the link for each video then process as youtube

Walkthrough and examples#

Refer to examples.md.

Getting started#

wdoc was mainly developped and tested on python 3.13.5 but for compatibility it is installable with python version >=3.11. If possible, try to use python 3.13.

Direct Installation#

  1. To install:

    • The recommended invocation is simply uvx wdoc[full] (see uv).

      • You can specify the dev branch like so: uvx --from git+https://github.com/thiswillbeyourgithub/wdoc@dev[full] wdoc

      • If you cloned the repository and have modified the code: uvx --from PATH/TO/WDOC[full] --refresh wdoc

    • Picking only the loaders you need: wdoc ships in a modular fashion so you don’t have to pull in heavy ML dependencies you won’t use. Plain wdoc already includes the engine plus the PDF and URL/web search loaders (the most common cases). Optional extras:

      • wdoc[youtube] – youtube videos and playlists (yt-dlp, youtube-transcript-api)

      • wdoc[audio] – local audio/video transcription (deepgram, pydub, torchaudio, ffmpeg-python)

      • wdoc[anki] – anki collection loading (ankipandas)

      • wdoc[office] – word/powerpoint/epub and other office formats (unstructured[all-docs], docx2txt, pandoc)

      • wdoc[logseq] – logseq markdown graphs

      • wdoc[fasttext] – language detection (buggy on windows, hence optional)

      • wdoc[pdftotext] – an additional pdf parser that needs system libs (sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev)

      • You can combine extras freely: uvx wdoc[youtube,audio,anki].

      • wdoc[full] is a shortcut that includes all the loader extras above (excluding fasttext and pdftotext, which need special handling). If unsure, use wdoc[full] and don’t worry about filetypes.

      • If you have problems with pdftotext or fasttext, try uvx wdoc[full,pdftotext,fasttext].

    • If you plan on contributing, you will also need wdoc[dev] for the commit hooks.

    • Claude Code users: to give Claude Code knowledge of wdoc’s CLI and Python API, install the SKILL.md reference file:

      mkdir -p ~/.claude/skills/wdoc && wget -O ~/.claude/skills/wdoc/SKILL.md https://raw.githubusercontent.com/thiswillbeyourgithub/wdoc/main/SKILL.md
      
  2. Add the API key for the backend you want as an environment variable: for example export ANTHROPIC_API_KEY="***my_key***"

  3. Launch is as easy as using uvx wdoc --task=query --path=MYDOC [ARGS]

    • If for some reason this fails, maybe try with python -m wdoc. And if everything fails, try with uvx wdoc@latest, or as last resort clone this repo and try again after cd inside it? Don’t hesitate to open an issue.

    • To get shell autocompletion: if you’re using zsh: eval $(cat shell_completions/wdoc_completion.zsh). Also provided for bash and fish. You can generate your own with uvx wdoc -- --completion MYSHELL > my_completion_file".

    • Don’t forget that if you’re using a lot of documents (notably via recursive filetypes) it can take a lot of time (depending on parallel processing too, but you then might run into memory errors).

    • Take a look at the examples.md for a list of shell and python examples.

  4. To ask questions about a local document: uvx wdoc[office] query --path="PATH/TO/YOUR/FILE" --filetype="auto"

    • If you want to reduce the startup time by directly loading the embeddings from a previous run (although the embeddings are always cached anyway): add --saveas="some/path" to the previous command to save the generated embeddings to a file and replace with --loadfrom "some/path" on every subsequent call.

  5. To do an online search, the idea is uvx wdoc --task=query --path='How is Nvidia doing this month?' --query='How is Nvidia doing this month' --filetype=ddg. But if any of path or query is missing, we replace it by the other one. This can also be used like so: uvx wdoc web 'How is Nvidia doing this month?'.

  6. For more: read the documentation at uvx wdoc --help

Experimental Docker Interface#

You can also use the experimental docker interface to use wdoc in the browser (including on a smartphone!).

See the Docker README for detailed instructions.

Scripts made with wdoc#

FAQ#

FAQ
  • Who is this for?

    • wdoc is for power users who want document querying on steroid, and in depth AI powered document summaries.

  • What’s RAG?

    • A RAG system (retrieval augmented generation) is basically an LLM powered search through a text corpus.

  • Why make another RAG system? Can’t you use any of the others?

    • I’m Olicorne, a psychiatry resident who needed a tool to ask medical questions from a lot (tens of thousands) of documents, of different types (epub, pdf, anki database, Logseq, website dump, youtube videos and playlists, recorded conferences, audio files, etc). Existing solutions couldn’t handle this diversity and scale of content.

  • Why is wdoc better than most RAG system to ask questions on documents?

    • It uses both a strong and query_eval LLM. After finding the appropriate documents using embeddings, the query_eval LLM is used to filter through the documents that don’t seem to be about the question, then the strong LLM answers the question based on each remaining documents, then combines them all in a neat markdown. Also wdoc is very customizable.

  • Can you use wdoc on wdoc’s documentation?

    • Yes of course! uvx wdoc --task=query --path https://wdoc.readthedocs.io/en/latest/all_docs.html

  • Why can wdoc also produce summaries?

    • I have little free time so I needed a tailor made summary feature to keep up with the news. But most summary systems are rubbish and just try to give you the high level takeaway points, and don’t handle properly text chunking. So I made my own tailor made summarizer. The summary prompts can be found in utils/prompts.py and focus on extracting the arguments/reasonning/though process/arguments of the author then use markdown indented bullet points to make it easy to read. It’s really good! The prompts dataclass is not frozen so you can provide your own prompt if you want.

  • Which tasks are supported by wdoc?

  • Which LLM providers are supported by wdoc?

    • wdoc supports virtually any LLM provider thanks to litellm. It even supports local LLM and local embeddings (see examples.md). The list of supported embeddings engine can be found here but includes at least Openai (or any openai API compatible models), Cohere, Azure, Bedrock, NVIDIA NIM, Hugginface, Mistral, Ollama, Gemini, Vertex, Voyage.

  • What do you use wdoc for?

    • I follow heterogeneous sources to keep up with the news: youtube, website, etc. So thanks to wdoc I can automatically create awesome markdown summaries that end up straight into my Logseq database as a bunch of TODO blocks.

    • I use it to ask technical questions to my vast heterogeneous corpus of medical knowledge.

    • I use it to query my personal documents using the --private argument.

    • I sometimes use it to summarize a documents then go straight to asking questions about it, all in the same command.

    • I use it to ask questions about entire youtube playlists.

    • Other use case are the reason I made the scripts made with wdoc section

  • What’s up with the name?

    • One of my favorite character (and somewhat of a rolemodel is Winston Wolf and after much hesitation I decided WolfDoc would be too confusing and WinstonDoc sounds like something micro$oft would do. Also wd and wdoc were free, whereas doctools was already taken. The initial name of the project was DocToolsLLM, a play on words between ‘doctor’ and ‘tool’.

  • How can I improve the prompt for a specific task without coding?

    • Each prompt of the query task are roleplaying as employees working for WDOC-CORP©, either as Eve the Evaluator (the LLM that filters out relevant documents), Anna the Answerer (the LLM that answers the question from a filtered document) or Carl the Combiner (the LLM that combines answers from Answerer as one). There’s also Sam the Summarizer for summaries and Raphael the Rephraser to expand your query. They are all receiving orders from you if you talk to them in a prompt.

  • How can I use wdoc’s parser for my own documents?

    • If you are in the shell cli you can easily use uvx wdoc parse my_file.pdf. add --format=langchain_dict to get the text and metadata as a list of dict, otherwise you will only get the text. Other formats exist including --format=xml to make it LLM friendly like files-to-promt.

    • If you want the document using python:

      from wdoc import wdoc
      list_of_docs = wdoc.parse_doc(path=my_path)
      
    • Another example would be to use wdoc to parse an anki deck: uvx wdoc[anki] parse --filetype "anki" --anki_profile "Main" --anki_deck "mydeck::subdeck1" --anki_notetype "my_notetype" --anki_template "<header>\n{header}\n</header>\n<body>\n{body}\n</body>\n<personal_notes>\n{more}\n</personal_notes>\n<tags>{tags}</tags>\n{image_ocr_alt}" --anki_tag_filter "a::tag::regex::.*something.*" --format=text

  • What should I do if my PDF are encrypted?

    • If you’re on linux you can try running qpdf --decrypt input.pdf output.pdf

  • How can I add my own pdf parser?

    • Write a python class and add it there: wdoc.utils.loaders.pdf_loaders['parser_name']=parser_object then call wdoc with --pdf_parsers=parser_name.

      • The class has to take a path argument in __init__, have a load method taking no argument but returning a List[Document]. Take a look at the OpenparseDocumentParser class for an example.

  • Can wdoc add source citations to summaries?

    • Yes! When summarizing documents that have page metadata (like PDFs), wdoc automatically adds [p.N] citations to bullet points tracking which page the information came from. For multi-file summaries, citations include the filename: [p.N, file.pdf]. You can also use --citation_url_template to turn these into clickable markdown links pointing to your own document server (e.g. --citation_url_template="https://my-site.com/docs/{source}#page={page}"). This feature was developed with Claude Code.

    • For the query task, source documents are referenced with clickable anchor links [N](#document-N) in the final answer.

  • What should I do if I keep hitting rate limits?

    • The simplest way is to add the debug argument. It will disable multithreading, multiprocessing and LLM concurrency. A less harsh alternative is to set the environment variable WDOC_LLM_MAX_CONCURRENCY to a lower value.

  • How can I run the tests?

    • Take a look at the files ./tests/run_all_tests.sh.

  • How can I query a text but without chunking? / How can I query a text with the full text as context?

    • If you set the environment variable WDOC_MAX_CHUNK_SIZE to a very high value and use a model with enough context according to litellm’s metadata, then no chunking will happen and the LLM will have the full text as context.

  • Is there a way to use wdoc with Open-WebUI?

  • Is there a web UI for wdoc?

  • Can I use shell pipes with wdoc?

    • Yes! Data sent using shell pipes (be it for strings or binary data) will be automatically saved to a temporary file which is then passed as --path=[temp_file] argument. For example cat **/*.txt | uvx wdoc --task=query, echo $my_url | uvx wdoc parse or even cat my_file.pdf | uvx wdoc parse --filetype=pdf. For binary input it is strongly recommended to use a --filetype argument because python-magic version <=0.4.27 chokes otherwise (see that issue.

  • Can the environment variables be set at runtime?

    • Sort of. Actually when importing wdoc, code in wdoc/utils/env.py creates a dataclass that holds the environment variables used by wdoc. This is done primarily to ensure runtime type checking and to ensure that when an env variable is accessed inside wdoc’s code (through the dataclass) it is always compared to the environment one. If you decide to change env variables throughout the code, this change new value will be used inside wdoc. But that’s somewhat brittle because some env variables are used to store the default value of some function or class and hence are only used when importing code so will be out of sync. Additionaly, wdoc will intentionaly crash if it suspects the WDOC_PRIVATE_MODE env var is out of sync, just to be safe. Also note that if env vars like WDOC_LANGFUSE_PUBLIC_KEY are found, wdoc will overwrite LANGFUSE_PUBLIC_KEY with it. This is because litellm (maybe others) looks for this env variable to enable langfuse callbacks. This whole contraption allows to set env variable for a specific user of when using the open-webui wdoc tool. Feedback is much welcome for this feature.

  • How can I build the autodoc using sphinx?

    • The command I’ve been using is sphinx-apidoc -o docs/source/ wdoc --force, to call from the root of this repository.

  • Why can’t I load the vectorstores in other langchain projects?

    • In wdoc/utils/customs/binary_faiss_vectorstore.py, we create BinaryFAISS and CompressedFAISS. The latter is just like FAISS but with zlib compression to the pickled index and the former adds on top binary embeddings, resulting in faster and more compact embeddings. If you want to disable compression altogether, use the env variable WDOC_MOD_FAISS_COMPRESSION=false.

  • Which python version is used in the test suite?

    • The recommended python version is 3.12.11.

  • Why does the online search only supports the ‘query’ task?

    • The way wdoc works for summaries is to take the “whole document”, chunk it into sequential “documents” and iteratively create the summary. But if we start with several documents (say difference web pages) then the “sequence” wouldn’t make sense.

Roadmap#

Click to read more

This TODO list is maintained automatically by MdXLogseqTODOSync

  • Most urgent

    • figure out a good way to skip merging batches that are too large before trying to merge them

      • probably means adding an env var to store a max value, document it in the help.md

      • then check after batch creation if a batch is that large

      • if it is put it in a separate var, to be concatenated later with the rest of the answers

    • add more tests

      • add test for the private mode

      • add test for the testing models

      • add test for the recursive loader functions

      • add test for each loader

    • rewrite the python API to make it more useable. (also related to https://github.com/thiswillbeyourgithub/wdoc/issues/13)

      • pay attention to how to modify the init and main.py files

      • pay attention to how the –help flag works

      • pay attention to how the USAGE document is structured

    • support other vector databases

    • learn how to set a github action for test code coverage

    • allow anki to use anki type search queries

    • refactor the tasks to use langgraph, as it seems easier to do complex recursive tasks with it

    • use async for the langchain chains

  • Features

    • use clusters of semantic ordering instead of just the order you dumbass

    • ability to cap the search documents capped by a number of tokens instead of a number of documents

    • Add prompt caching for claude

    • add a “fast summary” feature that does not use recursive summary if you care more about speed than overlapping summaries

    • count how many time each source is used, as it can be relevant to infer answer quality

    • add an html format output. It would display a nice UI with proper dropdowns for sources etc

    • if a model supports structured output we should make use of it to get the thinking and answer part. Opt in because some models hide their thoughts.

    • add an intermediate step for queries that asks the LLM for appropriate headers for the md output. Then for each intermediate answer attribute it a list of 1 to 3 headers (because a given intermediate answer can contain several pieces of information), then do the batch merge of intermediate answer per header.

      • this needs to be scalable and easy to add recursion to (because then we can do this for subheaders and so on)

      • the end goal is to have a scalable solution to answer queries about extremely large documents for impossibly vast questions

    • use apprise instead of ntfy for the scripts

    • add crawl4ai parser: https://github.com/unclecode/crawl4ai

    • Way to add the title (or all metadata) of a document to its own text. Enabled by default. Because this would allow searching among many documents that don’t refer to the original title (for example: material safety datasheets)

      • default value is “author” “page” title”

      • pay attention to avoid including personnal info (for example use relative paths instead of absolute paths)

    • add a /save PATH command to save the chat and metadata to a json file

    • add image support printing via icat or via the other lib you found last time, would be useful for summaries etc

    • add wdoc to tldr pages

    • add an audio backend to use the subtitles from a video file directly

    • store the anki images as ‘imagekeys’ as the idea works for other parsers too

    • investigate asking the LLM to add leading emojis to the bullet point for improved reading

    • add a key/val arg to specify the trust we have in a doc, call it context

    • add a way to open the documents automatically, based on platform dirs etc. For ex if okular is installed, open pdfs directly at the right page

      • the best way would be to create opener.py that does a bit like loader but for all filetypes and platforms

      • use a cli selector like in mnemonics creator

        • add shortcut to sort by score or by name

        • display metadata and score in a previewer

    • add an argument –whole_text to avoid chunking (this would just increase the chunk size to a super large number I guess)

    • add apprise callback support

    • add a filetype “custom_parser” and an argument “–custom_parser” containing a path to a python file. Must receive a docdict and a few other things and return a list of documents

    • add bespoke-minicheck from ollama to fact check when using RAG: https://ollama.com/library/bespoke-minicheck

      • or via their API directly : https://docs.bespokelabs.ai/bespoke-minicheck/api but they don’t seem to properly disclose what they do with the data

    • add a langchain code loader that uses aider to get the repomap

      • https://github.com/paul-gauthier/aider/issues/1043#issuecomment-2278486840

      • https://aider.chat/docs/scripting.html

    • add a pikepdf loader because it can be used to automatically decrypt pdfs

    • add a query_branching_nb argument that asks an LLM to identify a list of keywords from the intermediate answers, then look again for documents using this keyword and filtering via the weak llm

    • write a script that shows how to use bertopic on the documents of wdoc

    • add a retriever where the LLM answer without any context

    • add support for readabilipy for parsing html

      • https://github.com/alan-turing-institute/ReadabiliPy

    • add an obsidian loader

      • https://pypi.org/project/obsidiantools/

    • add a /chat command to the prompt, it would enable starting an interactive session directly with the llm

    • find a way to make it work with llm from simonw

    • make images an actual filetype

  • Enhancements

    • store the available tasks in a dataclass in misc.py

    • maybe add support for docling to parse documents?

    • when querying hard stuff the number of drop documents after batching is non negligible, we should remove those from the list of documents to display and instead store those in another variable

    • check if using html syntax is less costly and confusing to LLMs than markdown with tall those indentation. Or maybe json. It would be simple to turn that into markdown afterwards.

    • check that the task search work on things other than anki

    • create a custom custom retriever, derived from multiquery retriever that does actual parallel requests. Right now it’s not the case (maybe in async but I don’t plan on using async for now). This retriever seems a good part of the slow down.

    • stop using your own youtube timecode parser and instead use langchain’s chunk transcript format

    • implement usearch instead of faiss, it seems in all points faster, supports quantized embeddings, i trust their langchain implementation more

      • https://python.langchain.com/api_reference/community/vectorstores/langchain_community.vectorstores.usearch.USearch.html#langchain_community.vectorstores.usearch.USearch

    • Use an env var to drop_params of litellm

    • add more specific exceptions for file loading error. One exception for all, one for batch and one for individual loader

    • use heuristics to find the best number of clusters when doing semantic reranking

    • arg to use jina v3 embeddings for semantic batching because it allows specifying tasks that seem appropriate for that

    • add an env variable or arg to overload the backend url for whisper. Then set it always for you and mention it there: https://github.com/fedirz/faster-whisper-server/issues/5

    • find a way to set a max cost at which to crash if it exceeds a maximum cost during a query, probably via the price callback

    • anki_profile should be able to be a path

    • store wdoc’s version and indexing timestamp in the metadata of the document

    • arg –oneoff that does not trigger the chat after replying. Allowing to not hog all the RAM if ran in multiple terminals for example through SSH

    • add a (high) token threshold above which two texts are not combined but just concatenated in the semantic order. It would avoid it loosing context. Use a — separator

    • compute the cost of whisper and deepgram

    • use a pydantic basemodel for output instead of a dict

      • same for summaries, it should at least contain the method to substitute the sources and then back

    • investigate storing the vectors in a sqlite3 file

    • make a plugin to llm that looks like file-to-prompt from simonw

    • Always bind a user metadata to litellm for langfuse etc

      • Add more metadata to each request to langfuse more informative

    • add a reranker to better sort the output of the retrievers. Right now with the multiquery it returns way too many and I’m thinking it might be a bad idea to just crop at top_k as I’m doing currently

    • add a status argument that just outputs the logs location and size, the cache location and size, the number of documents etc

    • add the python magic of the file as a file metadata

    • add an env var to specify the threshold for relevant document by the query eval llm

    • find a way to return the evaluations for each document also

    • move retrievers.py in an embeddings folder

    • stop using lambda functions in the chains because it makes the code barely readable

    • when doing recursive summary: tell the model that if it’s really sure that there are no modifications to do: it should just reply “EXIT” and it would save time and money instead of waiting for it to copy back the exact content

    • add image parsing as base64 metadata from pdf

    • use multiple small chains instead of one large and complicated and hard to maintain

    • add an arg to bypass query combine, useful for small models

    • tell the llm to write a special message if the parsing failed or we got a 404 or paywall etc

      • catch this text and crash

    • add check that all metadata is only made of int float and str

    • move the code that filters embeddings inside the embeddings.py file

      • this way we can dynamically refilter using the chat prompt

    • task summary then query should keep in context both the full text and the summary

    • if there’s only one intermediate answer, pass it as answer without trying to recombine

    • filter_metadata should support an OR syntax

    • add a –show_models argument to display the list of available models

    • add a way to open the documents automatically, based on platform dirs etc. For ex if okular is installed, open pdfs directly at the right page

      • the best way would be to create opener.py that does a bit like loader but for all filetypes and platforms

    • add an image filetype: it will be either OCR’d using format and/or will be captioned using a multimodal llm, for example gpt4o mini

      • nanollava is a 0.5b that probably can be used for that with proper prompting

    • add a key/val arg to specify the trust we have in a doc, call this metadata context in the prompt

    • add an arg to return just the dict of all documents and embeddings. Notably useful to debug documents

    • use a class for the cli prompt, instead of a dumb function

    • arg to disable eval llm filtering

      • just answer 1 directly if no eval llm is set

    • display the number of documents and tokens in the bottom toolbar

    • add a demo gif

    • investigate asking the LLM to add leading emojis to the bullet point for quicker reading of summaries

    • see how easy or hard it is to use an async chain

    • ability to cap the search documents capped by a number of tokens instead of a number of documents

    • for anki, allow using a query instead of loading with ankipandas

    • add a “try_all” filetype that will try each filetype and keep the first that works

    • add textract extractor : https://textract.readthedocs.io/en/stable/

    • write a langchain compatible tool for agents

    • add bespoke-minicheck from ollama to fact check when using RAG: https://ollama.com/library/bespoke-minicheck

      • or via their API directly : https://docs.bespokelabs.ai/bespoke-minicheck/api but they don’t seem to properly disclose what they do with the data

Walkthrough & Examples#

Table of Contents#

  1. Walkthrough

  2. Shell Examples

  3. Python Script Examples

Heads up on installation: if you don’t want to think about which extras to install, use uvx wdoc[full] everywhere. The plain wdoc package only includes PDF and URL/web loaders, so commands that touch youtube, audio, anki, office formats (word/powerpoint/epub) or logseq need their extras. [full] bundles all of those at once. The examples below sometimes use plain uvx wdoc when the base install is enough (pdf, url, ddg), but you can always replace it with uvx wdoc[full] to be safe. See the installation section for the full list of extras.

Note that there is an official open-webui Tool that is even simpler to use.

Walkthrough#

  1. Say you want to ask a question about one pdf, that’s simple:

uvx wdoc --task="query" --path="my_file.pdf" --filetype="pdf" --model='openai/gpt-4o'

Note that you could have just let --filetype="auto" and it would have worked the same.

  • Note: By default wdoc tries to parse args as kwargs so uvx wdoc query mydocument What's the age of the captain? is parsed as uvx wdoc --task=query --path=mydocument --query "What's the age of the captain?". Likewise for summaries. This does not always work so use it only after getting comfortable with wdoc.

  1. You have several pdf? Say you want to ask a question about any pdf contained in a folder, that’s not much more complicated:

uvx wdoc --task="query" --path="my/other_dir" --pattern="**/*pdf" --filetype="recursive_paths" --recursed_filetype="pdf" --query="My question about those documents"

So basically you give as path the path to the dir, as pattern the globbing pattern used to find the files relative to the path, set as filetype “recursive_paths” so that wdoc knows what arguments to expect, and specify as recursed_filetype “pdf” so that wdoc knows that each found file must be treated as a pdf. You can use the same idea to glob any kind of file supported by wdoc like markdown etc. You can even use “auto”! Note that you can either directly ask your question with --query="my question", or wait for an interactive prompt to pop up, or just pass the question as *args like so uvx wdoc [your kwargs] here is my question.

  1. You want more? You can write a .json file where each line (#comments and empty lines are ignored) will be parsed as a list of argument. For example one line could be:

{"path": "my/other_dir", "pattern": "**/*pdf", "filetype": "recursive_paths", "recursed_filetype": "pdf"}

This way you can use a single json file to specify easily any number of sources. .toml files are also supported.

  1. You can specify a “source_tag” metadata to help distinguish between documents you imported. It is EXTREMELY recommended to include a source_tag to any document you want to save: especially if using recursive filetypes. This is because after loading all documents wdoc use the source_tag to see if it should continue or crash. If you want to load 10_000 pdf in one go as I do, then it makes sense to continue if some failed to crash but not if a whole source_tag is missing.

  2. Now say you do this with many many documents, as I do, you of course can’t wait for the indexing to finish every time you have a question (even though the embeddings are cached). You should then add

--save_embeds_as=your/saving/path

to save all this index in a file. Then simply do

--load_embeds_from=your/saving/path

to quickly ask queries about it!

  1. To know more about each argument supported by each filetype,

uvx wdoc --help
  1. There is a specific recursive filetype I should mention: --filetype="link_file". Basically the file designated by --path should contain in each line (#comments and empty lines are ignored) one url, that will be parsed by wdoc. I made this so that I can quickly use the “share” button on android from my browser to a text file (so it just appends the url to the file), this file is synced via syncthing to my browser and wdoc automatically summarize them and add them to my Logseq. Note that the url is parsed in each line, so formatting is ignored, for example it works even in markdown bullet point list.

  2. If you want to only use local models, here’s an example with ollama:

uvx wdoc --model="ollama/qwen3:8b" --query_eval_model="ollama/qwen3:8b" --embed_model="ollama/snowflake-arctic-embed2" --task summarize --path https://situational-awareness.ai/

You can always add --private to add additional safety nets that no data will leave your local network. You can also override specific API endpoints using

--llms_api_bases='{"model": "http://localhost:11434", "query_eval_model": "http://localhost:11434", "embeddings": "http://localhost:1434"}'
  1. Now say you just want to summarize Tim Urban’s TED talk on procrastination:

uvx wdoc[youtube] --task=summary --path='https://www.youtube.com/watch?v=arj7oStGLkU' --youtube_language="en" --disable_md_printing
Click to see the output

Summary

https://www.youtube.com/watch?v=arj7oStGLkU

  • Let me take a deep breath and summarize this TED talk about procrastination:

  • [0:00-3:40] Personal experience with procrastination in college:

    • Author’s pattern with papers: planning to work steadily but actually doing everything last minute

    • 90-page senior thesis experience:

      • Planned to work steadily over a year

      • Actually wrote 90 pages in 72 hours with two all-nighters

      • Jokingly implies it was brilliant, then admits it was ‘very, very bad’

  • [3:40-6:45] Brain comparison between procrastinators and non-procrastinators:

    • Both have a Rational Decision-Maker

    • Procrastinator’s brain also has an Instant Gratification Monkey:

      • Lives entirely in present moment

      • Only cares about ‘easy and fun’

      • Works fine for animals but problematic for humans in advanced civilization

    • Rational Decision-Maker capabilities:

      • Can visualize future

      • See big picture

      • Make long-term plans

  • [6:45-10:55] The procrastinator’s system:

    • Dark Playground:

      • Where leisure activities happen at wrong times

      • Characterized by guilt, dread, anxiety, self-hatred

    • Panic Monster:

      • Only thing monkey fears

      • Awakens near deadlines or threats of public embarrassment

      • Enables last-minute productivity

    • Personal example with TED talk preparation:

      • Procrastinated for months

      • Only started working when panic set in

  • [10:55-13:05] Two types of procrastination:

    • Deadline-based procrastination:

      • Effects contained due to Panic Monster intervention

      • Less harmful long-term

    • Non-deadline procrastination:

      • More dangerous

      • Affects important life areas without deadlines:

        • Entrepreneurial pursuits

        • Family relationships

        • Health

        • Personal relationships

      • Can cause long-term unhappiness and regrets

  • [13:05-14:04] Concluding thoughts:

    • Author believes no true non-procrastinators exist

    • Presents Life Calendar:

      • Shows 90 years in weekly boxes

      • Emphasizes limited time available

    • Call to action: need to address procrastination ‘sometime soon’

  • Key audience response moments:

    • Multiple instances of ‘(Laughter)’ noted throughout

    • Particularly strong response from PhD students relating to procrastination issues

    • Received thousands of emails after blog post about procrastination Tokens used for https://www.youtube.com/watch?v=arj7oStGLkU: ‘4936’ (in: 4307, out: 629, cost: $0.00063) Total cost of those summaries: 4936 tokens for $0.00063 (estimate was $0.00030) Total time saved by those summaries: 8.8 minutes Done summarizing.

Shell Examples#

  1. Query a simple PDF file

uvx wdoc --task=query --path="my_file.pdf" --filetype="pdf" --model='openai/gpt-4o'
  1. Recursively query multiple PDFs in a directory

uvx wdoc --task=query \
     --path="my/other_dir" \
     --pattern="**/*pdf" \
     --filetype="recursive_paths" \
     --recursed_filetype="pdf" \
     --query="My question about those documents"
  1. Summarize a YouTube video in french based on the english transcript

uvx wdoc[full] --task=summary \
     --path='https://www.youtube.com/watch?v=arj7oStGLkU' \
     --youtube_language="en" \
     --summary_language="fr" \
     --disable_md_printing
  1. Summarize a YouTube video based on the whisper transcript

uvx wdoc[youtube,audio] --task=summary \
     --path='https://www.youtube.com/watch?v=arj7oStGLkU' \
     --youtube_audio_backend="whisper" \
     --whisper_lang="en"
  1. Use local models with Ollama

uvx wdoc --model="ollama/qwen3:8b" \
     --query_eval_model="ollama/qwen3:8b" \
     --embed_model="ollama/snowflake-arctic-embed2" \
     --task summarize --path https://situational-awareness.ai/

Note: you might find that ollama models are sometimes overly optimistic about their context length. You can pass arguments to lower it like so:

wdoc --model="ollama/qwen3:8b" \
     --query_eval_model="ollama/qwen3:8b" \
     --model_kwargs='{"max_tokens": 4096}' \
     --query_eval_model="ollama/qwen3:8b" \
     --query_eval_model_kwargs='{"max_tokens": 4096}' \
     --embed_model="ollama/snowflake-arctic-embed2" \
     --task summarize --path https://situational-awareness.ai/
  1. Parse an Anki deck as text

uvx wdoc[anki] parse \
    --filetype "anki" \
    --anki_profile "Main" \
    --anki_deck "mydeck::subdeck1" \
    --anki_notetype "my_notetype" \
    --anki_template "<header>\n{header}\n</header>\n<body>\n{body}\n</body>\n<personal_notes>\n{more}\n</personal_notes>\n<tags>{tags}</tags>\n{image_ocr_alt}" \
    --anki_tag_filter "a::tag::regex::.*something.*" \
    --format=langchain_dict
  1. Query an online PDF

uvx wdoc --path="https://example.com/document.pdf" \
     --task=query \
     --filetype="online_pdf" \
     --query="What does it say about X?"
  1. Save and load embeddings for faster subsequent queries

# First run - save embeddings
uvx wdoc --task=query \
     --path="my_document.pdf" \
     --save_embeds_as="saved_embeddings.pkl"

# Subsequent runs - load embeddings
uvx wdoc --task=query \
     --load_embeds_from="saved_embeddings.pkl" \
     --query="My new question"
  1. You can even use shell pipes:

Data sent using shell pipes (be it for strings or binary data) will be automatically saved to a temporary file which is then passed as --path=[temp_file] argument. For example cat **/*.txt | uvx wdoc --task=query, echo $my_url | uvx wdoc parse or even cat my_file.pdf | uvx wdoc parse --filetype=pdf. For binary input it is strongly recommended to use a --filetype argument because python-magic version <=0.4.27 chokes otherwise (see that issue.

  1. You can also search the web for results using DuckDuckGo:

It’s implemented like if ddg was a recursive_filetype. Hence, the idea is to use uvx wdoc --task=query --path='How is Nvidia doing this month?' --query='How is Nvidia doing this month' --filetype=ddg (remember: path specifies the document and query the question to ask about the documents). To make it more natural, if any of path or query is missing, we replace it by the value of the other one. It can be shortened to: uvx wdoc web 'How is Nvidia doing this month?'. With --ddg_max_result=5 you can specify the maximum number of results to get, use --ddg_region=us-US to get US only result, --ddg_safesearch=on to filter out NSFW results.

Python Script Examples#

  1. Basic document summarization

from wdoc import wdoc

# Initialize wdoc for summarization
instance = wdoc(
    task="summary",
    path="document.pdf",
    summary_language="en",  # Optional: specify output language
)

# Get summary results
results = instance.summary_results
print(f"Summary:\n{results['summary']}")
print(f"Processing cost: ${results['doc_total_cost']:.5f}")
print(f"Original reading time: {results['doc_reading_length']:.1f} minutes")
  1. Summarize with custom model settings

from wdoc import wdoc

# Use specific models for better control
instance = wdoc(
    task="summary",
    path="https://example.com/paper.pdf",
    filetype="online_pdf",
    model="openai/gpt-4o",  # Use GPT-4o for summarization
    embed_model="openai/text-embedding-3-large",  # Specify embedding model
)

results = instance.summary_results
summary_text = results['summary']
  1. Summarize a PDF with clickable page citations (developed with Claude Code)

from wdoc import wdoc

# Summarize with page citations linking to a private document server
instance = wdoc(
    task="summary",
    path="court_transcript.pdf",
    filetype="pdf",
    model="openai/gpt-4o",
    citation_url_template="https://private-site.com/cases/{source}#page={page}",
)

results = instance.summary_results
# Bullet points will contain clickable citations like:
# - **Key finding** about the case [p.42](https://private-site.com/cases/court_transcript.pdf#page=42)
# Same example from the shell
uvx wdoc --task=summary \
     --path="court_transcript.pdf" \
     --filetype="pdf" \
     --citation_url_template="https://private-site.com/cases/{source}#page={page}"

Help#

Table of contents#

Global arguments#

  • --task: str

    • Accepted values:

      • query: means to load the input files then wait for user question.

      • search: means only return the document corresponding to the search

      • summarize: means the input will be passed through a summarization prompt.

      • summarize_then_query: summarize the text then open the prompt to allow querying directly the source document.

  • --filetype: str, default auto

    • the type of input. Depending on the value, different other parameters are needed. If json_entries is used, the line of the input file can contain any of those parameters as long as they are as json. You can find an example of json_entries file in wdoc/docs/json_entries_example.txt

    • Supported values and available arguments: For the details of each argument, see below

      • anki

        • Optional:

          • --anki_profile

          • --anki_deck

          • --anki_notetype

          • --anki_template

          • --anki_tag_filter

          • --anki_tag_render_filter

      • auto: will guess the appropriate filetype based on --path. Irrelevant for some filetypes, eg if --filetype=anki. It can also infer recursive filetypes, for example if the path leads to a .toml file.

      • epub

        • --path to a .epub file

      • json_dict

        • --path to a text file containing a single json dict

        • --json_dict_template

        • Optional:

          • --json_dict_exclude_keys

          • --metadata

      • local_audio

        • --path

        • --audio_backend

        • Optional:

          • --audio_unsilence

          • --whisper_prompt

          • --whisper_lang

          • --deepgram_kwargs

      • local_html

        • --path must points to a .html file

        • Optional:

          • --load_functions

      • local_video

        • --path

        • --audio_backend

        • Optional:

          • --audio_unsilence

          • --whisper_lang

          • --whisper_prompt

          • --deepgram_kwargs

      • logseq_markdown

        • --path path to the markdown file

      • online_media: load the url using youtube_dl to download a media (video or audio) then treat it as filetype=local_audio.

        • If youtube_dl failed to find the media, try using playwright browser where any requested element that looks like a possible media will try be downloaded.

        • Same arguments as local_audio with extra arguments:

          • --online_media_url_regex

          • --online_media_resourcetype_regex

      • online_pdf

        • Same arguments as for --filetype=pdf Note that the way online_pdf are handled is a bit different than pdf: we first try to download it then parse it with filetype=pdf and as a last resort we use langchain’s integrated OnlinePDFLoader as it’s far slower.

      • pdf

        • --path is the filepath to pdf

        • Optional:

          • --pdf_parsers

          • --doccheck_min_lang_prob

          • --doccheck_min_token

          • --doccheck_max_token

      • powerpoint

        • --path to a .ppt or .pptx etc

      • string: no parameters needed, will provide a field where you must type or paste the string

      • text (For text input as argument, not to be mistaken with txt)

        • --path is directly the text content.

        • Optional:

          • --metadata

      • txt (For text present in a txt file, not to be mistaken with text)

        • --path is path to a .txt file

      • url

        • --path must be a valid http(s) link

        • Optional:

          • --title, otherwise we try to detect it ourselves.

      • word

        • --path to a .doc, .docx, etc

      • youtube

        • --path must link to a youtube video Note: --yt_* is automatically parsed as --youtube_

        • Optional:

          • --youtube_language

          • --youtube_translations

          • --youtube_audio_backend

          • --whisper_prompt

          • --whisper_lang

          • --deepgram_kwargs

    • Recursive types:

      • ddg

        • --path is the search query for DuckDuckGo.

        • --ddg_max_results

        • --ddg_region, for example us-US

        • --ddg_safesearch

      • json_entries

        • --path is path to a text file that contains a json for each line containing at least a filetype and a path key/value but can contain any parameters described here

      • recursive_paths

        • --path is the starting path

        • --pattern is the globbing patterns to append

        • --exclude and --include can be a list of regex applying to found paths (include is run first then exclude, if the pattern is only lowercase it will be case insensitive)

        • --recursed_filetype is the filetype to use for each of the found path

      • youtube_playlist

        • --path must link to a youtube playlist

      • link_file

        • --path must point to a file where each line is a link that will be summarized.

        • --out_file path to text file where the summary will be added (appended). Links that have already been summarized in out_file will be skipped (the out_file is never overwritten). If a line is a markdown like like this then it will be parsed as a link. Empty lines and starting with # are ignored.


  • --model: str, default to value of WDOC_DEFAULT_MODEL

    • Keep in mind that given that the default backend used is litellm the part of model name before the slash (/) is the backend name (also called provider). If the backend is testing/ then it will be parsed as testing/testing and a fake LLM will be used for debugging purposes. It answers like a normal LLM but costs 0 and makes no sense. Note that it will automatically set the query_eval_model to testing/testing too. If the value is not part of the model list of litellm, will use fuzzy matching to find the best match.

  • --model_kwargs: dict, default None

    • dictionary of keyword arguments to pass to the model. For example {'temperature': 0}. Note that changing the kwargs will sometimes keep reusing the cache, use disable_llm_cache to avoid that.


  • --embed_model: str, default to value of WDOC_DEFAULT_EMBED_MODEL

    • Name of the model to use for embeddings. Must contain a ‘/’ Everything before the slash is the backend and everything after the / is the model name. Available backends: openai, sentencetransformers, huggingface

    • Note:

      • the device used by default for huggingface is ‘cpu’ and not ‘cuda’

      • If you change this, the embedding cache will be usually need to be recomputed with new elements (the hash used to check for previous values includes the name of the model name)

  • --embed_model_kwargs: dict, default None

    • dictionary of keyword arguments to pass to the embedding.

  • --save_embeds_as: str, default "{user_dir}/latest_docs_and_embeddings"

    • only used if task is query Saves the loaded documents and embeddings to a file in the specified directory. This then be loaded again with --load_embeds_from to avoid recomputing embeddings. Both the document splits and their embeddings are saved there, and always overwrite the location (i.e. no ‘updating’ of the previously saved documents and embeddings). In the default value, “{user_dir}” is automatically replaced by the path to the default cache folder for the current user. This way it always speeds up the previous session if --load_embeds_from. Should not be specified at the same time as --load_embeds_from as --load_embeds_from will take priority.

  • --load_embeds_from: str, default None

    • path to the file saved using --save_embeds_as If loading the embeddings fails, wdoc will crash instead of creating new embeddings, out of safety. Should not be specified at the same time as --save_embeds_as as --load_embeds_from will take priority.

  • --top_k: Union[int, str], default auto_200_500

    • number of chunks to look for when querying. It is high because the eval model is used to refilter the document after the embeddings first pass.e If top_k is a string, the format assumed is “auto_N_m” where N is the starting top_k and M is the max top_k value. If the number of filtered document is more than 90% of top_k, top_k will gradually increase up to M (with N and M being int, and 0<N<M). This way you are sure not to miss any document.


  • --query: str, default None

    • if str, will be directly used for the first query if task in ["query", "search", "summarize_then_query"]

  • --query_retrievers: str, default "basic_multiquery"

    • must be a string that specifies which retriever will be used for queries depending on which keyword is inside this string.

    • Possible values (can be combined if separated by _):

      • basic: cosine similarity retriever

      • multiquery: retriever that uses the LLM to reformulate your query to get different perspectives. This uses the strong LLM and, as it requires complex output parsing for now it is recommended to not use that retriever for average models.

      • knn: knn

      • svm: svm

      • parent: parent chunk

  • --query_eval_model: str, default to value of WDOC_DEFAULT_QUERY_EVAL_MODEL

    • Cheaper and quicker model than model. Used for intermediate steps in the RAG, not used in other tasks. If the value is not part of the model list of litellm, will use fuzzy matching to find the best match. None to disable.

  • --query_eval_model_kwargs: dict, default None

    • dictionary of keyword arguments to pass to the query_eval_model. For example {'temperature': 0}. Note that changing the kwargs will sometimes keep reusing the cache, use disable_llm_cache to avoid that.

  • --query_eval_check_number: int, default 3

    • number of pass to do with the eval llm to check if the document is indeed relevant to the question. The document will not be processed further if the mean answer from the eval llm is too low. For eval llm that don’t support setting n, multiple completions will be called, which costs more. It happens that some models are incorrectly reported as having a modifiable n parameter when they actually don’t. In this can instead of crashing wdoc will notify you and replicate the received value n times.

  • --query_relevancy: float, default -0.5

    • threshold underwhich a document cannot be considered relevant by embeddings alone. Keep in mind that the score is a similarity, so it goes from -1 (most different) to +1 (most similar), althrough if you set WDOC_MOD_FAISS_SCORE_FN to True it will then go from 0 to 1.


  • --summary_n_recursion: int, default 0

    • after summarizing, will go over the summary that many times to fix indentation, repetitions etc.

      • 0 means disabled.

      • 1 means that the original summary will be checked once.

      • 2 means that the original summary, will checked, then the check version will be checked again. We stop when equilibrium is reached (meaning the summary did not change).

    • If --out_file is used, each intermediate summary will be saved with the name {out_file}.n.md with n being the n-1th recursive summary.

  • --summary_language: str, default "the same language as the document"

    • When writing a summary, the LLM will write using the language specified in this argument. If it’s [same as input], the LLM will not translate.


  • --llm_verbosity: bool, default False

    • if True, will print the intermediate reasonning steps of LLMs if debug is set, llm_verbosity is also set to True

  • --debug: bool, default False or WDOC_DEBUG if set

    • if True will enable langchain tracing, increase verbosity, disable multithreading for summaries and loading files, display warning if an error is encountered when loading a file, automatically trigger the debugger on exceptions (except if wdoc is running in docker). Note that the parallel processing will not be disabled if you manually set --file_loader_n_jobs, allowing you to debug parallel processing issues. Because in some situation LLM calls are refused because of rate limiting, this can be used to slowly but always get your answer. It implies --verbose=True If you just want to open the debugger in case of issue, see below at WDOC_DEBUGGER. This is incompatible with running wdoc in docker. When in debugging mode, the default loading_failure is warn, but if you specify loading_failure=crash it will be honored.

  • --verbose: bool, default False or WDOC_VERBOSE if set Increase verbosity. Implied if --debug is set.

  • --dollar_limit: int, default 5

    • If the estimated price is above this limit, stop instead. Note that the cost estimate for the embeddings is using the openai tokenizer, which is not universal. This only applies to the summary and to the embeddings, not to queries. This check is skipped if the api_base url are changed using llms_api_bases. Note that the cost is assumed to be 0 for embeddings if we don’t find the price using litellm.

  • --notification_callback: Callable, default None

    • a function that must take as input a string and return the same string. Inside it you can do whatever you want with it. This can be used for example to send notification on your phone using ntfy.sh to get summaries.

  • --disable_llm_cache: bool, default False

    • disables caching for LLM. All caches are stored in the usual cache folder for your system. This does not disable caching for documents.

  • --file_loader_parallel_backend: str, default "loky"

    • joblib.Parallel backend to use when loading files. loky and multiprocessing refer to multiprocessing whereas threading refers to multithreading. The number of jobs can be specified with --file_loader_n_jobs but it’s a loader specific kwargs. To use neither multiprocessing nor threading, you can set --file_loader_n_jobs=1.

  • --file_loader_n_jobs: int, default -1

    • number of jobs to use when loading files in parallel (threads or process, depending on --file_loader_parallel_backend). Set to 1 to disable parallel processing (as it can result in out of memory error if using threads and overly recursive calls). Automatically set to 1 if --debug is set or if there’s only one document to load. If -1, means use as many as possible (this is joblib’s default).

  • --private: bool, default False

    • add extra check that your data will never be sent to another server: for example check that the api_base was modified and used, check that no api keys are used, check that embedding models are local only. It will also use a separate cache from non private. Note that in the current implementation, this disables any callbacks to langfuse. If you only want to override some API endpoints, take a look at the argument --llms_api_bases. Note that the values of llms_api_bases are whitelisted when using private.

  • --llms_api_bases: dict, default None

    • a dict with keys in ["model", "query_eval_model", "embeddings"] The corresponding value will be used to change the url of the endpoint. This is needed to use local LLMs for example using ollama, lmstudio, etc. If you want to be sure not to leak any information to a remote server, you can use ---private. Note that the values of llms_api_bases are whitelisted when using private.

  • --oneoff: bool, default False

    • If True, will not ask for a prompt but quit right away. This is useful for example if you run several cli calls in parallel and don’t want them to take all the RAM after they’re done.

  • --version: bool, default False

    • display the version and exit

  • --cli_kwargs: dict, optional

    • Any remaining keyword argument will be parsed as a loader specific argument ((see below)[#loader-specific-arguments]). Any unrecognized key or inappropriate value type will result in a crash.

DocDict arguments#

Also refered to as "loader specific arguments", these are
expected by a subset of loader functions. For example only loader
functions expecting audio files in their `path` argument can
receive a `audio_backend` argument.
Those arguments are validated by a `DocDict` object that allows
to check which argument is expected by loader functions instead of
wdoc. For example `--out_file` is not expected by any loader but by
`wdoc`'s `__init__` method.

Those arguments can be set at cli time but can also be used
when using recursive_paths filetype combination to have arguments specific
to a loader. They apply depending on the value of `--filetype`.
An unexpected argument for a given filetype will result in a crash.
  • --path: str or Path

    • Used by most loaders. For example for --filetype=youtube the path must point to a youtube video.

  • --pdf_parsers: str or List[str], default: pymupdf

    • list of string or comma separated list of strings where each string is a key of the dict pdf_loaders in ./utils/loaders.py. The case is insensitive. The parsers are used in the order of this list. Not all parsers are tried. Instead, after each parsing we check using fasttext and heuristics based on doccheck_* args to rank the quality of the parsing. When stop if 1 parsing is high enough or take the best if 3 parsing worked. Note that the way online_pdf are handled is a bit different than pdf: we first try to download it then parse it with filetype=pdf and as a last resort we use langchain’s integrated OnlinePDFLoader as it’s far slower.

    Currently implemented:

    • Okayish metadata:

      • pymupdf

      • pdfplumber

    • Few metadata:

      • pdfminer

      • pypdfloader

      • pypdfium2

      • openparse (also has table support but quite slow)

    • pdftotext (fastest and most basic but can be unavailable depending on your install)

    • Very slow but theoretically the best are from unstructured:

      • unstructured_fast

      • unstructured_elements_fast

      • unstructured_hires

      • unstructured_elements_hires

      • unstructured_fast_clean_table

      • unstructured_elements_fast_clean_table

      • unstructured_hires_clean_table

      • unstructured_elements_hires_clean_table Notes: to the best of my knowledge: ‘fast’ means not AI based, as opposed to ‘hires’ ‘elements’ means the parser returns each element of the pdf instead of collating them in the rendering ‘clean’ means it tries to remove the extra whitespace ‘table’ means it will try to infer table structure (AI based)

  • --anki_profile: str

    • The name of the profile

  • --anki_deck: str

    • The beginning of the deckname. Note that we only look at decks, filtered decks are not taken into acount (so a card of deck ‘A’ that is temporarily in ‘B::filtered_deck’ will still be considered as part of ‘A’. e.g. science::physics::freshman_year::lesson1

  • --anki_notetype: str

    • If it’s part of the card’s notetype, that notetype will be kept. Case insensitive. Note that suspended cards are always ignored.

  • --anki_template: str

    • The template to use for the anki card. For example if you have a notetype with fields “fieldA”,”fieldB”,”fieldC” then you could set –anki_template=”Question:{fieldA}\nAnswer:{fieldB}”. The field “fieldC” would not be used and each document would look like your template. Notes:

    • ‘{tags}’ can be used to include a ‘\n* ‘ separated string of the tag list. Use –anki_tag_render_filter to restrict which tag can be shown (to avoid privacy leakage). Example of what the tag formating looks like: “ Anki tags: ‘’’

      • my::tag1

      • my_othertag ‘’’ “

    • ‘{allfields}’ can be used to format automatically all fields (not including tags). It will be replaced as “fieldA: ‘fieldAContent’\n\nfieldB: ‘fieldBContent’” etc The ‘ are added.

    • The default value is ‘{allfields}\n{image_ocr_alt}’.

    • ‘{image_ocr_alt}’ if present will be replaced by any text present in the ‘title’ or ‘alt’ field of an html image. This is isually OCR so can be useful for the LLM.

  • --anki_tag_filter: str Only keep the cards that have tags matchign this regex.

  • --anki_tag_render_filter: str Only the tags that match this regex will be put in the template. Careful, this does not mean “only keep cards that have tags matching this filter” but rather “only mention the tags matching this filter in the final document”.

  • --json_dict_template: str String that must contain {key} and {value}`, that will be replaced by the content of the json dict so that each document correspond to a single key/value pair derived from the template.

  • --json_dict_exclude_keys: list of strings all those keys will be ignored.

  • --metadata: str either as a string that will be parsed as a json dict, or as a dict.

  • --audio_backend: str

    • either ‘whisper’ or ‘deepgram’ to transcribe audio. Not taken into account for the filetype “youtube”. Taken into account if filetype if “local_audio” or “local_video”

  • --audio_unsilence: bool, default to True.

    • When processing audio files, remove silence before transcribing.

  • --whisper_lang: str

    • if using whisper to transcribe an audio file, this if the language specified to whisper

  • --whisper_prompt: str

    • if using whisper to transcribe an audio file, this if the prompt given to whisper

  • --deepgram_kwargs: dict

    • if using deepgram for transcription, those arguments will be used.

Note: --yt_* is automatically parsed as --youtube_

  • --youtube_language: List[str]

    • For youtube. e.g. ["fr-orig", "fr","en"] to use french transcripts if possible and english otherwise.

    • If unset, wdoc lists the video’s available subtitle tracks and picks the first one ending in -orig (youtube’s original-language track, e.g. fr-orig for a french video). If no -orig track exists, it falls back to ["en", "en-US", "en-UK"].

  • --youtube_translation: str

    • For youtube. e.g. en to use the transcripts after translation to english (translation provided by youtube)

  • --youtube_audio_backend: str Either ‘youtube’, ‘whisper’ or ‘deepgram’. Default is ‘youtube’.

    • If ‘youtube’: will take the youtube transcripts as text content.

    • If ‘whisper’: wdoc will download the audio from the youtube link, and whisper will be used to turn the audio into text. whisper_prompt and whisper_lang will be used if set.

    • If ‘deepgram’ will download the audio from the youtube link, and deepgram will be used to turn the audio into text. --deepgram_kwargs will be used if set.

  • --include: str

    • Only active if --filetype is ‘recursive_paths’ --include can be a list of regex that must be present in the document PATH (not content!) --exclude can be a list of regex that if present in the PATH will exclude it. Exclude is run AFTER include

  • --exclude: str

    • See --include

Other specific arguments#

  • --out_file: str or Path, default None

    • For summaries: If wdoc must create a summary, if out_file given the summary will be written to this file. Note that the file is not erased and wdoc will simply append to it.

    • For queries: If provided, the final answer and intermediate answers will be appended to this file in addition to being displayed in the terminal.

    • If --summary_n_recursion is used, additional files will be created with the name {out_file}.n.md with n being the n-1th recursive summary.

  • --citation_url_template: str, default None

    • Optional URL template for turning page citations into clickable markdown links in summaries. When set, citations like [p.42] become [p.42](https://your-site.com/doc.pdf#page=42).

    • Available placeholders: {page} (page number), {source} (source file path or label).

    • Example: --citation_url_template="https://private-site.com/docs/{source}#page={page}"

    • Note: even without this template, summaries of documents with page metadata (e.g. PDFs) will automatically include [p.N] citations on bullet points. For multi-file summaries, citations include the filename: [p.N, file.pdf].

    • This feature was developed with Claude Code.

  • --filter_metadata: list or str, default None

    • list of regex string to use as metadata filter when querying. Format: [kvb][+-]your_regex

    For example:

    • Keep only documents that contain anki in any value of any of its metadata dict: --filter_metadata=v+anki <- at least the filetype key will have as value anki

    • Keep only documents that contain anki_profile as a key in its metadata dict: --filter_metadata=k+anki_profile <- because will contain the key anki_profile

    • Keep only data that have a certain source_tag value: --filter_metadata=b+source_tag:my_source_tag_regex

    Notes:

    • Each filter must be a regex string beginning with k, v or b (for key, value or both). Followed by either + or - to: + at least one metadata should match - exclude from (no metadata should match)

    • If the string starts with k, it will filter based on the keys of the metadata, if it starts with a v it will filter based on the values, if it starts with b it will require a : present and everything left of : will be a regex to match a key key and right of the : will be a regex matching the matched key.

    • Filters are only relevant for task related to queries and are ignored for summaries.

    • Smartcasing is used: if the filter is its own lowercase version then insensitive casing will be used, otherwise not.

    • The function used to check the matching is pattern.match

    • The filtering is not done at the search time but before it. We first scan all the corresponding documents, then delete the useless embeddings from the docstore. This makes the whole search faster. But the embeddings are not saved afterwards so they are not lost, just not present in memory for this prompt.

  • --filter_content: list or str, default None

    • Like --filter_metadata but filters through the page_content of each document instead of the metadata. Syntax: [+-]your_regex Example:

    • Keep only the document that contain wdoc --filter_content=+.*wdoc.*

    • Discard the document that contain wdoc --filter_content=-.*wdoc.*

  • --embed_instruct: bool, default None

    • when loading an embedding model using the HuggingFace backend, wether to wrap the input sentence using instruct framework or not.

  • --load_functions: List[str], default None

    • list of strings that when evaluated in python result in a list of callable. The first must take one input of type string and the last function must return one string.

    For example in the filetypes local_html this can be used to specify lambda functions that modify the text before running BeautifulSoup. Useful to decode html stored in .js files. Do tell me if you want more of this.

  • --ddg_max_results: int, default 50

    • Number of result to ask from DuckDuckGo when using --filetype=ddg.

  • --ddg_region: str, default "" (empty, meaning no specific region)

    • Region to ask DuckDuckGo result from. For example us-US.

  • --ddg_safesearch: str, default off

    • Either on, moderate or off.

  • --doccheck_min_lang_prob: float, default 0.5

    • float between 0 and 1 that sets the threshold under which to consider a document invalid if the estimation of fasttext’s langdetect of any language is below that value. For example, setting it to 0.9 means that only documents that fasttext thinks have at least 90% probability of being a language are valid.

  • --doccheck_min_token: int, default 50

    • if we find less that that many token in a document, crash.

  • --doccheck_max_token: int, default 10_000_000

    • if we find more that that many token in a document, crash.

  • --online_media_url_regex: str

    • a regex that if matching a request’s url, will consider the request to be leading to a media. We then try to fetch those media using youtube_dl. The default is already a sensible value.

  • --online_media_resourcetype_regex: str

    • Same as --online_media_url_regex but checking request.resource_type

  • --source_tag: str, default None

    • a string that will be added to the document metadata at the key source_tag. Useful when using filetype combination. It is EXTREMELY recommended to include a source_tag to any document you want to save: especially if using recursive filetypes. This is because after loading all documents wdoc use the source_tag to see if it should continue or crash. If you want to load 10_000 pdf in one go as I do, then it makes sense to continue if some failed to crash but not if a whole source_tag is missing.

  • --loading_failure: str, default warn

    • either crash or warn. Determines what to do with exceptions happening when loading a document. This can be set per document if a recursive_paths filetype is used. If using wdoc_doc_file it is by default set to crash. When using wdoc parse, the default value is crash.

Environment variables#

  • WDOC_DEBUG

    • Setting to true has the same effects as using --debug=True.

  • WDOC_VERBOSE

    • Setting to true has the same effects as using --verbose=True. Always set to true if WDOC_DEBUG is set to true.

  • WDOC_TYPECHECKING

    • Setting for runtime type checking. Default value is warn. The typing is checked using beartype so shouldn’t slow down the runtime.

    • Possible values:

      • disabled: disable typechecking.

      • warn: print a red warning if a typechecking fails.

      • crash: crash if a typechecking fails in any function.

  • WDOC_NO_MODELNAME_MATCHING

    • If “false”: will try to infer the model name based on a more human readable string. For example ‘4o’ might be matched to ‘openai/gpt-4o’. Useful for exotic or models that are fresh out of the oven, or bugs with backend parsing. As it can lead to issues it was decided to disable the matching by default, hence the default value is True.

  • WDOC_ALLOW_NO_PRICE

    • if “true”, won’t crash if no price was found for the given model. Useful if litellm has not yet updated its price table. Default is False.

  • WDOC_OPEN_ANKI

    • if “true”, will automatically ask wether to open the anki browser if cards are found in the sources. Only used if task is query or search. Default is False

  • WDOC_STRICT_DOCDICT

    • if “True”, will crash instead of printing if trying to set an unexpected argument in a DocDict. Otherwise, you can specify things like “anki_profile” as argument to filetype “pdf” without crashing, this also makes no sense but can be useful if there’s a bug in wdoc that is not yet fixed and you want to continue in the meantime.

    • If set to “False”: we print in red unexpected arguments but add them anyway.

    • If set to “strip”: we print in red unexpected arguments and ignore them. Default is False.

  • WDOC_MAX_LOADER_TIMEOUT

    • Number of seconds to wait before giving up on loading a document (this does not include recursive types, only the DocDict arguments). Default is -1 to disable. Disabled if <= 0.

  • WDOC_MAX_PDF_LOADER_TIMEOUT

    • Number of seconds to wait for each pdf loader before giving up this loader. This includes the online_pdf loader. Note that it probably makes PDF parsing substantially. Default is -1 to disable. Disabled when using --file_loader_parallel_backend=threading as python does not allow it. Also disabled if <= 0.

  • WDOC_DEBUGGER

    • If True, will open the debugger in case of issue. Implied by --debug Incompatible with WDOC_IN_DOCKER. Default is False

  • WDOC_IN_DOCKER

    • Flag set automatically, used to modify some behaviors to avoid issues when running wdoc inside docker. Incompatible with WDOC_DEBUGGER. Default is False

  • WDOC_EXPIRE_CACHE_DAYS

    • If an int, will remove any cached value that is older than that many days. Otherwise keep forever. Default is 0 to disable.

  • WDOC_EMPTY_LOADER

    • If True, loading any kind of document will return an empty string. Used for debugging. Default is False.

  • WDOC_BEHAVIOR_EXCL_INCL_USELESS

    • If an “include” or “exclude” key is found in a loader but does not actually change anything, if warn then just print in red but if crash then raise an error. Default is warn.

  • WDOC_PRIVATE_MODE

    • You should never set it yourself. It is set automatically if the --private argument is used, and used throughout to triple check that it’s indeed fully private.

  • WDOC_IMPORT_TYPE, default native

    • If native will just import the packages needed by wdoc without any tricks. This is the default as it’s bug-free but can be a bit slower to start up.

    • If thread, will try to use a separate thread to import packages making the startup time potentially smaller.

    • If lazy, will use lazy loading on some packages, making the startup time potentially smaller.

    • If both, will try to use both. All other than native are experimental as they rely on weird python tricks that may cause issues.

  • WDOC_LOADER_LAZY_LOADING, default True

    • If True the function used to load documents (e.g. load_anki, load_online_pdf etc) will be imported only when needed. This is faster but experimental for now. If False, we import all the loader function on start.

  • WDOC_MOD_FAISS_SCORE_FN, default True

  • WDOC_FAISS_COMPRESSION, default True

    • If True, zlib compression is applied around the pickling stage (=save_local/load_local) of the faiss index. Disable this if you want to use your faiss indexes with other softwares without using wdoc’s custom classes. If False, WDOC_FAISS_BINARY must also be False. Note that you can switch value between run, as the uncompressed loading is used as fallback.

  • WDOC_FAISS_BINARY, default False

    • If True, use a custom langchain vectorstore mimicking FAISS but using binary embeddings, resulting in a 32x compression ratio and faster search hurting performance too much. Note that binary indexes of FAISS only support embeddings with dimensions multiple of 8 so if that happens we add null dimensions. Note that if you switch this value between the index creation and index usage, you’ll probably encounter errors and should rather set it once then recreate your vectorstores.

  • WDOC_LLM_MAX_CONCURRENCY, default 1

    • Set the max_concurrency limit to give langchain. If debug is used, it is overriden and set to 1. Must be an int.

  • WDOC_LLM_REQUEST_TIMEOUT, default 600

    • Sets the timeout in seconds for requests made to the LLM. This helps prevent indefinite hangs if the LLM provider is unresponsive. For example with ollama.

  • WDOC_MAX_CHUNK_SIZE, default 32_000

    • When splitting large text into chunks, wdoc infers the maximum context size from litellm’s models metadata. The maximum chunk size is capped by this value, as the maximum advertised context length is usually optimistic and is often at the cost of prompt adherence. Note that the chunk size inferred for query is not the same as for summary as we need a much better prompt adherence for the latter. This can also be used to avoid chunking when querying a text if you want the LLM to have the entier text as context instead of chunking.

  • WDOC_MAX_EMBED_CONTEXT, default: 7_000

    • This variable sets the maximum token_size for document chunks when the task is query or search. This is necessary because some large language models (LLMs) might have a larger context window than their corresponding embedding models. The actual maximum chunk size will be the minimum of WDOC_MAX_CHUNK_SIZE and WDOC_MAX_EMBED_CONTEXT.

  • WDOC_SEMANTIC_BATCH_MAX_TOKEN_SIZE, default: 2000

    • Token size considered maximum for a single batch when doing semantic batching. The tokenizer used is the one from gpt-4o-mini as we don’t have access to most models’ tokenizers. Each batch contains at least two intermediate answers so it’s not an absolute limitation but increasing it should reduce the cost of the “combine intermediate answers” step when querying.

  • WDOC_DEFAULT_MODEL, default: "openrouter/deepseek/deepseek-v4-pro"

    • Default strong LLM to use. This is the strongest model, it will be used to answer the query about each document, combine those answers. It can also be used by some retrievers etc.

  • WDOC_DEFAULT_QUERY_EVAL_MODEL, default: "openrouter/deepseek/deepseek-v4-flash"

    • Default small LLM to use. It will be used to evaluate wether each document is relevant to the query or not.

  • WDOC_DEFAULT_EMBED_MODEL, default: "openai/text-embedding-3-small"

    • Default model to use for embeddings.

  • WDOC_DEFAULT_EMBED_DIMENSION, default: none

    • Default number of dimension to ask from the embeddings provider.

  • WDOC_EMBED_TESTING, default: True

    • If False, will skip the test of the embeddings model on simple sentences to find out if we loaded everything correctly.

  • WDOC_DISABLE_EMBEDDINGS_CACHE, default: False

    • If True, bypasses the caching mechanism for embeddings and uses the embeddings model directly. This can be useful for debugging or when you want to ensure fresh embeddings are generated for each document.

    • Note that disabling the cache only affects new queries, new documents, or during semantic batching. It will NOT affect embeddings that are loaded via load_embeds_from, as those embeddings are already pre-computed and stored.

  • WDOC_LANGFUSE_PUBLIC_KEY, default: None

    • If present, will replace the env variable LANGFUSE_PUBLIC_KEY.

  • WDOC_LANGFUSE_SECRET_KEY, default: None

    • If present, will replace the env variable LANGFUSE_SECRET_KEY.

  • WDOC_LANGFUSE_HOST, default: None

    • If present, will replace the env variable LANGFUSE_HOST.

  • WDOC_LITELLM_TAGS, default: None

    • If a comma separated list of string: they will be put as tags in the litellm LLM request via the ChatLiteLLM object.

  • WDOC_LITELLM_USER, default: wdoc_llm

    • Put as user argument when creating ChatLiteLLM object that talks to LLMs.

  • WDOC_CONTINUE_ON_INVALID_EVAL, default: True

    • If True, instead of raising an InvalidDocEvaluationByLLMEval exception when an eval LLM returns output that can’t be parsed, the system will print the error message in red and return “5” as the evaluation score. This allows the process to continue despite evaluation parsing failures.

    • If False, the system will raise the exception as normal, which typically causes the process to terminate.

  • WDOC_INTERMEDIATE_ANSWER_MAX_TOKENS, default: 4000

    • Sets the maximum number of tokens allowed for each intermediate answer when querying documents. This controls how much content the LLM generates for each document before these answers are combined into the final response. Lower values may reduce costs but might lose important details, while higher values allow for more comprehensive individual document analysis.

  • WDOC_WHISPER_PARALLEL_SPLITS, default: True

    • If True, when audio files need to be split for whisper transcription (due to size limits), the splits will be processed in parallel using joblib. This can significantly speed up transcription of large audio files when using remote whisper services.

    • If False, audio splits will be processed sequentially. It is recommended to set this to False when using a local whisper instance to avoid overwhelming the local system with concurrent requests.

  • WDOC_WHISPER_ENDPOINT, default: ""

    • If provided, sets a custom API endpoint for Whisper transcription services. This allows you to use local Whisper instances or alternative Whisper-compatible services instead of OpenAI’s default endpoint.

    • When empty, uses the default OpenAI Whisper endpoint.

  • WDOC_WHISPER_API_KEY, default: ""

    • If provided, sets a custom API key for Whisper transcription services. This is useful when using alternative Whisper-compatible services that require their own authentication.

    • When empty, uses the default OPENAI_API_KEY environment variable.

  • WDOC_WHISPER_MODEL, default: "whisper-1"

    • Specifies which Whisper model to use for audio transcription. This can be any model supported by your Whisper endpoint.

    • Common values include “whisper-1” for OpenAI’s service, or model names like “base”, “small”, “medium”, “large” for local instances.

  • WDOC_APPLY_ASYNCIO_PATCH, default: False

    • If True, applies the nest_asyncio patch to fix the Event loop closed error that can occur with Ollama and other async-based LLM providers. Set to False if you’re experiencing issues with asyncio or if you’re handling asyncio patching elsewhere in your application. See https://github.com/BerriAI/litellm/pull/7625/files

Parse Doc#

Description#

parse_doc is the function called when you do wdoc parse_doc --path=my_path. It takes as argument basically the file related arguments of wdoc and completely bypasses anything related to summarising, querying, LLM etc. Hence it is meant to be used as an utility that parses any input to text. You can for example use it to quickly parse anything to send to @simonw’s llm or any other .shell utility.

Arguments#

  • filetype: str

    • Same as for wdoc

  • format: str, default text

    • if text: returns the text, with splits joined separated by a newline

    • if split_text: returns the text, with indicators for the document splits

    • if xml: returns text in an xml like format

    • if langchain: return a list of langchain Documents

    • if langchain_dict: return a list of langchain Documents as python dicts (easy to json parse, and metadata are included)

  • debug: bool, default False

    • Same as for wdoc

  • verbose: bool, default False

    • Same as for wdoc

  • out_file: str or Path, default None

    • If specified, writes the output to the given file path.

    • If the file exists and is binary, the function will crash.

    • Otherwise, the output will be appended to the file (no overwrite).

    • The output is still returned normally for programmatic use.

  • **kwargs

    • Remaning keyword arguments are assumed to be DocDict arguments, the full list is at wdoc.utils.misc.filetype_arg_types or in the “DocDict arguments” section of wdoc --help.

Return value#

  • Either the document’s page_content as a string, or a list of langchain Document (so with attributes page_content and metadata).

Docker Setup#

This directory contains an experimental dockerized Gradio web interface for wdoc, designed for easy deployment and use.

Prerequisites#

This setup assumes you have already cloned the wdoc repository:

git clone https://github.com/thiswillbeyourgithub/wdoc.git
cd wdoc/docker

Note: No pre-built Docker images are provided. You’ll either build the image locally from the cloned repository or install it from PyPI, depending on the value of COMPILE_OR_INSTALL (see below).

Quick Start#

All commands below should be run from the docker subdirectory of the wdoc repository.

  1. Configure environment variables: Copy and edit the environment file (both files are in the ./docker directory):

    cp custom_env.example custom_env
    # Manually edit custom_env to add your API keys (ANTHROPIC_API_KEY, etc.)
    
  2. Start the service:

    sudo docker compose up
    
  3. Access the web interface: Open your browser to http://localhost:7618

Architecture#

  • Build modes: The Docker image can be built in two ways controlled by the COMPILE_OR_INSTALL build argument:

    • compile (default): Installs wdoc from the local repository source in editable mode. Use this for development or when you need the latest changes.

    • install: Installs wdoc from PyPI. Use this for a stable, released version.

    To change the build mode, set the environment variable before building:

    COMPILE_OR_INSTALL=install sudo docker compose up -d --build
    
  • Container user: Runs as non-root user wdoc (UID:GID 1000:1000) for security

  • Port: Exposes Gradio on port 7618 (mapped from internal port 7860)

  • Volumes (relative to the ./docker directory):

    • ./vectorstore: Persistent storage for document embeddings

    • ./wdoc_cache: LLM cache to reduce API costs and improve performance

Troubleshooting#

Permission Errors#

If you encounter permission errors on first startup, particularly related to the cache directory, this is typically because Docker created the volume directories with root ownership.

Solution: From the docker directory, change ownership to match the container’s user (UID:GID 1000:1000):

# Make sure you're in the docker directory
cd wdoc/docker

# Fix permissions
sudo chown -R 1000:1000 ./vectorstore ./wdoc_cache

# Or if the directories don't exist yet:
mkdir -p ./vectorstore ./wdoc_cache
sudo chown -R 1000:1000 ./vectorstore ./wdoc_cache

Alternative: If you’re running with a different user ID, you can modify the docker compose.yml to use your current user:

user: "${UID}:${GID}"

Then run with:

sudo docker compose up

Checking Logs#

To view the application logs:

sudo docker compose logs -f wdoc-gui

Rebuilding After Changes#

If you’ve modified gui.py or Dockerfile:

sudo docker compose down
sudo docker compose build --no-cache
sudo docker compose up

Configuration#

Environment Variables#

Create a custom_env file in the docker directory with your configuration:

# Required: API keys for your LLM provider
ANTHROPIC_API_KEY=sk-ant-...
# Or for other providers:
# OPENAI_API_KEY=...
# GEMINI_API_KEY=...

# Optional: Default models
WDOC_DEFAULT_MODEL=openai/gpt-4o-mini
WDOC_DEFAULT_EMBED_MODEL=openai/text-embedding-3-small

# Optional: Langfuse integration (if using)
LANGFUSE_PUBLIC_KEY=pk-...
LANGFUSE_SECRET_KEY=sk-...
LANGFUSE_HOST=https://cloud.langfuse.com

Volume Paths#

You can customize volume paths using environment variables in docker compose.yml:

VECTORSTORE_PATH=/your/custom/path/vectorstore sudo docker compose up -d
CACHE_PATH=/your/custom/path/cache sudo docker compose up -d

Security Notes#

  • The container runs as a non-root user for improved security

  • Security option no-new-privileges prevents privilege escalation

  • No unnecessary capabilities are granted

  • Network access is controlled (uses host.docker.internal for local services like Langfuse)

For Developers#

Building Locally#

From the docker directory:

# Build from local source (default)
sudo docker build -t wdoc-gui -f Dockerfile ..

# Or build from PyPI
sudo docker build -t wdoc-gui -f Dockerfile --build-arg COMPILE_OR_INSTALL=install ..

# Run the container
sudo docker run -p 7618:7860 \
  -v $(pwd)/vectorstore:/app/vectorstore \
  -v $(pwd)/wdoc_cache:/home/wdoc/.cache/wdoc \
  --env-file custom_env \
  wdoc-gui

Modifying the GUI#

The Gradio interface is defined in docker/gui.py. After making changes, rebuild the container to see them take effect.

Understanding COMPILE_OR_INSTALL#

  • compile mode: The Dockerfile copies your local wdoc source code and installs it in editable mode (pip install -e). This means:

    • Code changes in the repository affect the Docker image after rebuild

    • Useful for development and testing

    • Includes unreleased features/fixes

  • install mode: The Dockerfile installs wdoc from PyPI. This means:

    • You get the latest stable release

    • Independent of your local source code

    • Faster builds (no need to copy source files)

Additional Resources#


This Docker setup was created with assistance from aider.chat