Walkthrough & Examples#

Table of Contents#

Walkthrough
Shell Examples
Python Script Examples

Heads up on installation: if you don’t want to think about which extras to install, use uvx wdoc[full] everywhere. The plain wdoc package only includes PDF and URL/web loaders, so commands that touch youtube, audio, anki, office formats (word/powerpoint/epub) or logseq need their extras. [full] bundles all of those at once. The examples below sometimes use plain uvx wdoc when the base install is enough (pdf, url, ddg), but you can always replace it with uvx wdoc[full] to be safe. See the installation section for the full list of extras.

Note that there is an official open-webui Tool that is even simpler to use.

Walkthrough#

Say you want to ask a question about one pdf, that’s simple:

uvx wdoc --task="query" --path="my_file.pdf" --filetype="pdf" --model='openai/gpt-4o'

Note that you could have just let --filetype="auto" and it would have worked the same.

Note: By default wdoc tries to parse args as kwargs so uvx wdoc query mydocument What's the age of the captain? is parsed as uvx wdoc --task=query --path=mydocument --query "What's the age of the captain?". Likewise for summaries. This does not always work so use it only after getting comfortable with wdoc.

You have several pdf? Say you want to ask a question about any pdf contained in a folder, that’s not much more complicated:

uvx wdoc --task="query" --path="my/other_dir" --pattern="**/*pdf" --filetype="recursive_paths" --recursed_filetype="pdf" --query="My question about those documents"

So basically you give as path the path to the dir, as pattern the globbing pattern used to find the files relative to the path, set as filetype “recursive_paths” so that wdoc knows what arguments to expect, and specify as recursed_filetype “pdf” so that wdoc knows that each found file must be treated as a pdf. You can use the same idea to glob any kind of file supported by wdoc like markdown etc. You can even use “auto”! Note that you can either directly ask your question with --query="my question", or wait for an interactive prompt to pop up, or just pass the question as *args like so uvx wdoc [your kwargs] here is my question.

You want more? You can write a .json file where each line (#comments and empty lines are ignored) will be parsed as a list of argument. For example one line could be:

{"path": "my/other_dir", "pattern": "**/*pdf", "filetype": "recursive_paths", "recursed_filetype": "pdf"}

This way you can use a single json file to specify easily any number of sources. .toml files are also supported.

You can specify a “source_tag” metadata to help distinguish between documents you imported. It is EXTREMELY recommended to include a source_tag to any document you want to save: especially if using recursive filetypes. This is because after loading all documents wdoc use the source_tag to see if it should continue or crash. If you want to load 10_000 pdf in one go as I do, then it makes sense to continue if some failed to crash but not if a whole source_tag is missing.
Now say you do this with many many documents, as I do, you of course can’t wait for the indexing to finish every time you have a question (even though the embeddings are cached). You should then add

--save_embeds_as=your/saving/path

to save all this index in a file. Then simply do

--load_embeds_from=your/saving/path

to quickly ask queries about it!

To know more about each argument supported by each filetype,

uvx wdoc --help

There is a specific recursive filetype I should mention: --filetype="link_file". Basically the file designated by --path should contain in each line (#comments and empty lines are ignored) one url, that will be parsed by wdoc. I made this so that I can quickly use the “share” button on android from my browser to a text file (so it just appends the url to the file), this file is synced via syncthing to my browser and wdoc automatically summarize them and add them to my Logseq. Note that the url is parsed in each line, so formatting is ignored, for example it works even in markdown bullet point list.
If you want to only use local models, here’s an example with ollama:

uvx wdoc --model="ollama/qwen3:8b" --query_eval_model="ollama/qwen3:8b" --embed_model="ollama/snowflake-arctic-embed2" --task summarize --path https://situational-awareness.ai/

You can always add --private to add additional safety nets that no data will leave your local network. You can also override specific API endpoints using

--llms_api_bases='{"model": "http://localhost:11434", "query_eval_model": "http://localhost:11434", "embeddings": "http://localhost:1434"}'

Now say you just want to summarize Tim Urban’s TED talk on procrastination:

uvx wdoc[youtube] --task=summary --path='https://www.youtube.com/watch?v=arj7oStGLkU' --youtube_language="en" --disable_md_printing

Click to see the output

Summary

https://www.youtube.com/watch?v=arj7oStGLkU

Let me take a deep breath and summarize this TED talk about procrastination:

[0:00-3:40] Personal experience with procrastination in college:

Author’s pattern with papers: planning to work steadily but actually doing everything last minute

90-page senior thesis experience:

Planned to work steadily over a year

Actually wrote 90 pages in 72 hours with two all-nighters

Jokingly implies it was brilliant, then admits it was ‘very, very bad’

[3:40-6:45] Brain comparison between procrastinators and non-procrastinators:

Both have a Rational Decision-Maker

Procrastinator’s brain also has an Instant Gratification Monkey:

Lives entirely in present moment

Only cares about ‘easy and fun’

Works fine for animals but problematic for humans in advanced civilization

Rational Decision-Maker capabilities:

Can visualize future

See big picture

Make long-term plans

[6:45-10:55] The procrastinator’s system:

Dark Playground:

Where leisure activities happen at wrong times

Characterized by guilt, dread, anxiety, self-hatred

Panic Monster:

Only thing monkey fears

Awakens near deadlines or threats of public embarrassment

Enables last-minute productivity

Personal example with TED talk preparation:

Procrastinated for months

Only started working when panic set in

[10:55-13:05] Two types of procrastination:

Deadline-based procrastination:

Effects contained due to Panic Monster intervention

Less harmful long-term

Non-deadline procrastination:

More dangerous

Affects important life areas without deadlines:

Entrepreneurial pursuits

Family relationships

Health

Personal relationships

Can cause long-term unhappiness and regrets

[13:05-14:04] Concluding thoughts:

Author believes no true non-procrastinators exist

Presents Life Calendar:

Shows 90 years in weekly boxes

Emphasizes limited time available

Call to action: need to address procrastination ‘sometime soon’

Key audience response moments:

Multiple instances of ‘(Laughter)’ noted throughout

Particularly strong response from PhD students relating to procrastination issues

Received thousands of emails after blog post about procrastination Tokens used for https://www.youtube.com/watch?v=arj7oStGLkU: ‘4936’ (in: 4307, out: 629, cost: $0.00063) Total cost of those summaries: 4936 tokens for $0.00063 (estimate was $0.00030) Total time saved by those summaries: 8.8 minutes Done summarizing.

Shell Examples#

Query a simple PDF file

uvx wdoc --task=query --path="my_file.pdf" --filetype="pdf" --model='openai/gpt-4o'

Recursively query multiple PDFs in a directory

uvx wdoc --task=query \
     --path="my/other_dir" \
     --pattern="**/*pdf" \
     --filetype="recursive_paths" \
     --recursed_filetype="pdf" \
     --query="My question about those documents"

Summarize a YouTube video in french based on the english transcript

uvx wdoc[full] --task=summary \
     --path='https://www.youtube.com/watch?v=arj7oStGLkU' \
     --youtube_language="en" \
     --summary_language="fr" \
     --disable_md_printing

Summarize a YouTube video based on the whisper transcript

uvx wdoc[youtube,audio] --task=summary \
     --path='https://www.youtube.com/watch?v=arj7oStGLkU' \
     --youtube_audio_backend="whisper" \
     --whisper_lang="en"

Use local models with Ollama

uvx wdoc --model="ollama/qwen3:8b" \
     --query_eval_model="ollama/qwen3:8b" \
     --embed_model="ollama/snowflake-arctic-embed2" \
     --task summarize --path https://situational-awareness.ai/

Note: you might find that ollama models are sometimes overly optimistic about their context length. You can pass arguments to lower it like so:

wdoc --model="ollama/qwen3:8b" \
     --query_eval_model="ollama/qwen3:8b" \
     --model_kwargs='{"max_tokens": 4096}' \
     --query_eval_model="ollama/qwen3:8b" \
     --query_eval_model_kwargs='{"max_tokens": 4096}' \
     --embed_model="ollama/snowflake-arctic-embed2" \
     --task summarize --path https://situational-awareness.ai/

Parse an Anki deck as text

uvx wdoc[anki] parse \
    --filetype "anki" \
    --anki_profile "Main" \
    --anki_deck "mydeck::subdeck1" \
    --anki_notetype "my_notetype" \
    --anki_template "<header>\n{header}\n</header>\n<body>\n{body}\n</body>\n<personal_notes>\n{more}\n</personal_notes>\n<tags>{tags}</tags>\n{image_ocr_alt}" \
    --anki_tag_filter "a::tag::regex::.*something.*" \
    --format=langchain_dict

Query an online PDF

uvx wdoc --path="https://example.com/document.pdf" \
     --task=query \
     --filetype="online_pdf" \
     --query="What does it say about X?"

Save and load embeddings for faster subsequent queries

# First run - save embeddings
uvx wdoc --task=query \
     --path="my_document.pdf" \
     --save_embeds_as="saved_embeddings.pkl"

# Subsequent runs - load embeddings
uvx wdoc --task=query \
     --load_embeds_from="saved_embeddings.pkl" \
     --query="My new question"

You can even use shell pipes:

Data sent using shell pipes (be it for strings or binary data) will be automatically saved to a temporary file which is then passed as --path=[temp_file] argument. For example cat **/*.txt | uvx wdoc --task=query, echo $my_url | uvx wdoc parse or even cat my_file.pdf | uvx wdoc parse --filetype=pdf. For binary input it is strongly recommended to use a --filetype argument because python-magic version <=0.4.27 chokes otherwise (see that issue.

You can also search the web for results using DuckDuckGo:

It’s implemented like if ddg was a recursive_filetype. Hence, the idea is to use uvx wdoc --task=query --path='How is Nvidia doing this month?' --query='How is Nvidia doing this month' --filetype=ddg (remember: path specifies the document and query the question to ask about the documents). To make it more natural, if any of path or query is missing, we replace it by the value of the other one. It can be shortened to: uvx wdoc web 'How is Nvidia doing this month?'. With --ddg_max_result=5 you can specify the maximum number of results to get, use --ddg_region=us-US to get US only result, --ddg_safesearch=on to filter out NSFW results.

Query a Zotero collection (one selection fans out into all its attachments + metadata)

# Local Zotero app running: query a nested collection (no api key needed)
uvx wdoc[zotero] --task=query \
    --filetype="zotero" \
    --path="Research/ML/Papers" \
    --query="What do these papers say about attention?"

# Whole library via the Web API, summarizing, including notes
ZOTERO_LIBRARY_ID=123456 ZOTERO_API_KEY=xxxx uvx wdoc[zotero] parse \
    --filetype="zotero" \
    --path="library" \
    --zotero_connection="web" \
    --zotero_include_notes=True \
    --format=langchain_dict

# Items matching a tag, using Zotero's pre-indexed fulltext instead of re-parsing
uvx wdoc[zotero] parse \
    --filetype="zotero" \
    --path="tag:to-read" \
    --zotero_attachment_text="fulltext" \
    --format=langchain_dict

The selector passed to --path can also be items:KEY1,KEY2 for explicit item keys or search:MySavedSearch for a saved search. See the zotero section of the help for every option.

Query a Karakeep list (one selection fans out into all its bookmarks)

# Query a list by name (creds read from KARAKEEP_PYTHON_API_* env vars)
KARAKEEP_PYTHON_API_ENDPOINT=https://karakeep.example.com/api/v1/ \
KARAKEEP_PYTHON_API_KEY=xxxx uvx wdoc[karakeep] --task=query \
    --filetype="karakeep" \
    --path="list:Reading" \
    --query="What do these articles say about RAG?"

# Bookmarks matching a tag, using only Karakeep's stored content (no asset download)
uvx wdoc[karakeep] parse \
    --filetype="karakeep" \
    --path="tag:to-read" \
    --karakeep_content_source="native" \
    --karakeep_api_endpoint="https://karakeep.example.com/api/v1/" \
    --format=langchain_dict

# Whole library, preferring the stored pdf/archive asset parsed by wdoc's loaders
uvx wdoc[karakeep] parse \
    --filetype="karakeep" \
    --path="library" \
    --karakeep_content_source="wdoc" \
    --format=langchain_dict

The selector passed to --path can also be search:terms, ids:ID1,ID2, favourites or archived. See the karakeep section of the help for every option.

Python Script Examples#

Basic document summarization

from wdoc import wdoc

# Initialize wdoc for summarization
instance = wdoc(
    task="summary",
    path="document.pdf",
    summary_language="en",  # Optional: specify output language
)

# Get summary results
results = instance.summary_results
print(f"Summary:\n{results['summary']}")
print(f"Processing cost: ${results['doc_total_cost']:.5f}")
print(f"Original reading time: {results['doc_reading_length']:.1f} minutes")

Summarize with custom model settings

from wdoc import wdoc

# Use specific models for better control
instance = wdoc(
    task="summary",
    path="https://example.com/paper.pdf",
    filetype="online_pdf",
    model="openai/gpt-4o",  # Use GPT-4o for summarization
    embed_model="openai/text-embedding-3-large",  # Specify embedding model
)

results = instance.summary_results
summary_text = results['summary']

Summarize a PDF with clickable page citations (developed with Claude Code)

from wdoc import wdoc

# Summarize with page citations linking to a private document server
instance = wdoc(
    task="summary",
    path="court_transcript.pdf",
    filetype="pdf",
    model="openai/gpt-4o",
    citation_url_template="https://private-site.com/cases/{source}#page={page}",
)

results = instance.summary_results
# Bullet points will contain clickable citations like:
# - **Key finding** about the case [p.42](https://private-site.com/cases/court_transcript.pdf#page=42)

# Same example from the shell
uvx wdoc --task=summary \
     --path="court_transcript.pdf" \
     --filetype="pdf" \
     --citation_url_template="https://private-site.com/cases/{source}#page={page}"