# Help ### Table of contents - [Global arguments](#global-arguments) - [DocDict arguments](#docdict-arguments). - [Other specific arguments](#other-specific-arguments) - [Runtime flags / environment variables](#runtime-flags) # Global arguments * `--task`: str * Accepted values: * `query`: means to load the input files then wait for user question. * `search`: means only return the document corresponding to the search * `summarize`: means the input will be passed through a summarization prompt. * `summarize_then_query`: summarize the text then open the prompt to allow querying directly the source document. * `--filetype`: str, default `auto` * the type of input. Depending on the value, different other parameters are needed. If json_entries is used, the line of the input file can contain any of those parameters as long as they are as json. You can find an example of json_entries file in `wdoc/docs/json_entries_example.txt` * Supported values and available arguments: *For the details of each argument, [see below](#loader-specific-arguments)* * `anki` * Optional: * `--anki_profile` * `--anki_deck` * `--anki_notetype` * `--anki_template` * `--anki_tag_filter` * `--anki_tag_render_filter` * `auto`: will guess the appropriate filetype based on `--path`. Irrelevant for some filetypes, eg if `--filetype`=anki. It can also infer recursive filetypes, for example if the `path` leads to a `.toml` file. * `epub` * `--path` to a .epub file * `json_dict` * `--path` to a text file containing a single json dict * `--json_dict_template` * Optional: * `--json_dict_exclude_keys` * `--metadata` * `local_audio` * `--path` * `--audio_backend` * Optional: * `--audio_unsilence` * `--whisper_prompt` * `--whisper_lang` * `--deepgram_kwargs` * `local_html` * `--path` must points to a .html file * Optional: * `--load_functions` * `local_video` * `--path` * `--audio_backend` * Optional: * `--audio_unsilence` * `--whisper_lang` * `--whisper_prompt` * `--deepgram_kwargs` * `logseq_markdown` * `--path` path to the markdown file * `online_media`: load the url using youtube_dl to download a media (video or audio) then treat it as `filetype=local_audio`. * If youtube_dl failed to find the media, try using playwright browser where any requested element that looks like a possible media will try be downloaded. * Same arguments as `local_audio` with extra arguments: * `--online_media_url_regex` * `--online_media_resourcetype_regex` * `online_pdf` * Same arguments as for `--filetype=pdf` Note that the way `online_pdf` are handled is a bit different than `pdf`: we first try to download it then parse it with `filetype=pdf` and as a last resort we use langchain's integrated OnlinePDFLoader as it's far slower. * `pdf` * `--path` is the filepath to pdf * Optional: * `--pdf_parsers` * `--doccheck_min_lang_prob` * `--doccheck_min_token` * `--doccheck_max_token` * `powerpoint` * `--path` to a .ppt or .pptx etc * `string`: no parameters needed, will provide a field where you must type or paste the string * `text` (For text input as argument, not to be mistaken with `txt`) * `--path` is directly the text content. * Optional: * `--metadata` * `txt` (For text present in a txt file, not to be mistaken with `text`) * `--path` is path to a .txt file * `url` * `--path` must be a valid http(s) link * Optional: * `--title`, otherwise we try to detect it ourselves. * `word` * `--path` to a .doc, .docx, etc * `youtube` * `--path` must link to a youtube video *Note: `--yt_*` is automatically parsed as `--youtube_`* * Optional: * `--youtube_language` * `--youtube_translations` * `--youtube_audio_backend` * `--whisper_prompt` * `--whisper_lang` * `--deepgram_kwargs` * **Recursive types**: * `ddg` * `--path` is the search query for DuckDuckGo. * `--ddg_max_results` * `--ddg_region`, for example `us-US` * `--ddg_safesearch` * `json_entries` * `--path` is path to a text file that contains a json for each line containing at least a filetype and a path key/value but can contain any parameters described here * `recursive_paths` * `--path` is the starting path * `--pattern` is the globbing patterns to append * `--exclude` and `--include` can be a list of regex applying to found paths (include is run first then exclude, if the pattern is only lowercase it will be case insensitive) * `--recursed_filetype` is the filetype to use for each of the found path * `youtube_playlist` * `--path` must link to a youtube playlist * `link_file` * `--path` must point to a file where each line is a link that will be summarized. * `--out_file` path to text file where the summary will be added (appended). Links that have already been summarized in out_file will be skipped (the out_file is never overwritten). If a line is a markdown like like [this](link) then it will be parsed as a link. Empty lines and starting with # are ignored. --- * `--model`: str, default to value of WDOC_DEFAULT_MODEL * Keep in mind that given that the default backend used is litellm the part of model name before the slash (/) is the backend name (also called provider). If the backend is `testing/` then it will be parsed as `testing/testing` and a fake LLM will be used for debugging purposes. It answers like a normal LLM but costs 0 and makes no sense. Note that it will automatically set the query_eval_model to `testing/testing` too. If the value is not part of the model list of litellm, will use fuzzy matching to find the best match. * `--model_kwargs`: dict, default `None` * dictionary of keyword arguments to pass to the model. For example `{'temperature': 0}`. Note that changing the kwargs will sometimes keep reusing the cache, use `disable_llm_cache` to avoid that. --- * `--embed_model`: str, default to value of WDOC_DEFAULT_EMBED_MODEL * Name of the model to use for embeddings. Must contain a '/' Everything before the slash is the backend and everything after the / is the model name. Available backends: openai, sentencetransformers, huggingface * Note: * the device used by default for huggingface is 'cpu' and not 'cuda' * If you change this, the embedding cache will be usually need to be recomputed with new elements (the hash used to check for previous values includes the name of the model name) * `--embed_model_kwargs`: dict, default `None` * dictionary of keyword arguments to pass to the embedding. * `--save_embeds_as`: str, default `"{user_dir}/latest_docs_and_embeddings"` * only used if task is query Saves the loaded documents and embeddings to a file in the specified directory. This then be loaded again with `--load_embeds_from` to avoid recomputing embeddings. Both the document splits and their embeddings are saved there, and always overwrite the location (i.e. no 'updating' of the previously saved documents and embeddings). In the default value, "{user_dir}" is automatically replaced by the path to the default cache folder for the current user. This way it always speeds up the previous session if `--load_embeds_from`. Should not be specified at the same time as `--load_embeds_from` as `--load_embeds_from` will take priority. * `--load_embeds_from`: str, default `None` * path to the file saved using `--save_embeds_as` If loading the embeddings fails, `wdoc` will crash instead of creating new embeddings, out of safety. Should not be specified at the same time as `--save_embeds_as` as `--load_embeds_from` will take priority. * `--top_k`: Union[int, str], default `auto_200_500` * number of chunks to look for when querying. It is high because the eval model is used to refilter the document after the embeddings first pass.e If top_k is a string, the format assumed is "auto_N_m" where N is the starting top_k and M is the max top_k value. If the number of filtered document is more than 90% of top_k, top_k will gradually increase up to M (with N and M being int, and 0