Reference

DocsScraper.base_url_segment
DocsScraper.check_robots_txt
DocsScraper.clean_url
DocsScraper.crawl
DocsScraper.create_URL_map
DocsScraper.create_output_dirs
DocsScraper.docs_in_url
DocsScraper.find_duplicates
DocsScraper.find_urls_html!
DocsScraper.find_urls_xml!
DocsScraper.generate_embeddings
DocsScraper.get_base_url
DocsScraper.get_header_path
DocsScraper.get_html_content
DocsScraper.get_package_name
DocsScraper.get_urls!
DocsScraper.insert_parsed_data!
DocsScraper.l2_norm_columns
DocsScraper.l2_norm_columns
DocsScraper.load_chunks_sources
DocsScraper.make_chunks_sources
DocsScraper.make_knowledge_packs
DocsScraper.nav_bar
DocsScraper.parse_robots_txt!
DocsScraper.parse_url_to_blocks
DocsScraper.postprocess_chunks
DocsScraper.process_code
DocsScraper.process_docstring!
DocsScraper.process_generic_node!
DocsScraper.process_headings!
DocsScraper.process_hostname
DocsScraper.process_hostname!
DocsScraper.process_node!
DocsScraper.process_node!
DocsScraper.process_non_crawl_urls
DocsScraper.process_paths
DocsScraper.process_text
DocsScraper.remove_duplicates
DocsScraper.remove_short_chunks
DocsScraper.remove_urls_from_index
DocsScraper.report_artifact
DocsScraper.resolve_url
DocsScraper.roll_up_chunks
DocsScraper.save_embeddings
DocsScraper.text_before_version
DocsScraper.url_package_name
DocsScraper.urls_for_metadata
DocsScraper.validate_args
PromptingTools.Experimental.RAGTools.get_chunks

DocsScraper.base_url_segment — Method

base_url_segment(url::String)

Return the base url and first path segment if all the other checks fail

source

DocsScraper.check_robots_txt — Method

check_robots_txt(user_agent::AbstractString, url::AbstractString)

Check robots.txt of a URL and return a boolean representing if user_agent is allowed to crawl the input url, along with sitemap urls

Arguments

user_agent: user agent attempting to crawl the webpage
url: input URL string

source

DocsScraper.clean_url — Method

clean_url(url::String)

Strip URL of any http:// ot https:// or www. prefixes

source

DocsScraper.crawl — Method

crawl(input_urls::Vector{<:AbstractString})

Crawl on the input URLs and return a hostname_url_dict which is a dictionary with key being hostnames and the values being the URLs

source

DocsScraper.create_URL_map — Method

create_URL_map(sources::Vector{String}, output_file_path::AbstractString, index_name::AbstractString)

Creates a CSV file containing the URL along with the estimated package name

Arguments

sources: List of scraped sources
outputfilepath: Path to the directory in which the csv will be created
index_name: Name of the created index

source

DocsScraper.create_output_dirs — Method

create_output_dirs(parent_directory_path::String, index_name::String)

Create indexname, Scrapedfiles and Index directories inside parent_directory_path. Return path to index_name

source

DocsScraper.docs_in_url — Method

docs_in_url(url::AbstractString)

If the base url is in the form docs.packagename.domainextension, then return the middle word i.e., package_name

source

DocsScraper.find_duplicates — Method

find_duplicates(chunks::AbstractVector{<:AbstractString})

Find duplicates in a list of chunks using SHA-256 hash. Returns a bit vector of the same length as the input list, where true indicates a duplicate (second instance of the same text).

source

DocsScraper.find_urls_html! — Method

find_urls_html!(url::AbstractString, node::Gumbo.HTMLElement, url_queue::Vector{<:AbstractString}

Function to recursively find <a> tags and extract the urls

Arguments

url: The initial input URL
node: The HTML node of type Gumbo.HTMLElement
url_queue: Vector in which extracted URLs will be appended

source

DocsScraper.find_urls_xml! — Method

find_urls_xml!(url::AbstractString, url_queue::Vector{<:AbstractString})

Identify URL through regex pattern in xml files and push in url_queue

Arguments

url: url from which all other URLs will be extracted
url_queue: Vector in which extracted URLs will be appended

source

DocsScraper.generate_embeddings — Method

generate_embeddings(chunks::Vector{SubString{String}};
    model_embedding::AbstractString = MODEL_EMBEDDING,
    embedding_dimension::Int = EMBEDDING_DIMENSION, embedding_bool::Bool = EMBEDDING_BOOL,
    index_name::AbstractString = "")

Deserialize chunks and sources to generate embeddings. Returns path to tar.gz file of the created index Note: We recommend passing index_name. This will be the name of the generated index

Arguments

chunks: Vector of scraped chunks
model_embedding: Embedding model
embedding_dimension: Embedding dimensions
embedding_bool: If true, embeddings generated will be boolean, Float32 otherwise
index_name: Name of the index. Default: "index" symbol generated by gensym

source

DocsScraper.get_base_url — Method

get_base_url(url::AbstractString)

Extract the base url

source

DocsScraper.get_header_path — Method

get_header_path(d::Dict)

Concatenate the h1, h2, h3 keys from the metadata of a Dict

Examples

d = Dict("metadata" => Dict{Symbol,Any}(:h1 => "Axis", :h2 => "Attributes", :h3 => "yzoomkey"), "heading" => "yzoomkey")
get_header_path(d)
# Output: "Axis/Attributes/yzoomkey"

source

DocsScraper.get_html_content — Method

get_html_content(root::Gumbo.HTMLElement)

Return the main content of the HTML. If not found, return the whole HTML to parse

Arguments

root: The HTML root from which content is extracted

source

DocsScraper.get_package_name — Method

get_package_name(url::AbstractString)

Return name of the package through the package URL

source

DocsScraper.get_urls! — Method

get_links!(url::AbstractString, 
    url_queue::Vector{<:AbstractString})

Extract urls inside html or xml files

Arguments

url: url from which all other URLs will be extracted
url_queue: Vector in which extracted URLs will be appended

source

DocsScraper.insert_parsed_data! — Method

insert_parsed_data!(heading_hierarchy::Dict{Symbol,Any}, 
    parsed_blocks::Vector{Dict{String,Any}}, 
    text_to_insert::AbstractString, 
    text_type::AbstractString)

Insert the text into parsed_blocks Vector

Arguments

heading_hierarchy: Dict used to store metadata
parsed_blocks: Vector of Dicts to store parsed text and metadata
texttoinsert: Text to be inserted
text_type: The text to be inserted could be heading or a code block or just text

source

DocsScraper.l2_norm_columns — Method

l2_norm_columns(mat::AbstractMatrix)

Normalize the columns of the input embeddings

source

DocsScraper.l2_norm_columns — Method

l2_norm_columns(vect::AbstractVector)

Normalize the columns of the input embeddings

source

DocsScraper.load_chunks_sources — Method

load_chunks_sources(target_path::AbstractString)

Return chunks, sources by reading the .jls files in joinpath(target_path, "Scraped_files")

source

DocsScraper.make_chunks_sources — Method

make_chunks(hostname_url_dict::Dict{AbstractString,Vector{AbstractString}}, target_path::String; 
    max_chunk_size::Int=MAX_CHUNK_SIZE, min_chunk_size::Int=MIN_CHUNK_SIZE)

Parse URLs from hostnameurldict and save the chunks

Arguments

hostnameurldict: Dict with key being hostname and value being a vector of URLs
target_path: Knowledge pack path
maxchunksize: Maximum chunk size
minchunksize: Minimum chunk size

source

DocsScraper.make_knowledge_packs — Function

make_knowledge_packs(crawlable_urls::Vector{<:AbstractString} = String[];
    single_urls::Vector{<:AbstractString} = String[],
    max_chunk_size::Int = MAX_CHUNK_SIZE, min_chunk_size::Int = MIN_CHUNK_SIZE,
    model_embedding::AbstractString = MODEL_EMBEDDING, embedding_dimension::Int = EMBEDDING_DIMENSION, custom_metadata::AbstractString = "",
    embedding_bool::Bool = EMBEDDING_BOOL, index_name::AbstractString = "",
    target_path::AbstractString = "", save_url_map::Bool = true)

Entry point to crawl, parse and generate embeddings. Returns path to tar.gz file of the created index Note: We recommend passing index_name. This will be the name of the generated index

Arguments

crawlable_urls: URLs that should be crawled to find more links
single_urls: Single page URLs that should just be scraped and parsed. The crawler won't look for more URLs
maxchunksize: Maximum chunk size
minchunksize: Minimum chunk size
model_embedding: Embedding model
embedding_dimension: Embedding dimensions
custom_metadata: Custom metadata like ecosystem name if required
embedding_bool: If true, embeddings generated will be boolean, Float32 otherwise
index_name: Name of the index. Default: "index" symbol generated by gensym
target_path: Path to the directory where the index folder will be created
saveurlmap: If true, creates a CSV of crawled URLs with their associated package names

source

DocsScraper.nav_bar — Method

nav_bar(url::AbstractString)

Julia doc websites tend to have the package name under ".docs-package-name" class in the HTML tree

source

DocsScraper.parse_robots_txt! — Method

parse_robots_txt!(robots_txt::String)

Parse the robots.txt string and return rules and the URLs on Sitemap

source

DocsScraper.parse_url_to_blocks — Method

parse_url(url::AbstractString)

Initiator and main function to parse HTML from url. Return a Vector of Dict containing Heading/Text/Code along with a Dict of respective metadata

source

DocsScraper.postprocess_chunks — Method

function postprocess_chunks(chunks::AbstractVector{<:AbstractString}, sources::AbstractVector{<:AbstractString};
    min_chunk_size::Int=MIN_CHUNK_SIZE, skip_code::Bool=true, paths::Union{Nothing,AbstractVector{<:AbstractString}}=nothing,
    websites::Union{Nothing,AbstractVector{<:AbstractString}}=nothing)

Post-process the input list of chunks and their corresponding sources by removing short chunks and duplicates.

source

DocsScraper.process_code — Method

process_code(node::Gumbo.HTMLElement)

Process code snippets. If the current node is a code block, return the text inside code block with backticks.

Arguments

node: The root HTML node

source

DocsScraper.process_docstring! — Function

process_docstring!(node::Gumbo.HTMLElement,
    heading_hierarchy::Dict{Symbol,Any},
    parsed_blocks::Vector{Dict{String,Any}},
    child_new::Bool=true,
    prev_text_buffer::IO=IOBuffer(write=true))

Function to process node of class docstring

Arguments

node: The root HTML node
heading_hierarchy: Dict used to store metadata
parsed_blocks: Vector of Dicts to store parsed text and metadata
childnew: Bool to specify if the current block (child) is part of previous block or not. If it's not, then a new insertion needs to be created in parsedblocks
prevtextbuffer: IO Buffer which contains previous text

source

DocsScraper.process_generic_node! — Function

process_generic_node!(node::Gumbo.HTMLElement,
    heading_hierarchy::Dict{Symbol,Any},
    parsed_blocks::Vector{Dict{String,Any}},
    child_new::Bool=true,
    prev_text_buffer::IO=IOBuffer(write=true))

If the node is neither heading nor code

Arguments

node: The root HTML node
heading_hierarchy: Dict used to store metadata
parsed_blocks: Vector of Dicts to store parsed text and metadata
childnew: Bool to specify if the current block (child) is part of previous block or not. If it's not, then a new insertion needs to be created in parsedblocks
prevtextbuffer: IO Buffer which contains previous text

source

DocsScraper.process_headings! — Method

process_headings!(node::Gumbo.HTMLElement,
    heading_hierarchy::Dict{Symbol,Any},
    parsed_blocks::Vector{Dict{String,Any}})

Process headings. If the current node is heading, directly insert into parsed_blocks.

Arguments

node: The root HTML node
heading_hierarchy: Dict used to store metadata
parsed_blocks: Vector of Dicts to store parsed text and metadata

source

DocsScraper.process_hostname! — Method

process_hostname(url::AbstractString, hostname_dict::Dict{AbstractString,Vector{AbstractString}})

Add url to its hostname in hostname_dict

Arguments

url: URL string
hostname_dict: Dict with key being hostname and value being a vector of URLs

source

DocsScraper.process_hostname — Method

process_hostname(url::AbstractString)

Return the hostname of an input URL

source

DocsScraper.process_node! — Function

process_node!(node::Gumbo.HTMLElement,
    heading_hierarchy::Dict{Symbol,Any},
    parsed_blocks::Vector{Dict{String,Any}},
    child_new::Bool=true,
    prev_text_buffer::IO=IOBuffer(write=true))

Function to process a node

Arguments

node: The root HTML node
heading_hierarchy: Dict used to store metadata
parsed_blocks: Vector of Dicts to store parsed text and metadata
childnew: Bool to specify if the current block (child) is part of previous block or not. If it's not, then a new insertion needs to be created in parsedblocks
prevtextbuffer: IO Buffer which contains previous text

source

DocsScraper.process_node! — Method

multiple dispatch for process_node!() when node is of type Gumbo.HTMLText

source

DocsScraper.process_non_crawl_urls — Method

process_non_crawl_urls(
    single_urls::Vector{<:AbstractString}, visited_url_set::Set{AbstractString},
    hostname_url_dict::Dict{AbstractString, Vector{AbstractString}})

Check if the single_urls is scrapable. If yes, then add it to a Dict of URLs to scrape

Arguments

single_urls: Single page URLs that should just be scraped and parsed. The crawler won't look for more URLs
visitedurlset: Set of visited URLs. Avoids duplication
hostnameurldict: Dict with key being the hostname and the values being the URLs

source

DocsScraper.process_paths — Method

process_paths(url::AbstractString; max_chunk_size::Int=MAX_CHUNK_SIZE, min_chunk_size::Int=MIN_CHUNK_SIZE)

Process folders provided in paths. In each, take all HTML files, scrape them, chunk them and postprocess them.

source

DocsScraper.process_text — Method

remove_dashes(text::AbstractString)

removes all dashes ('-') from a given string

source

DocsScraper.remove_duplicates — Method

remove_duplicates(chunks::AbstractVector{<:AbstractString}, sources::AbstractVector{<:AbstractString})

Remove chunks that are duplicated in the input list of chunks and their corresponding sources.

source

DocsScraper.remove_short_chunks — Method

remove_short_chunks(chunks::AbstractVector{<:AbstractString}, sources::AbstractVector{<:AbstractString};
    min_chunk_size::Int=MIN_CHUNK_SIZE, skip_code::Bool=true)

Remove chunks that are shorter than a specified length (min_length) from the input list of chunks and their corresponding sources.

source

DocsScraper.remove_urls_from_index — Function

function remove_urls_from_index(index_path::AbstractString, prefix_urls=Vector{<:AbstractString})

Remove chunks and sources corresponding to URLs starting with prefix_urls

source

DocsScraper.report_artifact — Method

report_artifact(fn_output)

Print artifact information

source

DocsScraper.resolve_url — Method

resolve_url(base_url::String, extracted_url::String)

Check the extracted URL with the original URL. Return empty String if the extracted URL belongs to a different domain. Return complete URL if there's a directory traversal paths or the extracted URL belongs to the same domain as the base_url

Arguments

base_url: URL of the page from which other URLs are being extracted
extractedurl: URL extracted from the baseurl

source

DocsScraper.roll_up_chunks — Method

roll_up_chunks(parsed_blocks::Vector{Dict{String,Any}}, url::AbstractString; separator::String="<SEP>")

Roll-up chunks (that have the same header!), so we can split them later by <SEP> to get the desired length

source

DocsScraper.save_embeddings — Method

save_embeddings(index_name::AbstractString, embedding_dimension::Int,
    embedding_bool::Bool, model_embedding::AbstractString, target_path::AbstractString,
    chunks::AbstractVector{<:AbstractString}, sources::Vector{String},
    full_embeddings, custom_metadata::AbstractString, max_chunk_size::Int)

Save the generated embeddings along with a .txt containing the artifact info

Arguments

index_name: Name of the index. Default: "index" symbol generated by gensym
embedding_dimension: Embedding dimensions
embedding_bool: If true, embeddings generated will be boolean, Float32 otherwise
model_embedding: Embedding model
target_path: Path to the index folder
chunks: Vector of scraped chunks
sources: Vector of scraped sources
full_embeddings: Generated embedding matrix
custom_metadata: Custom metadata like ecosystem name if required
maxchunksize: Maximum chunk size

source

DocsScraper.text_before_version — Method

text_before_version(url::AbstractString)

Return text before "stable" or "dev" or any version in URL. It is generally observed that doc websites have package names before their versions

source

DocsScraper.url_package_name — Method

url_package_name(url::AbstractString)

Return the text if the URL itself contains the package name with ".jl" or "_jl" suffixes

source

DocsScraper.urls_for_metadata — Method

urls_for_metadata(sources::Vector{String})

Return a Dict of package names with their associated URLs Note: Due to their large number, URLs are stripped down to the package name; Package subpaths are not included in metadata.

source

DocsScraper.validate_args — Function

validate_args(crawlable_urls::Vector{<:AbstractString} = String[];
    single_urls::Vector{<:AbstractString} = String[], target_path::AbstractString = "", index_name::AbstractString = "")

Validate args. Return error if both crawlable_urls and single_urls are empty. Create a target path if input path is invalid. Create a gensym index if the input index is invalid.

Arguments

crawlable_urls: URLs that should be crawled to find more links
single_urls: Single page URLs that should just be scraped and parsed. The crawler won't look for more URLs
target_path: Path to the directory where the index folder will be created
index_name: Name of the index. Default: "index" symbol generated by gensym

source

PromptingTools.Experimental.RAGTools.get_chunks — Method

RT.get_chunks(chunker::DocParserChunker, url::AbstractString;
    verbose::Bool=true, separators=["

", ". ", " ", " "], maxchunksize::Int=MAXCHUNKSIZE)

Extract chunks from HTML files, by parsing the content in the HTML, rolling up chunks by headers, and splits them by separators to get the desired length.

Arguments

chunker: DocParserChunker
url: URL of the webpage to extract chunks
verbose: Bool to print the log
separators: Chunk separators
maxchunksize Maximum chunk size

source