ColBERT: Efficient, late-interaction retrieval systems in Julia!

ColBERT.jl is a pure Julia package for the ColBERT information retrieval system123, allowing developers to integrate this powerful neural retrieval algorithm into their own downstream tasks. ColBERT (contextualized late interaction over BERT) has emerged as a state-of-the-art approach for efficient and effective document retrieval, thanks to its ability to leverage contextualized embeddings from pre-trained language models like BERT.

Inspired from the original Python implementation of ColBERT, with ColBERT.jl, you can now bring this capability to your Julia applications, whether you're working on natural language processing tasks, information retrieval systems, or other areas where relevant document retrieval is crucial. Our package provides a simple and intuitive interface for using ColBERT in Julia, making it easy to get started with this powerful algorithm.

Get Started

To install the package, simply clone the repository and dev it:

julia> ] dev .

Consult the README of the GitHub repository for a small example. In this guide, we'll index a collection of 1000 documents.

Dataset and preprocessing

We'll go through an example of the lifestyle/dev split of the LoTTe dataset. To download the dataset, you can use the examples/lotte.sh script. We'll work with the first 1000 documents of the dataset:

$ cd examples
$ ./lotte.sh
$ head -n 1000 downloads/lotte/lifestyle/dev/collection.tsv > 1kcollection.tsv
$ wc -l 1kcollection.tsv
1000 1kcollection.txt

The 1kcollection.tsv file has documents in the format pid \t <document text>, where pid is the unique ID of the document. For now, the package only supports collections which have one document per line. So, we'll simply remove the pid from each document in 1kcollection.tsv, and save the resultant file of documents in 1kcollection.txt. Here's a simple Julia script you can use to do this preprocessing using the CSV.jl package:

using CSV
file = CSV.File("1kcollection.tsv"; delim = '\t', header = [:pid, :text],
        types = Dict(:pid => Int, :text => String), debug = true, quoted = false)
for doc in file.text
    open("1kcollection.txt", "a") do io
        write(io, doc*"\n")
    end
end

We now have our collection of documents to index!

The `ColBERTConfig`

To start off, make sure you download a ColBERT checkpoint somewhere in your system; for this example, I'll download the colbert-ir/colbertv2.0 checkpoint in $HOME/models:

git lfs install
cd $HOME/models
git clone https://huggingface.co/colbert-ir/colbertv2.0

The next step is to create a configuration object containing details about all parameters used during indexing/searching using ColBERT. All this information is contained in a type called ColBERTConfig. Creating a ColBERTConfig is easy; it has the right defaults for most users, and one can change the settings using simple kwargs. In this example, we'll create a config for the collection 1kcollection.txt we just created, and we'll also use CUDA.jl for GPU support (you can use any GPU backend supported by Flux.jl)!

julia>  using ColBERT, CUDA, Random;

julia>  Random.seed!(0)                                                 # global seed for a reproducible index

julia>  config = ColBERTConfig(
            use_gpu = true,
            checkpoint = "/home/codetalker7/models/colbertv2.0",        # local path to the colbert checkpoint
            collection = "./1kcollection.txt",                          # local path to the collection
            doc_maxlen = 300,                                           # max length beyond which docs are truncated
            index_path = "./1kcollection_index/",                       # local directory to save the index in
            chunksize = 200                                             # number of docs to store in a chunk
        );

You can read more about a ColBERTConfig from it's docstring.

Building the index

Building the index is even easier than creating a config; just build an Indexer and call the index function. I used an NVIDIA GeForce RTX 2020 Ti card to build the index:

julia>  indexer = Indexer(config);

julia>  @time index(indexer)
[ Info: # of sampled PIDs = 636
[ Info: Encoding 636 passages.
[ Info: avg_doclen_est = 233.25157232704402      length(local_sample) = 636
[ Info: Creating 4096 clusters.
[ Info: Estimated 233251.572327044 embeddings.
[ Info: Saving the index plan to ./1kcollection_index/plan.json.
[ Info: Saving the config to the indexing path.
[ Info: Training the clusters.
[ Info: Iteration 1/20, max delta: 0.26976448
[ Info: Iteration 2/20, max delta: 0.17742664
[ Info: Iteration 3/20, max delta: 0.16281573
[ Info: Iteration 4/20, max delta: 0.120501295
[ Info: Iteration 5/20, max delta: 0.08808214
[ Info: Iteration 6/20, max delta: 0.14226294
[ Info: Iteration 7/20, max delta: 0.07096822
[ Info: Iteration 8/20, max delta: 0.081315234
[ Info: Iteration 9/20, max delta: 0.06760075
[ Info: Iteration 10/20, max delta: 0.07043305
[ Info: Iteration 11/20, max delta: 0.060436506
[ Info: Iteration 12/20, max delta: 0.048092205
[ Info: Iteration 13/20, max delta: 0.052080974
[ Info: Iteration 14/20, max delta: 0.055756018
[ Info: Iteration 15/20, max delta: 0.057068985
[ Info: Iteration 16/20, max delta: 0.05717972
[ Info: Iteration 17/20, max delta: 0.02952642
[ Info: Iteration 18/20, max delta: 0.025388952
[ Info: Iteration 19/20, max delta: 0.034007154
[ Info: Iteration 20/20, max delta: 0.047712516
[ Info: Got bucket_cutoffs_quantiles = [0.25, 0.5, 0.75] and bucket_weights_quantiles = [0.125, 0.375, 0.625, 0.875]
[ Info: Got bucket_cutoffs = Float32[-0.023658333, -9.9312514f-5, 0.023450013] and bucket_weights = Float32[-0.044035435, -0.010775891, 0.010555617, 0.043713447]
[ Info: avg_residual = 0.031616904
[ Info: Saving codec to ./1kcollection_index/centroids.jld2, ./1kcollection_index/avg_residual.jld2, ./1kcollection_index/bucket_cutoffs.jld2 and ./1kcollection_index/bucket_weights.jld2.
[ Info: Building the index.
[ Info: Loading codec from ./1kcollection_index/centroids.jld2, ./1kcollection_index/avg_residual.jld2, ./1kcollection_index/bucket_cutoffs.jld2 and ./1kcollection_index/bucket_weights.jld2.
[ Info: Encoding 200 passages.
[ Info: Saving chunk 1:          200 passages and 36218 embeddings. From passage #1 onward.
[ Info: Saving compressed codes to ./1kcollection_index/1.codes.jld2 and residuals to ./1kcollection_index/1.residuals.jld2
[ Info: Saving doclens to ./1kcollection_index/doclens.1.jld2
[ Info: Saving metadata to ./1kcollection_index/1.metadata.json
[ Info: Encoding 200 passages.
[ Info: Saving chunk 2:          200 passages and 45064 embeddings. From passage #201 onward.
[ Info: Saving compressed codes to ./1kcollection_index/2.codes.jld2 and residuals to ./1kcollection_index/2.residuals.jld2
[ Info: Saving doclens to ./1kcollection_index/doclens.2.jld2
[ Info: Saving metadata to ./1kcollection_index/2.metadata.json
[ Info: Encoding 200 passages.
[ Info: Saving chunk 3:          200 passages and 50956 embeddings. From passage #401 onward.
[ Info: Saving compressed codes to ./1kcollection_index/3.codes.jld2 and residuals to ./1kcollection_index/3.residuals.jld2
[ Info: Saving doclens to ./1kcollection_index/doclens.3.jld2
[ Info: Saving metadata to ./1kcollection_index/3.metadata.json
[ Info: Encoding 200 passages.
[ Info: Saving chunk 4:          200 passages and 49415 embeddings. From passage #601 onward.
[ Info: Saving compressed codes to ./1kcollection_index/4.codes.jld2 and residuals to ./1kcollection_index/4.residuals.jld2
[ Info: Saving doclens to ./1kcollection_index/doclens.4.jld2
[ Info: Saving metadata to ./1kcollection_index/4.metadata.json
[ Info: Encoding 200 passages.
[ Info: Saving chunk 5:          200 passages and 52304 embeddings. From passage #801 onward.
[ Info: Saving compressed codes to ./1kcollection_index/5.codes.jld2 and residuals to ./1kcollection_index/5.residuals.jld2
[ Info: Saving doclens to ./1kcollection_index/doclens.5.jld2
[ Info: Saving metadata to ./1kcollection_index/5.metadata.json
[ Info: Running some final checks.
[ Info: Checking if all files are saved.
[ Info: Found all files!
[ Info: Collecting embedding ID offsets.
[ Info: Saving the indexing metadata.
[ Info: Building the centroid to embedding IVF.
[ Info: Loading codes for each embedding.
[ Info: Sorting the codes.
[ Info: Getting unique codes and their counts.
[ Info: Saving the IVF.
151.833047 seconds (78.15 M allocations: 28.871 GiB, 41.12% gc time, 0.51% compilation time: <1% of which was recompilation)

Searching

Once you've built the index for your collection of docs, it's now time to perform a query search. This involves creating a Searcher from the path of the index:

julia>  using ColBERT, CUDA;

julia>  searcher = Searcher("1kcollection_index");

Next, simply feed a query to the search function, and get the top-k best documents for your query:

julia>  query = "what is 1080 fox bait poisoning?";

julia>  @time pids, scores = search(searcher, query, 10)            # second run statistics
  0.136773 seconds (1.95 M allocations: 240.648 MiB, 0.00% compilation time)
([999, 383, 386, 323, 547, 385, 384, 344, 963, 833], Float32[8.754782, 7.6871076, 6.8440857, 6.365711, 6.323611, 6.1222105, 5.92911, 5.708316, 5.597268, 5.4987035])

You can now use these pids to see which documents match the best against your query: