Reference · ColBERT

ColBERT.ColBERTConfig
ColBERT.Indexer
ColBERT._add_marker_row
ColBERT._binarize
ColBERT._bucket_indices
ColBERT._cids_to_eids!
ColBERT._compute_avg_residuals!
ColBERT._integer_ids_and_mask
ColBERT._load_model
ColBERT._load_tokenizer
ColBERT._load_tokenizer_config
ColBERT._packbits
ColBERT._sample_embeddings
ColBERT._sample_pids
ColBERT._unbinarize
ColBERT._unpackbits
ColBERT.binarize
ColBERT.compress
ColBERT.compress_into_codes!
ColBERT.decompress
ColBERT.decompress_residuals
ColBERT.doc
ColBERT.encode_passages
ColBERT.encode_queries
ColBERT.extract_tokenizer_type
ColBERT.index
ColBERT.index
ColBERT.kmeans_gpu_onehot!
ColBERT.load_codec
ColBERT.load_config
ColBERT.load_hgf_pretrained_local
ColBERT.mask_skiplist!
ColBERT.save
ColBERT.save_chunk
ColBERT.save_codec
ColBERT.setup
ColBERT.tensorize_docs
ColBERT.tensorize_queries
ColBERT.train

ColBERT.ColBERTConfig — Type

ColBERTConfig(; use_gpu::Bool, rank::Int, nranks::Int, query_token_id::String,
        doc_token_id::String, query_token::String, doc_token::String, checkpoint::String,
        collection::String, dim::Int, doc_maxlen::Int, mask_punctuation::Bool,
        query_maxlen::Int, attend_to_mask_tokens::Bool, index_path::String,
        index_bsize::Int, nbits::Int, kmeans_niters::Int, nprobe::Int, ncandidates::Int)

Structure containing config for running and training various components.

Arguments

use_gpu: Whether to use a GPU or not. Default is false.
rank: The index of the running GPU. Default is 0. For now, the package only allows this to be 0.
nranks: The number of GPUs used in the run. Default is 1. For now, the package only supports one GPU.
query_token_id: Unique identifier for query tokens (defaults to [unused0]).
doc_token_id: Unique identifier for document tokens (defaults to [unused1]).
query_token: Token used to represent a query token (defaults to [Q]).
doc_token: Token used to represent a document token (defaults to [D]).
checkpoint: The path to the HuggingFace checkpoint of the underlying ColBERT model. Defaults to "colbert-ir/colbertv2.0".
collection: Path to the file containing the documents. Default is "".
dim: The dimension of the document embedding space. Default is 128.
doc_maxlen: The maximum length of a document before it is trimmed to fit. Default is 220.
mask_punctuation: Whether or not to mask punctuation characters tokens in the document. Default is true.
query_maxlen: The maximum length of queries after which they are trimmed.
attend_to_mask_tokens: Whether or not to attend to mask tokens in the query. Default value is false.
index_path: Path to save the index files.
index_bsize: Batch size used for some parts of indexing.
chunksize: Custom size of a chunk, i.e the number of passages for which data is to be stored in one chunk. Default is missing, in which case chunksize is determined from the size of the collection and nranks.
passages_batch_size: The number of passages sent as a batch to encoding functions. Default is 300.
nbits: Number of bits used to compress residuals.
kmeans_niters: Number of iterations used for k-means clustering.
nprobe: The number of nearest centroids to fetch during a search. Default is 2. Also see retrieve.
ncandidates: The number of candidates to get during candidate generation in search. Default is 8192. Also see retrieve.

Returns

A ColBERTConfig object.

Examples

Most users will just want to use the defaults for most settings. Here's a minimal example:

julia> using ColBERT;

julia> config = ColBERTConfig(
           use_gpu = true,
           collection = "/home/codetalker7/documents",
           index_path = "./local_index"
       );

source

ColBERT.Indexer — Method

Indexer(config::ColBERTConfig)

Type representing an ColBERT indexer.

Arguments

config: The ColBERTConfig used to build the index.

Returns

An [Indexer] wrapping a ColBERTConfig along with the trained ColBERT model.

source

ColBERT._add_marker_row — Method

_add_marker_row(data::AbstractMatrix{T}, marker::T) where {T}

Add row containing marker as the second row of data.

Arguments

data: The matrix in which the row is to be added.
marker: The marker to be added.

Returns

A matrix equal to data, with the second row being filled with marker.

Examples

julia> using ColBERT: _add_marker_row;

julia> x = ones(Float32, 5, 5);
5×5 Matrix{Float32}:
 1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0

julia> _add_marker_row(x, zero(Float32))
6×5 Matrix{Float32}:
 1.0  1.0  1.0  1.0  1.0
 0.0  0.0  0.0  0.0  0.0
 1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0
 1.0  1.0  1.0  1.0  1.0

source

ColBERT._binarize — Method

Examples

julia> using ColBERT: _binarize;

julia> using Flux, CUDA, Random;

julia> Random.seed!(0);

julia> nbits = 5;

julia> data = rand(0:2^nbits - 1, 100, 200000) |> Flux.gpu
100×200000 CuArray{Int64, 2, CUDA.DeviceMemory}:
 12  23  11   6   5   2  27   1   0   4  15   8  24  …   4  25  22  18   4   0  15  16   3  25   4  13
  2  11  29   8  31   3  15   1   8   1  22  22  10     25  25   1  12  21  13  27  20  23  24   9  14
 27   4   4  15   4   9  19   4   3  10  27  14   3     10   8  18  19  12   9  29  23   8  15  30  21
  2   7   4   5  25  16  27  23   5  24  26  19   9     22   1  21  12  31  20   4  31  26  21  25   6
 21  18  25   9   9  17   6  20  16  13  14   2   2     28  13  11   9  22   4   2  22  27  24   9  31
  3  26  22   8  24  23  29  19  13   3   2  20  14  …  22  18  18   5  16   5   9   3  21  19  17  23
  3  13   5   9   8  12  24  26   8  10  14   1  21     14  25  18   5   1   4  13   0  14  11  16   8
 22  20  22   6  25   1  29  23   9  21  13  27   6     11  21   4  31  14  14   5  27  17   6  27  19
  9   2   7   2  16   1  23  15   2  17  30  18   4     26   5  20  31  18   8  20  13  23  26  29  25
  0   6  20   8   0  18   9  28   8  30   6   2  21      0   7  25  23  19   2   6  27  13   3   6  22
 17   2   0  13  26   6   7   8  14  20  11   9  17  …  29   4  28  22   1  10  29  20  11  20  30   8
 28   5   0  30   1  26  23   9  29   9  29   2  15     27   8  13  11  27   6  11   7  19   4   7  28
  8   9  16  29  22   8   9  19  30  20   4   0   1      1  25  14  16  17  26  28  31  25   4  22  23
 10   9  31  22  20  15   1   9  26   2   0   1  27     23  21  15  22  29  29   1  24  30  22  17  22
 13   8  23   9   1   6   2  28  18   1  15   5  12     28  27   3   6  22   3  20  24   3   2   2  29
 28  22  19   7  20  28  25  13   3  13  17  31  28  …  18  17  19   6  20  11  31   9  28   9  19   1
 23   1   7  14   6  14   0   9   1   9  12  30  24     23   2  13   9   0  20  17   4  16  22  27  11
  4  19   8  31  14  30   2  13  27  16  29  10  30     29  25  28  31  13  11   8  12  30  13  10   7
 18  26  30   6  31   6  15  11  10  31  21  24  11     19  19  29  17  13   5   3  28  29  31  22  13
 14  29  18  14  25  10  28  28  15   8   5  14   5     10  17  13  23   0  26  25  13  15  26   3   5
  0   4  24  23  20  16  25   9  17  27  15   0  10  …   5  18   2   2  30  17   8  11  27  11  15  27
 15   2  22   8   6   8  16   2   8  24  26  15  30     27  12  28  31  26  18   4  10   5  16  23  16
 20  20  29  24   1   9  18  31  16   3   9  17  31      8   4   4  15  13  16   0  10  31  28   8  29
  2   3   2  23  15  21   6   8  21   7  17  15  17      7  15  19  25   3   2  11  26  16  12  11  27
 13  21  22  20  15   0  22   2  30  14  14  20  26     13  23  14  18   0  24  21  17   8  11  26  22
  ⋮                   ⋮                   ⋮          ⋱           ⋮                   ⋮
  9   7   1   1  28  28  10  16  23  18  26   9   7  …  14   5  12   3   6  25  20   5  13   3  20  10
 28  25  21   8  31   4  25   7  27  26  19   4   9     15  26   2  23  14  16  29  17  11  29  12  18
  4  15  20   2   3  10   6   9  13  22   5  28  21     12  11  12  14  14   9  13  31  12   6   9  21
  9  24   2   4  27  14   4  15  19   2  14  30   3     17   5   6   2  23  15  11   1   0  10   0  28
 20   0  26   8  21   7   1   7  22  10  10   5  31     23   5  20  11  29  12  25  14  13   5  25  15
  2   9  27  28  25   7  27  30  20   5  10   2  28  …  21  19  22  30  24   0  10  19  10  30  22   9
 10   2  31  10  12  13  16  10   5  28  16   4  16      3   1  31  20  19  16  19  30  31  14   5  20
 14   2  20  19  16  25   4   1  15  31  22  17   8     12  19   9  29  30  20  13  19  14  18   7  22
 20   3  27  23   9  21  20  10  14   3   5  26  22     19  19  11   3  22  19  24  12  27  12  28  17
  1  27  27  10   8  29  17  14  19   6   6  12   6     10   6  24  29  26  11   2  25   7   6   1  28
 11  19   5   1   7  19   8  17  27   4   4   7   0  …  13  29   0  15  15   2   2   6  24   0   5  18
 17  31  31  23  24  18   0  31   6  22  20  31  23     16   5   8  17   6  20  23  21  26  15  27  30
  1   6  30  31   8   3  28  31  10  23  23  24  26     12  30  10   3  25  24  12  20   8   7  14  11
 26  22  23  21  24   7   2  19  10  27  21  14   7      7  27   1  29   7  23  30  24  12   9  12  14
 28  26   8  28  10  18  23  28  10  19  31  26  17     18  20  23   8  31  15  18  10  24  28   7  23
  1   7  15  22  23   0  21  19  28  10  15  13   7  …  21  15  16   1  16   9  25  23   1  24  20   5
 21   7  30  30   5   0  27  26   6   7  30   2  16      2  16   6   9   6   4  12   4  12  18  28  17
 11  16   0  20  20  13  18  19  21   7  24   4  26      1  26   7  16   0   2   3   2  22  27  25  15
  9  20  31  24  14  29  28  26  29  31   7  28  12     28   0  12   3  17   7   0  30  25  22  23  20
 19  21  30  16  15  20  31   2   2   8  27  20  29     27  13   2  27   8  17  19  15   9  22   3  27
 13  17   6   4   9   1  18   2  21  27  13  14  12  …  28  21   4   2  11  13  31  13  25  25  29  21
  2  17  19  15  17   1  12   0  11   9  16   1  13     25  21  28  22   7  13   3  20   7   6  26  21
 13   6   5  11  12   2   2   3   4  16  30  14  19     16   5   5  19  17   3  31  26  19   2  11  15
 20  30  21  30  13  26   7   9  11  18   3   0  15      3  14  15   1   9  16   1  16   0   2   2  11
  3  24   6  16  10   3   7  17   0  30   9  14   1     29   4   8   4  17  14  27   0  17  12   4  19

julia> _binarize(data, nbits)
5×100×200000 CuArray{Bool, 3, CUDA.DeviceMemory}:
[:, :, 1] =
 0  0  1  0  1  1  1  0  1  0  1  0  0  0  1  0  1  0  0  …  0  0  0  1  1  1  1  0  0  1  1  1  1  1  1  0  1  0  1
 0  1  1  1  0  1  1  1  0  0  0  0  0  1  0  0  1  0  1     1  1  0  0  1  0  0  1  0  0  0  1  0  1  0  1  0  0  1
 1  0  0  0  1  0  0  1  0  0  0  1  0  0  1  1  1  1  0     0  1  1  0  0  0  0  0  1  0  1  0  0  0  1  0  1  1  0
 1  0  1  0  0  0  0  0  1  0  0  1  1  1  1  1  0  0  0     1  1  0  0  1  0  0  1  1  0  0  1  1  0  1  0  1  0  0
 0  0  1  0  1  0  0  1  0  0  1  1  0  0  0  1  1  0  1     0  0  1  0  0  1  0  1  1  0  1  0  0  1  0  0  0  1  0

[:, :, 2] =
 1  1  0  1  0  0  1  0  0  0  0  1  1  1  0  0  1  1  0  …  0  0  1  1  1  1  0  0  0  1  1  0  0  1  1  1  0  0  0
 1  1  0  1  1  1  0  0  1  1  1  0  0  0  0  1  0  1  1     1  1  1  1  1  1  1  1  1  1  1  0  0  0  0  0  1  1  0
 1  0  1  1  0  0  1  1  0  1  0  1  0  0  0  1  0  0  0     0  0  0  0  0  1  1  1  0  1  1  0  1  1  0  0  1  1  0
 0  1  0  0  0  1  1  0  0  0  0  0  1  1  1  0  0  0  1     0  0  0  1  0  1  0  0  1  0  0  0  0  0  0  0  0  1  1
 1  0  0  0  1  1  0  1  0  0  0  0  0  0  0  1  0  1  1     0  0  0  1  1  1  0  1  1  0  0  1  1  1  1  1  0  1  1

[:, :, 3] =
 1  1  0  0  1  0  1  0  1  0  0  0  0  1  1  1  1  0  0  …  1  0  1  1  1  1  0  1  0  1  0  0  1  0  0  1  1  1  0
 1  0  0  0  0  1  0  1  1  0  0  0  0  1  1  1  1  0  1     1  0  1  1  0  1  1  1  0  1  1  0  1  1  1  1  0  0  1
 0  1  1  1  0  1  1  1  1  1  0  0  0  1  1  0  1  0  1     1  1  0  0  1  1  1  1  0  1  1  0  1  1  1  0  1  1  1
 1  1  0  0  1  0  0  0  0  0  0  0  0  1  0  0  0  1  1     1  0  1  1  0  1  1  0  1  1  1  0  1  1  0  0  0  0  0
 0  1  0  0  1  1  0  1  0  1  0  0  1  1  1  1  0  0  1     1  1  1  1  0  1  1  1  0  0  1  0  1  1  0  1  0  1  0

;;; …

[:, :, 199998] =
 1  0  1  1  0  1  1  0  0  1  0  0  0  0  0  1  0  1  1  …  0  0  0  0  0  1  1  1  0  0  0  1  0  0  1  0  0  0  0
 0  0  1  0  0  1  1  1  1  1  0  0  0  1  1  0  1  0  1     1  1  0  1  0  1  1  0  0  0  1  1  1  1  0  1  1  1  0
 0  0  1  1  0  0  0  1  0  0  1  1  1  1  0  0  1  1  1     1  0  1  1  0  1  1  0  1  0  0  0  1  1  0  1  0  0  1
 1  1  1  0  1  0  1  0  1  0  0  0  0  0  0  1  0  1  1     1  0  1  0  0  1  0  1  1  1  0  1  0  0  1  0  0  0  1
 1  1  0  1  1  1  0  0  1  0  1  0  0  1  0  0  1  0  1     0  1  0  0  0  0  0  0  1  1  1  1  1  1  1  0  0  0  0

[:, :, 199999] =
 0  1  0  1  1  1  0  1  1  0  0  1  0  1  0  1  1  0  0  …  1  1  0  1  1  1  0  0  1  0  0  1  1  1  1  0  1  0  0
 0  0  1  0  0  0  0  1  0  1  1  1  1  0  1  1  1  1  1     0  1  0  0  0  1  1  0  1  0  0  0  1  1  0  1  1  1  0
 1  0  1  0  0  0  0  0  1  1  1  1  1  0  0  0  0  0  1     1  1  1  0  1  0  1  1  1  1  1  0  1  0  1  0  0  0  1
 0  1  1  1  1  0  0  1  1  0  1  0  0  0  0  0  1  1  0     0  0  1  0  0  1  1  1  0  0  1  1  0  0  1  1  1  0  0
 0  0  1  1  0  1  1  1  1  0  1  0  1  1  0  1  1  0  1     0  0  1  0  0  1  0  0  0  1  1  1  1  0  1  1  0  0  0

[:, :, 200000] =
 1  0  1  0  1  1  0  1  1  0  0  0  1  0  1  1  1  1  1  …  0  0  1  0  0  0  1  0  1  1  1  1  0  1  1  1  1  1  1
 0  1  0  1  1  1  0  1  0  1  0  0  1  1  0  0  1  1  0     0  1  0  0  1  1  1  1  1  0  0  1  0  1  0  0  1  1  1
 1  1  1  1  1  1  0  0  0  1  0  1  1  1  1  0  0  1  1     1  1  0  1  0  1  0  1  1  1  0  1  1  0  1  1  1  0  0
 1  1  0  0  1  0  1  0  1  0  1  1  0  0  1  0  1  0  1     0  0  0  1  0  1  1  1  0  0  0  1  0  1  0  0  1  1  0
 0  0  1  0  1  1  0  1  1  1  0  1  1  1  1  0  0  0  0     1  1  1  1  1  1  0  0  1  0  1  0  1  1  1  1  0  0  1

source

ColBERT._bucket_indices — Method

Examples

julia> using ColBERT: _bucket_indices;

julia> using Random; Random.seed!(0);

julia> data = rand(50, 50) |> Flux.gpu;
50×50 CuArray{Float32, 2, CUDA.DeviceMemory}:
 0.455238   0.828104   0.735106   0.042069   …  0.916387    0.10078      0.00907127
 0.547642   0.100748   0.993553   0.0275458     0.0954245   0.351846     0.548682
 0.773354   0.908416   0.703694   0.839846      0.613082    0.605597     0.660227
 0.940585   0.932748   0.150822   0.920883      0.754362    0.843869     0.0453409
 0.0296477  0.123079   0.409406   0.672372      0.19912     0.106127     0.945276
 0.746943   0.149248   0.864755   0.116243   …  0.541295    0.224275     0.660706
 0.746801   0.743713   0.64608    0.446445      0.951642    0.583662     0.338174
 0.97667    0.722362   0.692789   0.646206      0.089323    0.305554     0.454803
 0.329335   0.785124   0.254097   0.271299      0.320879    0.000438984  0.161356
 0.672001   0.532197   0.869579   0.182068      0.289906    0.068645     0.142121
 0.0997382  0.523732   0.315933   0.935547   …  0.819027    0.770597     0.654065
 0.230139   0.997278   0.455917   0.566976      0.0180972   0.275211     0.0619634
 0.631256   0.709048   0.810256   0.754144      0.452911    0.358555     0.116042
 0.096652   0.454081   0.715283   0.923417      0.498907    0.781054     0.841858
 0.69801    0.0439444  0.27613    0.617714      0.589872    0.708365     0.0266968
 0.470257   0.654557   0.351769   0.812597   …  0.323819    0.621386     0.63478
 0.114864   0.897316   0.0243141  0.910847      0.232374    0.861399     0.844008
 0.984812   0.491806   0.356395   0.501248      0.651833    0.173494     0.38356
 0.730758   0.970359   0.456407   0.8044        0.0385577   0.306404     0.705577
 0.117333   0.233628   0.332989   0.0857914     0.224095    0.747571     0.387572
 ⋮                                           ⋱
 0.908402   0.609104   0.108874   0.430905   …  0.00564743  0.964602     0.541285
 0.570179   0.10114    0.210174   0.945569      0.149051    0.785343     0.241959
 0.408136   0.221389   0.425872   0.204654      0.238413    0.583185     0.271998
 0.526989   0.0401535  0.686314   0.534208      0.29416     0.488244     0.747676
 0.129952   0.716592   0.352166   0.584363      0.0850619   0.161153     0.243575
 0.0256413  0.0831649  0.179467   0.799997   …  0.229072    0.711857     0.326977
 0.939913   0.21433    0.223666   0.914527      0.425202    0.129862     0.766065
 0.600877   0.516631   0.753827   0.674017      0.665329    0.622929     0.645962
 0.223773   0.257933   0.854171   0.259882      0.298119    0.231662     0.824881
 0.268817   0.468576   0.218589   0.835418      0.802857    0.0159643    0.0330232
 0.408092   0.361884   0.849442   0.527004   …  0.0500168   0.427498     0.70482
 0.740789   0.952265   0.722908   0.0856596     0.507305    0.32629      0.117663
 0.873501   0.587707   0.894573   0.355338      0.345011    0.0693833    0.457268
 0.758824   0.162728   0.608327   0.902837      0.492069    0.716635     0.459272
 0.922832   0.950539   0.51935    0.52672       0.725665    0.36443      0.936056
 0.239929   0.3754     0.247219   0.92438    …  0.0763809   0.737196     0.712317
 0.76676    0.182714   0.866055   0.749239      0.132254    0.755823     0.0869469
 0.378313   0.0392607  0.93354    0.908511      0.733769    0.552135     0.351491
 0.811121   0.891591   0.610976   0.0427439     0.0258436   0.482621     0.193291
 0.109315   0.474986   0.140528   0.776382      0.609791    0.49946      0.116989

julia> bucket_cutoffs = sort(rand(5)) |> Flux.gpu;
5-element CuArray{Float32, 1, CUDA.DeviceMemory}:
 0.42291805
 0.7075339
 0.8812783
 0.89976573
 0.9318977

julia> _bucket_indices(data, bucket_cutoffs)
50×50 CuArray{Int64, 2, CUDA.DeviceMemory}:
 1  2  2  0  1  0  2  0  0  2  0  1  1  0  …  0  0  0  1  1  0  2  2  4  0  4  0  0
 1  0  5  0  1  4  1  2  0  0  5  1  0  0     0  0  1  2  4  2  0  0  0  2  0  0  1
 2  4  1  2  1  0  5  0  1  1  0  0  0  1     2  5  1  1  1  1  1  1  0  5  1  1  1
 5  5  0  4  0  0  1  2  4  0  4  1  0  0     5  5  4  2  1  0  2  0  1  0  2  2  0
 0  0  0  1  0  0  1  1  0  2  0  1  2  0     1  0  2  0  2  0  2  1  1  5  0  0  5
 2  0  2  0  1  0  1  0  2  4  2  2  0  2  …  0  1  0  4  0  5  0  0  0  2  1  0  1
 2  2  1  1  1  0  3  0  2  0  1  1  5  0     2  0  0  0  0  1  0  5  5  1  5  1  0
 5  2  1  1  2  5  0  0  1  3  0  1  0  1     0  0  0  0  0  1  4  0  1  0  0  0  1
 0  2  0  0  1  1  0  5  2  0  2  2  2  2     0  0  5  5  0  0  2  2  0  2  0  0  0
 1  1  2  0  2  4  5  5  1  0  2  2  2  0     0  0  1  1  1  0  0  1  1  2  0  0  0
 0  1  0  5  0  0  2  0  2  0  0  3  0  0  …  1  2  0  5  0  1  2  0  0  0  2  2  1
 0  5  1  1  2  1  0  1  1  0  0  1  1  0     5  0  0  2  2  0  3  1  1  4  0  0  0
 1  2  2  2  2  1  1  5  0  0  0  1  0  5     0  1  1  0  0  0  2  0  2  0  1  0  0
 0  1  2  4  1  2  1  2  0  2  2  0  0  0     0  1  0  1  0  1  3  1  1  1  1  2  2
 1  0  0  1  4  0  2  2  5  4  0  3  0  1     3  0  0  0  0  5  0  1  2  0  1  2  0
 1  1  0  2  0  1  5  3  1  2  5  2  1  2  …  1  1  2  0  0  0  2  1  2  3  0  1  1
 0  3  0  4  0  0  0  0  0  0  0  0  0  1     1  1  1  2  0  1  0  2  3  0  0  2  2
 5  1  0  1  2  0  2  0  0  2  0  0  1  0     1  4  0  2  0  0  0  0  1  0  1  0  0
 2  5  1  2  0  1  0  2  5  1  1  1  5  0     1  1  0  0  2  0  1  0  4  0  0  0  1
 0  0  0  0  0  2  3  1  0  1  1  0  1  2     0  1  1  1  1  0  0  0  5  1  0  2  0
 ⋮              ⋮              ⋮           ⋱           ⋮              ⋮
 4  1  0  1  4  1  2  0  1  0  0  1  0  2  …  0  0  0  0  0  2  0  2  0  1  0  5  1
 1  0  0  5  2  2  5  0  0  3  5  0  1  5     1  2  0  1  2  0  0  0  1  0  0  2  0
 0  0  1  0  0  1  4  0  0  1  0  5  1  5     1  1  2  0  2  0  1  1  2  4  0  1  0
 1  0  1  1  0  0  0  0  1  0  0  0  0  4     0  0  1  0  3  5  0  1  1  1  0  1  2
 0  2  0  1  0  0  2  0  2  1  1  2  1  1     0  0  0  1  1  1  0  0  1  2  0  0  0
 0  0  0  2  5  2  2  0  0  5  5  4  1  0  …  0  0  2  1  5  0  1  0  1  0  0  2  0
 5  0  0  4  0  1  0  0  0  1  2  2  0  0     1  0  0  0  1  1  4  0  5  1  1  0  2
 1  1  2  1  1  1  0  0  0  0  0  2  1  0     0  5  0  1  0  0  1  2  0  0  1  1  1
 0  0  2  0  0  1  1  4  0  2  2  0  5  1     1  1  1  1  5  0  3  2  2  1  0  0  2
 0  1  0  2  2  1  1  0  1  0  1  0  0  2     5  0  1  0  5  0  0  2  2  0  2  0  0
 0  0  2  1  0  1  1  1  1  2  4  0  1  2  …  1  1  1  1  0  0  5  1  0  0  0  1  1
 2  5  2  0  0  0  2  0  2  0  0  0  0  0     4  0  5  5  0  2  0  0  0  0  1  0  0
 2  1  3  0  1  1  0  0  4  0  0  1  1  0     1  1  0  4  1  1  0  2  0  3  0  0  1
 2  0  1  4  1  0  0  1  0  2  1  0  0  0     5  1  0  0  1  1  0  0  2  0  1  2  1
 4  5  1  1  1  1  0  0  0  1  1  0  5  2     5  0  2  2  1  1  1  5  2  1  2  0  5
 0  0  0  4  2  1  0  3  0  3  2  0  1  2  …  0  1  0  2  0  0  2  5  2  0  0  2  2
 2  0  2  2  1  0  0  3  1  1  0  5  2  0     2  0  2  0  5  1  0  0  1  0  0  2  0
 0  0  5  4  1  0  2  2  2  0  1  1  2  5     0  0  0  0  1  0  0  1  0  1  2  1  0
 2  3  1  0  0  2  0  0  5  0  5  0  1  1     0  0  5  2  0  1  0  5  2  1  0  1  0
 0  1  0  2  1  0  2  2  1  0  1  4  1  1     5  1  0  1  4  1  1  1  1  1  1  1  0

source

ColBERT._cids_to_eids! — Method

_cids_to_eids!(eids::Vector{Int}, centroid_ids::Vector{Int},
    ivf::Vector{Int}, ivf_lengths::Vector{Int})

Get the set of embedding IDs contained in centroid_ids.

source

ColBERT._compute_avg_residuals! — Method

_compute_avg_residuals!(
    nbits::Int, centroids::AbstractMatrix{Float32},
    heldout::AbstractMatrix{Float32}, codes::AbstractVector{UInt32})

Compute the average residuals and other statistics of the held-out sample embeddings.

Arguments

nbits: The number of bits used to compress the residuals.
centroids: A matrix containing the centroids of the computed using a $k$-means clustering algorithm on the sampled embeddings. Has shape (D, indexer.num_partitions), where D is the embedding dimension (128) and indexer.num_partitions is the number of clusters.
heldout: A matrix containing the held-out embeddings, computed using _heldout_split.
codes: The array used to store the codes for each heldout embedding.

Returns

A tuple bucket_cutoffs, bucket_weights, avg_residual, which will be used in compression/decompression of residuals.

source

ColBERT._integer_ids_and_mask — Method

_integer_ids_and_mask(
    tokenizer::TextEncoders.AbstractTransformerTextEncoder,
    batch_text::AbstractVector{String})

Run batch_text through tokenizer to get matrices of tokens and attention mask.

Arguments

tokenizer: The tokenizer to be used to tokenize the texts.
batch_text: The list of texts to tokenize.

Returns

A tuple integer_ids, bitmask, where integer_ids is a Matrix containing token IDs and bitmask is the attention mask.

Examples

julia> using ColBERT: _integer_ids_and_mask, load_hgf_pretrained_local;

julia> tokenizer = load_hgf_pretrained_local("/home/codetalker7/models/colbertv2.0/:tokenizer");

julia> batch_text = [
    "hello world",
    "thank you!",
    "a",
    "this is some longer text, so length should be longer",
    "this is an even longer document. this is some longer text, so length should be longer",
];

julia> integer_ids, bitmask = _integer_ids_and_mask(tokenizer, batch_text);

julia> integer_ids
20×5 Matrix{Int32}:
  102   102   102   102   102
 7593  4068  1038  2024  2024
 2089  2018   103  2004  2004
  103  1000     1  2071  2020
    1   103     1  2937  2131
    1     1     1  3794  2937
    1     1     1  1011  6255
    1     1     1  2062  1013
    1     1     1  3092  2024
    1     1     1  2324  2004
    1     1     1  2023  2071
    1     1     1  2937  2937
    1     1     1   103  3794
    1     1     1     1  1011
    1     1     1     1  2062
    1     1     1     1  3092
    1     1     1     1  2324
    1     1     1     1  2023
    1     1     1     1  2937
    1     1     1     1   103

julia> bitmask
20×5 BitMatrix:
 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
 1  1  0  1  1
 0  1  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  0  1
 0  0  0  0  1
 0  0  0  0  1
 0  0  0  0  1
 0  0  0  0  1
 0  0  0  0  1
 0  0  0  0  1

source

ColBERT._load_model — Method

_load_model(cfg::HF.HGFConfig; path_model::AbstractString,
    trainmode::Bool = false, lazy::Bool = false, mmap::Bool = true)

Local model loader.

source

ColBERT._load_tokenizer — Method

_load_tokenizer(cfg::HF.HGFConfig; path_tokenizer_config::AbstractString,
    path_special_tokens_map::AbstractString, path_tokenizer::AbstractString)

Local tokenizer loader.

source

ColBERT._load_tokenizer_config — Method

_load_tokenizer_config(path_config)

Load tokenizer config locally.

source

ColBERT._packbits — Method

Examples

julia> using ColBERT: _packbits;

julia> using Random; Random.seed!(0);

julia> bitsarray = rand(Bool, 2, 128, 200000);

julia> _packbits(bitsarray)
32×200000 Matrix{UInt8}:
 0x2e  0x93  0x5a  0xbd  0xd1  0x89  0x2c  0x39  0x6a  …  0xed  0xdb  0x45  0x95  0xf8  0x64  0x57  0x5b  0x06
 0x3f  0x45  0x0c  0x2a  0x14  0xdb  0x16  0x2b  0x00     0x70  0xba  0x3c  0x40  0x56  0xa6  0xbe  0x33  0x3d
 0xbd  0x61  0xa3  0xa7  0xb4  0xe7  0x1e  0xf8  0xa7     0xf0  0x70  0xaf  0xc0  0xeb  0xa3  0x34  0x6d  0x81
 0x15  0x9d  0x02  0xa5  0x7b  0x84  0xde  0x2f  0x28     0xa7  0xf2  0x51  0xb3  0xe7  0x01  0xbf  0x6f  0x5a
 0xaf  0x76  0x8f  0x55  0x81  0x2f  0xa5  0xcc  0x03     0xe7  0xea  0x17  0xf2  0x07  0x45  0x40  0x40  0xd8
 0xd2  0xd4  0x25  0xcc  0x41  0xc6  0x87  0x7e  0xfd  …  0x5a  0xe6  0xed  0x28  0x26  0x8b  0x39  0x3b  0x4b
 0xb3  0xbe  0x08  0xdb  0x73  0x3d  0x58  0x04  0xda     0x7b  0xf7  0xab  0x1f  0x2d  0x7b  0x71  0x12  0xdf
 0x6f  0x86  0x20  0x90  0xa5  0x0f  0xc7  0xeb  0x79     0x19  0x92  0x74  0x59  0x4b  0xfe  0xe2  0xb9  0xef
 0x4b  0x93  0x7c  0x02  0x4f  0x40  0xad  0xe3  0x4f     0x9c  0x9c  0x69  0xd1  0xf8  0xd9  0x9e  0x00  0x70
 0x77  0x5d  0x05  0xa6  0x2c  0xaa  0x9d  0xf6  0x8d     0xa9  0x4e  0x46  0x70  0xd9  0x47  0x80  0x06  0x7e
 0x6e  0x7e  0x0f  0x3c  0xe7  0xaf  0x12  0xbf  0x0a  …  0x3f  0xaf  0xe8  0x57  0x26  0x4b  0x2c  0x3f  0x01
 0x72  0xb1  0xea  0xde  0x97  0x1d  0xf4  0x4c  0x89     0x47  0x98  0xc5  0xb6  0x47  0xaf  0x95  0xb1  0x74
 0xc6  0x2b  0x51  0x95  0x30  0xab  0xdc  0x29  0x79     0x5c  0x7b  0xc3  0xf4  0x6a  0xa6  0x09  0x39  0x96
 0xeb  0xef  0x6f  0x70  0x8d  0x1f  0xb9  0x95  0x4e     0xd0  0xf5  0x68  0x0a  0x04  0x63  0x5b  0x45  0xf5
 0xef  0xca  0xb7  0xd4  0x31  0x14  0x34  0x96  0x0c     0x1e  0x6a  0xce  0xf2  0xa3  0xa0  0xbe  0x92  0x9c
 0xda  0x91  0x53  0xd1  0x43  0xfa  0x59  0x7a  0x0c  …  0x0f  0x7a  0xa0  0x4a  0x19  0xc6  0xd3  0xbb  0x7a
 0x9a  0x81  0xdb  0xee  0xce  0x7e  0x4a  0xb5  0x2a     0x3c  0x3e  0xaa  0xdc  0xa6  0xd5  0xae  0x23  0xb2
 0x82  0x2b  0xab  0x06  0xfd  0x8a  0x4a  0xba  0x80     0xb6  0x1a  0x62  0xa0  0x29  0x97  0x61  0x6e  0xf7
 0xb8  0xe6  0x0d  0x21  0x38  0x3a  0x97  0x55  0x58     0x46  0x01  0xe1  0x82  0x34  0xa3  0xfa  0x54  0xb3
 0x09  0xc7  0x2f  0x7b  0x82  0x0c  0x26  0x4d  0xa4     0x1e  0x64  0xc2  0x55  0x41  0x6b  0x14  0x5c  0x0b
 0xf1  0x2c  0x3c  0x0a  0xf1  0x76  0xd4  0x57  0x42  …  0x44  0xb1  0xac  0xb4  0xa2  0x40  0x1e  0xbb  0x44
 0xf8  0x0d  0x6d  0x09  0xb0  0x80  0xe3  0x5e  0x18     0xb3  0x43  0x22  0x82  0x0e  0x50  0xfb  0xf6  0x7b
 0xf0  0x32  0x02  0x28  0x36  0x00  0x4f  0x84  0x2b     0xe8  0xcc  0x89  0x07  0x2f  0xf4  0xcb  0x41  0x53
 0x53  0x9b  0x01  0xf3  0xb2  0x13  0x6a  0x43  0x88     0x22  0xd8  0x33  0xa2  0xab  0xaf  0xe1  0x02  0xf7
 0x59  0x60  0x4a  0x1a  0x9c  0x29  0xb1  0x1b  0xea     0xe9  0xd6  0x07  0x78  0xc6  0xdf  0x16  0xff  0x87
 0xba  0x98  0xff  0x98  0xc3  0xa3  0x7d  0x7c  0x75  …  0xfe  0x75  0x4d  0x43  0x8e  0x5e  0x32  0xb0  0x97
 0x7b  0xc9  0xcf  0x4c  0x99  0xad  0xf1  0x0e  0x0d     0x9f  0xf2  0x92  0x75  0x86  0xd6  0x08  0x74  0x8d
 0x7c  0xd4  0xe7  0x53  0xd3  0x23  0x25  0xce  0x3a     0x19  0xdb  0x14  0xa2  0xf1  0x01  0xd4  0x27  0x20
 0x2a  0x63  0x51  0xcd  0xab  0xc3  0xb5  0xc1  0x74     0xa5  0xa4  0xe1  0xfa  0x13  0xab  0x1f  0x8f  0x9a
 0x93  0xbe  0xf4  0x54  0x2b  0xb9  0x41  0x9d  0xa8     0xbf  0xb7  0x2b  0x1c  0x09  0x36  0xa5  0x7b  0xdc
 0xdc  0x93  0x23  0xf8  0x90  0xaf  0xfb  0xd1  0xcc  …  0x54  0x09  0x8c  0x14  0xfe  0xa7  0x5d  0xd7  0x6d
 0xaf  0x93  0xa2  0x29  0xf9  0x5b  0x24  0xd5  0x2a     0xf1  0x7f  0x3a  0xf5  0x8f  0xd4  0x6e  0x67  0x5b

source

ColBERT._sample_embeddings — Method

_sample_embeddings(bert::HF.HGFBertModel, linear::Layers.Dense,
    tokenizer::TextEncoders.AbstractTransformerTextEncoder,
    dim::Int, index_bsize::Int, doc_token::String,
    skiplist::Vector{Int}, collection::Vector{String})

Compute embeddings for the PIDs sampled by _sample_pids.

The embedding array has shape (D, N), where D is the embedding dimension (128, after applying the linear layer of the ColBERT model) and N is the total number of embeddings over all documents.

Arguments

bert: The pre-trained BERT component of ColBERT.
linear: The pre-trained linear component of ColBERT.
tokenizer: The tokenizer to be used.
dim: The embedding dimension.
index_bsize: The batch size to be used to run the transformer. See ColBERTConfig.
doc_token: The document token. See ColBERTConfig.
skiplist: List of tokens to skip.
collection: The underlying collection of passages to get the samples from.

Returns

A tuple containing the average document length (i.e number of attended tokens) computed from the sampled documents, and the embedding matrix for the local samples. The matrix has shape (D, N), where D is the embedding dimension (128) and N is the total number of embeddings over all the sampled passages.

source

ColBERT._sample_pids — Method

_sample_pids(num_documents::Int)

Sample PIDs from the collection to be used to compute clusters using a $k$-means clustering algorithm.

Arguments

num_documents: The total number of documents in the collection. It is assumed that each document has an ID (aka PID) in the range of integers between 1 and num_documents (both inclusive).

Returns

A Set of Ints containing the sampled PIDs.

source

ColBERT._unbinarize — Method

Examples

julia> using ColBERT: _binarize, _unbinarize;

julia> using Flux, CUDA, Random;

julia> Random.seed!(0);

julia> nbits = 5;

julia> data = rand(0:2^nbits - 1, 100, 200000) |> Flux.gpu

julia> binarized_data = _binarize(data, nbits);

julia> unbinarized_data = _unbinarize(binarized_data);

julia> isequal(unbinarized_data, data)
true

source

ColBERT._unpackbits — Method

Examples

julia> using ColBERT: _unpackbits;

julia> using Random; Random.seed!(0);

julia> dim, nbits = 128, 2;

julia> bitsarray = rand(Bool, nbits, dim, 200000);

julia> packedbits = _packbits(bitsarray);

julia> unpackedarray = _unpackbits(packedbits, nbits);

julia> isequal(bitsarray, unpackedarray)

source

ColBERT.binarize — Method

binarize(dim::Int, nbits::Int, bucket_cutoffs::Vector{Float32},
    residuals::AbstractMatrix{Float32})

Convert a matrix of residual vectors into a matrix of integer residual vector using nbits bits.

Arguments

dim: The embedding dimension (see ColBERTConfig).
nbits: Number of bits to compress the residuals into.
bucket_cutoffs: Cutoffs used to determine residual buckets.
residuals: The matrix of residuals ot be compressed.

Returns

A AbstractMatrix{UInt8} of compressed integer residual vectors.

Examples

julia> using ColBERT: binarize;

julia> using Statistics, Random;

julia> Random.seed!(0);

julia> dim, nbits = 128, 2;           # encode residuals in 2 bits

julia> residuals = rand(Float32, dim, 200000);

julia> quantiles = collect(0:(2^nbits - 1)) / 2^nbits;

julia> bucket_cutoffs = Float32.(quantile(residuals, quantiles[2:end]))
3-element Vector{Float32}:
 0.2502231
 0.5001043
 0.75005275

julia> binarize(dim, nbits, bucket_cutoffs, residuals)
32×200000 Matrix{UInt8}:
 0xb4  0xa2  0x0f  0xd5  0xe2  0xd3  0x03  0xbe  0xe3  …  0x44  0xf5  0x8c  0x62  0x59  0xdc  0xc9  0x9e  0x57
 0xce  0x7e  0x23  0xd8  0xea  0x96  0x23  0x3e  0xe1     0xfb  0x29  0xa5  0xab  0x28  0xc3  0xed  0x60  0x90
 0xb1  0x3e  0x96  0xc9  0x84  0x73  0x2c  0x28  0x22     0x27  0x6e  0xca  0x19  0xcd  0x9f  0x1a  0xf4  0xe4
 0xd8  0x85  0x26  0xe2  0xf8  0xfc  0x59  0xef  0x9a     0x51  0xcf  0x06  0x09  0xec  0x0f  0x96  0x94  0x9d
 0xa7  0xfe  0xe2  0x9a  0xa1  0x5e  0xb0  0xd3  0x98     0x41  0x64  0x7b  0x0c  0xa6  0x69  0x26  0x35  0x05
 0x12  0x66  0x0c  0x17  0x05  0xff  0xf2  0x35  0xc0  …  0xa6  0xb7  0xda  0x20  0xb4  0xfe  0x33  0xfc  0xa1
 0x1b  0xa5  0xbc  0xa0  0xc7  0x1c  0xdc  0x43  0x12     0x38  0x81  0x12  0xb1  0x53  0x52  0x50  0x92  0x41
 0x5b  0xea  0xbe  0x84  0x81  0xed  0xf5  0x83  0x7d     0x4a  0xc8  0x7f  0x95  0xab  0x34  0xcb  0x35  0x15
 0xd3  0x0a  0x18  0xc8  0xea  0x34  0x31  0xcc  0x79     0x39  0x3c  0xec  0xe2  0x6a  0xb2  0x59  0x62  0x74
 0x1b  0x01  0xee  0xe7  0xda  0xa9  0xe4  0xe6  0xc5     0x75  0x10  0xa1  0xe1  0xe5  0x50  0x23  0xfe  0xa3
 0xe8  0x38  0x28  0x7c  0x9f  0xd5  0xf7  0x69  0x73  …  0x4e  0xbc  0x52  0xa0  0xca  0x8b  0xe9  0xaf  0xae
 0x2a  0xa2  0x12  0x1c  0x03  0x21  0x6a  0x6e  0xdb     0xa3  0xe3  0x62  0xb9  0x69  0xc0  0x39  0x48  0x9a
 0x76  0x44  0xce  0xd7  0xf7  0x02  0xbd  0xa1  0x7f     0xee  0x5d  0xea  0x9e  0xbe  0x78  0x51  0xbc  0xa3
 0xb2  0xe6  0x09  0x33  0x5b  0xd1  0xad  0x1e  0x9e     0x2c  0x36  0x09  0xd3  0x60  0x81  0x0f  0xe0  0x9e
 0xb8  0x18  0x94  0x0a  0x83  0xd0  0x01  0xe1  0x0f     0x76  0x35  0x6d  0x87  0xfe  0x9e  0x9c  0x69  0xe8
 0x8c  0x6c  0x24  0xf5  0xa9  0xe2  0xbd  0x21  0x83  …  0x1d  0x77  0x11  0xea  0xc1  0xc8  0x09  0xd7  0x4b
 0x97  0x23  0x9f  0x7a  0x8a  0xd1  0x34  0xc6  0xe7     0xe2  0xd0  0x46  0xab  0xbe  0xb3  0x92  0xeb  0xd8
 0x10  0x6f  0xce  0x60  0x17  0x2a  0x4f  0x4a  0xb3     0xde  0x79  0xea  0x28  0xa7  0x08  0x68  0x81  0x9c
 0xae  0xc9  0xc8  0xbf  0x48  0x33  0xa3  0xca  0x8d     0x78  0x4e  0x0e  0xe2  0xe2  0x23  0x08  0x47  0xe6
 0x41  0x29  0x8e  0xff  0x66  0xcc  0xd8  0x58  0x59     0x92  0xd8  0xef  0x9c  0x3c  0x51  0xd4  0x65  0x64
 0xb5  0xc4  0x2d  0x30  0x14  0x54  0xd4  0x79  0x62  …  0xff  0xc1  0xed  0xe4  0x62  0xa4  0x12  0xb7  0x47
 0xcf  0x9a  0x9a  0xd7  0x6f  0xdf  0xad  0x3a  0xf8     0xe5  0x63  0x85  0x0f  0xaf  0x62  0xab  0x67  0x86
 0x3e  0xc7  0x92  0x54  0x8d  0xef  0x0b  0xd5  0xbb     0x64  0x5a  0x4d  0x10  0x2e  0x8f  0xd4  0xb0  0x68
 0x7e  0x56  0x3c  0xb5  0xbd  0x63  0x4b  0xf4  0x8a     0x66  0xc7  0x1a  0x39  0x20  0xa4  0x50  0xac  0xed
 0x3c  0xbc  0x81  0x67  0xb8  0xaf  0x84  0x38  0x8e     0x6e  0x8f  0x3b  0xaf  0xae  0x03  0x0a  0x53  0x55
 0x3d  0x45  0x76  0x98  0x7f  0x34  0x7d  0x23  0x29  …  0x24  0x3a  0x6b  0x8a  0xb4  0x3c  0x2d  0xe2  0x3a
 0xed  0x41  0xe6  0x86  0xf3  0x61  0x12  0xc5  0xde     0xd1  0x26  0x11  0x36  0x57  0x6c  0x35  0x38  0xe2
 0x11  0x57  0x82  0x9b  0x19  0x1f  0x56  0xd7  0x06     0x1e  0x2b  0xd9  0x76  0xa1  0x68  0x27  0xb1  0xde
 0x89  0xb3  0xeb  0x86  0xbb  0x57  0xda  0xd3  0x5b     0x0e  0x79  0x4c  0x8c  0x57  0x3d  0xf0  0x98  0xb7
 0xbf  0xc2  0xac  0xf0  0xed  0x69  0x0e  0x19  0x12     0xfe  0xab  0xcd  0xfc  0x72  0x76  0x5c  0x58  0x8b
 0xe9  0x7b  0xf6  0x22  0xa0  0x60  0x23  0xc9  0x33  …  0x77  0xc7  0xdf  0x8a  0xb9  0xef  0xe3  0x03  0x8a
 0x6b  0x26  0x08  0x53  0xc3  0x17  0xc4  0x33  0x2e     0xc6  0xb8  0x1e  0x54  0xcd  0xeb  0xb9  0x5f  0x38

source

ColBERT.compress — Method

compress(centroids::Matrix{Float32}, bucket_cutoffs::Vector{Float32},
    dim::Int, nbits::Int, embs::AbstractMatrix{Float32})

Compress a matrix of embeddings into a compact representation.

All embeddings are compressed to their nearest centroid IDs and their quantized residual vectors (where the quantization is done in nbits bits). If emb denotes an embedding and centroid is is nearest centroid, the residual vector is defined to be emb - centroid.

Arguments

centroids: The matrix of centroids.
bucket_cutoffs: Cutoffs used to determine residual buckets.
dim: The embedding dimension (see ColBERTConfig).
nbits: Number of bits to compress the residuals into.
embs: The input embeddings to be compressed.

Returns

A tuple containing a vector of codes and the compressed residuals matrix.

Examples

julia> using ColBERT: compress;

julia> using Random; Random.seed!(0);

julia> nbits, dim = 2, 128;

julia> embs = rand(Float32, dim, 100000);

julia> centroids = embs[:, randperm(size(embs, 2))[1:10000]];

julia> bucket_cutoffs = Float32.(sort(rand(2^nbits - 1)));
3-element Vector{Float32}:
 0.08594067
 0.0968812
 0.44113323

julia> @time codes, compressed_residuals = compress(
    centroids, bucket_cutoffs, dim, nbits, embs);
  4.277926 seconds (1.57 k allocations: 4.238 GiB, 6.46% gc time)

source

ColBERT.compress_into_codes! — Method

compress_into_codes(
    centroids::AbstractMatrix{Float32}, embs::AbstractMatrix{Float32})

Compresses a matrix of embeddings into a vector of codes using the given centroids, where the code for each embedding is its nearest centroid ID.

Arguments

centroids: The matrix of centroids.
embs: The matrix of embeddings to be compressed.

Returns

A Vector{UInt32} of codes, where each code corresponds to the nearest centroid ID for the embedding.

Examples

julia> using ColBERT: compress_into_codes;

julia> using Flux, CUDA, Random;

julia> Random.seed!(0);

julia> centroids = rand(Float32, 128, 500) |> Flux.gpu;

julia> embs = rand(Float32, 128, 10000) |> Flux.gpu;

julia> codes = zeros(UInt32, size(embs, 2)) |> Flux.gpu;

julia> @time compress_into_codes!(codes, centroids, embs);
  0.003489 seconds (4.51 k allocations: 117.117 KiB)

julia> codes
10000-element CuArray{UInt32, 1, CUDA.DeviceMemory}:
 0x00000194
 0x00000194
 0x0000000b
 0x000001d9
 0x0000011f
 0x00000098
 0x0000014e
 0x00000012
 0x000000a0
 0x00000098
 0x000001a7
 0x00000098
 0x000001a7
 0x00000194
          ⋮
 0x00000199
 0x000001a7
 0x0000014e
 0x000001a7
 0x000001a7
 0x000001a7
 0x000000ec
 0x00000098
 0x000001d9
 0x00000098
 0x000001d9
 0x000001d9
 0x00000012

source

ColBERT.decompress — Method

Examples

julia> using ColBERT: compress, decompress;

julia> using Random; Random.seed!(0);

julia> nbits, dim = 2, 128;

julia> embs = rand(Float32, dim, 100000);

julia> centroids = embs[:, randperm(size(embs, 2))[1:10000]];

julia> bucket_cutoffs = Float32.(sort(rand(2^nbits - 1)))
3-element Vector{Float32}:
 0.08594067
 0.0968812
 0.44113323

julia> bucket_weights = Float32.(sort(rand(2^nbits)));
4-element Vector{Float32}:
 0.10379179
 0.25756857
 0.27798286
 0.47973529

julia> @time codes, compressed_residuals = compress(
    centroids, bucket_cutoffs, dim, nbits, embs);
  4.277926 seconds (1.57 k allocations: 4.238 GiB, 6.46% gc time)

julia> @time decompressed_embeddings = decompress(
    dim, nbits, centroids, bucket_weights, codes, compressed_residuals);
0.237170 seconds (276.40 k allocations: 563.049 MiB, 50.93% compilation time)

source

ColBERT.decompress_residuals — Method

Examples

julia> using ColBERT: binarize, decompress_residuals;

julia> using Statistics, Flux, CUDA, Random;

julia> Random.seed!(0);

julia> dim, nbits = 128, 2;           # encode residuals in 5 bits

julia> residuals = rand(Float32, dim, 200000);

julia> quantiles = collect(0:(2^nbits - 1)) / 2^nbits;

julia> bucket_cutoffs = Float32.(quantile(residuals, quantiles[2:end]))
3-element Vector{Float32}:
 0.2502231
 0.5001043
 0.75005275

julia> bucket_weights = Float32.(quantile(residuals, quantiles .+ 0.5 / 2^nbits))
4-element Vector{Float32}:
 0.1250611
 0.37511465
 0.62501323
 0.87501866

julia> binary_residuals = binarize(dim, nbits, bucket_cutoffs, residuals);

julia> decompressed_residuals = decompress_residuals(
    dim, nbits, bucket_weights, binary_residuals)
128×200000 Matrix{Float32}:
 0.125061  0.625013  0.875019  0.375115  0.625013  0.875019  …  0.375115  0.125061  0.375115  0.625013  0.875019
 0.375115  0.125061  0.875019  0.375115  0.125061  0.125061     0.625013  0.875019  0.625013  0.875019  0.375115
 0.875019  0.625013  0.125061  0.375115  0.625013  0.375115     0.375115  0.375115  0.125061  0.375115  0.375115
 0.625013  0.625013  0.125061  0.875019  0.875019  0.875019     0.375115  0.875019  0.875019  0.625013  0.375115
 0.625013  0.625013  0.875019  0.125061  0.625013  0.625013     0.125061  0.875019  0.375115  0.125061  0.125061
 0.875019  0.875019  0.125061  0.625013  0.625013  0.375115  …  0.625013  0.125061  0.875019  0.125061  0.125061
 0.125061  0.875019  0.625013  0.375115  0.625013  0.375115     0.625013  0.125061  0.625013  0.625013  0.375115
 0.875019  0.375115  0.125061  0.875019  0.875019  0.625013     0.125061  0.875019  0.875019  0.375115  0.625013
 0.375115  0.625013  0.625013  0.375115  0.125061  0.875019     0.375115  0.875019  0.625013  0.125061  0.125061
 0.125061  0.875019  0.375115  0.625013  0.375115  0.125061     0.875019  0.875019  0.625013  0.375115  0.375115
 0.875019  0.875019  0.375115  0.125061  0.125061  0.875019  …  0.125061  0.375115  0.375115  0.875019  0.625013
 0.625013  0.125061  0.625013  0.875019  0.625013  0.375115     0.875019  0.625013  0.125061  0.875019  0.875019
 0.125061  0.375115  0.625013  0.625013  0.125061  0.125061     0.125061  0.875019  0.625013  0.125061  0.375115
 0.625013  0.375115  0.375115  0.125061  0.625013  0.875019     0.875019  0.875019  0.375115  0.375115  0.875019
 0.375115  0.125061  0.625013  0.625013  0.875019  0.875019     0.625013  0.125061  0.375115  0.375115  0.375115
 0.875019  0.625013  0.125061  0.875019  0.875019  0.875019  …  0.875019  0.125061  0.625013  0.625013  0.625013
 0.875019  0.625013  0.625013  0.625013  0.375115  0.625013     0.625013  0.375115  0.625013  0.375115  0.375115
 0.375115  0.875019  0.125061  0.625013  0.125061  0.875019     0.375115  0.625013  0.375115  0.375115  0.375115
 0.625013  0.875019  0.625013  0.375115  0.625013  0.375115     0.625013  0.625013  0.625013  0.875019  0.125061
 0.625013  0.875019  0.875019  0.625013  0.625013  0.375115     0.625013  0.375115  0.125061  0.125061  0.125061
 0.625013  0.625013  0.125061  0.875019  0.375115  0.875019  …  0.125061  0.625013  0.875019  0.125061  0.375115
 0.125061  0.375115  0.875019  0.375115  0.375115  0.875019     0.375115  0.875019  0.125061  0.875019  0.125061
 0.375115  0.625013  0.125061  0.375115  0.125061  0.875019     0.875019  0.875019  0.875019  0.875019  0.625013
 0.125061  0.375115  0.125061  0.125061  0.125061  0.875019     0.625013  0.875019  0.125061  0.875019  0.625013
 0.875019  0.375115  0.125061  0.125061  0.875019  0.125061     0.875019  0.625013  0.125061  0.625013  0.375115
 0.625013  0.375115  0.875019  0.125061  0.375115  0.875019  …  0.125061  0.125061  0.125061  0.125061  0.125061
 0.375115  0.625013  0.875019  0.625013  0.125061  0.375115     0.375115  0.375115  0.375115  0.375115  0.125061
 ⋮                                                 ⋮         ⋱  ⋮
 0.875019  0.375115  0.375115  0.625013  0.875019  0.375115     0.375115  0.875019  0.875019  0.125061  0.625013
 0.875019  0.125061  0.875019  0.375115  0.875019  0.875019     0.875019  0.875019  0.625013  0.625013  0.875019
 0.125061  0.375115  0.375115  0.625013  0.375115  0.125061     0.625013  0.125061  0.125061  0.875019  0.125061
 0.375115  0.375115  0.625013  0.625013  0.875019  0.375115     0.875019  0.125061  0.375115  0.125061  0.625013
 0.875019  0.125061  0.375115  0.375115  0.125061  0.125061  …  0.375115  0.875019  0.375115  0.625013  0.125061
 0.625013  0.125061  0.625013  0.125061  0.875019  0.625013     0.375115  0.625013  0.875019  0.875019  0.625013
 0.875019  0.375115  0.875019  0.625013  0.875019  0.375115     0.375115  0.375115  0.125061  0.125061  0.875019
 0.375115  0.875019  0.625013  0.875019  0.375115  0.875019     0.375115  0.125061  0.875019  0.375115  0.625013
 0.125061  0.375115  0.125061  0.625013  0.625013  0.875019     0.125061  0.625013  0.375115  0.125061  0.875019
 0.375115  0.375115  0.125061  0.375115  0.375115  0.375115  …  0.625013  0.625013  0.625013  0.875019  0.375115
 0.125061  0.375115  0.625013  0.625013  0.125061  0.125061     0.625013  0.375115  0.125061  0.625013  0.875019
 0.375115  0.875019  0.875019  0.625013  0.875019  0.875019     0.875019  0.375115  0.125061  0.125061  0.875019
 0.625013  0.125061  0.625013  0.375115  0.625013  0.375115     0.375115  0.875019  0.125061  0.625013  0.375115
 0.125061  0.875019  0.625013  0.125061  0.875019  0.375115     0.375115  0.875019  0.875019  0.375115  0.875019
 0.625013  0.625013  0.875019  0.625013  0.625013  0.375115  …  0.375115  0.125061  0.875019  0.625013  0.625013
 0.875019  0.625013  0.125061  0.125061  0.375115  0.375115     0.625013  0.625013  0.125061  0.125061  0.875019
 0.875019  0.125061  0.875019  0.125061  0.875019  0.625013     0.125061  0.375115  0.875019  0.625013  0.625013
 0.875019  0.125061  0.625013  0.875019  0.625013  0.625013     0.875019  0.875019  0.375115  0.375115  0.125061
 0.625013  0.875019  0.625013  0.875019  0.875019  0.375115     0.375115  0.375115  0.375115  0.375115  0.625013
 0.375115  0.875019  0.625013  0.625013  0.125061  0.125061  …  0.375115  0.875019  0.875019  0.875019  0.625013
 0.625013  0.625013  0.375115  0.125061  0.125061  0.125061     0.625013  0.875019  0.125061  0.125061  0.625013
 0.625013  0.875019  0.875019  0.625013  0.625013  0.625013     0.875019  0.625013  0.625013  0.125061  0.125061
 0.875019  0.375115  0.875019  0.125061  0.625013  0.375115     0.625013  0.875019  0.875019  0.125061  0.625013
 0.875019  0.625013  0.125061  0.875019  0.875019  0.875019     0.375115  0.875019  0.375115  0.875019  0.125061
 0.625013  0.375115  0.625013  0.125061  0.125061  0.375115  …  0.875019  0.625013  0.625013  0.875019  0.625013
 0.625013  0.625013  0.125061  0.375115  0.125061  0.375115     0.125061  0.625013  0.875019  0.375115  0.875019
 0.375115  0.125061  0.125061  0.375115  0.875019  0.125061     0.875019  0.875019  0.625013  0.375115  0.125061

source

ColBERT.doc — Method

doc(bert::HF.HGFBertModel, linear::Layers.Dense,
    integer_ids::AbstractMatrix{Int32}, bitmask::AbstractMatrix{Bool})

Compute the hidden state of the BERT and linear layers of ColBERT for documents.

Arguments

bert: The pre-trained BERT component of the ColBERT model.
linear: The pre-trained linear component of the ColBERT model.
integer_ids: An array of token IDs to be fed into the BERT model.
integer_mask: An array of corresponding attention masks. Should have the same shape as integer_ids.

Returns

An array D containing the normalized embeddings for each token in each document. It has shape (D, L, N), where D is the embedding dimension (128 for the linear layer of ColBERT), and (L, N) is the shape of integer_ids, i.e L is the maximum length of any document and N is the total number of documents.

source

ColBERT.encode_passages — Method

encode_passages(bert::HF.HGFBertModel, linear::Layers.Dense,
    tokenizer::TextEncoders.AbstractTransformerTextEncoder,
    passages::Vector{String}, dim::Int, index_bsize::Int,
    doc_token::String, skiplist::Vector{Int})

Encode a list of document passages.

The given passages are run through the underlying BERT model and the linear layer to generate the embeddings, after doing relevant document-specific preprocessing.

Arguments

bert: The pre-trained BERT component of the ColBERT model.
linear: The pre-trained linear component of the ColBERT model.
tokenizer: The tokenizer to be used.
passages: A list of strings representing the passages to be encoded.
dim: The embedding dimension.
index_bsize: The batch size to be used for running the transformer.
doc_token: The document token.
skiplist: A list of tokens to skip.

Returns

A tuple embs, doclens where:

embs::AbstractMatrix{Float32}: The full embedding matrix. Of shape (D, N), where D is the embedding dimension and N is the total number of embeddings across all the passages.
doclens::AbstractVector{Int}: A vector of document lengths for each passage, i.e the total number of attended tokens for each document passage.

Examples

julia> using ColBERT: load_hgf_pretrained_local, ColBERTConfig, encode_passages;

julia> using CUDA, Flux, Transformers, TextEncodeBase;

julia> config = ColBERTConfig();

julia> dim = config.dim
128

julia> index_bsize = 128;                       # this is the batch size to be fed in the transformer

julia> doc_maxlen = config.doc_maxlen
300

julia> doc_token = config.doc_token_id
"[unused1]"

julia> tokenizer, bert, linear = load_hgf_pretrained_local("/home/codetalker7/models/colbertv2.0/");

julia> process = tokenizer.process;

julia> truncpad_pipe = Pipeline{:token}(
           TextEncodeBase.trunc_and_pad(doc_maxlen - 1, "[PAD]", :tail, :tail),
           :token);

julia> process = process[1:4] |> truncpad_pipe |> process[6:end];

julia> tokenizer = TextEncoders.BertTextEncoder(
           tokenizer.tokenizer, tokenizer.vocab, process; startsym = tokenizer.startsym,
           endsym = tokenizer.endsym, padsym = tokenizer.padsym, trunc = tokenizer.trunc);

julia> bert = bert |> Flux.gpu;

julia> linear = linear |> Flux.gpu;

julia> passages = readlines("./downloads/lotte/lifestyle/dev/collection.tsv")[1:1000];

julia> punctuations_and_padsym = [string.(collect("!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"));
                                   tokenizer.padsym];

julia> skiplist = [lookup(tokenizer.vocab, sym)
                    for sym in punctuations_and_padsym];

julia> @time embs, doclens = encode_passages(
    bert, linear, tokenizer, passages, dim, index_bsize, doc_token, skiplist)      # second run stats
[ Info: Encoding 1000 passages.
 25.247094 seconds (29.65 M allocations: 1.189 GiB, 37.26% gc time, 0.00% compilation time)
(Float32[-0.08001435 -0.10785186 … -0.08651956 -0.12118215; 0.07319974 0.06629379 … 0.0929825 0.13665271; … ; -0.037957724 -0.039623592 … 0.031274226 0.063107446; 0.15484622 0.16779025 … 0.11533891 0.11508792], [279, 117, 251, 105, 133, 170, 181, 115, 190, 132  …  76, 204, 199, 244, 256, 125, 251, 261, 262, 263])

source

ColBERT.encode_queries — Method

encode_queries(bert::HF.HGFBertModel, linear::Layers.Dense,
    tokenizer::TextEncoders.AbstractTransformerTextEncoder,
    queries::Vector{String}, dim::Int,
    index_bsize::Int, query_token::String, attend_to_mask_tokens::Bool,
    skiplist::Vector{Int})

Encode a list of query passages.

Arguments

bert: The pre-trained BERT component of the ColBERT model.
linear: The pre-trained linear component of the ColBERT model.
tokenizer: The tokenizer to be used.
queries: A list of strings representing the queries to be encoded.
dim: The embedding dimension.
index_bsize: The batch size to be used for running the transformer.
query_token: The query token.
attend_to_mask_tokens: Whether to attend to "[MASK]" tokens.
skiplist: A list of tokens to skip.

Returns

An array containing the embeddings for each token in the query.

Examples

julia> using ColBERT: load_hgf_pretrained_local, ColBERTConfig, encode_queries;

julia> using CUDA, Flux, Transformers, TextEncodeBase;

julia> config = ColBERTConfig();

julia> dim = config.dim
128

julia> index_bsize = 128;                       # this is the batch size to be fed in the transformer

julia> query_maxlen = config.query_maxlen
300

julia> query_token = config.query_token_id
"[unused1]"

julia> tokenizer, bert, linear = load_hgf_pretrained_local("/home/codetalker7/models/colbertv2.0/");

julia> process = tokenizer.process;

julia> truncpad_pipe = Pipeline{:token}(
           TextEncodeBase.trunc_or_pad(query_maxlen - 1, "[PAD]", :tail, :tail),
           :token);

julia> process = process[1:4] |> truncpad_pipe |> process[6:end];

julia> tokenizer = TextEncoders.BertTextEncoder(
           tokenizer.tokenizer, tokenizer.vocab, process; startsym = tokenizer.startsym,
           endsym = tokenizer.endsym, padsym = tokenizer.padsym, trunc = tokenizer.trunc);

julia> bert = bert |> Flux.gpu;

julia> linear = linear |> Flux.gpu;

julia> skiplist = [lookup(tokenizer.vocab, tokenizer.padsym)]
1-element Vector{Int64}:
 1

julia> attend_to_mask_tokens = config.attend_to_mask_tokens

julia> queries = [
    "what are white spots on raspberries?",
    "here is another query!",
];

julia> @time encode_queries(bert, linear, tokenizer, queries, dim, index_bsize,
    query_token, attend_to_mask_tokens, skiplist);
[ Info: Encoding 2 queries.
  0.029858 seconds (27.58 k allocations: 781.727 KiB, 0.00% compilation time)

source

ColBERT.extract_tokenizer_type — Method

extract_tokenizer_type(tkr_type::AbstractString)

Extract tokenizer type from config.

source

ColBERT.index — Method

index(indexer::Indexer)

Build an index given the configuration stored in indexer.

Arguments

indexer: An Indexer which is used to build the index on disk.

source

ColBERT.index — Method

index(index_path::String, bert::HF.HGFBertModel, linear::Layers.Dense,
    tokenizer::TextEncoders.AbstractTransformerTextEncoder,
    collection::Vector{String}, dim::Int, index_bsize::Int,
    doc_token::String, skiplist::Vector{Int}, num_chunks::Int,
    chunksize::Int, centroids::AbstractMatrix{Float32},
    bucket_cutoffs::AbstractVector{Float32}, nbits::Int)

Build the index using for the collection.

The documents are processed in batches of size chunksize (see setup). Embeddings and document lengths are computed for each batch (see encode_passages), and they are saved to disk along with relevant metadata (see save_chunk).

Arguments

index_path: Path where the index is to be saved.
bert: The pre-trained BERT component of the ColBERT model.
linear: The pre-trained linear component of the ColBERT model.
tokenizer: Tokenizer to be used.
collection: The collection to index.
dim: The embedding dimension.
index_bsize: The batch size used for running the transformer.
doc_token: The document token.
skiplist: List of tokens to skip.
num_chunks: Total number of chunks.
chunksize: The maximum size of a chunk.
centroids: Centroids used to compute the compressed representations.
bucket_cutoffs: Cutoffs used to compute the residuals.
nbits: Number of bits to encode the residuals in.

source

ColBERT.kmeans_gpu_onehot! — Method

Examples

julia> using ColBERT, Flux, CUDA, Random;

julia> d, n, k = 100, 2000000, 50000 # dimensions, number of points, number of clusters
(100, 2000000, 50000)

julia> data = rand(Float32, d, n) |> Flux.gpu;           # around 800MB

julia> centroids = data[:, randperm(n)[1:k]];

julia> point_bsize = 1000;         # adjust according to your GPU/CPU memory

julia> @time assignments = ColBERT.kmeans_gpu_onehot!(
           data, centroids, k; max_iters = 2, point_bsize = point_bsize)
[ Info: Iteration 1/2, max delta: 0.6814487
[ Info: Iteration 2/2, max delta: 0.28856403
 76.381827 seconds (5.76 M allocations: 606.426 MiB, 4.25% gc time, 0.11% compilation time)
2000000-element Vector{Int32}:
 24360
 10954
 29993
 22113
 19024
 32192
 33033
 32738
 19901
  5142
 23567
 12686
 18894
 23919
  7325
 29809
 27885
 31122
  1457
  9823
 41315
 14311
 21975
 48753
 16162
  7809
 33018
 22410
 26646
  2607
 34833
     ⋮
 15216
 26424
 21939
  9252
  5071
 14570
 22467
 37881
 28239
  8775
 31290
  4625
  7561
  7645
  7277
 36069
 49799
 39307
 10595
  7639
 18879
 12754
  1233
 29389
 24772
 47907
 29380
  1345
  4781
 35313
 30000

julia> centroids
100×50000 CuArray{Float32, 2, CUDA.DeviceMemory}:
 0.573378  0.509291  0.40079   0.614619  0.593501  0.532985  0.79016    0.573517  …  0.544782  0.666605  0.537127  0.490516  0.74021   0.345155  0.613033
 0.710199  0.301702  0.570302  0.302831  0.378944  0.28444   0.577703   0.327737     0.27379   0.352727  0.413396  0.49565   0.685949  0.534816  0.540361
 0.379057  0.424286  0.771943  0.411402  0.319783  0.550557  0.64573    0.679135     0.702826  0.846835  0.608924  0.376951  0.431148  0.642033  0.697345
 0.694464  0.435644  0.422319  0.532234  0.521483  0.627431  0.501389   0.359163     0.328353  0.350925  0.485843  0.437292  0.354213  0.185923  0.427814
 0.221736  0.506781  0.352585  0.678622  0.333673  0.50622   0.463275   0.591525     0.572961  0.473792  0.369353  0.400138  0.733724  0.477619  0.254028
 0.619385  0.51777   0.40583   0.445265  0.224872  0.677207  0.713577   0.620289  …  0.389378  0.487728  0.675865  0.250588  0.614895  0.668617  0.235178
 0.591426  0.395195  0.538931  0.744411  0.533349  0.338823  0.345266   0.327421     0.373282  0.36309   0.681582  0.646208  0.404389  0.251627  0.341416
 0.583477  0.423426  0.247412  0.446173  0.280856  0.614167  0.533047   0.573224     0.45711   0.445103  0.697702  0.474529  0.616773  0.460811  0.286667
 0.49608   0.685452  0.424273  0.683325  0.581213  0.684903  0.382428   0.529762     0.734883  0.71177   0.414117  0.417863  0.543535  0.610839  0.488656
 0.626167  0.540865  0.677231  0.596885  0.378552  0.398865  0.518733   0.497296     0.661245  0.594468  0.288819  0.29435   0.467833  0.722748  0.663824
 0.619386  0.579229  0.441548  0.386045  0.564118  0.646701  0.632154   0.612795  …  0.617854  0.597241  0.490215  0.308035  0.349091  0.486332  0.32071
 0.315375  0.457891  0.642345  0.361314  0.410211  0.380876  0.844302   0.496581     0.726295  0.21279   0.555863  0.468077  0.448128  0.497228  0.688524
 0.302116  0.55576   0.22489   0.50484   0.561481  0.461971  0.605235   0.627733     0.570166  0.536869  0.647504  0.458224  0.27462   0.553473  0.268046
 0.745733  0.403701  0.468518  0.418122  0.533233  0.579005  0.837422   0.538135     0.704916  0.666066  0.571446  0.500032  0.585166  0.555079  0.39484
 0.576735  0.590597  0.312162  0.330425  0.45483   0.279067  0.577954   0.539739     0.644922  0.185377  0.681872  0.36546   0.619736  0.755231  0.818024
 0.548489  0.695465  0.835756  0.478009  0.412736  0.416005  0.118124   0.626901  …  0.313572  0.754964  0.659507  0.677611  0.479118  0.3991    0.622777
 0.285406  0.381637  0.338189  0.544162  0.477955  0.546904  0.309153   0.439008     0.563208  0.346864  0.448714  0.383776  0.55155   0.3148    0.467101
 0.823076  0.652229  0.504614  0.400098  0.357104  0.448227  0.24265    0.696984     0.485136  0.637487  0.643558  0.705938  0.632451  0.424837  0.766686
 0.421668  0.343106  0.530787  0.528398  0.24584   0.699929  0.214073   0.419076     0.331078  0.35033   0.354848  0.46255   0.475431  0.715539  0.688314
 0.779925  0.724435  0.638462  0.482254  0.521571  0.715278  0.621099   0.556042     0.308391  0.492443  0.36217   0.408848  0.73595   0.540198  0.698907
 0.356398  0.544033  0.543013  0.462401  0.402219  0.387093  0.323547   0.373834  …  0.645622  0.674534  0.723415  0.353287  0.613711  0.38006   0.554985
 0.658572  0.401115  0.25994   0.483548  0.52677   0.712259  0.774561   0.438474     0.376936  0.297307  0.455176  0.23899   0.608517  0.76084   0.382525
 0.525316  0.362833  0.361821  0.383153  0.248305  0.401027  0.554528   0.278677     0.415318  0.512563  0.401782  0.674682  0.666895  0.663432  0.378345
 0.580109  0.489022  0.255441  0.590038  0.488305  0.51133   0.508364   0.416333     0.262037  0.348079  0.564498  0.360297  0.702012  0.324764  0.249475
 0.723813  0.548868  0.550225  0.438456  0.455546  0.714484  0.0994013  0.465583     0.590603  0.414145  0.583897  0.41563   0.411714  0.271341  0.440918
 0.62465   0.664534  0.342419  0.648037  0.719117  0.665314  0.256789   0.325002  …  0.636772  0.235229  0.472394  0.656942  0.414241  0.216398  0.799625
 0.409948  0.493941  0.522245  0.38117   0.235328  0.310665  0.557497   0.621436     0.413982  0.577326  0.645292  0.225434  0.430032  0.450371  0.375822
 0.372894  0.635165  0.494829  0.440398  0.380812  0.755357  0.473521   0.487604     0.349699  0.659922  0.626307  0.437899  0.488775  0.404058  0.64511
 0.288256  0.491838  0.338052  0.466105  0.363578  0.456235  0.425795   0.453427     0.226024  0.429285  0.604995  0.403821  0.33844   0.254136  0.42694
 0.314443  0.319862  0.56776   0.652814  0.626939  0.234881  0.274685   0.531139     0.270967  0.547521  0.664938  0.451628  0.531532  0.592488  0.525191
 0.493068  0.306231  0.562287  0.454218  0.199483  0.57302   0.238318   0.567198  …  0.297332  0.460382  0.285109  0.411792  0.356838  0.340022  0.414451
 0.53873   0.258357  0.402785  0.269083  0.594396  0.505856  0.690911   0.738276     0.737582  0.369145  0.409122  0.336054  0.358317  0.392364  0.561769
 0.617347  0.639471  0.333155  0.370546  0.526723  0.293309  0.247984   0.660384     0.647745  0.286011  0.681676  0.624425  0.580846  0.402701  0.297121
 0.496282  0.378267  0.270501  0.475257  0.516464  0.356405  0.175957   0.539904     0.236559  0.58985   0.578107  0.543669  0.563102  0.71473   0.43457
 0.297402  0.476382  0.426692  0.283131  0.626477  0.220255  0.372191   0.615784     0.374197  0.55345   0.495846  0.331621  0.645283  0.578616  0.389071
 0.734077  0.371284  0.826699  0.684061  0.272948  0.693993  0.528874   0.304462  …  0.525932  0.395874  0.500069  0.559787  0.460612  0.798967  0.580689
 ⋮                                                 ⋮                              ⋱                      ⋮
 0.295452  0.589387  0.339522  0.383816  0.63141   0.505792  0.66544    0.479078     0.448193  0.774786  0.607631  0.349403  0.689084  0.619     0.251087
 0.342872  0.684608  0.66651   0.402659  0.424726  0.591997  0.391954   0.667982  …  0.459421  0.376128  0.301928  0.538294  0.530345  0.458879  0.59855
 0.449909  0.409996  0.149798  0.576651  0.290799  0.635566  0.437937   0.511792     0.648198  0.661462  0.61996   0.644484  0.636402  0.527594  0.407358
 0.782475  0.421017  0.69657   0.691838  0.382575  0.805573  0.364693   0.597721     0.652466  0.666937  0.693412  0.490323  0.514455  0.380534  0.427285
 0.314463  0.420641  0.364206  0.348991  0.59921   0.746625  0.617284   0.697596     0.342617  0.45338   0.363351  0.660113  0.674676  0.376416  0.721194
 0.402126  0.588711  0.323173  0.388439  0.34814   0.491494  0.545984   0.648734     0.430481  0.378938  0.309212  0.382807  0.632475  0.367792  0.376823
 0.555737  0.668767  0.490702  0.663971  0.250589  0.445352  0.172075   0.673576  …  0.322794  0.644713  0.394593  0.572583  0.687199  0.662051  0.3559
 0.793682  0.698499  0.67152   0.46898   0.656144  0.353421  0.803591   0.633019     0.803097  0.640827  0.365467  0.679615  0.642185  0.685466  0.296224
 0.428538  0.528681  0.438861  0.625715  0.591183  0.629757  0.456717   0.50485      0.405746  0.437458  0.368839  0.446011  0.488281  0.471933  0.514202
 0.485429  0.738783  0.287516  0.463954  0.188286  0.544762  0.37223    0.58192      0.585194  0.489835  0.506583  0.464377  0.645507  0.804297  0.786932
 0.29249   0.586557  0.608833  0.663233  0.576919  0.267828  0.308029   0.712437     0.533969  0.421972  0.476979  0.530931  0.47962   0.528001  0.621458
 0.279038  0.445135  0.177712  0.515837  0.300508  0.281383  0.400402   0.651     …  0.58635   0.443282  0.657886  0.697657  0.552504  0.329047  0.399654
 0.832609  0.485713  0.600559  0.699044  0.714713  0.606326  0.273329   0.440225     0.623437  0.667127  0.41734   0.767461  0.702767  0.601694  0.506635
 0.297328  0.287248  0.36852   0.657753  0.698171  0.719895  0.238376   0.638514     0.343874  0.373995  0.511818  0.377467  0.389039  0.522639  0.686664
 0.301796  0.737757  0.635025  0.666437  0.393605  0.346305  0.547774   0.689093     0.519264  0.361948  0.718109  0.475808  0.573496  0.514178  0.598478
 0.549563  0.248966  0.364826  0.57668   0.590149  0.533822  0.664503   0.553704     0.284555  0.591084  0.316526  0.660029  0.516786  0.824489  0.689313
 0.247931  0.238425  0.23728   0.516849  0.732181  0.405793  0.724634   0.5149    …  0.380765  0.696078  0.41157   0.642839  0.384414  0.493493  0.552407
 0.606629  0.601705  0.319954  0.533014  0.382539  0.410641  0.29247    0.506377     0.615707  0.501867  0.475531  0.405969  0.333115  0.358202  0.502586
 0.583896  0.619858  0.593031  0.451623  0.58986   0.349512  0.536081   0.298436     0.396871  0.239656  0.406909  0.541055  0.416507  0.547856  0.424243
 0.691322  0.50077   0.323869  0.500225  0.420282  0.436531  0.703267   0.541637     0.539365  0.725134  0.693945  0.676646  0.556313  0.374397  0.583554
 0.701328  0.488743  0.35439   0.613276  0.493706  0.399695  0.728355   0.467517     0.261417  0.575774  0.37854   0.490462  0.461564  0.556492  0.424225
 0.718797  0.550606  0.565344  0.561342  0.355202  0.578364  0.786034   0.562179  …  0.289592  0.183233  0.524043  0.335948  0.333167  0.476679  0.65326
 0.701058  0.380252  0.444291  0.532477  0.540552  0.696061  0.403728   0.58757      0.520714  0.510013  0.547041  0.564867  0.532286  0.501574  0.595203
 0.365637  0.531816  0.565021  0.602144  0.548403  0.764079  0.365481   0.613074     0.360902  0.527056  0.375336  0.544605  0.689852  0.837963  0.459323
 0.288392  0.268179  0.332016  0.689326  0.234238  0.23735   0.756387   0.532537     0.403286  0.471491  0.602447  0.429769  0.293544  0.437438  0.349532
 0.664517  0.31624   0.59785   0.230114  0.376591  0.773395  0.752942   0.636399     0.326092  0.72005   0.333086  0.339832  0.325618  0.461294  0.524966
 0.222333  0.305546  0.673752  0.762977  0.307967  0.312146  0.663083   0.58212   …  0.69865   0.643548  0.640484  0.755733  0.496422  0.649607  0.720769
 0.411979  0.370252  0.237112  0.311196  0.610508  0.447023  0.506591   0.213862     0.721287  0.373431  0.594912  0.621447  0.43674   0.258687  0.560904
 0.617416  0.641325  0.560164  0.313925  0.490977  0.337085  0.714373   0.506699     0.253813  0.470016  0.584523  0.447376  0.51011   0.270167  0.484992
 0.623836  0.324357  0.734953  0.790519  0.455406  0.52695   0.403097   0.446101     0.633619  0.403004  0.694153  0.717927  0.47924   0.576069  0.253169
 0.73859   0.344694  0.183747  0.69547   0.458342  0.481904  0.737565   0.720339     0.447743  0.619669  0.367867  0.34662   0.607812  0.251007  0.509758
 0.530767  0.332264  0.550998  0.364326  0.722955  0.580428  0.490779   0.426905  …  0.793421  0.713281  0.779156  0.54861   0.674266  0.21644   0.493613
 0.343766  0.379023  0.630344  0.744247  0.567047  0.377182  0.73119    0.615484     0.761156  0.264631  0.510148  0.481783  0.453394  0.410757  0.335559
 0.568994  0.332011  0.631839  0.455666  0.631383  0.453398  0.654253   0.276721     0.268318  0.658483  0.523244  0.549092  0.485578  0.342858  0.436086
 0.686312  0.268361  0.414777  0.437959  0.617892  0.582933  0.649577   0.342277     0.70994   0.435503  0.24157   0.668377  0.412632  0.667489  0.544822
 0.446142  0.527333  0.160024  0.325712  0.330222  0.368513  0.661516   0.431168     0.44104   0.665175  0.286649  0.534375  0.67307   0.571995  0.3261

source

ColBERT.load_codec — Method

load_codec(index_path::String)

Load compression/decompression information from the index path.

Arguments

index_path: The path of the index.

source

ColBERT.load_config — Method

load_config(index_path::String)

Load a ColBERTConfig from disk.

Arguments

index_path: The path of the directory where the config resides.

Examples

julia> using ColBERT;

julia> config = ColBERTConfig(
           use_gpu = true,
           collection = "/home/codetalker7/documents",
           index_path = "./local_index"
       );

julia> ColBERT.save(config);

julia> ColBERT.load_config("./local_index")
ColBERTConfig(true, 0, 1, "[unused0]", "[unused1]", "[Q]", "[D]", "colbert-ir/colbertv2.0", "/home/codetalker7/documents", 128, 220, true, 32, false, "./local_index", 64, 2, 20, 2, 8192)

source

ColBERT.load_hgf_pretrained_local — Method

load_hgf_pretrained_local(dir_spec::AbstractString;
    path_config::Union{Nothing, AbstractString} = nothing,
    path_tokenizer_config::Union{Nothing, AbstractString} = nothing,
    path_special_tokens_map::Union{Nothing, AbstractString} = nothing,
    path_tokenizer::Union{Nothing, AbstractString} = nothing,
    path_model::Union{Nothing, AbstractString} = nothing,
    kwargs...

)

Local model loader. Honors the load_hgf_pretrained interface, where you can request specific files to be loaded, eg, my/dir/to/model:tokenizer or my/dir/to/model:config.

Arguments

dir_spec::AbstractString: Directory specification (item specific after the colon is optional), eg, my/dir/to/model or my/dir/to/model:tokenizer.
path_config::Union{Nothing, AbstractString}: Path to config file.
path_tokenizer_config::Union{Nothing, AbstractString}: Path to tokenizer config file.
path_special_tokens_map::Union{Nothing, AbstractString}: Path to special tokens map file.
path_tokenizer::Union{Nothing, AbstractString}: Path to tokenizer file.
path_model::Union{Nothing, AbstractString}: Path to model file.
kwargs...: Additional keyword arguments for _load_model function like mmap, lazy, trainmode.

Examples

julia> using ColBERT, CUDA;

julia> dir_spec = "/home/codetalker7/models/colbertv2.0/";

julia> tokenizer, model, linear = load_hgf_pretrained_local(dir_spec);

source

ColBERT.mask_skiplist! — Method

mask_skiplist(tokenizer::TextEncoders.AbstractTransformerTextEncoder,
    integer_ids::AbstractMatrix{Int32}, skiplist::Union{Missing, Vector{Int64}})

Create a mask for the given integer_ids, based on the provided skiplist. If the skiplist is not missing, then any token IDs in the list will be filtered out along with the padding token. Otherwise, all tokens are included in the mask.

Arguments

tokenizer: The underlying tokenizer.
integer_ids: An Array of token IDs for the documents.
skiplist: A list of token IDs to skip in the mask.

Returns

An array of booleans indicating whether the corresponding token ID is included in the mask or not. The array has the same shape as integer_ids, i.e (L, N), where L is the maximum length of any document in integer_ids and N is the number of documents.

Examples

In this example, we'll mask out all punctuations as well as the pad symbol of a tokenizer.

julia> using ColBERT: mask_skiplist;

julia> using TextEncodeBase

julia> tokenizer = load_hgf_pretrained_local("/home/codetalker7/models/colbertv2.0/:tokenizer");

julia> punctuations_and_padsym = [string.(collect("!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~"));
                                   tokenizer.padsym];

julia> skiplist = [lookup(tokenizer.vocab, sym)
                    for sym in punctuations_and_padsym]
33-element Vector{Int64}:
 1000
 1001
 1002
 1003
 1004
 1005
 1006
 1007
 1008
 1009
 1010
 1011
 1012
 1013
 1014
 1025
 1026
 1027
 1028
 1029
 1030
 1031
 1032
 1033
 1034
 1035
 1036
 1037
 1064
 1065
 1066
 1067
    1

julia>  batch_text = [
    "no punctuation text",
    "this, batch,! of text contains puncts! but is larger so that? the other text contains pad symbol;"
];

julia> integer_ids, _ = tensorize_docs("[unused1]", tokenizer, batch_text)

julia> integer_ids
27×2 Matrix{Int32}:
   102    102
     3      3
  2054   2024
 26137   1011
  6594  14109
 14506   1011
  3794   1000
   103   1998
     1   3794
     1   3398
     1  26137
     1  16650
     1   1000
     1   2022
     1   2004
     1   3470
     1   2062
     1   2009
     1   1030
     1   1997
     1   2061
     1   3794
     1   3398
     1  11688
     1   6455
     1   1026
     1    103

julia> decode(tokenizer, integer_ids)
27×2 Matrix{String}:
 " [CLS]"      " [CLS]"
 " [unused1]"  " [unused1]"
 " no"         " this"
 " pun"        " ,"
 "ct"          " batch"
 "uation"      " ,"
 " text"       " !"
 " [SEP]"      " of"
 " [PAD]"      " text"
 " [PAD]"      " contains"
 " [PAD]"      " pun"
 " [PAD]"      "cts"
 " [PAD]"      " !"
 " [PAD]"      " but"
 " [PAD]"      " is"
 " [PAD]"      " larger"
 " [PAD]"      " so"
 " [PAD]"      " that"
 " [PAD]"      " ?"
 " [PAD]"      " the"
 " [PAD]"      " other"
 " [PAD]"      " text"
 " [PAD]"      " contains"
 " [PAD]"      " pad"
 " [PAD]"      " symbol"
 " [PAD]"      " ;"
 " [PAD]"      " [SEP]"

julia> mask_skiplist(integer_ids, skiplist)
27×2 BitMatrix:
 1  1
 1  1
 1  1
 1  0
 1  1
 1  0
 1  0
 1  1
 0  1
 0  1
 0  1
 0  1
 0  0
 0  1
 0  1
 0  1
 0  1
 0  1
 0  0
 0  1
 0  1
 0  1
 0  1
 0  1
 0  1
 0  0
 0  1

source

ColBERT.save — Method

save(config::ColBERTConfig)

Save a ColBERTConfig to disk in JSON.

Arguments

config: The ColBERTConfig to save.

Examples

julia> using ColBERT;

julia> config = ColBERTConfig(
           use_gpu = true,
           collection = "/home/codetalker7/documents",
           index_path = "./local_index"
       );

julia> ColBERT.save(config);

source

ColBERT.save_chunk — Method

save_chunk(
    index_path::String, codes::AbstractVector{UInt32}, residuals::AbstractMatrix{UInt8},
    chunk_idx::Int, passage_offset::Int, doclens::AbstractVector{Int})

Save a single chunk of compressed embeddings and their relevant metadata to disk.

The codes and compressed residuals for the chunk are saved in files named <chunk_idx>.codes.jld2. and <chunk_idx>.residuals.jld2 respectively. The document lengths are saved in a file named doclens.<chunk_idx>.jld2. Relevant metadata, including number of documents in the chunk, number of embeddings and the passage offsets are saved in a file named <chunk_idx>.metadata.json.

Arguments

index_path: The path of the index.
codes: The codes for the chunk.
residuals: The compressed residuals for the chunk.
chunk_idx: The index of the current chunk being saved.
passage_offset: The index of the first passage in the chunk.
doclens: The document lengths vector for the current chunk.

source

ColBERT.save_codec — Method

save_codec(
    index_path::String, centroids::Matrix{Float32}, bucket_cutoffs::Vector{Float32},
    bucket_weights::Vector{Float32}, avg_residual::Float32)

Save compression/decompression information from the index path.

Arguments

index_path: The path of the index.
centroids: The matrix of centroids of the index.
bucket_cutoffs: Cutoffs used to determine buckets during residual compression.
bucket_weights: Weights used to determine the decompressed values during decompression.
avg_residual: The average residual value, computed from the heldout set (see _compute_avg_residuals).

source

ColBERT.setup — Method

setup(collection::Vector{String}, avg_doclen_est::Float32,
    num_clustering_embs::Int, chunksize::Union{Missing, Int}, nranks::Int)

Initialize the index by computing some indexing-specific estimates and the index plan.

The number of chunks into which the document embeddings will be stored is simply computed using the number of documents and the size of a chunk. The number of clusters to be used for indexing is computed, and is proportional to $16\sqrt{\text{Estimated number of embeddings}}$.

Arguments

collection: The collection of documents to index.
avg_doclen_est: The collection of documents to index.
num_clustering_embs: The number of embeddings to be used for computing the clusters.
chunksize: The size of a chunk to be used. Can be Missing.
nranks: Number of GPUs. Currently this can only be 1.

Returns

A Dict containing the indexing plan.

source

ColBERT.tensorize_docs — Method

tensorize_docs(doc_token_id::String,
    tokenizer::TextEncoders.AbstractTransformerTextEncoder,
    batch_text::Vector{String})

Convert a collection of documents to tensors in the ColBERT format.

This function adds the document marker token at the beginning of each document and then converts the text data into integer IDs and masks using the tokenizer.

Arguments

config: The ColBERTConfig to be used to fetch the document marker token ID.
tokenizer: The tokenizer which is used to convert text data into integer IDs.
batch_text: A document texts that will be converted into tensors of token IDs.

Returns

A tuple containing the following is returned:

integer_ids: A Matrix of token IDs of shape (L, N), where L is the length of the largest document in batch_text, and N is the number of documents in the batch being considered.
integer_mask: A Matrix of attention masks, of the same shape as integer_ids.

Examples

julia> using ColBERT: tensorize_docs, load_hgf_pretrained_local;

julia> using Transformers, Transformers.TextEncoders, TextEncodeBase;

julia> tokenizer = load_hgf_pretrained_local("/home/codetalker7/models/colbertv2.0/:tokenizer")

# configure the tokenizers maxlen and padding/truncation
julia> doc_maxlen = 20;

julia> process = tokenizer.process
Pipelines:
  target[token] := TextEncodeBase.nestedcall(string_getvalue, source)
  target[token] := Transformers.TextEncoders.grouping_sentence(target.token)
  target[(token, segment)] := SequenceTemplate{String}([CLS]:<type=1> Input[1]:<type=1> [SEP]:<type=1> (Input[2]:<type=2> [SEP]:<type=2>)...)(target.token)
  target[attention_mask] := (NeuralAttentionlib.LengthMask ∘ Transformers.TextEncoders.getlengths(512))(target.token)
  target[token] := TextEncodeBase.trunc_and_pad(512, [PAD], tail, tail)(target.token)
  target[token] := TextEncodeBase.nested2batch(target.token)
  target[segment] := TextEncodeBase.trunc_and_pad(512, 1, tail, tail)(target.segment)
  target[segment] := TextEncodeBase.nested2batch(target.segment)
  target[sequence_mask] := identity(target.attention_mask)
  target := (target.token, target.segment, target.attention_mask, target.sequence_mask)

julia> truncpad_pipe = Pipeline{:token}(
           TextEncodeBase.trunc_and_pad(doc_maxlen - 1, "[PAD]", :tail, :tail),
           :token);

julia> process = process[1:4] |> truncpad_pipe |> process[6:end];

julia> tokenizer = TextEncoders.BertTextEncoder(
           tokenizer.tokenizer, tokenizer.vocab, process; startsym = tokenizer.startsym,
           endsym = tokenizer.endsym, padsym = tokenizer.padsym, trunc = tokenizer.trunc);

julia> batch_text = [
    "hello world",
    "thank you!",
    "a",
    "this is some longer text, so length should be longer",
    "this is an even longer document. this is some longer text, so length should be longer",
];

julia> integer_ids, bitmask = tensorize_docs(
    "[unused1]", tokenizer, batch_text)
(Int32[102 102 … 102 102; 3 3 … 3 3; … ; 1 1 … 1 2023; 1 1 … 1 2937], Bool[1 1 … 1 1; 1 1 … 1 1; … ; 0 0 … 0 1; 0 0 … 0 1])

julia> integer_ids
20×5 Matrix{Int32}:
  102   102   102   102   102
    3     3     3     3     3
 7593  4068  1038  2024  2024
 2089  2018   103  2004  2004
  103  1000     1  2071  2020
    1   103     1  2937  2131
    1     1     1  3794  2937
    1     1     1  1011  6255
    1     1     1  2062  1013
    1     1     1  3092  2024
    1     1     1  2324  2004
    1     1     1  2023  2071
    1     1     1  2937  2937
    1     1     1   103  3794
    1     1     1     1  1011
    1     1     1     1  2062
    1     1     1     1  3092
    1     1     1     1  2324
    1     1     1     1  2023
    1     1     1     1  2937

julia> bitmask
20×5 Matrix{Bool}:
 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
 1  1  1  1  1
 1  1  0  1  1
 0  1  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  1  1
 0  0  0  0  1
 0  0  0  0  1
 0  0  0  0  1
 0  0  0  0  1
 0  0  0  0  1
 0  0  0  0  1

julia> TextEncoders.decode(tokenizer, integer_ids)
20×5 Matrix{String}:
 "[CLS]"      "[CLS]"      "[CLS]"      "[CLS]"      "[CLS]"
 "[unused1]"  "[unused1]"  "[unused1]"  "[unused1]"  "[unused1]"
 "hello"      "thank"      "a"          "this"       "this"
 "world"      "you"        "[SEP]"      "is"         "is"
 "[SEP]"      "!"          "[PAD]"      "some"       "an"
 "[PAD]"      "[SEP]"      "[PAD]"      "longer"     "even"
 "[PAD]"      "[PAD]"      "[PAD]"      "text"       "longer"
 "[PAD]"      "[PAD]"      "[PAD]"      ","          "document"
 "[PAD]"      "[PAD]"      "[PAD]"      "so"         "."
 "[PAD]"      "[PAD]"      "[PAD]"      "length"     "this"
 "[PAD]"      "[PAD]"      "[PAD]"      "should"     "is"
 "[PAD]"      "[PAD]"      "[PAD]"      "be"         "some"
 "[PAD]"      "[PAD]"      "[PAD]"      "longer"     "longer"
 "[PAD]"      "[PAD]"      "[PAD]"      "[SEP]"      "text"
 "[PAD]"      "[PAD]"      "[PAD]"      "[PAD]"      ","
 "[PAD]"      "[PAD]"      "[PAD]"      "[PAD]"      "so"
 "[PAD]"      "[PAD]"      "[PAD]"      "[PAD]"      "length"
 "[PAD]"      "[PAD]"      "[PAD]"      "[PAD]"      "should"
 "[PAD]"      "[PAD]"      "[PAD]"      "[PAD]"      "be"
 "[PAD]"      "[PAD]"      "[PAD]"      "[PAD]"      "longer"

source

ColBERT.tensorize_queries — Method

using TextEncodeBase: tokenize tensorizequeries(querytoken::String, attendtomasktokens::Bool, tokenizer::TextEncoders.AbstractTransformerTextEncoder, batchtext::Vector{String})

Convert a collection of queries to tensors of token IDs and attention masks.

This function adds the query marker token at the beginning of each query text and then converts the text data into integer IDs and masks using the tokenizer.

Arguments

config: The ColBERTConfig to be used to figure out the query marker token ID.
tokenizer: The tokenizer which is used to convert text data into integer IDs.
batch_text: A document texts that will be converted into tensors of token IDs.

Returns

A tuple integer_ids, integer_mask containing the token IDs and the attention mask. Each of these two matrices has shape (L, N), where L is the maximum query length specified by the config (see ColBERTConfig), and N is the number of queries in batch_text.

Examples

In this example, we first fetch the tokenizer from HuggingFace, and then configure the tokenizer to truncate or pad each sequence to the maximum query length specified by the config. Note that, at the time of writing this package, configuring tokenizers in Transformers.jl doesn't have a clean interface; so, we have to manually configure the tokenizer.

julia> using ColBERT: tensorize_queries, load_hgf_pretrained_local;

julia> using Transformers, Transformers.TextEncoders, TextEncodeBase;

julia> tokenizer = load_hgf_pretrained_local("/home/codetalker7/models/colbertv2.0/:tokenizer");

# configure the tokenizers maxlen and padding/truncation
julia> query_maxlen = 32;

julia> process = tokenizer.process;

julia> truncpad_pipe = Pipeline{:token}(
    TextEncodeBase.trunc_or_pad(query_maxlen - 1, "[PAD]", :tail, :tail),
    :token);

julia> process = process[1:4] |> truncpad_pipe |> process[6:end];

julia> tokenizer = TextEncoders.BertTextEncoder(
    tokenizer.tokenizer, tokenizer.vocab, process; startsym = tokenizer.startsym,
    endsym = tokenizer.endsym, padsym = tokenizer.padsym, trunc = tokenizer.trunc);

julia> batch_text = [
    "what are white spots on raspberries?",
    "what do rabbits eat?",
    "this is a really long query. I'm deliberately making this long"*
    "so that you can actually see that this is really truncated at 32 tokens"*
    "and that the other two queries are padded to get 32 tokens."*
    "this makes this a nice query as an example."
];

julia> integer_ids, bitmask = tensorize_queries(
    "[unused0]", false, tokenizer, batch_text);
(Int32[102 102 102; 2 2 2; … ; 104 104 8792; 104 104 2095], Bool[1 1 1; 1 1 1; … ; 0 0 1; 0 0 1])

julia> integer_ids
32×3 Matrix{Int32}:
   102    102    102
     2      2      2
  2055   2055   2024
  2025   2080   2004
  2318  20404   1038
  7517   4522   2429
  2007   1030   2147
 20711    103  23033
  2362    104   1013
 20969    104   1046
  1030    104   1006
   103    104   1050
   104    104   9970
   104    104   2438
   104    104   2024
   104    104   2147
   104    104   6500
   104    104   2009
   104    104   2018
   104    104   2065
   104    104   2942
   104    104   2157
   104    104   2009
   104    104   2024
   104    104   2004
   104    104   2429
   104    104  25450
   104    104   2013
   104    104   3591
   104    104  19205
   104    104   8792
   104    104   2095

julia> bitmask 
32×3 Matrix{Bool}:
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  1  1
 1  0  1
 1  0  1
 1  0  1
 1  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1
 0  0  1

julia> TextEncoders.decode(tokenizer, integer_ids)
32×3 Matrix{String}:
 "[CLS]"      "[CLS]"      "[CLS]"
 "[unused0]"  "[unused0]"  "[unused0]"
 "what"       "what"       "this"
 "are"        "do"         "is"
 "white"      "rabbits"    "a"
 "spots"      "eat"        "really"
 "on"         "?"          "long"
 "ras"        "[SEP]"      "query"
 "##p"        "[MASK]"     "."
 "##berries"  "[MASK]"     "i"
 "?"          "[MASK]"     "'"
 "[SEP]"      "[MASK]"     "m"
 "[MASK]"     "[MASK]"     "deliberately"
 "[MASK]"     "[MASK]"     "making"
 "[MASK]"     "[MASK]"     "this"
 "[MASK]"     "[MASK]"     "long"
 "[MASK]"     "[MASK]"     "##so"
 "[MASK]"     "[MASK]"     "that"
 "[MASK]"     "[MASK]"     "you"
 "[MASK]"     "[MASK]"     "can"
 "[MASK]"     "[MASK]"     "actually"
 "[MASK]"     "[MASK]"     "see"
 "[MASK]"     "[MASK]"     "that"
 "[MASK]"     "[MASK]"     "this"
 "[MASK]"     "[MASK]"     "is"
 "[MASK]"     "[MASK]"     "really"
 "[MASK]"     "[MASK]"     "truncated"
 "[MASK]"     "[MASK]"     "at"
 "[MASK]"     "[MASK]"     "32"
 "[MASK]"     "[MASK]"     "token"
 "[MASK]"     "[MASK]"     "##san"
 "[MASK]"     "[MASK]"     "##d"

source

ColBERT.train — Method

train(sample::AbstractMatrix{Float32}, heldout::AbstractMatrix{Float32},
    num_partitions::Int, nbits::Int, kmeans_niters::Int)

Compute centroids using a $k$-means clustering algorithn, and store the compression information on disk.

Average residuals and other compression data is computed via the _compute_avg_residuals. function.

Arguments

sample: The matrix of sampled embeddings used to compute clusters.
heldout: The matrix of sample embeddings used to compute the residual information.
num_partitions: The number of clusters to compute.
nbits: The number of bits used to encode the residuals.
kmeans_niters: The maximum number of iterations in the $k$-means algorithm.

Returns

A Dict containing the residual codec, i.e information used to compress/decompress residuals.

source