Find all off-targets

Fastest method with symbolic alignments and prefix hashes

CHOPOFF.build_prefixHashDBFunction
name::String,
    genomepath::String,
    motif::Motif,
    storage_dir::String,
    hash_len::Int = min(length_noPAM(motif) - motif.distance, 16);
    reuse_saved::Bool = true)

Prepare prefixHashDB index for future searches using search_prefixHashDB.

Will return a path to the database location, same as storage_dir. If interested with searches within distance 4, preferably use prefix_len of 8 or 9. You can also play with hash_len parameter, but keeping it at 16 should be close to optimal.

Arguments

name - Your preferred name for this index for easier identification.

genomepath - Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.

motif - Motif defines what kind of gRNA to search for.

storage_dir - Folder path to the where index will be saved with name linearDB.bin and many prefix files.

hash_len - Length of the hash in bp. At maximum 16.

reuse_saved - Whether to reuse paths that were saved for Cas9 distance 4 and prefix 16.

Examples

using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "prefixHashDB")
mkpath(db_path)

# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a prefixHashDB
build_prefixHashDB(
    "samirandom", genome, 
    Motif("Cas9"), 
    db_path)

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
    
# finally, make results!
res_path = joinpath(tdir, "prefixHashDB", "results.csv")
search_prefixHashDB(db_path, guides, res_path; distance = 3)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))

# filter results by close proximity
res = filter_overlapping(res, 23)

# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
CHOPOFF.search_prefixHashDBFunction
search_prefixHashDB(
    storage_dir::String, 
    guides::Vector{LongDNA{4}}, 
    output_file::String;
    distance::Int = 3,
    early_stopping::Vector{Int} = Int.(floor.(exp.(0:distance))))

Find all off-targets for guides within distance of dist using prefixHashDB located at storage_dir. Uses early stopping to stop searching when a guide passes a limit on number of off-targets. This method does not keep track of the off-target locations and does not filter overlapping off-targets, therefore it might hit the early stopping condition a little earlier than intended.

Assumes your guides do not contain PAM, and are all in the same direction as you would order for the lab e.g.:

5' - ...ACGTCATCG NGG - 3'  -> will be input: ...ACGTCATCG
     guide        PAM
    
3' - CCN GGGCATGCT... - 5'  -> will be input: ...AGCATGCCC
     PAM guide

Arguments

output_file - Path and name for the output file, this will be comma separated table, therefore .csv extension is preferred. This search will create intermediate files which will have same name as output_file, but with a sequence prefix. Final file will contain all those intermediate files.

distance - Defines maximum levenshtein distance (insertions, deletions, mismatches) for which off-targets are considered.

early_stopping - Integer vector. Early stopping condition. For example for distance 2, we need vector with 3 values e.g. [1, 1, 5]. Which means we will search with "up to 1 offtargets within distance 0", "up to 1 offtargets within distance 1"...

Examples

using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "prefixHashDB")
mkpath(db_path)

# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a prefixHashDB
build_prefixHashDB(
    "samirandom", genome, 
    Motif("Cas9"), 
    db_path)

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
    
# finally, make results!
res_path = joinpath(tdir, "prefixHashDB", "results.csv")
search_prefixHashDB(db_path, guides, res_path; distance = 3)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))

# filter results by close proximity
res = filter_overlapping(res, 23)

# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)

Prefix-Suffix partial alignment

CHOPOFF.build_linearDBFunction
build_linearDB(
    name::String,
    genomepath::String,
    motif::Motif,
    storage_dir::String,
    prefix_len::Int = 7)

Prepare linearDB index for future searches using search_linearDB.

Will return a path to the database location, same as storage_dir. When this database is used for the guide off-target scan it is similar to linear in performance, hence the name. There is an optimization that if the alignment becomes impossible against the prefix we don't search the off-targets grouped inside the prefix. Therefore, it is advantageous to select much larger prefix than maximum search distance, however in that case number of files also grows. For example, if interested with searches within distance 4, preferably use prefix length of 7 or 8.

Arguments

name - Your preferred name for this index for easier identification.

genomepath - Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.

motif - Motif defines what kind of gRNA to search for.

storage_dir - Folder path to the where index will be saved with name linearDB.bin and many prefix files.

prefix_len - Size of the prefix by which off-targets are indexed. Prefix of 8 or larger will be the fastest, however it will also result in large number of files.

Examples

using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "linearDB")
mkpath(db_path)

# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a linearDB
build_linearDB(
    "samirandom", genome, 
    Motif("Cas9"), 
    db_path)

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
    
# finally, make results!
res_path = joinpath(tdir, "linearDB", "results.csv")
search_linearDB(db_path, guides, res_path; distance = 3)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))

# filter results by close proximity
res = filter_overlapping(res, 23)

# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
CHOPOFF.search_linearDBFunction
search_linearDB(
    storage_dir::String, 
    guides::Vector{LongDNA{4}}, 
    output_file::String;
    distance::Int = 4)

Find all off-targets for guides within distance of dist using linearDB located at storage_dir.

Assumes your guides do not contain PAM, and are all in the same direction as you would order from the lab e.g.:

5' - ...ACGTCATCG NGG - 3'  -> will be input: ...ACGTCATCG
     guide        PAM
    
3' - CCN GGGCATGCT... - 5'  -> will be input: ...AGCATGCCC
     PAM guide

Arguments

output_file - Path and name for the output file, this will be comma separated table, therefore .csv extension is preferred. This search will create intermediate files which will have same name as output_file, but with a sequence prefix. Final file will contain all those intermediate files.

distance - Defines maximum levenshtein distance (insertions, deletions, mismatches) for which off-targets are considered.

Examples

using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "linearDB")
mkpath(db_path)

# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a linearDB
build_linearDB(
    "samirandom", genome, 
    Motif("Cas9"), 
    db_path)

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
    
# finally, make results!
res_path = joinpath(tdir, "linearDB", "results.csv")
search_linearDB(db_path, guides, res_path; distance = 3)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))

# filter results by close proximity
res = filter_overlapping(res, 23)

# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
CHOPOFF.build_motifDBFunction
build_motifDB(
    name::String,
    genomepath::String,
    motif::Motif,
    storage_dir::String,
    prefix_len::Int = 7;
    skipmer_size::Int = Int(floor(length_noPAM(motif) / (motif.distance + 3))))

Prepare motifDB index for future searches using search_motifDB.

Will return a path to the database location, same as storage_dir. When this database is used for the guide off-target scan it is similar to search_linearDB, however additional filter is applied on top of prefix filtering. Suffixes are used for next filter, similarly to pigeon hole principle - depending on the size of the skipmer skipmer_size. For example, Cas9 off-target within distance 4 (d) might be 20bp long. We skip prefix_len of 7, and are left with 13bp which can be split into 3 skipmer (r) of size 4, 1bp will be left unused. However when searching within distance of 4 and for prefix where initial alignment was of distance 3 (m) and adjustment parameter is 0 (a). We are obliged to find at least r - (d - m + a) which is 3 - (4 - 3 + 0) = 2 this many skipmers inside the off-targets.

There exist also another approach which builds on the idea that it might be more efficient to find at least two kmers of smaller size (named 01*0 seed) rather than one larger kmer (pigeon hole principle). You can use the adjust option for that during search_linearDB step.

Be sure to understand implications of using motifDB as using wrong parameters on skipmer_size might result in leaky filtering in relation to the assumed distance dist and adjustment adjust during search step in search_motifDB.

Arguments

name - Your preferred name for this index for easier identification.

genomepath - Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.

motif - Motif defines what kind of gRNA to search for.

storage_dir - Folder path to the where index will be saved and many prefix files.

prefix_len - Size of the prefix by which off-targets are indexed. Prefix of 8 or larger will be the fastest, however it will also result in large number of files.

skipmer_size - Size of the skipmer as described above. Be careful when setting this too large!

Examples

using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "motifDB")
mkpath(db_path)

# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a motifDB
build_motifDB(
    "samirandom", genome, 
    Motif("Cas9"), 
    db_path)

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
    
# finally, make results!
res_path = joinpath(tdir, "motifDB", "results.csv")
search_motifDB(db_path, guides, res_path; distance = 3)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))

# filter results by close proximity
res = filter_overlapping(res, 23)

# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
CHOPOFF.search_motifDBFunction
search_motifDB(
    storage_dir::String, 
    guides::Vector{LongDNA{4}}, 
    output_file::String;
    distance::Int = 3,
    adjust::Int = 0)

Find all off-targets for guides within distance of distance using motifDB located at storage_dir.

Assumes your guides do not contain PAM, and are all in the same direction as you would order from the lab e.g.:

5' - ...ACGTCATCG NGG - 3'  -> will be input: ...ACGTCATCG
     guide        PAM
    
3' - CCN GGGCATGCT... - 5'  -> will be input: ...AGCATGCCC
     PAM guide

Arguments

storage_dir - Directory containing motifDB.

guides - a vector of gRNAs without PAM.

output_file - File to which write detailed results. This search will create intermediate files which will have same name as output_file, but with a sequence prefix. Final file will contain all those intermediate files, and other files with sequence prefix will be deleted.

distance - Defines maximum levenshtein distance (insertions, deletions, mismatches) for which off-targets are considered.

adjust - This will be crucial parameter for tightening second layer of filtering after, the initial prefix alignment. For example, Cas9 off-target within distance 4 (d) might be 20bp long. We skip prefix_len of 7, and are left with 13bp which can be split into 3 skipmers (r) of size 4, 1bp will be left unused. However when searching within distance of 4 and for prefix where initial alignment was of distance 3 (m) and adjustment parameter is 0 (a). We are obliged to find at least r - (d - m + a) which is 3 - (4 - 3 + 0) = 2 this many skipmers inside the off-targets.

Examples

using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "motifDB")
mkpath(db_path)

# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a motifDB
build_motifDB(
    "samirandom", genome, 
    Motif("Cas9"), 
    db_path)

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
    
# finally, make results!
res_path = joinpath(tdir, "motifDB", "results.csv")
search_motifDB(db_path, guides, res_path; distance = 3)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))

# filter results by close proximity
res = filter_overlapping(res, 23)

# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)

Vantage-Point tree

CHOPOFF.build_treeDBFunction
build_treeDB(
    name::String,
    genomepath::String,
    motif::Motif,
    storage_dir::String,
    prefix_len::Int = 7)

Build a Vantage Point tree DB of offtargets for the given motif, DB groups off-targets by their prefixes, each prefix has its own Vantage Point tree.

Will return a path to the database location, same as storage_dir.

There is an optimization that if the alignment becomes impossible against the prefix we don't search the off-targets grouped inside the prefix. Therefore it is advantageous to select larger prefix than maximum search distance, however in that case number of files also grows.

Arguments

name - Your preferred name for this index for easier identification.

genomepath - Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.

motif - Motif defines what kind of gRNA to search for.

storage_dir - Folder path to the where index will be saved with name linearDB.bin and many prefix files.

prefix_len - Size of the prefix by which off-targets are indexed. Prefix of 8 or larger will be the fastest, however it will also result in large number of files.

Examples

using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "treeDB")
mkpath(db_path)

# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a treeDB
build_treeDB(
    "samirandom", genome, 
    Motif("Cas9"), 
    db_path)

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
    
# finally, make results!
res_path = joinpath(tdir, "treeDB", "results.csv")
search_treeDB(db_path, guides, res_path; distance = 3)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))

# filter results by close proximity
res = filter_overlapping(res, 23)

# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
CHOPOFF.inspect_treeDBFunction
inspect_treeDB(
    storage_dir::String; 
    levels::Int = 5, 
    inspect_prefix::String = "")

See small part of the full vantage point tree of the treeDB.

TreeDB can be split based on the distance to the radius (r) into inside (left <= r) and right (outside > r) nodes.

Examples

using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "treeDB")
mkpath(db_path)

# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a treeDB
build_treeDB(
    "samirandom", genome, 
    Motif("Cas9"), 
    db_path)

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
    
# finally, make results!
res_path = joinpath(tdir, "treeDB", "results.csv")
search_treeDB(db_path, guides, res_path; distance = 3)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))

# filter results by close proximity
res = filter_overlapping(res, 23)

# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)


# finally, view some part of the database!
inspect_treeDB(db_path; inspect_prefix = "CCGTCGC")
CHOPOFF.search_treeDBFunction
search_treeDB(
    storage_dir::String, 
    guides::Vector{LongDNA{4}}, 
    output_file::String;
    distance::Int = 3)

Search previously build treeDB database for the off-targets of the guides.

Assumes your guides do not contain PAM, and are all in the same direction as you would order from the lab e.g.:

5' - ...ACGTCATCG NGG - 3'  -> will be input: ...ACGTCATCG
     guide        PAM

3' - CCN GGGCATGCT... - 5'  -> will be input: ...AGCATGCCC
     PAM guide

Arguments

dist - Defines maximum levenshtein distance (insertions, deletions, mismatches) for which off-targets are considered.

detail - Path and name for the output file. This search will create intermediate files which will have same name as detail, but with a sequence prefix. Final file will contain all those intermediate files. Leave detail empty if you are only interested in off-target counts returned by the treeDB.

Examples

using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "treeDB")
mkpath(db_path)

# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a treeDB
build_treeDB(
    "samirandom", genome, 
    Motif("Cas9"), 
    db_path)

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
    
# finally, make results!
res_path = joinpath(tdir, "treeDB", "results.csv")
search_treeDB(db_path, guides, res_path; distance = 3)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))

# filter results by close proximity
res = filter_overlapping(res, 23)

# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)

Experiments with FM-index

CHOPOFF.build_PathTemplatesFunction
build_PathTemplates(
    motif::Motif; 
    storagepath::String = "", 
    mismatch_only::Bool = false, 
    restrict_to_len::Int = length_noPAM(motif),
    withPAM::Bool = false)

Builds up a PathTemplates object. Stores shortened version of all possible paths within each distance d mapped on the graph of all possible alignments of sequence of length len. Then one can use templates_to_sequences_extended or templates_to_sequences and map guide sequence to all possible alignments quickly.

Arguments

len - length of the sequence (e.g. guide - without PAM)

d - Maximal distance on which to build the graph.

storagepath - If not empty "", will save the object under given path.

mismatch_only - Whether to skip insertions/deletions.

restrict_to_len - To which length ambiguity should be expanded and after which length ambiguity should be collapsed if possible. For example: ACTG and ANNN with restriction to length 2, would result in these seqeunces: ACNN AANN AGNN ATNN This length does not includes PAM - it applies directly to the guide seqeunce. The default is full length of the guide and its maximal distance.

withPAM - Whether to include PAM in the paths. Default is false.

CHOPOFF.build_fmiDBFunction
build_fmiDB(
    genomepath::String,
    storage_dir::String)

Prepare FM-index for future searches.

Arguments

genomepath - Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.

storage_dir - Path to the DIRECTORY where index with many files will be saved.

Examples

# prepare libs
using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
fmi_dir = joinpath(tdir, "fmi")
mkpath(fmi_dir)

# use CHOPOFF example genome
genome = joinpath(vcat(splitpath(dirname(pathof(CHOPOFF)))[1:end-1], 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a fmiDB!
build_fmiDB(genome, fmi_dir)
CHOPOFF.search_fmiDBFunction
search_fmiDB(
    guides::Vector{LongDNA{4}}, 
    mpt::PathTemplates, 
    motif::Motif, 
    fmidbdir::String,
    output_file::String; 
    distance::Int = 2)

Search FM-index for off-targets using brute-force enumeration method.

Experimental! Proof-of-concept!

This method uses PathTemplates build on top of Motif to enumerate all possible off-target sequences, next, these sequences are found in the genome using FM-index. This method is impractically slow above distance of 2.

Arguments

guides - a vector of gRNAs without PAM.

mpt - PathTemplates object that contains abstraction for all possible alignments

motif - Motif defines what kind of gRNA to search for. Has to be compatible with mpt.

fmidbdir - Path to the folder where FM-index was build using build_fmi.

output_file - Where output will be saved.

distance - Search distance, maximum of 2 is practical.

Examples

# prepare libs
using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
fmi_dir = joinpath(tdir, "fmi")
mkpath(fmi_dir)

# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))
# build FM-index
build_fmiDB(genome, fmi_dir)
motif = Motif("Cas9"; distance = 1)
mpt = build_PathTemplates(motif; withPAM = true) # its important to add PAM here!

# prepare output folder
res_dir = joinpath(tdir, "results")
mkpath(res_dir)

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
    
# finally, make results!
res_path = joinpath(res_dir, "results.csv")
search_fmiDB(guides, mpt, fmi_dir, res_path; distance = 1)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))

# filter results by close proximity
res = filter_overlapping(res, 23)

# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 1)
CHOPOFF.build_pamDBFunction
build_pamDB(
    fmidbdir::String, 
    motif::Motif; 
    storage_path::String = "")

Find locations of all the PAM for a given motif in the genome.

Example of what position we store for traditional Cas9 (NGG, CCN) and Cas12a (TTTV, BAAA):

NGG  CCN TTTN  NAAA
^      ^    ^  ^

Arguments

fmidbdir - Path to directory with the FM-index build with build_fmiDB.

motif - Motif defines which PAM we will locate in the genome.

storage_path - Path to the DIRECTORY where index will be saved.

Examples

# prepare libs
using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
fmi_dir = joinpath(tdir, "fmi")
mkpath(fmi_dir)

# use CHOPOFF example genome
genome = joinpath(
    vcat(
        splitpath(dirname(pathof(CHOPOFF)))[1:end-1], 
        "test", "sample_data", "genome", "semirandom.fa"))
# build FM-index
build_fmiDB(genome, fmi_dir)

# build a pamDB
pamDB = build_pamDB(fmi_dir, Motif("Cas9"))
CHOPOFF.search_fmiDB_seedFunction
search_fmiDB_seed(
    guides::Vector{LongDNA{4}},
    fmidbdir::String, 
    genomepath::String, 
    pamDB::PAMinFMI,
    output_file::String; 
    distance::Int = 2)

Search FM-index for off-targets using 01*0 seed method. Read more here: publication and pdf.

Experimental! Proof-of-concept!

Arguments

guides - a vector of gRNAs without PAM.

fmidbdir - Path to the folder where FM-index was build using build_fmi.

genomepath - Path to the genome used to build the FM-index.

pamDB - object build with build_pamDB that contains locations of the PAM inside the genome.

output_file - Where output will be saved.

distance - Search distance, maximum of 2 is practical.

Examples

# prepare libs
using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
fmi_dir = joinpath(tdir, "fmi")
mkpath(fmi_dir)

# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))
# build FM-index
build_fmiDB(genome, fmi_dir)

# build a pamDB
motif = Motif("Cas9"; distance = 1)
pamDB = build_pamDB(fmi_dir, motif)

# prepare PathTemplates
mpt = build_PathTemplates(motif)

# prepare output folder
res_dir = joinpath(tdir, "results")
mkpath(res_dir)

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
    
# finally, make results!
res_path = joinpath(res_dir, "results.csv")
search_fmiDB_seed(guides, fmi_dir, genome, pamDB, res_path; distance = 1)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))

# filter results by close proximity
res = filter_overlapping(res, 23)

# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 1)
CHOPOFF.build_binaryFuseFilterDBFunction
build_binaryFuseFilterDB(
    name::String, 
    genomepath::String, 
    motif::Motif,
    storage_dir::String;
    seed::UInt64 = UInt64(0x726b2b9d438b9d4d),
    max_iterations::Int = 10,
    precision::DataType = UInt32)

Prepare hashDB index for future searches using search_hashDB.

Arguments

name - Your preferred name for this index to ease future identification.

genomepath - Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.

motif - Motif defines what kind of gRNA to search for and at what maxium distance.

storage_dir - Directory to the where many files needed by the database will be saved. Naming of the files follows this pattern: BinaryFuseFilterDB_ + chromsome + .bin Each unique motif has its own file naming created.

seed - Optional. Seed is used during hashing for randomization.

max_iterations - When finding hashing structure for binary fuse filter it might fail sometimes, we will retry max_iterations number of times though.

precision- The higher the precision the larger the database, but also chances for error decrease dramatically. We support UInt8, UInt16, and UInt32.

restrict_to_len - Restrict lengths of the motif for the purpose of checking its presence in the genome. Allows for significant speedups when expanding all possible sequences for each guide, as we will expand up to the specified length here. For example, default setting for Cas9, would restrict standard 20bp to 16bp for the genome presence check, for distance of 4 that would be 8 bases (4bp from the 20 - 16, and 4 bases because of the potential extension) that have to be actually aligned in the genome.

Examples

using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "binaryFuseFilterDB")
mkpath(db_path)

# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a binaryFuseFilterDB
build_binaryFuseFilterDB(
    "samirandom", genome, 
    Motif("Cas9"), 
    db_path)

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
    
# finally, make results!
res_path = joinpath(tdir, "binaryFuseFilterDB", "results.csv")
search_binaryFuseFilterDB(db_path, guides, res_path; distance = 3)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))

# filter results by close proximity
res = filter_overlapping(res, 23)

# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
CHOPOFF.search_binaryFuseFilterDBFunction
search_binaryFuseFilterDB(
    bffddbir::String, 
    fmidbdir::String,
    genomepath::String, 
    guides::Vector{LongDNA{4}}, 
    output_file::String;
    distance::Int = 0)

Find all off-targets for guides within distance of dist using BinaryFuseFilterDB located at storage_dir.

Assumes your guides do not contain PAM, and are all in the same direction as you would order from the lab e.g.:

5' - ...ACGTCATCG NGG - 3'  -> will be input: ...ACGTCATCG
     guide        PAM
    
3' - CCN GGGCATGCT... - 5'  -> will be input: ...AGCATGCCC
     PAM guide

Arguments

bffdbdir - Folder location where BinaryFuseFilterDB is stored at.

fmidbdir - Folder location where FM-index is build.

guides - Vector of your guides, without PAM.

output_file - Path and name for the output file, this will be comma separated table, therefore .csv extension is preferred. This search will create intermediate files which will have same name as output_file, but with a sequence prefix. Final file will contain all those intermediate files.

distance - Defines maximum levenshtein distance (insertions, deletions, mismatches) for which off-targets are considered.

Examples

using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "binaryFuseFilterDB")
mkpath(db_path)

# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a binaryFuseFilterDB
build_binaryFuseFilterDB(
    "samirandom", genome, 
    Motif("Cas9"), 
    db_path)

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
    
# finally, make results!
res_path = joinpath(tdir, "binaryFuseFilterDB", "results.csv")
search_binaryFuseFilterDB(db_path, guides, res_path; distance = 3)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))

# filter results by close proximity
res = filter_overlapping(res, 23)

# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)

VCF

CHOPOFF.build_vcfDBFunction
build_vcfDB(
    name::String, 
    genomepath::String, 
    vcfpath::String,
    motif::Motif,
    storage_path::String,
    hash_len::Int = min(length_noPAM(motif) - motif.distance, 16);
    reuse_saved::Bool = true,
    variant_overlaps = true)

Builds a database of all potential off-targets that overlap any of the variants in the VCF file. It supports combinations of variants that are close to each other, will report all possible combinations of variants. This database uses simialr principles to prefixHashDB, also utilizes hashed prefix of specific length. In case of troubles with loading of VCF files, the only fields that we use are ID, CHROM, POS, REF, ALT, so its often possible to remove INFO field and other unnecesary fields which may cause troubles.

Arguments

name - Your preferred name for this index for future identification.

genomepath - Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.

vcfpath - Path to the VCF file, it has to be compatible with your genome.

motif - Motif defines what kind of gRNA to search for.

storage_path - Path to the where index will be saved. Normally, it includes ".bin" extension.

hash_len - length of the prefix that is stored inside the hash

reuse_saved - Whether to reuse paths that were saved for Cas9 distance 4 and prefix 16.

variant_overlaps - Whether to check for all potential combinations of alternate alleles for nearby variants. Only use with small VCF files! Preferably only run for specific variants.

Examples

# prepare libs
using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "vcfDB.bin")

# use CHOPOFF example genome and vcf file
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))
vcf = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "artificial.vcf"))

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)

# example VCF file
build_vcfDB(
    "samirandom", genome, vcf,
    Motif("Cas9"; distance = 2, ambig_max = 3),
    db_path)

# search using vcfDB
output_file = joinpath(tdir, "output.csv")
search_vcfDB(db_path, guides, output_file)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(output_file))
CHOPOFF.search_vcfDBFunction
search_vcfDB(
    storage_path::String, 
    guides::Vector{LongDNA{4}}, 
    output_file::String;
    distance::Int = 3,
    early_stopping::Vector{Int} = Int.(floor.(exp.(0:distance))))

Find all off-targets for guides within distance of dist using vcfDB located at storage_dir. Uses early stopping to stop searching when a guide passes a limit on number of off-targets. This method does not keep track of the off-target locations and does not filter overlapping off-targets, therefore it might hit the early stopping condition a little earlier than intended. Especially, when variants have multiple ALT and multiple variants are overlapping off-targets, this function will report each combination of the overlapping variants. Each of these combinations will also count towards early stopping condition.

Assumes that your guides do not contain PAM, and are all in the same direction as you would order for the lab e.g.:

5' - ...ACGTCATCG NGG - 3'  -> will be input: ...ACGTCATCG
     guide        PAM
    
3' - CCN GGGCATGCT... - 5'  -> will be input: ...AGCATGCCC
     PAM guide

Arguments

output_file - Path and name for the output file, this will be comma separated table, therefore .csv extension is preferred. This search will create intermediate files which will have same name as output_file, but with a sequence prefix. Final file will contain all those intermediate files.

distance - Defines maximum levenshtein distance (insertions, deletions, mismatches) for which off-targets are considered.

early_stopping - Integer vector. Early stopping condition. For example for distance 2, we need vector with 3 values e.g. [1, 1, 5]. Which means we will search with "up to 1 offtargets within distance 0", "up to 1 offtargets within distance 1"...

Examples

# prepare libs
using CHOPOFF, BioSequences

# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "vcfDB.bin")

# use CHOPOFF example genome and vcf file
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "genome", "semirandom.fa"))
vcf = joinpath(vcat(chopoff_path, 
    "test", "sample_data", "artificial.vcf"))

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)

# example VCF file
build_vcfDB(
    "samirandom", genome, vcf,
    Motif("Cas9"; distance = 2, ambig_max = 3),
    db_path)

# search using vcfDB
output_file = joinpath(tdir, "output.csv")
search_vcfDB(db_path, guides, output_file)

# load results
using DataFrames, CSV
res = DataFrame(CSV.File(output_file))