Find all off-targets
Fastest method with symbolic alignments and prefix hashes
CHOPOFF.build_prefixHashDB
— Functionname::String,
genomepath::String,
motif::Motif,
storage_dir::String,
hash_len::Int = min(length_noPAM(motif) - motif.distance, 16);
reuse_saved::Bool = true)
Prepare prefixHashDB index for future searches using search_prefixHashDB
.
Will return a path to the database location, same as storage_dir
. If interested with searches within distance 4, preferably use prefix_len
of 8 or 9. You can also play with hash_len
parameter, but keeping it at 16 should be close to optimal.
Arguments
name
- Your preferred name for this index for easier identification.
genomepath
- Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.
motif
- Motif defines what kind of gRNA to search for.
storage_dir
- Folder path to the where index will be saved with name linearDB.bin
and many prefix files.
hash_len
- Length of the hash in bp. At maximum 16.
reuse_saved
- Whether to reuse paths that were saved for Cas9 distance 4 and prefix 16.
Examples
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "prefixHashDB")
mkpath(db_path)
# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
# build a prefixHashDB
build_prefixHashDB(
"samirandom", genome,
Motif("Cas9"),
db_path)
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# finally, make results!
res_path = joinpath(tdir, "prefixHashDB", "results.csv")
search_prefixHashDB(db_path, guides, res_path; distance = 3)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))
# filter results by close proximity
res = filter_overlapping(res, 23)
# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
CHOPOFF.search_prefixHashDB
— Functionsearch_prefixHashDB(
storage_dir::String,
guides::Vector{LongDNA{4}},
output_file::String;
distance::Int = 3,
early_stopping::Vector{Int} = Int.(floor.(exp.(0:distance))))
Find all off-targets for guides
within distance of dist
using prefixHashDB located at storage_dir
. Uses early stopping to stop searching when a guide passes a limit on number of off-targets. This method does not keep track of the off-target locations and does not filter overlapping off-targets, therefore it might hit the early stopping condition a little earlier than intended.
Assumes your guides do not contain PAM, and are all in the same direction as you would order for the lab e.g.:
5' - ...ACGTCATCG NGG - 3' -> will be input: ...ACGTCATCG
guide PAM
3' - CCN GGGCATGCT... - 5' -> will be input: ...AGCATGCCC
PAM guide
Arguments
output_file
- Path and name for the output file, this will be comma separated table, therefore .csv
extension is preferred. This search will create intermediate files which will have same name as output_file
, but with a sequence prefix. Final file will contain all those intermediate files.
distance
- Defines maximum levenshtein distance (insertions, deletions, mismatches) for which off-targets are considered.
early_stopping
- Integer vector. Early stopping condition. For example for distance 2, we need vector with 3 values e.g. [1, 1, 5]. Which means we will search with "up to 1 offtargets within distance 0", "up to 1 offtargets within distance 1"...
Examples
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "prefixHashDB")
mkpath(db_path)
# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
# build a prefixHashDB
build_prefixHashDB(
"samirandom", genome,
Motif("Cas9"),
db_path)
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# finally, make results!
res_path = joinpath(tdir, "prefixHashDB", "results.csv")
search_prefixHashDB(db_path, guides, res_path; distance = 3)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))
# filter results by close proximity
res = filter_overlapping(res, 23)
# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
Prefix-Suffix partial alignment
CHOPOFF.build_linearDB
— Functionbuild_linearDB(
name::String,
genomepath::String,
motif::Motif,
storage_dir::String,
prefix_len::Int = 7)
Prepare linearDB index for future searches using search_linearDB
.
Will return a path to the database location, same as storage_dir
. When this database is used for the guide off-target scan it is similar to linear in performance, hence the name. There is an optimization that if the alignment becomes impossible against the prefix we don't search the off-targets grouped inside the prefix. Therefore, it is advantageous to select much larger prefix than maximum search distance, however in that case number of files also grows. For example, if interested with searches within distance 4, preferably use prefix length of 7 or 8.
Arguments
name
- Your preferred name for this index for easier identification.
genomepath
- Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.
motif
- Motif defines what kind of gRNA to search for.
storage_dir
- Folder path to the where index will be saved with name linearDB.bin
and many prefix files.
prefix_len
- Size of the prefix by which off-targets are indexed. Prefix of 8 or larger will be the fastest, however it will also result in large number of files.
Examples
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "linearDB")
mkpath(db_path)
# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
# build a linearDB
build_linearDB(
"samirandom", genome,
Motif("Cas9"),
db_path)
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# finally, make results!
res_path = joinpath(tdir, "linearDB", "results.csv")
search_linearDB(db_path, guides, res_path; distance = 3)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))
# filter results by close proximity
res = filter_overlapping(res, 23)
# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
CHOPOFF.search_linearDB
— Functionsearch_linearDB(
storage_dir::String,
guides::Vector{LongDNA{4}},
output_file::String;
distance::Int = 4)
Find all off-targets for guides
within distance of dist
using linearDB located at storage_dir
.
Assumes your guides do not contain PAM, and are all in the same direction as you would order from the lab e.g.:
5' - ...ACGTCATCG NGG - 3' -> will be input: ...ACGTCATCG
guide PAM
3' - CCN GGGCATGCT... - 5' -> will be input: ...AGCATGCCC
PAM guide
Arguments
output_file
- Path and name for the output file, this will be comma separated table, therefore .csv
extension is preferred. This search will create intermediate files which will have same name as output_file
, but with a sequence prefix. Final file will contain all those intermediate files.
distance
- Defines maximum levenshtein distance (insertions, deletions, mismatches) for which off-targets are considered.
Examples
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "linearDB")
mkpath(db_path)
# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
# build a linearDB
build_linearDB(
"samirandom", genome,
Motif("Cas9"),
db_path)
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# finally, make results!
res_path = joinpath(tdir, "linearDB", "results.csv")
search_linearDB(db_path, guides, res_path; distance = 3)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))
# filter results by close proximity
res = filter_overlapping(res, 23)
# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
CHOPOFF.build_motifDB
— Functionbuild_motifDB(
name::String,
genomepath::String,
motif::Motif,
storage_dir::String,
prefix_len::Int = 7;
skipmer_size::Int = Int(floor(length_noPAM(motif) / (motif.distance + 3))))
Prepare motifDB index for future searches using search_motifDB
.
Will return a path to the database location, same as storage_dir
. When this database is used for the guide off-target scan it is similar to search_linearDB
, however additional filter is applied on top of prefix filtering. Suffixes are used for next filter, similarly to pigeon hole principle - depending on the size of the skipmer skipmer_size
. For example, Cas9 off-target within distance 4 (d) might be 20bp long. We skip prefix_len
of 7, and are left with 13bp which can be split into 3 skipmer (r) of size 4, 1bp will be left unused. However when searching within distance of 4 and for prefix where initial alignment was of distance 3 (m) and adjustment parameter is 0 (a). We are obliged to find at least r - (d - m + a) which is 3 - (4 - 3 + 0) = 2 this many skipmers inside the off-targets.
There exist also another approach which builds on the idea that it might be more efficient to find at least two kmers of smaller size (named 01*0 seed) rather than one larger kmer (pigeon hole principle). You can use the adjust
option for that during search_linearDB
step.
Be sure to understand implications of using motifDB
as using wrong parameters on skipmer_size
might result in leaky filtering in relation to the assumed distance dist
and adjustment adjust
during search step in search_motifDB
.
Arguments
name
- Your preferred name for this index for easier identification.
genomepath
- Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.
motif
- Motif defines what kind of gRNA to search for.
storage_dir
- Folder path to the where index will be saved and many prefix files.
prefix_len
- Size of the prefix by which off-targets are indexed. Prefix of 8 or larger will be the fastest, however it will also result in large number of files.
skipmer_size
- Size of the skipmer as described above. Be careful when setting this too large!
Examples
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "motifDB")
mkpath(db_path)
# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
# build a motifDB
build_motifDB(
"samirandom", genome,
Motif("Cas9"),
db_path)
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# finally, make results!
res_path = joinpath(tdir, "motifDB", "results.csv")
search_motifDB(db_path, guides, res_path; distance = 3)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))
# filter results by close proximity
res = filter_overlapping(res, 23)
# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
CHOPOFF.search_motifDB
— Functionsearch_motifDB(
storage_dir::String,
guides::Vector{LongDNA{4}},
output_file::String;
distance::Int = 3,
adjust::Int = 0)
Find all off-targets for guides
within distance of distance
using motifDB located at storage_dir
.
Assumes your guides do not contain PAM, and are all in the same direction as you would order from the lab e.g.:
5' - ...ACGTCATCG NGG - 3' -> will be input: ...ACGTCATCG
guide PAM
3' - CCN GGGCATGCT... - 5' -> will be input: ...AGCATGCCC
PAM guide
Arguments
storage_dir
- Directory containing motifDB.
guides
- a vector of gRNAs without PAM.
output_file
- File to which write detailed results. This search will create intermediate files which will have same name as output_file, but with a sequence prefix. Final file will contain all those intermediate files, and other files with sequence prefix will be deleted.
distance
- Defines maximum levenshtein distance (insertions, deletions, mismatches) for which off-targets are considered.
adjust
- This will be crucial parameter for tightening second layer of filtering after, the initial prefix alignment. For example, Cas9 off-target within distance 4 (d) might be 20bp long. We skip prefix_len
of 7, and are left with 13bp which can be split into 3 skipmers (r) of size 4, 1bp will be left unused. However when searching within distance of 4 and for prefix where initial alignment was of distance 3 (m) and adjustment parameter is 0 (a). We are obliged to find at least r - (d - m + a)
which is 3 - (4 - 3 + 0) = 2
this many skipmers inside the off-targets.
Examples
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "motifDB")
mkpath(db_path)
# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
# build a motifDB
build_motifDB(
"samirandom", genome,
Motif("Cas9"),
db_path)
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# finally, make results!
res_path = joinpath(tdir, "motifDB", "results.csv")
search_motifDB(db_path, guides, res_path; distance = 3)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))
# filter results by close proximity
res = filter_overlapping(res, 23)
# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
Vantage-Point tree
CHOPOFF.build_treeDB
— Functionbuild_treeDB(
name::String,
genomepath::String,
motif::Motif,
storage_dir::String,
prefix_len::Int = 7)
Build a Vantage Point tree DB of offtargets for the given motif
, DB groups off-targets by their prefixes, each prefix has its own Vantage Point tree.
Will return a path to the database location, same as storage_dir
.
There is an optimization that if the alignment becomes impossible against the prefix we don't search the off-targets grouped inside the prefix. Therefore it is advantageous to select larger prefix than maximum search distance, however in that case number of files also grows.
Arguments
name
- Your preferred name for this index for easier identification.
genomepath
- Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.
motif
- Motif defines what kind of gRNA to search for.
storage_dir
- Folder path to the where index will be saved with name linearDB.bin
and many prefix files.
prefix_len
- Size of the prefix by which off-targets are indexed. Prefix of 8 or larger will be the fastest, however it will also result in large number of files.
Examples
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "treeDB")
mkpath(db_path)
# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
# build a treeDB
build_treeDB(
"samirandom", genome,
Motif("Cas9"),
db_path)
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# finally, make results!
res_path = joinpath(tdir, "treeDB", "results.csv")
search_treeDB(db_path, guides, res_path; distance = 3)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))
# filter results by close proximity
res = filter_overlapping(res, 23)
# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
CHOPOFF.inspect_treeDB
— Functioninspect_treeDB(
storage_dir::String;
levels::Int = 5,
inspect_prefix::String = "")
See small part of the full vantage point tree of the treeDB.
TreeDB can be split based on the distance to the radius (r) into inside (left <= r) and right (outside > r) nodes.
Examples
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "treeDB")
mkpath(db_path)
# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
# build a treeDB
build_treeDB(
"samirandom", genome,
Motif("Cas9"),
db_path)
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# finally, make results!
res_path = joinpath(tdir, "treeDB", "results.csv")
search_treeDB(db_path, guides, res_path; distance = 3)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))
# filter results by close proximity
res = filter_overlapping(res, 23)
# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
# finally, view some part of the database!
inspect_treeDB(db_path; inspect_prefix = "CCGTCGC")
CHOPOFF.search_treeDB
— Functionsearch_treeDB(
storage_dir::String,
guides::Vector{LongDNA{4}},
output_file::String;
distance::Int = 3)
Search previously build treeDB database for the off-targets of the guides
.
Assumes your guides do not contain PAM, and are all in the same direction as you would order from the lab e.g.:
5' - ...ACGTCATCG NGG - 3' -> will be input: ...ACGTCATCG
guide PAM
3' - CCN GGGCATGCT... - 5' -> will be input: ...AGCATGCCC
PAM guide
Arguments
dist
- Defines maximum levenshtein distance (insertions, deletions, mismatches) for which off-targets are considered.
detail
- Path and name for the output file. This search will create intermediate files which will have same name as detail, but with a sequence prefix. Final file will contain all those intermediate files. Leave detail
empty if you are only interested in off-target counts returned by the treeDB.
Examples
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "treeDB")
mkpath(db_path)
# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
# build a treeDB
build_treeDB(
"samirandom", genome,
Motif("Cas9"),
db_path)
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# finally, make results!
res_path = joinpath(tdir, "treeDB", "results.csv")
search_treeDB(db_path, guides, res_path; distance = 3)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))
# filter results by close proximity
res = filter_overlapping(res, 23)
# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
Experiments with FM-index
CHOPOFF.build_PathTemplates
— Functionbuild_PathTemplates(
motif::Motif;
storagepath::String = "",
mismatch_only::Bool = false,
restrict_to_len::Int = length_noPAM(motif),
withPAM::Bool = false)
Builds up a PathTemplates object. Stores shortened version of all possible paths within each distance d
mapped on the graph of all possible alignments of sequence of length len
. Then one can use templates_to_sequences_extended
or templates_to_sequences
and map guide sequence to all possible alignments quickly.
Arguments
len
- length of the sequence (e.g. guide - without PAM)
d
- Maximal distance on which to build the graph.
storagepath
- If not empty "", will save the object under given path.
mismatch_only
- Whether to skip insertions/deletions.
restrict_to_len
- To which length ambiguity should be expanded and after which length ambiguity should be collapsed if possible. For example: ACTG and ANNN with restriction to length 2, would result in these seqeunces: ACNN AANN AGNN ATNN This length does not includes PAM - it applies directly to the guide seqeunce. The default is full length of the guide and its maximal distance.
withPAM
- Whether to include PAM in the paths. Default is false.
CHOPOFF.build_fmiDB
— Functionbuild_fmiDB(
genomepath::String,
storage_dir::String)
Prepare FM-index for future searches.
Arguments
genomepath
- Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.
storage_dir
- Path to the DIRECTORY where index with many files will be saved.
Examples
# prepare libs
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
fmi_dir = joinpath(tdir, "fmi")
mkpath(fmi_dir)
# use CHOPOFF example genome
genome = joinpath(vcat(splitpath(dirname(pathof(CHOPOFF)))[1:end-1],
"test", "sample_data", "genome", "semirandom.fa"))
# build a fmiDB!
build_fmiDB(genome, fmi_dir)
CHOPOFF.search_fmiDB
— Functionsearch_fmiDB(
guides::Vector{LongDNA{4}},
mpt::PathTemplates,
motif::Motif,
fmidbdir::String,
output_file::String;
distance::Int = 2)
Search FM-index for off-targets using brute-force enumeration method.
Experimental! Proof-of-concept!
This method uses PathTemplates
build on top of Motif
to enumerate all possible off-target sequences, next, these sequences are found in the genome using FM-index. This method is impractically slow above distance of 2.
Arguments
guides
- a vector of gRNAs without PAM.
mpt
- PathTemplates object that contains abstraction for all possible alignments
motif
- Motif defines what kind of gRNA to search for. Has to be compatible with mpt
.
fmidbdir
- Path to the folder where FM-index was build using build_fmi
.
output_file
- Where output will be saved.
distance
- Search distance, maximum of 2 is practical.
Examples
# prepare libs
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
fmi_dir = joinpath(tdir, "fmi")
mkpath(fmi_dir)
# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
# build FM-index
build_fmiDB(genome, fmi_dir)
motif = Motif("Cas9"; distance = 1)
mpt = build_PathTemplates(motif; withPAM = true) # its important to add PAM here!
# prepare output folder
res_dir = joinpath(tdir, "results")
mkpath(res_dir)
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# finally, make results!
res_path = joinpath(res_dir, "results.csv")
search_fmiDB(guides, mpt, fmi_dir, res_path; distance = 1)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))
# filter results by close proximity
res = filter_overlapping(res, 23)
# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 1)
CHOPOFF.build_pamDB
— Functionbuild_pamDB(
fmidbdir::String,
motif::Motif;
storage_path::String = "")
Find locations of all the PAM for a given motif in the genome.
Example of what position we store for traditional Cas9 (NGG, CCN) and Cas12a (TTTV, BAAA):
NGG CCN TTTN NAAA
^ ^ ^ ^
Arguments
fmidbdir
- Path to directory with the FM-index build with build_fmiDB
.
motif
- Motif
defines which PAM we will locate in the genome.
storage_path
- Path to the DIRECTORY where index will be saved.
Examples
# prepare libs
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
fmi_dir = joinpath(tdir, "fmi")
mkpath(fmi_dir)
# use CHOPOFF example genome
genome = joinpath(
vcat(
splitpath(dirname(pathof(CHOPOFF)))[1:end-1],
"test", "sample_data", "genome", "semirandom.fa"))
# build FM-index
build_fmiDB(genome, fmi_dir)
# build a pamDB
pamDB = build_pamDB(fmi_dir, Motif("Cas9"))
CHOPOFF.search_fmiDB_seed
— Functionsearch_fmiDB_seed(
guides::Vector{LongDNA{4}},
fmidbdir::String,
genomepath::String,
pamDB::PAMinFMI,
output_file::String;
distance::Int = 2)
Search FM-index for off-targets using 01*0 seed method. Read more here: publication and pdf.
Experimental! Proof-of-concept!
Arguments
guides
- a vector of gRNAs without PAM.
fmidbdir
- Path to the folder where FM-index was build using build_fmi
.
genomepath
- Path to the genome used to build the FM-index.
pamDB
- object build with build_pamDB
that contains locations of the PAM inside the genome.
output_file
- Where output will be saved.
distance
- Search distance, maximum of 2 is practical.
Examples
# prepare libs
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
fmi_dir = joinpath(tdir, "fmi")
mkpath(fmi_dir)
# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
# build FM-index
build_fmiDB(genome, fmi_dir)
# build a pamDB
motif = Motif("Cas9"; distance = 1)
pamDB = build_pamDB(fmi_dir, motif)
# prepare PathTemplates
mpt = build_PathTemplates(motif)
# prepare output folder
res_dir = joinpath(tdir, "results")
mkpath(res_dir)
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# finally, make results!
res_path = joinpath(res_dir, "results.csv")
search_fmiDB_seed(guides, fmi_dir, genome, pamDB, res_path; distance = 1)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))
# filter results by close proximity
res = filter_overlapping(res, 23)
# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 1)
CHOPOFF.build_binaryFuseFilterDB
— Functionbuild_binaryFuseFilterDB(
name::String,
genomepath::String,
motif::Motif,
storage_dir::String;
seed::UInt64 = UInt64(0x726b2b9d438b9d4d),
max_iterations::Int = 10,
precision::DataType = UInt32)
Prepare hashDB index for future searches using search_hashDB
.
Arguments
name
- Your preferred name for this index to ease future identification.
genomepath
- Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.
motif
- Motif defines what kind of gRNA to search for and at what maxium distance.
storage_dir
- Directory to the where many files needed by the database will be saved. Naming of the files follows this pattern: BinaryFuseFilterDB_ + chromsome + .bin Each unique motif has its own file naming created.
seed
- Optional. Seed is used during hashing for randomization.
max_iterations
- When finding hashing structure for binary fuse filter it might fail sometimes, we will retry max_iterations
number of times though.
precision
- The higher the precision the larger the database, but also chances for error decrease dramatically. We support UInt8, UInt16, and UInt32.
restrict_to_len
- Restrict lengths of the motif
for the purpose of checking its presence in the genome. Allows for significant speedups when expanding all possible sequences for each guide, as we will expand up to the specified length here. For example, default setting for Cas9, would restrict standard 20bp to 16bp for the genome presence check, for distance of 4 that would be 8 bases (4bp from the 20 - 16, and 4 bases because of the potential extension) that have to be actually aligned in the genome.
Examples
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "binaryFuseFilterDB")
mkpath(db_path)
# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
# build a binaryFuseFilterDB
build_binaryFuseFilterDB(
"samirandom", genome,
Motif("Cas9"),
db_path)
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# finally, make results!
res_path = joinpath(tdir, "binaryFuseFilterDB", "results.csv")
search_binaryFuseFilterDB(db_path, guides, res_path; distance = 3)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))
# filter results by close proximity
res = filter_overlapping(res, 23)
# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
CHOPOFF.search_binaryFuseFilterDB
— Functionsearch_binaryFuseFilterDB(
bffddbir::String,
fmidbdir::String,
genomepath::String,
guides::Vector{LongDNA{4}},
output_file::String;
distance::Int = 0)
Find all off-targets for guides
within distance of dist
using BinaryFuseFilterDB located at storage_dir
.
Assumes your guides do not contain PAM, and are all in the same direction as you would order from the lab e.g.:
5' - ...ACGTCATCG NGG - 3' -> will be input: ...ACGTCATCG
guide PAM
3' - CCN GGGCATGCT... - 5' -> will be input: ...AGCATGCCC
PAM guide
Arguments
bffdbdir
- Folder location where BinaryFuseFilterDB is stored at.
fmidbdir
- Folder location where FM-index is build.
guides
- Vector of your guides, without PAM.
output_file
- Path and name for the output file, this will be comma separated table, therefore .csv
extension is preferred. This search will create intermediate files which will have same name as output_file
, but with a sequence prefix. Final file will contain all those intermediate files.
distance
- Defines maximum levenshtein distance (insertions, deletions, mismatches) for which off-targets are considered.
Examples
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "binaryFuseFilterDB")
mkpath(db_path)
# use CHOPOFF example genome
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
# build a binaryFuseFilterDB
build_binaryFuseFilterDB(
"samirandom", genome,
Motif("Cas9"),
db_path)
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# finally, make results!
res_path = joinpath(tdir, "binaryFuseFilterDB", "results.csv")
search_binaryFuseFilterDB(db_path, guides, res_path; distance = 3)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(res_path))
# filter results by close proximity
res = filter_overlapping(res, 23)
# summarize results into a table of counts by distance
summary = summarize_offtargets(res; distance = 3)
VCF
CHOPOFF.build_vcfDB
— Functionbuild_vcfDB(
name::String,
genomepath::String,
vcfpath::String,
motif::Motif,
storage_path::String,
hash_len::Int = min(length_noPAM(motif) - motif.distance, 16);
reuse_saved::Bool = true,
variant_overlaps = true)
Builds a database of all potential off-targets that overlap any of the variants in the VCF file. It supports combinations of variants that are close to each other, will report all possible combinations of variants. This database uses simialr principles to prefixHashDB
, also utilizes hashed prefix of specific length. In case of troubles with loading of VCF files, the only fields that we use are ID, CHROM, POS, REF, ALT, so its often possible to remove INFO field and other unnecesary fields which may cause troubles.
Arguments
name
- Your preferred name for this index for future identification.
genomepath
- Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.
vcfpath
- Path to the VCF file, it has to be compatible with your genome.
motif
- Motif defines what kind of gRNA to search for.
storage_path
- Path to the where index will be saved. Normally, it includes ".bin" extension.
hash_len
- length of the prefix that is stored inside the hash
reuse_saved
- Whether to reuse paths that were saved for Cas9 distance 4 and prefix 16.
variant_overlaps
- Whether to check for all potential combinations of alternate alleles for nearby variants. Only use with small VCF files! Preferably only run for specific variants.
Examples
# prepare libs
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "vcfDB.bin")
# use CHOPOFF example genome and vcf file
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
vcf = joinpath(vcat(chopoff_path,
"test", "sample_data", "artificial.vcf"))
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# example VCF file
build_vcfDB(
"samirandom", genome, vcf,
Motif("Cas9"; distance = 2, ambig_max = 3),
db_path)
# search using vcfDB
output_file = joinpath(tdir, "output.csv")
search_vcfDB(db_path, guides, output_file)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(output_file))
CHOPOFF.search_vcfDB
— Functionsearch_vcfDB(
storage_path::String,
guides::Vector{LongDNA{4}},
output_file::String;
distance::Int = 3,
early_stopping::Vector{Int} = Int.(floor.(exp.(0:distance))))
Find all off-targets for guides
within distance of dist
using vcfDB located at storage_dir
. Uses early stopping to stop searching when a guide passes a limit on number of off-targets. This method does not keep track of the off-target locations and does not filter overlapping off-targets, therefore it might hit the early stopping condition a little earlier than intended. Especially, when variants have multiple ALT and multiple variants are overlapping off-targets, this function will report each combination of the overlapping variants. Each of these combinations will also count towards early stopping condition.
Assumes that your guides do not contain PAM, and are all in the same direction as you would order for the lab e.g.:
5' - ...ACGTCATCG NGG - 3' -> will be input: ...ACGTCATCG
guide PAM
3' - CCN GGGCATGCT... - 5' -> will be input: ...AGCATGCCC
PAM guide
Arguments
output_file
- Path and name for the output file, this will be comma separated table, therefore .csv
extension is preferred. This search will create intermediate files which will have same name as output_file
, but with a sequence prefix. Final file will contain all those intermediate files.
distance
- Defines maximum levenshtein distance (insertions, deletions, mismatches) for which off-targets are considered.
early_stopping
- Integer vector. Early stopping condition. For example for distance 2, we need vector with 3 values e.g. [1, 1, 5]. Which means we will search with "up to 1 offtargets within distance 0", "up to 1 offtargets within distance 1"...
Examples
# prepare libs
using CHOPOFF, BioSequences
# make a temporary directory
tdir = tempname()
db_path = joinpath(tdir, "vcfDB.bin")
# use CHOPOFF example genome and vcf file
chopoff_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(chopoff_path,
"test", "sample_data", "genome", "semirandom.fa"))
vcf = joinpath(vcat(chopoff_path,
"test", "sample_data", "artificial.vcf"))
# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(chopoff_path,
"test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)
# example VCF file
build_vcfDB(
"samirandom", genome, vcf,
Motif("Cas9"; distance = 2, ambig_max = 3),
db_path)
# search using vcfDB
output_file = joinpath(tdir, "output.csv")
search_vcfDB(db_path, guides, output_file)
# load results
using DataFrames, CSV
res = DataFrame(CSV.File(output_file))