Alignment-free filters for gRNAs

CHOPOFF.build_hashDB — Function

build_hashDB(
    name::String, 
    genomepath::String, 
    motif::Motif;
    storage_path::String = "",
    seed::UInt64 = UInt64(0x726b2b9d438b9d4d),
    max_iterations::Int = 10,
    max_count::Int = 10,
    precision::DataType = UInt16)

Prepare hashDB index for future searches using search_hashDB.

Arguments

name - Your preferred name for this index to ease future identification.

genomepath - Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.

motif - Motif defines what kind of gRNA to search for.

storage_path - Path to the where index will be saved.

seed - Optional. Seed is used during hashing for randomization.

max_iterations - When finding hashing structure for binary fuse filter it might fail sometimes, we will retry max_iterations number of times though.

max_count - Above this count we put all unique off-targets into one bin. Put number here that is the minimum number of off-targets that you think is fine for the distance of 1.

precision- The higher the precision the larger the database, but also chances for error decrease dramatically. We support UInt8, UInt16, and UInt32.

Examples

# prepare libs
using CHOPOFF, BioSequences

# use CHOPOFF example genome
genome = joinpath(
    vcat(splitpath(dirname(pathof(CHOPOFF)))[1:end-1], 
    test", "sample_data", "genome", "semirandom.fa"))

# finally, build a hashDB
build_hashDB(
    "samirandom", genome, 
    Motif("Cas9"; distance = 1, ambig_max = 0))

CHOPOFF.search_hashDB — Function

search_hashDB(
    db::HashDB,
    guides::Vector{LongDNA{4}},
    right::Bool)

Estimate off-target counts for guides using hashDB stored at storage_dir.

Probabilistic filter offers a guarantee that it will always be correct when a sequence is in the set (no false negatives), but may overestimate that a sequence is in the set while it is not (false positive) with low probability. If both columns in the results are 0, it is guaranteed this gRNA has no off-targets in the genome!

Also, maximum count for each off-target in the database is capped, to max_count specified during building of hashDB. This means that counts larger than max_count are no longer estimating correctly. Its likely you would not care for those guides anyhow.

right argument specifies whether the database should be checked in direction from unique off-targets which occur once to increasingly more occurring off-targets up until max_count is reached, which may result in assuming lower than real off-target counts (underestimate) for some of the sequences, however this approach will not reject any gRNAs that should not be rejected and is suitable for filtering of gRNAs we do not need. Left (right = false, or hight-counts to low-counts) approach is also supported, which can be used for ordering of gRNAs to the best of database ability. Left approach may overestimate counts for some gRNAs. When gRNA is reported as off-target free it is also guaranteed to be true in both cases (low-to-high and high-to-low).

Examples

# prepare libs
using CHOPOFF, BioSequences

# use CHOPOFF example genome
CHOPOFF_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(CHOPOFF_path, 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a hashDB
db = build_hashDB(
    "samirandom", genome, 
    Motif("Cas9"; distance = 1, ambig_max = 0))

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(CHOPOFF_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)

# finally, get results!
hdb_res = search_hashDB(db, guides, false)

CHOPOFF.build_dictDB — Function

build_dictDB(
    name::String, 
    genomepath::String, 
    motif::Motif;
    storage_path::String = "")

Prepare dictDB index for future searches using search_dictDB.

Arguments

name - Your preferred name for this index to ease future identification.

genomepath - Path to the genome file, it can either be fasta or 2bit file. In case of fasta also prepare fasta index file with ".fai" extension.

motif - Motif defines what kind of gRNA to search for.

storage_path - Path to the where index will be saved.

Examples

# prepare libs
using CHOPOFF, BioSequences

# use CHOPOFF example genome
genome = joinpath(vcat(splitpath(dirname(pathof(CHOPOFF)))[1:end-1], 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a hashDB!
db = build_dictDB(
    "samirandom", genome, 
    Motif("Cas9"; distance = 1, ambig_max = 0))

CHOPOFF.search_dictDB — Function

search_dictDB(
    db::DictDB,
    guides::Vector{LongDNA{4}})

Summarize off-target counts for guides using dictDB.

This is simple dictionary storing all possible off-targets and their counts for a given Motif. If you find no off-targets using this method it is guaranteed this gRNA has no off-targets in the genome! Beware that the dictionary can be very big (e.g. human genome ~ 8Gb).

Examples

# prepare libs
using CHOPOFF, BioSequences

# use CHOPOFF example genome
CHOPOFF_path = splitpath(dirname(pathof(CHOPOFF)))[1:end-1]
genome = joinpath(vcat(CHOPOFF_path, 
    "test", "sample_data", "genome", "semirandom.fa"))

# build a dictDB
db = build_dictDB(
    "samirandom", genome, 
    Motif("Cas9"; distance = 1, ambig_max = 0))

# load up example gRNAs
guides_s = Set(readlines(joinpath(vcat(CHOPOFF_path, 
    "test", "sample_data", "guides.txt"))))
guides = LongDNA{4}.(guides_s)

# finally, get results!
res = search_dictDB(db, guides)