Local alignment with kbo

Published November 19, 2024

I like to create new tools to solve already solved problems. This time I’ve created a new tool for reference-based alignment of bacterial genomes, and computation of sequence identity and coverage of some reference gene sequences in a query assembly. The tool is called kbo and is available from https://github.com/tmaklin/kbo (written in Rust).

kbo comes with some interesting algorithmic bioinformatics that converts plain k-mer matching between a database and a query sequence into local alignments by looking at the longest common prefix length in each k-mer and its best match in a database (see the preprint by colleagues from Helsinki).

Overview

The basic idea is that if the longest common prefix length – a k-bounded matching statistic – of each k-mer match and their order in the query are known, then the following events are identifiable by looking at the (ordered) matching statistic vector:

  • Matching region between query and reference: matching statistic has a high value.
  • Region in query not present in reference: matching statistic has random value drawn from noise distribution.
  • Single character mismatch or insertion in query: matching statistic decreases linearly from k to 0 and immediately goes back to a high value.
  • Insertion in reference or jump from one region in query to another: matching statistic decreases linearly from k to 1 and immediately goes back to a high value.

These are described in documentation for the derandomize and translate modules, the kbo crate documentation, and will be turned into a pre-print in early 2025.

This post contains a guide and examples for the two main use cases I have in mind for kbo. You can replicate them using the kbo command-line interface.

Installation

Run in browser

An in-browser demo version is available at maklin.fi/kbo.

You can find the source code for the browser version in a separate git repository.

Run locally

Download a precompiled binary from here, or install the command-line interface using cargo:

cargo install kbo-cli

and then call kbo to run the tool.

I am also developing a browser version at https://github.com/tmaklin/kbo-gui which will hopefully be available soon.

Finding genes of interest in a query

Assume you have sequenced and assembled a bacterial genome and are interested in knowing whether the assembly contains some gene of interest or not. Typically, this would be a task for blast, but we want to use kbo find.

Setup:

  • A database containing 19 interesting genes from GitHub.
  • A genome which has variants of these genes from the ENA.

Run the following to see where the genes are located in the query

kbo find -r db.fasta GCA_000013265.1.fasta.gz

This will produce the output

This replaces the query.contig column with the name of the contig (click to expand)
query	ref	q.start	q.end	strand	length	mismatches	gap_opens	identity	coverage	query.contig	ref.contig
GCA_000013265.1.fasta.gz	db.fasta	56646	57164	+	519	0	0	100.00	1.03	ENA|CP000244|CP000244.1 Escherichia coli UTI89 plasmid pUTI89, complete sequence.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2165476	2166423	+	948	0	0	100.00	1.88	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2900668	2910288	-	9621	0	0	100.00	19.06	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2910329	2912929	-	2601	0	0	100.00	5.15	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2912939	2913808	-	870	0	0	100.00	1.72	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2913838	2914086	-	249	0	0	100.00	0.49	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2914090	2916485	-	2396	0	0	100.00	4.75	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2916533	2921329	-	4797	1	0	99.98	9.50	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2921379	2924411	-	3033	0	0	100.00	6.01	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2924455	2930955	-	6501	0	0	100.00	12.88	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2930966	2938886	-	7921	0	0	100.00	15.69	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2938948	2944751	-	5804	0	0	100.00	11.50	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2944782	2947241	-	2460	0	0	100.00	4.87	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2947254	2949483	-	2230	0	0	100.00	4.42	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta
GCA_000013265.1.fasta.gz	db.fasta	2949518	2950030	-	513	0	0	100.00	1.02	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	db.fasta

Note that the output doesn’t tell which genes were found, only where in the query there are matches in the reference. If this is inconvenient for you, add the --detailed flag which performs the analysis gene-by-gene at the cost of some extra runtime:

kbo find --detailed --threads 4 -r db.fasta GCA_000013265.1.fasta.gz
This replaces the query.contig column with the name of the contig (click to expand)
query	ref	q.start	q.end	strand	length	mismatches	gap_opens	identity	coverage	query.contig	ref.contig
GCA_000013265.1.fasta.gz	db.fasta	56646	57164	+	519	00	100.00	100.00	ENA|CP000244|CP000244.1 Escherichia coli UTI89 plasmid pUTI89, complete sequence.	clbS-like_4ce09a
GCA_000013265.1.fasta.gz	db.fasta	2165476	2165688	+	213	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbR|locus_tag=ECOK1_RS11410|product="colibactin biosynthesis LuxR family transcriptional regulator ClbR"|protein_id=WP_000357141.1
GCA_000013265.1.fasta.gz	db.fasta	2165689	2166423	+	735	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbA|locus_tag=ECOK1_RS11415|product="colibactin biosynthesis phosphopantetheinyl transferase ClbA"|protein_id=WP_001217110.1
GCA_000013265.1.fasta.gz	db.fasta	2900668	2910288	-	9621	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbB|locus_tag=ECOK1_RS11405|product="colibactin hybrid non-ribosomal peptide synthetase/type I polyketide synthase ClbB"|protein_id=WP_001518711.1
GCA_000013265.1.fasta.gz	db.fasta	2910329	2912929	-	2601	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbC|locus_tag=ECOK1_RS11400|product="colibactin polyketide synthase ClbC"|protein_id=WP_001297908.1
GCA_000013265.1.fasta.gz	db.fasta	2912939	2913808	-	870	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbD|locus_tag=ECOK1_RS11395|product="colibactin biosynthesis dehydrogenase ClbD"|protein_id=WP_000982270.1
GCA_000013265.1.fasta.gz	db.fasta	2913838	2914086	-	249	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbE|locus_tag=ECOK1_RS11390|product="colibactin biosynthesis aminomalonyl-acyl carrier protein ClbE"|protein_id=WP_001297917.1
GCA_000013265.1.fasta.gz	db.fasta	2914090	2915220	-	1131	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbF|locus_tag=ECOK1_RS11385|product="colibactin biosynthesis dehydrogenase ClbF"|protein_id=WP_000337350.1
GCA_000013265.1.fasta.gz	db.fasta	2915217	2916485	-	1269	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbG|locus_tag=ECOK1_RS11380|product="colibactin biosynthesis acyltransferase ClbG"|protein_id=WP_000159201.1
GCA_000013265.1.fasta.gz	db.fasta	2916533	2921329	-	4797	10	99.98	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbH|locus_tag=ECOK1_RS11375|product="colibactin non-ribosomal peptide synthetase ClbH"|protein_id=WP_001304254.1
GCA_000013265.1.fasta.gz	db.fasta	2921379	2924411	-	3033	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbI|locus_tag=ECOK1_RS11370|product="colibactin polyketide synthase ClbI"|protein_id=WP_000829570.1
GCA_000013265.1.fasta.gz	db.fasta	2924455	2930955	-	6501	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbJ|locus_tag=ECOK1_RS11365|product="colibactin non-ribosomal peptide synthetase ClbJ"|protein_id=WP_001468003.1
GCA_000013265.1.fasta.gz	db.fasta	2929045	2930331	-	1287	20	99.84	19.91	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbK|locus_tag=ECOK1_RS11360|product="colibactin hybrid non-ribosomal peptide synthetase/type I polyketide synthase ClbK"|protein_id=WP_000222467.1
GCA_000013265.1.fasta.gz	db.fasta	2930966	2937430	-	6465	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbK|locus_tag=ECOK1_RS11360|product="colibactin hybrid non-ribosomal peptide synthetase/type I polyketide synthase ClbK"|protein_id=WP_000222467.1
GCA_000013265.1.fasta.gz	db.fasta	2934698	2935984	-	1287	20	99.84	19.80	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbJ|locus_tag=ECOK1_RS11365|product="colibactin non-ribosomal peptide synthetase ClbJ"|protein_id=WP_001468003.1
GCA_000013265.1.fasta.gz	db.fasta	2937423	2938886	-	1464	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbL|locus_tag=ECOK1_RS11355|product="colibactin biosynthesis amidase ClbL"|protein_id=WP_001297937.1
GCA_000013265.1.fasta.gz	db.fasta	2938948	2940387	-	1440	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbM|locus_tag=ECOK1_RS11350|product="precolibactin export MATE transporter ClbM"|protein_id=WP_000217768.1
GCA_000013265.1.fasta.gz	db.fasta	2940384	2944751	-	4368	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbN|locus_tag=ECOK1_RS11345|product="colibactin non-ribosomal peptide synthetase ClbN"|protein_id=WP_001327259.1
GCA_000013265.1.fasta.gz	db.fasta	2944782	2947241	-	2460	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbO|locus_tag=ECOK1_RS11340|product="colibactin polyketide synthase ClbO"|protein_id=WP_001029878.1
GCA_000013265.1.fasta.gz	db.fasta	2947254	2948768	-	1515	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbP|locus_tag=ECOK1_RS11335|product="precolibactin peptidase ClbP"|protein_id=WP_002430641.1
GCA_000013265.1.fasta.gz	db.fasta	2948761	2949483	-	723	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbQ|locus_tag=ECOK1_RS11330|product="colibactin biosynthesis thioesterase ClbQ"|protein_id=WP_000065646.1
GCA_000013265.1.fasta.gz	db.fasta	2949518	2950030	-	513	00	100.00	100.00	ENA|CP000243|CP000243.1 Escherichia coli UTI89, complete genome.	clbS|locus_tag=ECOK1_RS11325|product="colibactin self-protection protein ClbS"|protein_id=WP_000290498.1

kbo isn’t massively better for this task than blast but I believe the command-line syntax is somewhat easier to memorize. And it’s interesting that we can extract local alignments from pure k-mer matching by adding the k-bounded matching statistics.

My internal roadmap thinks of this as the main use case for the eventual browser version, which could support things like bundling databases with the executable to distribute analyses you need to run often.

Reference-based alignment

Next, we have the case of generating a reference-based alignment of two bacterial genome assemblies. For me, this was the original motivation to write kbo, as the existing tools are either somewhat slow (snippy) or require some megabytes of temporary disk space per query (snippy and ska2). Both factors scale poorly when running reference-based alignment for tens of thousands of genomes, although it’s still possible with the exisintg tools.

kbo can handle this task very easily, and takes a few seconds to run the alignment with no indexing step or temporary disk space required.

Setup

  • A genome from the ENA.
  • Another genome from the ENA.

With the genomes downloaded, run

kbo map -r GCA_000013265.1.fasta.gz GCA_000333215.1.fasta.gz > alignment.fasta

The file alignment.fasta will contain an alignment of GCA_000333215 against GCA_000013265. If you need the nucleotide sequence of the reference in the file, you need to add it to the file first in the same format that kbo will output

echo ">GCA_000013265.1.fasta.gz" > alignment.fasta
gunzip -c GCA_000013265.1.fasta.gz | grep -v ">" | tr -d '\n' | sed 's/$/\n/g' >> alignment.fasta

kbo map -r GCA_000013265.1.fasta.gz GCA_000333215.1.fasta.gz > alignment.fasta

This will be added as an option to the command-line interface and the browser version in the future.

The main use case for kbo map is to generate reference-based alignments of large numbers of queries to build a phylogeny since kbo only takes ~a second per genome comparison and uses no disk space. You can use the file to extract SNPs but some quirks in the algorithm mean that the bases are not always called correctly, so ska or snippy should be preferred if this is important to you (and compare different approaches first).

TODO list

  • Finish the browser version (more cumbersome than it seems).
  • Write up the algorithmic parts and put it out as a pre-print.

If you want to contribute code, documentation, or something else feel free to do so in the GitHub repos:

Reading

Thanks to Jarno Alanko and Simon Puglisi from the University of Helsinki for introducing me to k-bounded matching statistics and especially Jarno for writing the sbwt Rust crate that I used within kbo.

Some papers if you are interested in the internals: