readfish plugins API¶
readfish.plugins.abc
module¶
Abstract Base Classes for readfish plugins
These classes define the expected structures and type information for readfish plugins. These are expanded on in the Developer’s guide.
Validation is left to the author of any plugins that inherits from either the AlignerABC
or CallerABC
.
Things we suggest that are validated:
required keys - Keys that must be present in the TOML
correctly typed values - Values that have been passed in are correctly parsed
available input files - Check the existence of paths
writable outputs - Check permissions on output files
sufficient space/RAM/resource - Check Disk space at least
- class readfish.plugins.abc.AlignerABC(debug_log, **kwargs)[source]¶
Bases:
ABC
Aligner base class.
- abstract describe(regions, barcodes)[source]¶
Informatively describe the Aligner and how it is setup, to be logged for the user. For example reference size, reference file etc.
- Returns:
A string containing (preferably pretty formatted) information about the aligner
- Return type:
- abstract disconnect()[source]¶
Aligner disconnection method, this will be called after readish’s main loop finishes
- abstract property initialised: bool¶
Is this aligner instance initialised.
This method should indicate whether the class is initialised and capable of aligning data. If it returns
False
readfish will be paused until it evaluates toTrue
- class readfish.plugins.abc.CallerABC(debug_log, **kwargs)[source]¶
Bases:
ABC
Caller base class.
- abstract basecall(chunks, signal_dtype, daq_values)[source]¶
Basecall live data from the Read Until API.
- Parameters:
- Returns:
Yields
Result
classes with theResult.channel
,Result.read_id
, andResult.seq
fields set.- Return type:
Iterable[Result]
- abstract describe()[source]¶
Informatively describe the Caller and how it is setup, to be logged for the user. For example the name of the caller, any connections made for basecalling, models used etc.
- Returns:
A string containing (preferably pretty formatted) information about the caller
- Return type:
readfish.plugins.utils
module¶
- readfish.plugins.utils.TARGET_INTERVAL¶
alias of
TargetInterval
- class readfish.plugins.utils.Strand(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
Enum
Enum representing the forward and reverse strand of DNA for alignments
- forward = '+'¶
Forward strand
- reverse = '-'¶
Reverse strand
- readfish.plugins.utils.get_contig_lengths(al)[source]¶
Get the lengths of all contigs in the reference genome provided by an Aligner instance.
- Parameters:
al (AlignerABC) – An Aligner instance representing the reference genome.
- Returns:
A dictionary mapping contig names to their respective lengths.
- Return type:
- readfish.plugins.utils.is_empty(item)[source]¶
Check if an item is empty.
This function checks whether the given item is empty. An item is considered empty if it is an empty Container (Set, Tuple, Dict, List etc.). For primitive types (Float, Int, String etc.), this function considers them non-empty. considers them non-empty.
- Parameters:
item (Any) – The item to check for emptiness.
- Returns:
True if the item is empty, False otherwise.
- Examples:
- Return type:
>>> is_empty(42) False >>> is_empty("Hello, world!") False >>> is_empty([]) True >>> is_empty({}) True >>> is_empty([1, 2, 3]) False >>> is_empty({"a": 1, "b": 2}) False >>> is_empty([[], [], []]) False >>> is_empty([{}, {}]) False >>> is_empty(None) False
- readfish.plugins.utils.count_dict_elements(d)[source]¶
Recursively count all the bottom elements of an arbitrarily nested dictionary. If the bottom element v is a list, return the length of the list, else return 1 for each v at the bottom of the list.
Note - This will break for nested lists, i.e will only count the list as one, ignoring the sublists - see the last doctest for an example. This is not a problem for the current use case, but may be in the future.
- Parameters:
d (dict[Any]) – Dictionary to count elements of, may or may not be nested
- Returns:
Count of elements at lowest point in tree
- Return type:
>>> simple_dict = {"a": 1, "b": 2, "c": 3} >>> count_dict_elements(simple_dict) 3
>>> string_dict = {"a": 1, "b": {"x": ["10", "2000"]}, "c": {"y": {"z": [30, 40, 50]}}} >>> count_dict_elements(string_dict) 6
>>> nested_dict = {"a": 1, "b": {"x": [10, 20]}, "c": {"y": {"z": [30, 40, 50]}}} >>> count_dict_elements(nested_dict) 6
>>> empty_dict = {"a": {}, "b": {"x": {}}, "c": {"y": {"z": []}}} >>> count_dict_elements(empty_dict) 0
>>> mixed_dict = {"a": 1, "b": {"x": [10, 20]}, "c": {"y": {"z": [30, 40, 50], "w": 7.0}}} >>> count_dict_elements(mixed_dict) 7
>>> empty_list_dict = {"a": [], "b": [{}], "c": [[], [], []]} >>> count_dict_elements(empty_list_dict) 0
# Nested lists are not counted properly
>>> nested_list_dict = {"a": [], "b": [{}], "c": [[1, 2, [1, 2]], [], []]} >>> count_dict_elements(nested_list_dict) 1
- readfish.plugins.utils.sum_target_coverage(targets, genomes)[source]¶
Recursively find the coverage of the range of a set of Targets - ASSUMES bottoms elements are in the form dict[chromosome_name, tuple[float, float]] or tuple[int, int], i.e genomic coordinates
If there are no targets, return 0.
- Parameters:
- Returns:
sum of distance covered by ranges of targets in d.
- Return type:
- readfish.plugins.utils.coord_validator(row)[source]¶
Validates and converts the ‘start’ and ‘end’ fields in the given row dictionary to integers. If conversion is not possible, or if ‘start’ is greater than ‘end’, appends appropriate error messages to a list and returns the list along with the row dictionary. The error messages as intended to be collected and converted to a ValueError as part of a BaseExceptionGroup.
- Parameters:
row (dict[str, str]) – A dictionary containing ‘start’ and ‘end’ fields, presumably as strings.
- Returns:
A tuple containing the possibly modified row dictionary and a list of error messages.
- Example:
- Return type:
>>> row = {'start': '10', 'end': '5'} >>> coord_validator(row) ({'start': 10, 'end': 5}, ['{target_specification_format} {line_number} start > end (10 > 5)'])
>>> row = {'start': 'a', 'end': '20'} >>> coord_validator(row) ({'start': 'a', 'end': 20}, ["{target_specification_format} {line_number} start coordinate 'a' could not be converted to an integer"])
>>> row = {'start': '10', 'end': '20'} >>> coord_validator(row) ({'start': 10, 'end': 20}, [])
- readfish.plugins.utils.strand_validator(row)[source]¶
Validates the ‘strand’ field in the given row dictionary to be either ‘+’, ‘-’, or ‘.’. If the ‘strand’ field is ‘.’, it is converted to ‘+-’. If the ‘strand’ field is not one of the mentioned valid values, an error message is added to a list of errors, and the list of error messages along with the modified row dictionary are returned.
- Parameters:
row (dict[str, str]) – A dictionary containing a ‘strand’ field.
- Returns:
A tuple containing the possibly modified row dictionary and a list of error messages.
- Example:
- Return type:
>>> row = {'strand': '.'} >>> strand_validator(row) ({'strand': '+-'}, [])
>>> row = {'strand': 'x'} >>> strand_validator(row) ({'strand': 'x'}, ["{target_specification_format} {line_number} strand 'x' not one of ['+', '-', '.']"]) >>> row = {'strand': '+'} >>> strand_validator(row) ({'strand': '+'}, [])
Refer to http://genome.ucsc.edu/FAQ/FAQformat#format1 for more details on the strand field in BED format.
- readfish.plugins.utils.row_checker(row, mode='csv')[source]¶
Validates the given row dictionary based on the mode and returns the row along with any errors found during the validation.
The mode alters the behaviour. If the mode is ‘csv’, the row is allowed to only contain the contig. If it is “bed”, the first 6 columns of the BED format are required.
Refer to http://genome.ucsc.edu/FAQ/FAQformat#format1 for more details on the strand field in BED format.
- Parameters:
- Returns:
A tuple containing the validated (and possibly corrected) row and a list of error messages encountered during validation.
- Return type:
>>> row = {'chrom': 'chr1', 'start': '1000', 'end': '2000', 'strand': '+'} >>> row_checker(row, mode='csv') # No errors, valid row ({'chrom': 'chr1', 'start': 1000, 'end': 2000, 'strand': '+'}, [])
>>> row = {'chrom': 'chr1', 'start': '2000', 'end': '1000', 'strand': '+'} >>> # Coord validator will report an error due to start > end >>> row_checker(row, mode='csv') ({'chrom': 'chr1', 'start': 2000, 'end': 1000, 'strand': '+'}, ['{target_specification_format} {line_number} start > end (2000 > 1000)'])
>>> row = {'chrom': None, 'start': '1000', 'end': '2000', 'strand': '+'} >>> # Chromosome value is missing, an error will be reported >>> row_checker(row, mode='csv') ({'chrom': None, 'start': 1000, 'end': 2000, 'strand': '+'}, ['{target_specification_format} {line_number} has no chromosome value'])
>>> row = ["chr1", 0, 1000, "+"] >>> # Chromosome value is missing, an error will be reported >>> row_checker(row, mode='csv') (['chr1', 0, 1000, '+'], ['Input row is not a valid dictionary'])
- Raises:
ValueError – If the mode is neither ‘csv’ nor ‘bed’.
- class readfish.plugins.utils.Decision(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
Enum
Decision readfish has made about a read after Alignment
- single_off = 'single_off'¶
The read aligned to a single location that is not in a target region
- multi_on = 'multi_on'¶
The read aligned to multiple locations, where at least one alignment is within a target region
- multi_off = 'multi_off'¶
The read aligned to multiple locations, none of which were in a target region
- no_map = 'no_map'¶
The read was basecalled but did not align
- no_seq = 'no_seq'¶
The read did not basecall
- above_max_chunks = 'above_max_chunks'¶
Too many signal chunks have been collected for this read
- below_min_chunks = 'below_min_chunks'¶
Fewer signal chunks for this read collected than required
- duplex_override = 'duplex_override'¶
Potential second half of a duplex read
- first_read_override = 'first_read_override'¶
Read sequenced as translocated portion was of unknown length at start of readfish
- class readfish.plugins.utils.Action(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
Enum
Action to take for a read.
This enum class represents different actions that can be taken for a read during sequencing. Each action has a corresponding string value used for logging.
- Variables:
unblock – Send an unblock command to the sequencer.
stop_receiving – Allow the read to finish sequencing.
proceed – Sample another chunk of data.
- Example:
Define an Action:
>>> action = Action.unblock
Access the string value of an Action:
>>> action.value 'unblock'
- unblock = 'unblock'¶
Send an unblock command to the sequencer
- stop_receiving = 'stop_receiving'¶
Allow the read to finish sequencing
- proceed = 'proceed'¶
Sample another chunk of data
- class readfish.plugins.utils.Result(channel, read_id, seq, decision=Decision.no_seq, barcode=None, basecall_data=None, alignment_data=None)[source]¶
Bases:
object
Result holder
This should be progressively filled with data from the basecaller, barcoder, and then the aligner.
- Parameters:
channel (int) – The channel that this read is being sequenced on
read_id (str) – The read ID assigned to this read by MinKNOW
seq (str) – The basecalled sequence for this read
decision (Decision) – The
Decision
that has been made, this will by used to determine theAction
barcode (str | None) – The barcode that has been assigned to this read
basecall_data (Any | None) – Any extra data that the basecaller may want to send to the aligner
alignment_data (list[_AlignmentAttribute | _AlignmentProperty] | None) – Any extra alignment data
- class readfish.plugins.utils.Targets(value=NOTHING, padding=0)[source]¶
Bases:
object
Class representation of target regions of a genome.
This class is responsible for parsing and managing target regions specified either through a TOML file or provided as a list of strings.
- Variables:
Note
Example:
Using a list of targets:
>>> targets = Targets.from_parsed_toml(["chr1,100,200,+"])
Using a .bed file:
targets = Targets.from_parsed_toml(“/path/to/targets.bed”)
- classmethod from_parsed_toml(targets)[source]¶
Create the target array from the targets that have been read from the provided TOML file
- Parameters:
targets (List[str] | str) – The targets array or a target file, containing a file per line
- Raises:
ValueError – Raised if the supplied target is a file that cannot be parsed
ValueError – If we fail to initialise class
- Returns:
Initialised targets class
- Return type:
- check_coord(contig, strand, coord)[source]¶
Check to see if a coordinate is within any of the target regions :param contig: The target contig name :param strand: The strand that the alignment is to :param coord: The coordinate to be checked :raises ValueError: If the strand passed is not recognised :return: Boolean representing whether the coordinate is within a target region or not
>>> targets = Targets(["chr1,10,20,+", "chr1,15,30,+"]) >>> targets.check_coord('chr1', "+", 15) True >>> targets.check_coord('chr1', "+", 5) False >>> targets.check_coord('chr1', "-", 15) False >>> targets.check_coord('chr1', "+", 31) # Example where coord (31) is in reversed target interval (+ve strand) Should fail False >>> targets.check_coord('chr1', "-", 41) # Example where coord (41) is in reversed target interval (-ve strand) Should fail False >>> targets.check_coord('chr1', "unknown_strand", 15) Traceback (most recent call last): ... ValueError: Unexpected strand unknown_strand
- get_offset(strand)[source]¶
Get the start and end padding offsets for a given strand.
- Parameters:
strand (Strand) – The strand for which to get the offsets.
- Returns:
A tuple containing the start and end offsets.
- Examples:
>>> targets = Targets(["chr1,10,20,+", "chr1,15,30,+"], padding=10) >>> targets.get_offset(Strand.forward) (-10, 0) >>> targets.get_offset(Strand.reverse) (0, 10)
- iter_targets()[source]¶
Iterate over the intervals for a _Conditions target intervals, yielding TARGET_INTERVAL objects.
This method iterates over the target intervals stored in the Targets object and yields TARGET_INTERVAL objects representing each target interval.
- Returns:
Generator that yields TARGET_INTERVAL objects.
- Example:
>>> targets = Targets(["chr1,10,20,+", "chr1,15,30,+"]) >>> for target in targets.iter_targets(): ... print(target) TargetInterval(chromosome='chr1', start=10, end=30, strand=<Strand.forward: '+'>)
>>> targets = Targets(["chr1,10,20,+", "chr2,5,15,-"]) >>> for target in targets.iter_targets(): ... print((target.chromosome, target.start, target.end, target.strand)) ('chr1', 10, 20, <Strand.forward: '+'>) ('chr2', 5, 15, <Strand.reverse: '-'>)
>>> targets = Targets(["chr1,10,20,+", "chr2,5,15,-", "chr1,25,35,-"]) >>> for target in targets.iter_targets(): ... print(target.chromosome, target.start, target.end, target.strand) chr1 10 20 Strand.forward chr2 5 15 Strand.reverse chr1 25 35 Strand.reverse
- class readfish.plugins.utils.PreviouslySentActionTracker(last_actions=NOTHING)[source]¶
Bases:
object
A class to keep track of the last action sent from a channel.
This class provides methods to add and retrieve the last action sent for each channel.
- Parameters:
last_action – A dictionary mapping channel IDs to the last sent action for that channel.
- Example:
Initialize a PreviouslySentActionTracker:
>>> tracker = PreviouslySentActionTracker()
Add an action for channel number 1:
>>> from readfish.plugins.utils import Action >>> action = Action.unblock >>> tracker.add_action(1, action)
Retrieve the last action for a channel:
>>> retrieved_action = tracker.get_action(1) >>> retrieved_action <Action.unblock: 'unblock'>
Retrieve the last action for a channel that hasn’t sent any actions:
>>> no_action = tracker.get_action(2) >>> no_action is None True
- class readfish.plugins.utils.DuplexTracker(previous_alignments=NOTHING, previous_decision=NOTHING)[source]¶
Bases:
object
Wrapper class to keep track the alignment location of the latest read seen on a channel, and previous decision made, tracking whether we made a duplex override on the last read Specifically, we store a list of tuples of any target contig names and strands that were aligned to, keyed to channel number and the previous decision for a read made on that channel. The decision should only be updated when a read has been finalised and should not be seen again, i.e a Stop receiving or Unblock has been sent to MinKNOW No maps are specified as (*, *)
- get_previous_decision(channel)[source]¶
Get the previous decision seen on this channel.
- Parameters:
channel (int) – The channel number.
- Returns:
Previously seen decision
- Return type:
>>> dt = DuplexTracker() >>> dt.get_previous_decision(1) is None True >>> dt.set_decision(1, Decision.duplex_override) >>> dt.get_previous_decision(1) <Decision.duplex_override: 'duplex_override'>
- set_decision(channel, decision)[source]¶
Set the previous decision for a given channel number.
- Parameters:
i.e we won’t see the read again. >>> dt = DuplexTracker() >>> dt.set_decision(1, Decision.no_map) >>> dt.previous_decision[1] <Decision.no_map: ‘no_map’>
- get_previous_alignments(channel)[source]¶
Retrieves last alignments, including no maps seen on the given channel.
- Parameters:
channel (int) – The channel number to lookup the previous action for
read_id – Read of ID of the current alignment
- Returns:
Returns a tuple of (contig_name, strand), for the last alignment seen on this channel
- Return type:
>>> dt = DuplexTracker() >>> dt.get_previous_alignments(1) is None True >>> dt.set_alignments(1, [("contig1", Strand.forward), ("contig2", Strand.reverse)]) >>> dt.get_previous_alignments(1) [('contig1', <Strand.forward: '+'>), ('contig2', <Strand.reverse: '-'>)]
- set_alignments(channel, alignments)[source]¶
Add an alignment that has been seen for a channel.
- Parameters:
channel (int) – The channel number to set the alignment for.
target_name – The name of the target contig aligned to
strand – The strand we have aligned to.
>>> dt = DuplexTracker() >>> dt.set_alignments(1, [("contig3", Strand.forward), ("contig4", Strand.reverse)]) >>> dt.previous_alignments[1] [('contig3', <Strand.forward: '+'>), ('contig4', <Strand.reverse: '-'>)]
- possible_duplex(channel, target_name, strand)[source]¶
Compare the current alignment target_name and strand for a given channel with the previous alignment target_name and strand.
If the strand is opposite and the target is the same, return True, else False. :param channel: Channel number to fetch alignment for :param target_name: The name of the target contig for the current alignment :param strand: The strand of the current alignment :return: True if the strand is opposite and target contig the same
>>> dt = DuplexTracker() >>> dt.set_alignments(1, [("contig5", Strand.forward)]) >>> dt.possible_duplex(1, "contig5", Strand.reverse) True >>> dt.possible_duplex(1, "contig6", Strand.reverse) False
readfish.plugins.dorado
module¶
Dorado plugin module
Extension of pyBaseCaller that maintains a connection to the basecaller
- class readfish.plugins.dorado.Caller(run_information=None, sample_rate=None, debug_log=None, **kwargs)[source]¶
Bases:
CallerABC
- basecall(reads, signal_dtype, daq_values=None)[source]¶
Basecall live data from minknow RPC
- Parameters:
reads (Iterable[tuple[int, minknow_api.data_pb2.GetLiveReadsResponse.ReadData]]) – List or generator of tuples containing (channel, MinKNOW.rpc.Read)
signal_dtype (npt.DTypeLike) – Numpy dtype of the raw data
daq_values (dict[int, namedtuple]) – Dictionary mapping channel to offset and scaling values. If not provided default values of 1.0 and 0.0 are used.
- Yield:
- Return type:
- describe()[source]¶
Describe the Dorado Caller
- Returns:
Description of parameters passed to this Dorado Caller plugin
- Return type:
- validate()[source]¶
Validate the parameters passed to Dorado to ensure they will initialise py Basecall Client correctly
- Currently checks:
That the socket file exists
That the Socket file has the correct permissions
That the version of py basecall client lib installed matches the system version
- Returns:
None, if the parameters pass all the checks
- Return type:
None
readfish.plugins.mappy
module¶
readfish.plugins._no_op
module¶
A no operation plugin module, used for pass through behaviour.
This module implements a basic Aligner and Caller that do nothing and the minimum required behaviours respectively.
They are here for when readfish expects an action that may not be required.
For example if using a signal based alignment approach that module can replace the Caller
and completely remove the extra alignment step.
To achieve this the _no_op.Caller
will only iterate the raw data from the Read Until API and yield
the minimal Result
structs for the targets
script to use:
Result(
channel=<channel number>,
read_number=<read_number>,
read_id=<read_id>,
seq="",
)
The seq
field will always be empty.
This is of little (essentially no) use outside of an unblock all or something completely random where you don’t want or need any sequence.
In addition the _no_op.Aligner
will pass through the iterable from the caller module without modifying/adding anything.
This behaviour can be useful if a plugin can complete it’s entire decision in a single step.
- class readfish.plugins._no_op.Aligner(*args, **kwargs)[source]¶
Bases:
AlignerABC
- class readfish.plugins._no_op.Caller(*args, **kwargs)[source]¶
Bases:
CallerABC
- basecall(chunks, *args, **kwargs)[source]¶
Create a minimal
Result
instance from live data from the Read Until API.This will use the actual channel, read number, and read ID but will set an empty string for the
seq
field.