Output page of BARCOSEL gives a FASTA file which contains the selected
barcodes. In the output, there is also a graphical diagnostic plot which
shows the position-wise and global nucleotide balance within a selected
set.
Selection of barcodes is based on two criteria:
1) Each barcode must be at least a user-defined number of nucleotide
mismatches apart from each other. This is a strict requirement. The
default number (threshold for minimum Hamming distance) is 3. In order
to tolerate M sequencing errors and still detect the correct barcode,
minimum barcode distance must be 2M+1. That is, with the default
threshold 3 it is guaranteed that one sequencing error can be corrected
for each barcode.
2) The proportion of nucleotides in an optimal barcode set should be
in as good balance as possible. This is the goal of the optimization
process. Objective function includes both the global and the
position-wise nucleotide balances. In addition to single nucleotides
A,C,G,and T, also balance between two nucleotide groups (A,C) vs (G,T)
is maximized because of Illumina's nucleotide detection system.
Please note that both the content of candidate barcode sequences and the
number of barcodes to be selected affects how well the optimization
criteria can be satisfied:
1) If no set of barcodes can be found for which the Hamming distance
between all barcode pairs is above or equal the given threshold, an
error message is given to user.
2) Optimal position-wise nucleotide balance depends on the number of
barcodes to be selected. Only when it is a multiple of four, there can
be an equal number of A,C,G,and T in every barcode position resulting in
perfect balance. In this case also the solution is found fast.
Selecting an optimal set of barcodes is a combinatorial problem. We use
mixed-integer linear programming approach. The details can be found in
the manuscript.
MANDATORY INPUT PARAMETERS:
INPUT BARCODES:
Barcode candidates should be in multi-FASTA format where each barcode sequence
is in a single line right after the line containing the barcode identifier. E.g.
>id1
ACCAAGGT
>id2
CGGATTAC
...
The length of the barcode sequence must be the same for all candidates. For practical
reasons, the length of the barcode sequence should be less than 256 nucleotides.
At the moment, the number of candidate sequences is not limited.
NUMBER OF BARCODES TO BE SELECTED:
Number of selected barcodes must be at least 1 and at most the number of barcodes in a candidate set.
ADVANCED INPUT PARAMETERS:
INITIAL BARCODES FASTA FILE:
Optional initial barcode set which will be expanded.
HAMMING DISTANCE:
Minimum number of nucleotide mismatches between two barcode sequences. In order to tolerate M sequencing errors
and still detect the correct barcode, minimum barcode distance must be 2M+1. That is, default threshold 3
allows one sequencing error to be corrected.
TIME LIMIT IN SECONDS:
Maximum time for searching optimal solution. Usually, when perfect solution exists, it is found within a few seconds.
If the time limit is exceeded, the best possible solution so far found is returned. Increase the maximum time if you
think the better solution should exist and the optimum was not found with default parameters. In this case you might also
want to try restricting maximum depth of the search and change basis crash value.
Due how the search engine works, sometimes the search is continued even when the global optimum is found. This is the case
when the number of selected barcodes is not multiple of four, since in this case the perfect nucleotide balance doesn't exist
and the cost function cannot be zero. In this case the algorithm spends the maximum allowed time.
MAXIMUM DEPTH OF THE SEARCH:
This is a parameter of lpsolve. Default is 0 which means there are no restrictions to branch&bound search depth. You can limit
this (e.g. try 20) if you did not get optimal solution with default values. The effect is to explore wider search area in a given
time limit. When you change this parameter, you might also want to increase the time limit.
BASIS CRASH:
This is a parameter of lpsolve. Values are for different initializations for the lpsolver. Try different values if you were
not happy with the previous results and you think there should be anyway a better solution among your candidate barcodes.