Patterns / Matrices to be aligned
Insert matrices / patterns that will be aligned. Matrices must be in the frequency matrix format (only integer numbers are acceptable) where the columns correspond to the four possible nucleotides (A, C, G and T) separated by spaces/tabs, and the number of rows matches the number of nucleotides within the matrix element, i.e. if the element is 9 bp long, your matrix should have a header + 9 rows. IUPAC-coded nucleotide sequences (A, C, G, T, M, R, W, S, Y, K, V, H, D, B and N) can also be inserted, and the tool automatically converts these sequences into matrices. In this conversion, regular nucleotides (A, C, G or T) create a row with a corresponding count in it (i.e. G is converted to 0,0,1 and 0) whereas ambiguous nucleotides create a row with more than one count (i.e. S is converted to 0,1,1 and 0).
For example:
>my_mat1
2 14 2 0
2 1 15 0
17 0 1 0
0 1 17 0
18 0 0 0
17 0 0 1
0 0 1 17
18 0 0 0
1 2 0 15
>my_seq1
TATATATAT
etc.

Scores
The score for a position in the alignment is calculated using the specified substitution matrix and penalties. There is also a possibility to modify, how the score is calculated. This gives an additional way to adapt the tool onto the needs of different analysis.
Also several other distance measurments can be choosed to score a position in the alignment of pattern/matrices (i.e. use to measure the similarity of two rows in dynamic programmin). The metric distances available are Sperman's correlation(a version that accounts even numbers = corrected and a a version that does not account even numbers = uncorrected), Pearson correlation, Kendall's tau correlation and the normalized Euclidean distance. These metrics can be combined together by either multiplying their raw values or by multiplying their z-score tranformed values.

Alignment parameters

The score for an alignment is calculated as the sum of character match scores and the penalties for gaps.

Character match scores: Each nucleotide matched with another nucleotide gains a score according to the substitution score matrix. At the simplest, the diagonal scores (i.e., two identical characters matching) are positive and all the rest are zeros. However, one can apply more complex scores and, e.g., give a small positive score for transitions (A<->G,C<->T).
Pattern matrices are converted to probability matrices by dividing the counts for individual characters by the total sum of the row. Sequences can be seen as a matrix where probabilities are either 0 or 1. For the match of sites i and j in the two sequences, the score is defined as:

where f1i(k) is the probability of character k at site i of sequence 1, and Sk,l the score of matching characters k and l.

Gap open and extension penalty: An internal gap that is n characters long is penalised using the function 'gap opening score' + (n-1) * 'gap extension score'. Gap opening and extension scores should be negative, opening typically penalised more (i.e., a smaller negative value) than extension.

Maximal terminal gap length: Terminal gaps are not penalised. The parameter 'maximal terminal gap length' defines the maximum length of terminal gaps, i.e., how much the sequences/matrices have to overlap. That should always be positive.


Number of permuatations
This parameter sets the number of permutations used to asses the FDR of the pair-wise alignment of two patterns or matrices. FDR is calculated as the average number of pair-wise alignments having an equal or higher score in randomisations divided by the number of pair-wise alignments in the true data set having an equal or higher score (#FP[avg random data] / #P[true data]).
If permutations are performed, patters alignments with FDR below the threshold are shown.

Examples
Examples 1-3. To demonstrate how Matlign can faciliate the interpretation of a pattern prediction results, we first analysed a set of DNA sequences with different pattern prediction tools and then analysed the outputs of these tools using Matlign. The analysed data are a set of upstream sequences of genes known to be co-regulated (according to SCPD). The target genes (PDR3,SNQ2, PDR15, YOR1, HXT9, HXT11 and PDR5) are being regulated by PDR3 transcription factor, which binds onto TCCGYGGA-element. In the pattern prediction analyses, we first retrieved the 800 bp region in front of the genes and then predicted the patterns using three different tools (Examples 1-3). Example 1 contains patterns reported by POCO , Example 2 cotains patterns reported by oligo-analysis and Example 3 contains pattern reported by MotifSampler. In each analysis, the default (or recommended) parameters were used. In Examples 1 and 2, the best 50 predictions of the tools were then analysed using Matlign. Example 3 contains 100 prediction, that were created by 50 runs of MotifSampler reporting the best two patterns. The Matlign results show that, indeed, several similar patterns are reported. For example, out of the 100 predictions of MotifSampler allmost 60 are indistinguishable, and the 50 predictions made by POCO can be shrinked to several patterns.
Example 4. To demonstrate how Matlign can be used to perform meta-analyses, the clustering results of Examples 1-3 were further analysed. From the obtained results, the best patterns were chosen using the maximal silhouette value and were pooled together between the tools. The results of this analysis show that the tools agree reasonably well by their predictions. Node 22 contains the meta consensus pattern of the different tools, i.e. it is the first node that combines all the tools together and is located in the silhouette selection. The pattern obtained from this node is ayTCCGCGGArm (see below) which matches well the binding site of PDR3 (TCCGYGGA).
Example 5. To demonstrate how Matlign can be used to annotate putative patterns, we downloaded all cis-elements from JASPAR and added some amount of noise to them (in this example, each position of the element contains 25% of noise) and counted the number of times the dimmed pattern matched the original one. Here, the first 123 matrices are the dimmed ones (jaspar_n-prefix) and the next 123 matrices are the original ones (jaspar_t-prefix). In the results, see how many times jaspar_nX matches jaspar_tX.



This tool was developed by Matti Kankainen, University of Helsinki and Ari Löytynoja, European Bioinformatics Institute
Contact the Webmaster.
© 2006 University of Helsinki
© 2006 European Bioinformatics Institute