/***************************************************************************
*   Copyright (C) 2007 by Ari Loytynoja and Matti Kankainen                *
*   ari@ebi.ac.uk / matti.kankainen@helsinki.fi                            *
*                                                                          *
*   This program is free software; you can redistribute it and/or modify   *
*   it under the terms of the GNU General Public License as published by   *
*   the Free Software Foundation; either version 2 of the License, or      *
*   (at your option) any later version.                                    *
*                                                                          *
*   This program is distributed in the hope that it will be useful,        *
*   but WITHOUT ANY WARRANTY; without even the implied warranty of         *
*   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the          *
*   GNU General Public License for more details.                           *
*                                                                          *
*   You should have received a copy of the GNU General Public License      *
*   along with this program; if not, write to the                          *
*   Free Software Foundation, Inc.,                                        *
*   59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.              *
/**************************************************************************/
PARAMETER TYPES:
       -input=         String                                  no default
       -out=           String                                  no default
       -noise=         Int (0/1)                               Default = 0     (no noise printing)
       -norm=          Int (0-6, or a combination of Ints)     Default = 5     (evolutionary score)
       -spacer=        Int (0/1)                               Default = 1     (allow spacers)
       -gopen=         Int (-inf-inf)                          Default = -10
       -gext=          Int (-inf-inf)                          Default = -1
       -mode=          Int (0/1)                               Default = 0     (report best matches)
       -zscore=        Int (0/1)                               Default = 0     (use original scores)
       -random=        Int (0-inf)                             Default = 0     (do not permutate)
       -freqat=        Float (0.0-1.0)                         Default = 0.5   (AT/GC ratio is similar)
       -freqcg=        Float (0.0-1.0)                         Default = 0.5   (AT/GC ratio is similar)
       -pseudo=        Float (0.0-1.0)                         Default = 0     (do not add pseudo-counts)
       -term=          Int (0-inf)                             Default = 5
       -match=         Int (-inf-inf)                          Default = 10
       -transi=        Int (-inf-inf)                          Default = -4
       -transv=        Int (-inf-inf)                          Default = -4
/**************************************************************************/
EXAMPLE
	GET BEST PAIRS:
		matlign -input=tmpJ7pU2Z -transv=-4 -transi=-4 -match=5 -gopen=-10
		-gext=-1 -term=5 -norm=235 -out=tmpJ7pU2Z -spacer=1 -mode=0 
		-zscore=1 -random=100 -pseudo=0 -freqat=0.5 -freqcg=0.5
	CLUSTER MOTIFS:
		matlign -input=tmp9of8oY -transv=-4 -transi=-4 -match=5 -gopen=-10 
		-gext=-1 -term=5 -norm=235 -out=tmp9of8oY -spacer=1  -mode=1 
		-zscore=1 -random=0 -pseudo=0 -freqat=0.5 -freqcg=0.5
/**************************************************************************/
PARAMETERS
	-input= 	
		the name of the file that contains the input motifs. Motifs can
		be described as PFM (position frequency matrix) and/or CS
		(consensus strings). Both form of motifs are converted to frequency
		matrices by dividing the counts for individual characters by the 
		total sum of the row and by using the given at/cg-frequencies and
		pseudo-weights. 

		PFM is matrix which rows represent the nucleotide position of the 
		motif and which columns represent the number of occurrences of the
		specific nucleotide (A, C, G, T) in a given position. For example, 
		the jaspar_n00001 PFM has ten rows (+ the header), one for each of 
		its nucleotide. Notice that the columns of PFM must be separated 
		using tabulators (\t-char) !
                >jaspar_n0001
                26      36      12      26
                26      31      12      30
                46      13      13      28
                36      13      13      38
                43      12      12      33
                38      13      12      38
                42      13      13      31
                28      13      13      46
                42      13      19      26
                26      13      35      27
		
		CS is a string of characters, which can consist of A,C,G,T,R,Y,S,W,
		K,M,B,D,H,V and N characters. Of these, A,C,G and T are nucleotides 
		and the rest represents degenerated nucleotides (= a combination of
		A,C,G and T). CSs are in the tool seen as a matrix where frequencies 
		are either 0 or 1 for the different nucleotides.
		(see combination http://bioinformatics.org/sms/iupac.html)
                >test_pattern1
                AACTNWSTAA
                >test_pattern2
                CCGGWWNAA
                >and so on

	-out=   	
		prefix that is common for all the output files. Matlign creates six 
		output files (prefix.pswm, prefix.mtrx, prefix.tree, prefix.vara, 
		prefix.fdrs, prefix.freq).

		pswm = PFM file describing the motifs with relative frequencies 
			 (includes both input PFMs/CSs and PFMs that are clustered).
		mtrx = pair-wise distance matrix of each input motif/cluster pair. 
			 Distance is the score calculated using the chosen distance 
			 function. Hierarchical clustering is done by choosing the 
			 motif/cluster pair with the highest value.
		tree = motif/cluster joining events.
		vara = silhouette values for each cluster created. Can be used to 
			 find the optimal cluster set.
		fdrs = false discovery rate (FDR) values for each motif/cluster 
			 pair. 
		freq = count matrix file. A file showing the number of PFMs that
                	 joined in a particular cluster.

	-noise= 	
		Print progress information to STDOUT. 1 = print, 0 = dont print. 

	-norm=
		score function. 0=Kendall's tau correlation, 1=Spearman's rank 
		correlation (a version where equal ranks are not corrected), 2=
		Spearman's rank correlation (a version where equal ranks are 
		corrected), 3=Pearson's correlation, 4=Normalized Euclidean distance
		(values scaled between -1 and 1), 5=evolutionary score, 6=the method
		which is used in YSRA-tool. Functions can be combined by 
		concatenating the function numbers. For example, -norm=5 is the 
		evolutionary score and -norm=1 is the Spearman's uncorrected 
		correlation whereas -norm=15 uses both functions 

	-spacer=
		Use spacers. 1 = use, 0 dont use. If spacer are allowed, the 
		alignment of two motifs can contain one internal spacer.

	-gopen=
		gap open penalty. Penalty given for the motif pair for opening a gap
		into its alignment. An internal gap that is n characters long is 
		penalised using the function 'gap opening score' + (n-1) * 'gap 
		extension score'. Gap opening and extension scores should be 
		negative, opening typically penalised more (i.e., a smaller 
		negative value) than extension.

	-gext=
		gap extension penalty. Penalty given for the motif pair for extending
		the gap. An internal gap that is n characters long is penalised 
		using the function 'gap opening score' + (n-1) * 'gap extension 
		score'. Gap opening and extension scores should be negative, opening
		typically penalised more (i.e., a smaller negative value) than 
		extension. 

	-mode=
		Clustering mode. 1 = cluster motifs, 0 = dont cluster. If clustering
		mode is off, the best matching motif of each motif is shown. If 
		clustering mode is on, hierarchical clustering will be applied to 
		the motifs.

	-zscore=
		Use Z-score transformed scores. 1 = Z-scores used, 0 = raw scores
		used. 

	-random=
		number of permutation. If 0 no permutations are performed. If higher
		than zero, permutation are performed and FDR-values calculated.

	-freqat=
		prior frequency for A and T nucleotides. Used to convert CSs to 
		relative frequencies and to create consensus strings from relative
		frequencies. Use 0.5, if CS counts should not be re-scaled. For 
		example, if -freqat=0.5, then the character R(=A or G) is converted 
		to 0.5, 0, 0.5, 0. If -freqat=0.6, then the character R(=A or G) is 
		converted to 0.6, 0, 0.4, 0.

		If also pseudo-weight is defined, then PFM are converted to relative
		frequencies by adding the pseudo-weights using the given prior 
		frequency. Relative frequency of PFMij= [PFMij+pseudo-weight*
		prior-frequency]/[SUM(PFMi)+pseudo-weight].

	-freqcg=
		prior frequency for A and T nucleotides. Used to convert CSs to 
		relative frequencies and to create consensus strings from relative
		frequencies. Use 0.5, if CS counts should not be re-scaled. For 
		example, if -freqcg=0.5, then the character R(=A or G) is converted 
		to 0.5, 0, 0.5, 0. If -freqcg=0.6, then the character R(=A or G) is 
		converted to 0.4, 0, 0.6, 0.

		If also pseudo-weight is defined, then PFM are converted to relative
		frequencies by adding the pseudo-weights using the given prior 
		frequency. Relative frequency of PFMij= [PFMij+pseudo-weight*
		prior-frequency]/[SUM(PFMi)+pseudo-weight].

	-pseudo=
		pseudo-weight. Pseudo-weight determines the pseudo-count, which is
		is shared between the motif columns according to their prior 
		frequencies. Pseudo-weights/counts make the input motifs more inexact, 
		in cases, for example, where the motifs contain some uncertainties. 
		Used to convert PFMs/CSs to relative frequency matrices.

	-term=
		maximum length of the terminal gaps. 

	-match=
		score for matching nucleotides. Needed when using the evolutionary 
		score (viterbi-algorithm).

	-transi
		score for transitions (A<->G,C<->T). Needed when using the 
		evolutionary score (viterbi-algorithm).

	-transv=
		score for transversions (A<->C, A<->T, C<->G). Needed when using 
		the evolutionary score (viterbi-algorithm).

        	Character match scores: Each nucleotide matched with another 
		nucleotide gains a score according to the substitution score matrix.
		At the simplest, the diagonal scores (i.e., two identical characters
		matching) are positive and all the rest are zeros. However, one can
		apply more complex scores and, e.g., give a small positive score for
		transitions (A<->G,C<->T).
/**************************************************************************/



