Input boxes for DNA upstream sequences
Paste or upload your DNA cluster sequences here. Please note, sequences must be in FASTA format. Maximum cluster size is 31,000 characters and sequences less than 4 bp are disgarded. If you wish to input two clusters, you must click the button below the text areas to enable this. Inputted sequences are automatically converted into reverse complementary form. If you don't know where to find upstream sequences, try these sites: TAIR for Arabidopsis thaliana and ENSMART for Homo sapiens, Mus musculus, etc. TIP! Remember, headers must always be separated from sequence data with a newline/linefeed character.
For example:
>GeneID information, header information, etc
TTCAGCTAGCTGCTACGATCGTACGTA
TAACACACATGCAT

Sequence length
This is the length of the sequences in the selected background model and in the input sequences. Maximum limit is 3000 (800 when using Saccharomyces cerevisiae). The search is automatically performed in both directions (sense and antisense). Please note that longer input sequences are cut to contain only the required number of bases and smaller sequences are treated as such. When the default value (auto) is being used, POCO detects the shortest sequence within the input sets and then uses this value for the analysis. TIP! In order to obtain reliable results, you should use the same length as your inputted upstream sequences.

Pattern length
Please select the maximum length of patterns you wish to search for. The maximum length is also the maximum number of non-wildcard nucleotides (A,C,G and T) in the search. The minumum length is four. The program will automatically search smaller patterns and patterns having a combination of wildcard and non-wildcard nucleotides (pattern must have at least four non-wildcard nucleotides).

Patterns to report
You can limit the results to contain only a certain number of top patterns.

Min occurrence
Please enter the minimum number of sequences that must contain a certain pattern. When using two clusters, a pattern that exceeds this value in either cluster is analyzed, but it can be enriched only in the cluster where it exceeded the value. This value can be used to remove patterns that occur multiple times in only a few sequences, or patterns that are not found in the majority of sequences. An enriched pattern is reported to the user only if it bypasses this value, whereas deplete patterns can have any occurrence (even zero). When the default value (auto) is being used, POCO detects the number of sequences within the input sets and then uses the value of the smaller input set.

Background organism
Please select one from the drop-down list. Backgrounds contain the upstream regions of almost every gene in the selected genome. Currently available species are: Anopheles gambiae, Arabidopsis thaliana, Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae. The suffix full is the complete sequence set (as it is in the ENSEMBL or in TAIR) whereas suffix clean is a filtered sequence set where sequences containing other letters than A,C,G or T have been removed.

Number of sequences to pick out
This is the number of sequences that each bootstrap sample (or cluster) will contain. The program uses random sampling with replacement, which means that a sequence from the selected dataset can be included in the sample zero or more times. Bootstrap samples are used to calculate background means and standard deviations and. For example, if your input cluster contains 20 sequences and the value entered here is 10, a sample can contain one sequence varying between 0 and 10. TIP! Use a value which is less than or equal to the size of the input cluster. In the case of two clusters a good starting point could be the size of your smaller cluster. When the default value (auto) is being used, POCO detects the number of sequences within the input sets and then uses the value of the smaller input set for the analysis.

Number of samples generated
This is the number of bootstrap samples generated in the bootstrap simulation. Increasing this value will increase the time you have to wait, but also the accuracy of your results because more sequences are chosen.

Your email address
Insert your email address, if you wish to obtain an email notification when your analysis is ready. The email will contain a link to result pages, which are maintained few days in our server. NOTE ! that this parameter is optional, but if you dont use it you must wait until you will see your results in the browser.

Pattern to search for
The anchoring pattern to be used. Insert the patterns, in which vicinity you wish to search other over-represented and/or distinctly represented patters.

Sequence length in front
The number of nucleotides taken in to the analysis that are locating in front of the anchoring pattern in your input sequences. If Bacground type is motif sequences then this is also the number of nucleotides taken in to the analysis from the background sequences. If background type is random sequence then the sequence lenght that is analyzed in the background sequences is sequence length in front + sequence length behind + search pattern length.

Background type
Choose the type of backgroud to be used in the analysis. Background can either be generated using sequences with the anchoring pattern or using randomly selected sequences.

Sequence length behind
The number of nucleotides taken in to the analysis that are locating behind the anchoring pattern in your input sequences. If Bacground type is motif sequences then this is also the number of nucleotides taken in to the analysis from the background sequences. If background type is random sequence then the sequence lenght that is analyzed in the background sequences is sequence length in front + sequence length behind + search pattern length.


This tool was developed by Matti Kankainen, University of Helsinki
Contact the Webmaster.
© 2005 University of Helsinki