Input boxes for DNA upstream sequences
Paste or upload your DNA cluster sequences here. Please note, sequences must be in FASTA format. Maximum cluster size is 62,000 characters. If you wish to input two clusters, you must click the button below the text areas to enable this. Inputted sequences are automatically converted into reverse complementary form. If you don't know where to find upstream sequences, try these sites: TAIR for Arabidopsis thaliana and ENSMART for Homo sapiens, Mus musculus, etc. TIP! Remember, headers must always be separated from sequences with a newline/linefeed character.
For example:
>GeneID or some other information...
TTCAGCTAGCTGCTACGATCGTACGTA
TAACACACATGCAT

Search consensus patterns
Please select the pattern (transcription factor element) you wish to search for within your sequences from the drop-down list - the pattern will appear in the right-hand text box. The patterns in the drop-down list are from the PLACE and TRANSFAC and JASPAR databases. Please visit their web pages to find out more about the patterns.
With the Find Keywords button you can create a drop-down list that contains only patterns with the certain words in it. To find all patterns type *. To activate or to disable the Search consensus patterns press the button next to the title.
Examples:
ID: Ac:
TATABOX1, S000108
TATABOX2, S000109
TATABOX3, S000110
TATABOX4, S000111
TATABOX5, S000203
TATABOXOSPAL, S000400
TATAPVTRNALEU, S000340
V$TATA_C, M00216
V$TATA_01, M00252
Alternatively, you may enter your own pattern into the text box. Patterns you wish to search for must only contain IUPAC letters.

Search matrix patterns
Please type or copy-paste the matrix of your pattern (transcription factor cis-element) you wish to search. Matrices must have 5 rows (title row and own row for each nucleotide (ACGT)), separated with linefeed. Columns of the matrices will correspond to positions of each nucleotide in the matrix. Please note that the matrix should not contain any letters after the first row only numbers, so thus in some cases you have to erase consensus sequences or column headers from the matrix. To activate or to disable the search matrix patterns area you must press the button next to the title.
Examples:
>name
0 6 0 0
0 0 6 0
0 2 4 0
6 0 0 0
0 0 0 6
>name
1.20 -1.6 -1.6 -1.6
0.00 -1.6 0.96 -1.6
-1.6 -1.6 0.96 0.00
0.00 0.00 -1.6 0.59
-1.6 0.00 0.59 0.00
0.00 0.00 0.00 0.00

Cut-off for matrix similarity
When using matrices you must set a certain threshold-value "Cut-off for matrix similarity" for the detectivity. In the matrix each pattern has a certain score, which can be calculated by summing its values together. Patterns, which have bigger scores than the threshold are searched from sequences where others are ignored. Beware that some matrices are in log-scale and can contain negative values, when setting the proper threshold value.
In the first example the length of the matrix is 5 bp. By setting the threshold value to 25 the CG[ACGT]AT pattern would have been searched, because its score is bigger than 25. Alternative by setting the values to 26 or 27 the CG[CG]AT or CGGAT patterns would have been searched. The value 29 would have given 0 occurances for the matrix, because all patterns have smaller value.

Background organism
Please select one from the drop-down list. Backgrounds contain the upstream regions of almost every gene in the selected genome. Currently available species are: Anopheles gambiae, Arabidopsis thaliana, Homo sapiens, Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, Saccharomyces cerevisiae. The suffix full is the complete sequence set (as it is in the ENSEMBL or in TAIR) whereas suffix clean is a filtered sequence set where sequences containing other letters than A,C,G or T have been removed.

Number of samples to generate
This is the number of artificial sequence clusters that will be generated by POBO. Increasing this value will increase the time you have to wait and the accurancy of your results. Too high values can cause that the p-value is extremely statistically significant, althought it is not. In this case it is recommend to look the picture and judge the result by you self. TIP! Recommended value is 100 - 1000.

Number of sequences to pick-out
This is the number of sequences that each pseudo-cluster will contain. The program uses random sampling with replacement. This means that a sequence from the inserted dataset can be included into the pseudo-cluster either zero or more times. For example, if your input cluster contains 20 sequences and the value entered in here is 10, a pseudo-cluster can contain one sequence varying from 0 to 10. TIP! Use a value, which is less than or equal to the size of the input cluster. In the case of two clusters a good starting point could be the size of your smaller cluster.

Sequence length
This is the length of the sequences in the selected background model and in the input sequences. Maximum limit is 3000 (800 when using Saccharomyces cerevisiae). The search is automatically performed in both directions (sense and antisense). Please note, that longer input-sequences are cut to contain only the wanted number of bases and smaller sequences are treated as such. TIP! In order to obtain reliable results, use the same length as your inputted upstream sequences.


This tool was developed by Matti Kankainen, University of Helsinki
Contact the Webmaster.
© 2004 University of Helsinki