1. How does Procom work?
2. What are ANCHOR, INTERSECTION, and SUBTRACTION organisms?
3. What are the parameters used in BLASTP?
4. Why do I have to specify both "Intersection" and "Subtraction" E-values?
5. Where did you download the proteomes? Are the genomes complete?
6. Contact and citation

1. How does Procom work?

(1) Choose one organism as the ANCHOR;
(2) Choose 0 or more organisms as the INTERSECTION organism to identify matches between the ANCHOR and the INTERSECTION organism;
(3) Choose 0 or more organisms as the SUBTRACTION organism to subtract matches between the ANCHOR and the INTERSECTION organism;
(4) BLASTP: Each of the ANCHOR proteins will serve as the query, the proteins in the INTERSECTION or SUBTRACTION organisms will serve as databases, and the E-values are specified by the user;
(5) Pick the matches between the ANCHOR and each of the INTERSECTION organisms, and identify the ones in common;
(6) Remove the matches between the ANCHOR and each of the SUBTRACTION organisms from (5).

2. What are ANCHOR, INTERSECTION, and SUBTRACTION organisms?

ANCHOR:
(1) ANCHOR is used as the query in the BLASTP comparisons;
(2) The output IDs and sequences are from the ANCHOR;
(3) ANCHOR is often the organism the user is working on or familiar with.

INTERSECTION:
(1) INTERSECTION is used as the database in the BLASTP comparisons;
(2) The output ANCHOR proteins must have a match in all INTERSECTION organisms that are chosen;
(3) INTERSECTION organisms often share the trait of interest with the ANCHOR.

SUBTRACTION:
(1) SUBTRACTION is used as the database in the BLASTP comparisons;
(2) The output ANCHOR proteins should NOT have a match in any of the SUBTRACTION organisms that are chosen;
(3) SUBTRACTION organisms do not share the trait of interest with the ANCHOR.

3. What are the parameters used in BLASTP?

E=1 V=1 B=1 -filter SEG+XNU

E=1: Only the matches with E-value <= 1 are reported.
V=1: Only one database sequence for which the one-line description will be reported.
B=1: Only one database sequence for which high-scoring segment pairs (HSPs) will be reported.
-filter SEG+XNU: To mask the low complexity regions.

The BLASTP output file will be parsed; the query (ANCHOR) protein name will be retrieved when the corresponding E-value is lower than specified by the user. The collecitons of query protein names are compared with each other to obtain the overlap for intersection organisms and remove the overlap for subtraction organisms.

4. Why do I have to specify both "Intersection" and "Subtraction" E-values?

For both intersection and subtraction organisms, the user specifies an E-value threshold. The lower the E-value is, the more stringent it is to find matches between the anchor and the intersection/subtraction organism. Since the final output is the proteins that are matches between the anchor and the intersection organisms, but not matches between the anchor and the subtraction organisms, the E-values for intersection and subtraction organisms will have different effect on the final list of proteins. To have a stringent list of proteins, it is therefore recommended to choose a low "Intersection" E-value and a high "Subtraction" E-value. On the contrary, the high "Intersection" E-value and a low "Subtraction" E-value will generate a loose list of proteins.

5. Where did you download the proteomes? Are the genomes complete?

We downloaded the protein sequences from the following sites. The genomes followed by a PubMed link are published and are considered "complete". The genomes that are not yet published are followed by the depth of coverage, whenever available. The user should interpret the results with caution when selecting "incomplete" genomes.

Anopheles gambiae [PubMed]
   ftp://ftp.ensembl.org/pub/current_mosquito/data/fasta/pep/
Arabidopsis thaliana [PubMed]
   ftp://tairpub:tairpub@ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/OLD
Aspergillus nidulans [Coverage: 13X; Percentage: 96%]
   http://www.broad.mit.edu/cgi-bin/annotation/aspergillus/download_license.cgi
Brugia malayi [Coverage: 5.1X]
   ftp://ftp.tigr.org/private/euk/b_malayi_fh574/ (license needed)
Caenorhabditis briggsae [PubMed]
   ftp://ftp.wormbase.org/pub/wormbase/briggsae-current_release/gff_db_load_files/run_25/
Caenorhabditis elegans [PubMed]
   ftp://ftp.wormbase.org/pub/wormbase/elegans/WS131/wormpep131.tar.gz
Chlamydomonas reinhardtii [Coverage: 8X]
   http://genome.jgi-psf.org/chlre2/chlre2.download.ftp.html
Ciona intestinalis [PubMed]
   http://genome.jgi-psf.org/ciona4/ciona4.download.ftp.html
Cryptococcus neoformans [Coverage: 10.5X]
   ftp://ftp.tigr.org/private/euk/c_neoformans_64hr/
Danio rerio [Coverage: 5.7X]
   ftp://ftp.ensembl.org/pub/current_zebrafish/data/fasta/pep/
Dictyostelium discoideum [PubMed]
   http://dictybase.org/db/cgi-bin/dictyBase/download/download.pl
Drosophila melanogaster [PubMed]
   ftp://ftp.ensembl.org/pub/current_fly/data/fasta/pep/
Encephalitozoon cuniculi [PubMed]
   ftp://ftp.ncbi.nlm.nih.gov/genomes/Encephalitozoon_cuniculi/
Entamoeba histolytica [Pubmed]
   ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/e_histolytica/annotation_dbs/EHA1.pep [Newly Added 2005-12-5]
Fugu rubripes [PubMed]
   ftp://ftp.ensembl.org/pub/current_fugu/data/fasta/pep/
Gallus gallus [Coverage: 6.6X]
   ftp://ftp.ensembl.org/pub/current_chicken/data/fasta/pep/
Giardia lamblia
   http://gmod.mbl.edu/perl/site/giardia?page=download_tool&file=orfs_aa&type=orfs&noheader=T [Newly Added 2005-12-5]
Guillardia theta [PubMed]
   http://www.ebi.ac.uk/integr8/FtpSearch.do;jsessionid=1302BCF1222FD416655F54964C6F0C7C?orgTaxID=55529
Homo sapiens [PubMed]
   ftp://ftp.ensembl.org/pub/current_human/data/fasta/pep/
Leishmania major [Coverage: 10X]
   ftp://ftp.sanger.ac.uk/pub/databases/L.major_sequences/LEISHPEP/GeneDB_protein_database_270404
   ftp://ftp.sanger.ac.uk/pub/databases/L.major_sequences/LEISHPEP/GeneDB_Protein_database_100505 [New Version]
Mus musculus [PubMed]
   ftp://ftp.ensembl.org/pub/current_mouse/data/fasta/pep/
Neurospora crassa [PubMed]
   http://www.broad.mit.edu/cgi-bin/annotation/neurospora/download_license.cgi
Oryza sativa [PubMed]
   ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/pseudomolecules/version_2.0/all_chrs/
Plasmodium falciparum [PubMed]
   http://www.plasmodb.org/restricted/data/P_falciparum/WG/cds.aa/
Rattus norvegicus [PubMed]
   ftp://ftp.ensembl.org/pub/current_rat/data/fasta/pep/
Saccharomyces cerevisiae [PubMed]
   ftp://genome-ftp.stanford.edu/pub/yeast/data_download/sequence/genomic_sequence/orf_protein/
Schizosaccharomyces pombe [PubMed]
   ftp://ftp.sanger.ac.uk/pub/yeast/pombe/Protein_data/pompep/
Tetrahymena thermophila
   ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/t_thermophila/Gene_Predictions/
Thalassiosira pseudonana [PubMed]
   http://genome.jgi-psf.org/thaps1/thaps1.download.ftp.html
Toxoplasma gondii [Coverage: 8X]
   http://toxodb.org/restricted/data/Genome/pep/Tg10x_TwinScan_20040527.gz (license needed)
Trypanosoma brucei [PubMed (ChrI)] [PubMed (ChrII)]
   ftp://ftp.tigr.org/private/euk/t_brucei_fnzm1/annotation_dbs/TBA1.pep (license needed)
   ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/t_brucei/annotation_dbs/TBA1.pep [New Version]
Trypanosoma cruzi [Coverage: 19X]
   ftp://ftp.tigr.org/private/euk/t_cruzi_q98122/annotation_dbs/TCA1.pep (license needed)

6. Contact and citation:

The authors who developed Procom are: Jin Billy Li, Miao Zhang, Susan K. Dutcher, and Gary D. Stormo.

Please email "billy [AT] ural.wustl.edu" for questions and comments.

Please cite:
Li, J.B., Zhang, M., Dutcher, S.K., and Stormo, G.D. (2005) Procom: a web-based tool to compare multiple eukaryotic proteomes. Bioinformatics. 21: 1693-1694. [Abstract] [PDF]


Last update: Jin Billy Li, December 6, 2005