|
Over 40% of Stratagene’s GeneConnection™ cDNA gene clusters are novel— and more
than 50% are estimated to be full-length clones
The GeneZoo™ or the GeneJungle™?
Jason Goncalves • Jean-Michel Lelias • Joe Sorge
Stratagene
The GeneConnection™ discovery cDNA clone collectionYY
goes beyond normal cDNA collections. Stratagene has searched through many
different human tissues and has identified thousands of cDNAs that are not found
in the UniGene database (our GeneJungle™).* We have systematically
sequenced the ends of our clones and have used a rigorous cDNA clustering
algorithm to determine how many of our clones are unique and how many can be
found in UniGene (those found in UniGene are called our GeneZoo™). About
40% of our clusters are in the GeneJungle and 60% are in the GeneZoo. The
average insert size of our cDNAs is 1.7 Kb, which indicates that about half of
the clones are predicted to have full length open reading frames. We have taken
extreme care to guarantee clone identity and purity. Stratagene’s clones can be
found on our GeneConnection Virtual Lab using several different free search
strategies, including searching by gene name, keyword, UniGene ID, or nucleotide
or protein sequence (www.stratagene.com). A gel image of a restriction digest of
each clone can be viewed on our website. Stratagene’s GeneConnection discovery
clone collection has been tracked and curated with precision and consistency in
mind from its inception, and the GeneJungle collection provides an untapped
resource for new discoveries.
The raw sequence of the human genome is now mostly complete, but it will be
quite some time before all of the genes and their exon structures are known.
Gene finding software programs are inaccurate at spotting exon-intron junctions.
They can split groups of contiguous exons into two putative genes when they
really belong to one gene. They can also group exons into one gene when they
really belong to two or more genes. Thus the importance of high quality cDNA
collections is rising once again. cDNAs are an excellent way of confirming gene
structure, and Stratagene has a highly unique, well-characterized collection of
1.7 Kb average length human cDNA clones. Approximately 40% of Stratagene’s cDNA
clusters cannot be found in UniGene, and based on our initial in-house analysis
of known sequences about 50% of Stratagene’s clones contain full-length open
reading frames which corresponds to our predicted estimate for the collection.
Inherent Problems with Clustering ESTs
ESTs (Expressed Sequence Tags) are DNA sequences derived from sequencing the
5¢ and/or 3¢ termini of
cDNA inserts using vector-specific sequencing primers. Each EST sequence is
typically 300 to 600 nucleotides in length. The dbEST database currently
contains about 2.1 million human ESTs. UniGene is an attempt to reduce these
sequences to a non-redundant set of gene-oriented “sequence clusters”. In theory
there should be one cluster for each underlying gene. Build #116 of UniGene
(July 3, 2000) has produced 81,967 unique clusters. However UniGene is plagued
with many artifacts. The ESTs found in dbEST have come from many different
sources, with widely varying degrees of sequence quality. The quality of the
cDNA libraries from which these ESTs were derived is variable. Sometimes two
cDNAs can be fused together in one vector. This creates UniGene artifacts
whereby two unrelated cDNAs are placed into the same UniGene cluster. Genomic
DNA contamination is common, as well as are products resulting from aberrant
transcription and termination. Splice variants are of course common, and it is
difficult for the UniGene clustering algorithms to differentiate splice variants
from different gene products.
Table 1
Common UniGene Artifacts and Stratagene’s Solutions
|
UniGene Artifact
|
Stratagene Solution
|
|
Fusion cDNAs from 2 different genes. Tends to cause separate genes to cluster together and underestimates gene number.
|
Did not ligate cDNAs, rather annealed them to vector to create library. Eliminates fusions.
|
|
Genomic DNA inserts contaminating cDNA library. Tends to create artificial clusters and overestimates the number of genes.
|
Did not ligate adapters onto ends of cDNA, thus genomic DNA false inserts are very rare.
|
|
Falsely separating 5¢ and 3¢ ends of cDNAs into different clusters. Tends to overestimate the number of genes.
|
Only cluster sequences having a polyA tail, thereby not counting 5¢ and 3¢ ends twice.
|
|
Falsely separating splice variants into separate clusters. Tends to over- estimate the number of genes.
|
Only use sequence contiguous to the polyA tail, minimizing the appearance of splice variants in the data.
|
Stratagene has made some very unique cDNA libraries. Much has been done
to minimize artifacts such as fusion inserts and genomic DNA inserts (Table
1). Moreover the cDNAs have been highly normalized, yielding a very
low level of clone redundancy. We have systematically sequenced the 3¢
ends of our clones and analyzed the resulting sequences. In the analysis,
we remove sequences lacking poly-A tails. This effectively eliminates an
artifact typically found in UniGene. If 5¢ and
3¢ sequences from the same gene are
non-overlapping in UniGene, then UniGene will put these 5¢
and 3¢ sequences into two different clusters,
overestimating the number of human genes. Because we only cluster 3¢
end sequences bearing poly-A tails we do not see this 5¢/3¢
splitting artifact (Table
1). Moreover, because the sequences that we have clustered to date are
contiguous with the poly-A tails, we are less likely to be confused by
splice variants, which occur less frequently in the 3¢
untranslated regions of transcripts (Table
1). Utilizing the 3¢ sequences as a unique
gene-identification tag has been demonstrated as an effective
gene-specific marker. Because the 3¢UTRs are
not as conserved as the coding sequences, this makes it easier to
distinguish between individual genes and paralogous gene family members
that may have sequence homology in their coding sequences1.

Fig.2
Our clustering algorithm is rigorous. We first identify commonly
repeated sequence elements in the data set. We then require that for any
two sequences to cluster, they must match at 96% identity over 100 or more
base pairs and the percentage of alignable sequence must be greater than
90%. Figure 2 shows how the percentage of alignable sequence is determined.
Two sequences are compared and aligned to maximize the number of matching
base pairs. At each end of the local alignment, the shorter of the two
unaligned sequences is used to calculate the number of alignable bases.
The number of alignable bases is simply the sum of the local alignment
length plus the length of unaligned sequence flanking the local alignment.
Of course, alignments of commonly repeated or low complexity sequences are
discarded. The algorithm will not cluster sequences from different gene
family members, since the untranslated regions tend not to align. In
contrast, sequencing artifacts are ignored since they generally do not
drop the percent identity below 96%. Our algorithm would place splice
variants into different clusters; however since we only use the sequence
contiguous to the poly-A tail to perform the clustering, splicing is not a
significant factor. We could choose to ignore splice variants when
clustering (Figure
2) by eliminating internally unpaired sequence from the computation of
alignable length. However the algorithm we have chosen is more rigorous,
and we rely instead on there being little splicing near the 3¢
ends.

Fig.1
We have also clustered our sequences together with 1.7 million human
EST sequences that are included in the human UniGene Database (Build
#116). Those clusters that contained a UniGene representative (Stratagene
GeneZoo) were also compared with 41,472 sequence-verified IMAGE clones
from Research Genetics, and 9,182 Unigem 2.0 clones from Incyte Genomics.
Figure 1 shows that most clusters, except for those in Stratagene’s
GeneJungle, and except for 354 Incyte clones, fall within the 81,967
UniGene clusters. Interestingly, Incyte’s 9,182 Unigem clones collapse
into 8,298 UniGene clusters when referenced against UniGene build #116,
plus 354 non-UniGene clusters. Research Genetics’ 41,472 IMAGE clones
collapse into 31,521 unique clusters when referenced against UniGene build
#116. (Table
2)
Table 2
Relative Overlaps of Various Human cDNA Clone Collections
Clustering was based on UniGene build #116 for sequences that match UniGene.
For the 354 Incyte Unigem 2.0 clones that lie outside of UniGene it is
assumed that each of the 354 represents an individual cluster.
|
Clone Set
|
# of Clones
|
# of Unique
Clusters*
|
% of Clusters Found In
UniGene #116
|
% of Clusters
Found in Research Genetics SV IMAGE set
|
% of Clusters
Found in
Unigem 2.0
|
% of Clusters Found in Stratagene’s GeneConnection 1.0 Set
|
% of Clusters Found in UniGene, Research Genetics SV, or Unigem 2.0
|
|
UniGene Build #116
|
N.A.
|
81,967
|
100%
|
38%
|
10%
|
11%
|
100%
|
|
Research Genetics IMAGE
|
41,472
|
31,521
|
100%
|
100%
|
22%
|
21%
|
100%
|
|
Unigem 2.0
|
9,182
|
8,652
|
96%
|
81%
|
100%
|
39%
|
100%
|
|
Stratagene’s GeneConnection 1.0 Discovery Set
|
25,321
|
15,724
|
59%
|
45%
|
23%
|
100%
|
59%
|
Table 2 shows that Stratagene’s GeneConnection 1.0 Discovery set has a
substantial proportion of clusters not found in UniGene (about 41% of the
Stratagene clusters are in the GeneJungle). This suggests that
Stratagene’s libraries contain rare sequences not commonly found in other
cDNA libraries. With an average insert size of 1.7 Kb, the Discovery clone
set contains over 50% full length human cDNAs. This suggests that out of
25,321 clones we currently have over 12,500 full-length sequence-tagged
cDNAs, and over 5000 of these full-length cDNAs have never been reported
publicly. Stratagene is expanding its collections and updating its website
with additional sequences on a regular basis.
Searching for Stratagene Clones
To find clones within Stratagene’s collection, the GeneConnection website (www.stratagene.com)
allows searches by keyword gene name, accession number, Unigene number, or DNA
or protein sequence. Stratagene has annotated all clones to optimize searches
using key words, so that a clone similar to a characterized gene can also be
found, for example “ESTs, Highly similar to protein-tyrosine-phosphatase” or
“Zinc finger protein homologous to Zfp-36 in mouse (ZFP36)”.

Fig.4
Searches can be carried out with nucleic acid or protein search
queries. Search results will show the degree of match as both a % identity
of the aligned bases and as the quality score of the match (Figure
4). Similar genes can be found this way. For example if you want to
find homologs to a gene of interest, obtain the sequence of your gene of
interest and paste it into the search window on the GeneConnection search
page (see Figure 3). Several search engines are available. Nucleotide sequence
target data can be searched with a nucleotide sequence query using simple
BLASTN. Nucleotide sequence target data can be first translated into all 6
open reading frames and then searched with a query DNA sequence that is
also translated into all 6 reading frames using the search engine TBLASTX.
Nucleotide sequence target data can be translated into all 6 reading
frames and searched with a query protein sequence using TBLASTN. If a
match is found to a Stratagene clone, the clone information will be
displayed. The Stratagene clones have all been restriction mapped and size
estimates are available for all clones. Sample restriction gel data are
available on the website for all clones. If the DNA sequence of the
Stratagene clone is within the UniGene set (a GeneZoo clone) its DNA
sequence will be revealed in the search report. If the DNA sequence of the
Stratagene clone is outside of UniGene (a GeneJungle clone), its DNA
sequence is provided upon standard purchase of the clone.

Fig.3
Since Stratagene has only entered 3¢ sequence
information into its clustering database, we have designed an automatic
“indirect match” strategy to help locate clones having homologous sequences
in the UniGene database. Even if you enter a coding sequence or a 5¢
sequence, matches can be found through a bridging database. We have taken
our 3¢ sequences and BLASTed them against all the
sequences in the UniGene database. When identities above a certain threshold
are found, the UniGene sequence is placed into an “indirect database” with a
link to the homologous Stratagene clone. When you perform sequence searches
at our GeneConnection website, the program automatically searches both our
direct sequence data and the indirect database sequence data. Both types of
matches are shown in separate sections of the search report. Indirect match
reports show the alignment between your query sequence and the indirect (UniGene)
sequence, with a link to the Stratagene clone name and number.
All of our GeneConnection discovery clones can be purchased as a bacterial
stab culture. We use the XL10-Gold® strain, which is T1-phage
resistant thus minimizing the threat of T1-phage contamination. The clones in
our collection are categorized as either GeneZoo clones or GeneJungle clones and
are differentially priced accordingly. Refer to the website for price
information and special discount prices are available for large volume orders.
Conclusions
While other clone collections may provide a defined subset of human cDNAs,
Stratagene’s GeneJungle goes beyond the familiar territory of UniGene. If you
have a desire to discover new genes or gene families, the GeneJungle is an
exciting place to explore. With an average insert size of 1.7 Kb, the
probability of finding a previously undiscovered, full length human cDNA is high
since we estimate that half of our clones contain full-length open reading
frames. All clones have been restriction mapped, so you can obtain an estimate
of the insert size before ordering a clone. The clones have been carefully
sequenced and tracked, assuring that you receive the clone you ordered.
REFERENCE
-
Wilcox, AS, et al. (1991) Nucleic Acids Research 19(8): 1837-1843.
* Patents pending
|
|