here - Boguski, Mark S.

markboguski.net

here - Boguski, Mark S.

Structural & Functional Genomics:The Information LandscapeNCBIMark S. Boguski, M.D., Ph.D.National Center for Biotechnology InformationNational Library of MedicineNational Institutes of HealthBethesda, MarylandCurrent Topics in Genome Analysis, 4 November 1997


1.0E+079.0E+068.0E+067.0E+066.0E+065.0E+064.0E+063.0E+062.0E+061.0E+060.0E+00Growth of Biomedical Information (1)MEDLINE®10 million articles...1965 1970 1975 1980 1985 1990 1995NCBIWith 325,000 additional per yearfrom 3,400 biomedical journals


Growth of Biomedical Information (2)“G5” MeSH subset of MEDLINENCBI1.0E+069.0E+058.0E+057.0E+056.0E+055.0E+054.0E+053.0E+052.0E+051.0E+050.0E+00Molecular Biology& Genetics literature1965 1970 1975 1980 1985 1990 1995900,000 articlesAccumulating at amore rapid rate thanMEDLINE as a whole


Growth of Biomedical Information (3)GenBank DNA sequencesNCBI1.2E+061.0E+068.0E+056.0E+054.0E+052.0E+051,766,000 sequences1,160,000,000 bases30,000 speciesRapid DNAsequencing inventedEST sequencingbegun0.0E+001965 1970 1975 1980 1985 1990 1995


Growth of Biomedical Information (4)Mapped Human GenesNCBI180001600014000120001000080006000400020000About one-fifth of allprotein-coding genesThere will be twice thisnumber shortly.1965 1970 1975 1980 1985 1990 1995


The Cross-over over to FunctionalGenomics (1)NCBI1,200,0001,000,000800,000600,000400,000200,00001975 1980 1985 1990 1995PublicationsDNA sequences


The Cross-over over to FunctionalGenomics (2)NCBI“In the past we have had functions insearch of sequences. In the future, pathologyand physiology will become ‘functionators’for the sequences.”Daniel C. Tosteson, DeanHarvard Medical SchoolMarch 26, 1997


Growth of Biomedical Information1965-19961996NCBI1.0E+079.0E+06 MEDLINE8.0E+067.0E+066.0E+065.0E+064.0E+063.0E+062.0E+061.0E+060.0E+001965 1970 1975 1980 1985 1990 1995DNA Sequences1.2E+061.0E+068.0E+056.0E+054.0E+052.0E+050.0E+001965 1970 1975 1980 1985 1990 19951800016000Mapped Genes140001200010000800060004000200001965 1970 1975 1980 1985 1990 1995“G5” Literature1.0E+069.0E+058.0E+057.0E+056.0E+055.0E+054.0E+053.0E+052.0E+051.0E+050.0E+001965 1970 1975 1980 1985 1990 1995Proteins3.0E+052.5E+052.0E+051.5E+051.0E+055.0E+040.0E+001965 1970 1975 1980 1985 1990 1995450040003-D Structures35003000250020001500100050001965 1970 1975 1980 1985 1990 1995


The National Center for BiotechnologyInformationNCBIPublic Law 100-607Created by Congressin 1988 with amandate to...Create automated systems forknowledge about molecularbiology, biochemistry andgeneticsPerform research into advancedmethods of analyzing andinterpreting molecularbiology dataEnable biotechnologyresearchers and medicalcare personnel to use thesystems and methodsdeveloped


Informatics on the World Wide Web“[There are] 150 million Web pages now inexistence.…We can expect a billion Webpages by 2000. Some of them will even beworth reading.”WIRED MagazineMarch 1997inNCBIout


World Wide Web: InformationCornucopia?NCBIAmong 30 million Web pages in cyberspace, toutedas the road to real-time, up-to-the-minute informationresources, 5 million pages have been neither checkednor updated in the past year, and 75,000 havelanguished untouched since 1994.ORThe Wall Street JournalMarch 1996


U.S. National Library of MedicineFounded 1836


The National Center for BiotechnologyInformation• Builders and providersof GenBank®, BLAST,Entrez and many otherdata and softwareresources• NCBI is a also a centerfor basic research andtraining incomputational biologyand bioinformaticsNCBI


European & JapanesecollaboratorsBuilding and DistributingGenBank®Author-directsubmissionJournalscanningNCBIHigh-throughputsequencing centersGenBankEmail Anonymous ftp CD-ROM Entrez BLASTNCBI services 1.6 million web hits and >200,000 intellectual queriesper day from approximately 37,000 individual users


Biotechnology InformationGeneStructure & Function> DNA sequenceAATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACACTGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAATCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTAACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGGTTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAATTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTGGTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGACGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGCTACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGAACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGGTAAGAAGATCGCGAACATCTAGTAGA> Protein sequenceMKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNIDELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGKKVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNEPDEAEQDCIEFGKKIANIThe power of computingon the data


Ataxia telangiectasia: : 18 years and 5 minutesNCBINew England Journal of Medicine 333:645-7; 1995


B.L.A.S.T.=NCBIBasicLocalAlignmentSearchTool


Comparative Analysis of GenesNCBICell, Vol. 75, 1027-1038, December 3, 1993, Copyright © 1993 by Cell PressThe Human Mutator Gene Homolog MSH2 and ItsAssociationwith Hereditary Nonpolyposis Colon CancerRichard Fishel, * Mary Kay Lescoe, * M. R. S. Rao, § Neal G.Copeland, † Nancy Jenkins, †Judy Garber, ‡ Michael Kane, §and Richard Kolodner §*Department of Microbiology and Molecular GeneticsMarkey Center for Molecular GeneticsUniversity of Vermont Medical SchoolHomology tobacterial and yeastgenes sheds new light onhuman disease processcan give rise to mismatched basesexample, the deamination of 5-thymine and and, therefore, a G1980). Second, misincorporationDNA replicationHuman 638 RHACVEVQDEIAFIPNDVYFEKDKQMFHIITGPNMGGKSTYIRQTGVIVLMAQIGCFVPC 697Yeast 657 RHPVLEMQDDISFISNDVTLESGKGDFLIITGPNMGGKSTYIRQVGVISLMAQIGCFVPC 716E.coli 584 RHPVVEQVLNEPFIANPLNLSPQRR-MLIITGPNMGGKSTYMRQTALIALMAYIGSYVPA 642portion of DNA mismatch repair protein sequence


Comparative analysis of genomesNCBIE. coli4.6 Mb4286 genes1703 genesH. influenzae1.83 MbM. genitalium0.58 Mb468 genes254 genes necessary(and sufficient?) forcellular life*Mushegian & Koonin, Proc. Natl. Acad. Sci. 93:10268, 1996


Comparative Analysis of GenomesNCBI“What is true for E. coliis also true for elephant.”Jacques Monod, c. 1961“What is true for yeastis also true for human.”David Botstein, 1988


NCBIBassett et al. (1997)Nat. Gen. 15:439-44


Sequence Conservation among Human andRodent mRNAsNCBIMouse93%Rat85% 86%HumanCoding sequence5’ UTR 3’ UTRMakalowski et al. (1996) Genome Res. 6:846


Molecular EvolutionNCBI3000 MyrCommon ancestryallows us to infersimilar function1000 Myr540 MyrHumanFlyWormYeastBacteriaPancreaticcarcinomaAlzheimer’sDiseaseAtaxiatelangiectasiaColoncancer


“Homology...NCBI... is the central concept for all of biology. Whenever we say that amammalian hormone is the ‘same’ hormone as a fish hormone, thata human gene sequence is the ‘same’ as a sequence in a chimp ora mouse, that a HOX gene is the ‘same’ in a mouse, a fruit fly, afrog, and a human -- even when we argue that discoveries about aworm, a fruit fly, a frog, a mouse, or a chimp have relevance to thehuman condition -- we have made a bold and direct statementabout homology. The aggressive confidence of modern biomedicalscience implies that we know what we are talking about.”David B. Wake


“DinosaurDNA”NCBI


BLAST Search with “Dinosaur DNA”BLASTN 1.1.7MP [23-Nov-90]NCBIQuery: "Dinosaur DNA" from Crichton's JURASSIC PARK, p. 103 nt 1-1200Database: GenBank Release 65.0 (complete), October 199039,533 sequences; 49,179,285 total residues.Sequences producing high-scoring segment pairs:>Plasmid pBR322, complete genome.length = 4361Score = 328, Matches = 95% (68/71), Query strand = PlusExpect = 9.7e-18, Poisson P = 9.7e-18Query: 721 CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA 780||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 2581 CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA 2640Query: 781 GCGCTCTCCTG 791|| | ||| ||Sbjct: 2641 GCTCCCTCGTG 2651Score = 320, Matches = 93% (68/73), Query strand = PlusExpect = 4.5e-17, Poisson P = 4.5e-38Query: 530 GCTTCCGGCGGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGG 589|| || || ||||||||||||||||||||||||||||||||||||||||||||||||||Sbjct: 1026 GCATCGGGATGCCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGG 1085


Dot matrix analysis of Dinosaur DNANCBIWindow Size = 35Min. % Score =100Scoring Matrix: DNA MatrixDinosaur DNA20040060080010001200ABC1C2500 1000 1500 2000 2500 3000 3500 4000pBR322, complete genome


NCBIRejected by ScienceRejected by NatureRejected by CellPublished inBioTechniques12(5):668-9; 1992


NCBI


Dr. Crichton’s reply:NCBI


Another Dinosaurgene. Or is it?NCBI


Growth of Biomedical Information1965-19961996NCBI1.0E+079.0E+06 MEDLINE8.0E+067.0E+066.0E+065.0E+064.0E+063.0E+062.0E+061.0E+060.0E+001965 1970 1975 1980 1985 1990 1995DNA Sequences1.2E+061.0E+068.0E+056.0E+054.0E+052.0E+050.0E+001965 1970 1975 1980 1985 1990 19951800016000Mapped Genes140001200010000800060004000200001965 1970 1975 1980 1985 1990 1995“G5” Literature1.0E+069.0E+058.0E+057.0E+056.0E+055.0E+054.0E+053.0E+052.0E+051.0E+050.0E+001965 1970 1975 1980 1985 1990 1995Proteins3.0E+052.5E+052.0E+051.5E+051.0E+055.0E+040.0E+001965 1970 1975 1980 1985 1990 1995450040003-D Structures35003000250020001500100050001965 1970 1975 1980 1985 1990 1995


Entrez (1992)NCBITerm frequencystatisticsExplicit linkLiteraturecitations insequencedatabasesMEDLINEabstractsLiteraturecitations insequencedatabasesEmergent linkNucleotidesequencesimilarityNucleotidesequencesCoding regionfeaturesProteinsequencesAmino acidsequence similarity


MEDLINE® Text NeighboringNCBIGenetic Analysisof Cancer inFamiliesThe GeneticPredisposition toCancer• Common terms couldindicate similar subjectmatter• Statistical method• Weights based on termfrequencies withindocument and withinthe database as awhole• Some terms are betterthan others


Entrez (1994)NCBISequencesimilarityNucleotide sequenceProtein sequenceSequencesimilarityACGATGTGGTCGATGGGTTTCTCTATTATTATCCCTGGAAGCTAAGGATATAGGCGCTGATGTGAGGTGGTTTCGGTTCTATCTGCATGCTAGCATGGATATTGATCGTGGCTTATAGGCTAGTCGMVILLVILAIVLISDKRIVTGREGSWQIPCMNVNERKRKKTDEDDHIVLILGGPILLNNASAIVLPESDESASDSGPLIILKRKEKRWKLLALAMAREENSPNCTGGTPLIKRESAEDSEDLRREDCross-referencesMEDLINE3-D structureTextSimilarityStructuralSimilarity


Entrez (1996)MEDLINENCBINucleotideSequencesACGATGTGGTCGATGTTCTCTATTATTATCGGAAGCTAAGGATATCGCTGATGTGAGGTGATCGGTTCTATCTGCATAGCATGGATATTGATGGCTTATAGGCTAGCGCTGATGTGAGGTGLinksMVILLVILAIVLISDVTGREGSWQIPCMNVKRKKGREGDHIVLILILLNNAWASVLPESDSSDSGPLIILHEREKRLALAMAREENSPNCTPLIKRESAEDSEDLRKRKKTDEDDHIVLILProteinSequencesGenomesStructures


NCBIGenBank Genomes Division


The Human Genome Project, 2005NCBI“The final, irreducible productof the Human Genome projectwill be…• The genetic map• The sequence (map)• The peer-reviewedscientific literature.”—Maynard V. Olson,1995


The SequenceGene 1 Gene 2 Gene 3 …etc... Gene 10 5The Literature


Human Chromosome 7: Global ViewsSHGC7_RHD7S2477 D7S1810 D7S521 sWSS1488 D7S2741 D7S796 D7S2769 sWSS4700.00 985.56 1971.12 2956.69 3942.25 4927.81 5913.38MIT_Chr7D7S541E IB514 NIB1417 GATA3B01 WI-16178 D7S1811 D7S681 GATA63F080.00 100.03 200.06 300.09 400.12 500.15 600.18Hsap-7sWSS1757 sWSS961 sWSS2898 sWSS3200 sWSS1397 sWSS2680 PSMC2 PTPRZ BPGM TBXAS1 CLCN1 sWSS3067 sWSS31920 25000K 50000K 75000K 100000K 125000K 150000KNCBIseq_70 25000K 50000K 75000K 100000K 125000K 150000KGENETHON_70.00 27.11 54.21 81.32 108.43 135.53 162.64CHLC_70.00 42.59 85.18 127.77 170.36 212.95 255.54Chrom. 7CLCN1 TBXAS17p22 7p21 7p15 7p13 7p11.1 7q21 7q22 7q31 7q32 7q34 7q36


Chromosome 7: Local Views of T-Cell Receptor β LocusStanfordD7S2473.87 5187.73 5197.58 5207.44 5217.30 5227.15 5237.01D7S111SGC33735IB201WI-11576WI-11069WI-12643GATA42A01SGC31009 SGC35671 NIB1354SGC32830 SGC35595 GATA2C04.755MITWI-4587 WI-9353 SGC35337 GATA81D02.30 643.30 644.30 645.30 646.30 647.30 648.30WSS3068 sWSS3077 sWSS505 sWSS3816 sWSS2357 sWSS3789 sWSS3558 sWSS3918 sWSSNHGRINCBIsWSS3432 sWSS1142 sWSS365 sWSS3728 sWSS500 sWSS602 TCRB PRSS1 sWSS1520 sWSS2792 KEL sWSS383188K 144538K 144788K 145038K 145288K 145538K 145788KsWSS306848K 144498K 144748K 144998K 145248K 145498K 145748KAFMc008xd1GenethonAFMb017xc5.92 153.19 153.47 153.74 154.01 154.28 154.55D7S1837D7S1836D7S1824D7S1798ATA32D09CHLCGATA2C04.61 258.04 258.46 258.89 259.31 259.74 260.16


TotalentriesApr-97Dec-92Dec-96Apr-93Aug-93Dec-93Apr-94Aug-94Dec-94Apr-95Aug-95Dec-95Apr-96Aug-961,600,0001,400,0001,200,0001,000,000800,000600,000400,000200,0000Growth of GenBankGenBank Release DateSequence Records


Dec-92Apr-93Aug-93Dec-93Apr-94Aug-94Dec-94Apr-95Aug-95Dec-95Apr-96Aug-96Dec-96Apr-971,600,0001,400,0001,200,0001,000,000800,000600,000400,000200,0000Growth of GenBank TotalentriesESTentriesGenBank Release DateSequence Records


What are “Expressed SequenceTags” (ESTs)?• Partial, inaccurate cDNA sequences obtainedby rapid survey sequencing of various cDNAlibraries derived from various tissues,developmental stages, tumors, etc.• There are now >750,000 human ESTs and>200,000 mouse ESTs available• Ref: Trends in Biochem. Sci. 20:295; 1995• URL: http://www.ncbi.nlm.nih.gov/dbEST/NCBI


Evolution of EST ResearchApplicationsNCBI• Gene Discovery (1991-present)Species-jumping & Comparative Genomics• Gene Mapping (1995-present)- 16,000 human genes mapped to date- Mouse gene map in progress- Rat gene map planned• Gene Expression (1997-)Microarrays of cDNA clones & large-scaleexpression analysis


One Gene, Many SequencesNCBIgenomic, contiguousgenomic, segmentedKnownGenes mRNA variant 1mRNA variant 2ESTs5’ EST5’ EST5’ EST5’ EST3’ EST3’ EST3’ EST3’ ESTRef: Nature Genetics 10:369-371; 1995Anchor sequenceused for STSdevelopment


For the publicFor PhysiciansFor Scientists


Link to OMIM --Online MendelianInheritance in Man


Gene Map User FeedbackNCBIFrom: user@aol.comDate: Tue, 29 Oct 1996 17:51:22 -0500To: info@ncbi.nlm.nih.govSubject: awsomeI am a 6th grade student on Long iSLAND AND iAM LEARNING ALL ABOUT THIS AND i THINKIS IS MAGNIFECENT THAT YOU GUYS CANDO ALL THIS


Functional Cloning and Positional CloningNCBIDiseaseDiseaseMapFunctionMapFunctionGeneGene


Standard Positional CloningNCBIFamilyStudiesChromosomeIntervalLarge-InsertClonesCandidateGenesDiseaseMutation*Met A A MetT TG GVal G G ValT TC CSer T T SerC CA ALeu C C LeuT TG GGln C TSTOPA AA APro C CC CG GCys T TG GT TGeneticMappingPhysicalMappingTranscriptMappingGeneSequencing


“Positional Candidate” ApproachNCBIFamilyStudiesChromosomeIntervalCandidateGene*Met A A MetT TG GVal G G ValT TC CSer T T SerC CA ALeu C C LeuT TG GGln C TSTOPA AA APro C CC CG GCys T TG GT TRecent example(July ‘97):Parkinson’sDiseaseGeneticMappingMutationDetection


What’s this gene?


The MEN1 GroupCloning of the gene for Multiple Endocrine Neoplasia, Type IScience 276:404-407; 1997Absent: Lance Liotta, Bruce Roe


PowerBLAST* * analysis of the MEN1 gene:human and mouse EST coverageNCBIMEN1 Gene1666Menin mRNA3666 5666 7666CodingAA211377AA157873AA26179626 HumanESTsAA209475AA211877H14300H38333H14313AA236383R32679H9163817 MouseESTsAA105533AA048913AA000099W89897AA031132AA275378AA168218AA049922*J. Zhang & T. Madden (1997) Genome Res. 7:649-56


National Cancer InstituteCancer Genome Anatomy ProjectPROJECT GOALTo achieve a comprehensivemolecular characterization ofnormal, precancerous, andmalignant cellsTumor Gene IndexTranscript“Profiling”Physical DNAresourcesAll data & materialmade availablewithout restrictions


NCI Tumor Gene IndexcDNA Libraries• Initial target tissues are breast, prostate,colon, lung and ovarian tumors• Microdissected as well as “bulk” specimens• Normalized and non-normalized libraries• Interplay between gene discovery andpathophysiologic analysis of cancerprogression• R & D of methods to assess gene expressionin archival, embedded tissue


Laser CaptureMicrodissectionEmmert-Buck et al.Science 274:998; 1996


Laser Capture Microdissection: in situ breast carcinomaSlideCoverslipEmmert-Buck et al. (1996) Science 274:998


Laser Capture Microdissection: glomerulus, normal kidneySlideCoverslipEmmert-Buck et al. (1996) Science 274:998


MicrodissectionscDNA librariesTumorTissuesN C I C G A PClone arrayscDNA sequencesTumor Gene Index,Physical DNA Resources


National Cancer InstituteCGAP Web Site


Carcinogenesis and Tumor ProgressionDiagnosisYear 0 Year 3 Year 6 Year 10Malignanttumors withmetastasesMicrodissectionExpression profileDetectionDiagnosisPrognosisTherapy


Gene Expression & Glass MicroarraysNCBI


The ArrayerNCBI


The Reader...NCBI... is basically acomputer-controlledinverted scanningfluorescent confocalmicroscope with atriple laserillumination system


Large-scale Analysis of Gene ExpressionNCBIUse of a cDNA microarray to analyzegene expression patterns in human cancerJ. DeRisi, L.Penland, P.O.Brown, M.L.Bittner, P.S.Meltzer, M.Ray, Y.Chen,Y.A.Su, and J.M.Trent, Nature Genetics 14, 457-460 (1996)


15,000 cDNA Clones to ArraySelection criteria:NCBIGenBank + dbEST500,000UniGene53,00015K15,0001. Correspondence tofunctionally clonedhuman gene2. Significant proteinsimilarity3. Included on the HumanTranscript Map4. Of specific researchinterest5. Physical cDNA cloneavailable


The “15K” CollaborationNCBIExpressionDataStanfordcDNAarraysNHGRINCBIIMAGEClone IDsResearch Genetics


15K Set: Portion of the FASTA fileNCBI>M63960 Human protein phosphatase-1 catalytic subunit mRNA, complete cds/cds=(29,1021) /gb=M63960 /gi=190515 /len=1367GGGCAAGGAGCTGCTGGCTGGACGGCGGCATGTCCGACAGCGAGAAGCTCAACCTGGACTCGATCATCGGGCGCCTGCTGGAAGTGCAGGGCTCGCGGCCTGGCAAGAATGTACAGCTGACAGAGAACGAGATCCGCGGTCTGTGCCTG...etc.>W37493 zc10g02.s1 Homo sapiens cDNA, 3' end /contact=Wilson,RK/clone=321938 /clone_end=3' /gb=W37493 /gi=1319087 /len=348GAGAATCCANCTTTGACCTTTATTCAAGAGACCAGATGGGTTGCCCCAGGATCCGGCTGCCAGCCTGAGGCCAAGCACGGCTGGAGACCCACGACCTGGCCTGCCGTTGCCCTGAGCTGCAGCCTCGGCCCCAGGATCCTGCTCACAGT...etc.>H42556 yo63c10.r1 Homo sapiens cDNA, 5' end /contact=Wilson,RK/clone=182610 /clone_end=5' /gb=H42556 /gi=918608 /len=544GTGTGACCAGACATGCAACCGNCATCTATGGTTTCTACGNATGNAGTGNCAAGCAGNACGNCTNACAACATCAAACTGTGGNAAAACCTTCACTGNACTGNCTTCAACTGNCCTGNCCCATCGCGGNCCATAGTGGACGTAAAAGATCTTCTGNCTGNCCACGGAGGCCTGTTCCCCGGACCTGNCAGTTCTATGGNAGCAGATTCGG...etc.


Sample 15K Cluster ReportNCBITITLE: Human protein phosphatase-1 catalytic subunit mRNA,complete cdsCLONE: 488948FLAGS: Gene SwissProt MappedCLUST: Hs.1001GENES: J04759 X70848 M639603'EST: AA115517 AA004413 H97499 W02143 W374935'EST: AA115516 N42323 N41606 H42559 H42556//


15K Cluster Report: Sample 2NCBITITLE: ESTs, Highly similar to HYPOTHETICAL 34.9 KD PROTEIN INFRE2-JEN1 INTERGENIC REGION [Saccharomyces cerevisiae]CLONE: 310438FLAGS: SwissProt MappedCLUST: Hs.100183'EST: R44498 N33766 T93144 N48585 N999705'EST: N47271 R14393 T56201 R68523 W30909//


15K Cluster Report: Sample 3TITLE: ESTsCLONE: 302998FLAGS: MappedCLUST: Hs.242973'EST: H97384 AA147620 N24156 AA157374 H41351 R98414H91639 N51774 R28488 H14277 H38296 N900775'EST: W38527 AA147612 AA157873 H91638 R32679H38333 H14300 N36190//NCBIThis is the MEN1 Gene


Informatics Issues in Large-ScaleStudies of Gene ExpressionNCBI4Resources (arrayed genes, probes)4Laboratory information management system4Information retrieval & query systems


Expression Array DatabaseNCBI


Click here.NCBI


Plot of green/redintensity ratiosNCBI


Click here.NCBI


NCBIUniGene Cluster Report


NCBIGenBank record


Data Mining in Gene Expression ArraysNHGRI / NCBI• Finding associations– Which genes tend to be expressed together?• Finding sequential patterns– Which genes are expressed in succession, i.e. in a pathway?• Clustering the data– Does this set of genes have any common features?• Classifying the data– Has this expression pattern been observed in other experiments?• Predicting values– Can we extrapolate to other conditions under which these genesmay be expressed?


Data Mining in Gene Expression ArraysGene expression dataMEDLINENCBIPubMed Online journalsGenBankACGATGTGGTCGATGTTCTCTATTATTATCGGAAGCTAAGGATATCGCTGATGTGAGGTGATCGGTTCTATCTGCATAGCATGGATATTGATGGCTTATAGGCTAGCGCTGATGTGAGGTGLinksMVILLVILAIVLISDVTGREGSWQIPCMNVKRKKGREGDHIVLILILLNNAWASVLPESDSSDSGPLIILHEREKRLALAMAREENSPNCTPLIKRESAEDSEDLRKRKKTDEDDHIVLILProteinSequencesGenomesStructures


PubMedNCBI• An information retrieval system, basedon Entrez “neighboring” technology• Includes all 10 million articles inMEDLINE• Available without charge via the WorldWide Web• Links to full-text, online journalsavailable from various publishers• So easy, anyone can use it.


NCBINCBI Director instructs V.P. on use ofPubMed.


NCBI


NCBIJust what we’re looking for.


Click here.NCBI


NCBIFull text of articleavailable online


NCBICitations linked back toEntrez/PubMed system


AcknowledgementsUniGeneGreg Schuler NCBIDavid Lipman NCBIWebb Miller Penn State U.Alejandro Schaeffer NCBITranscript MappingTom Hudson MIT/WhiteheadDavid Cox Stanford U.Jean Weissenbach GenethonDavid Bentley Sanger CentreMichael James Oxfordet al.Digital Differential DisplayJohn Spouge NCBINCBI15K cDNA MicroarrayPaul MetlzerNHGRIJeff TrentNHGRIMike BittnerNHGRIYidong ChenNHGRIOlga Ermolaeva NHGRI/NCBILou StaudtNCIPat Brown Stanford U.Mike Eisen Stanford U.Jim HudsonRes. GeneticsNCI-CGAPBob Strausberg NCICarol DahlNCILance LiottaNCIMike Emmert-Buck NCIDavid Krizman NCIDavid Lipman NCBIKen KatzNCBICarolyn Tolstoshev NCBI


http://www.ncbi.nlm.nih.govThe National Center for BiotechnologyInformationDirected by Dr. David J. LipmanNCBI


Another Dinosaurgene. Or is it?NCBI


BLAST Search with “Lost World” DNANCBIQuery: Sequence from THE LOST WORLD, page 135 (1435 bases)Database: Non-redundant GenBank+EMBL+DDBJ+PDB sequences316,522 sequences; 481,803,458 total lettersSearching..................................................doneSequences producing significant alignments:High EScore Valuegb|M26209|CHKRERYF1 Chicken erythroid-specific transcription fa... 783 0.0gb|M76564|XELGATAC X.laevis GATA-binding protein (XGATA-2) gene... 670 0.0gb|M76563|XELGATAB X.laevis GATA-binding protein (XGATA-1B ) ge... 248 1e-63dbj|D13518|RATGATA1 Rat mRNA for transcription factor GATA-1, c... 71.9 2e-10emb|X95701|HSGATA6PR H.sapiens mRNA for GATA-6 DNA-binding protein 65.9 1e-08gb|U66075|HSU66075 Human transcription factor hGATA-6 mRNA, com... 65.9 1e-08gb|U91328|HSU91328 Human hereditary haemochromatosis region, hi... 60.0 6e-07emb|X00257|SCCDC28 Yeast CDC28 (cell division control) gene 60.0 6e-07emb|X99254|PFPRIMSSU P.falciparum gene encoding primase, small ... 60.0 6e-07emb|Z36028|SCYBR159W S.cerevisiae chromosome II reading frame O... 60.0 6e-07


Alignment of “Lost World” ORF with GATA-1NCBIScore = 607 bits (1637), Expect = e-174Identities = 304/318 (95%), Positives = 304/318 (95%)Gaps = 14/318 (4%)QUERY 1 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLGP17678 1 MEFVALGGPDAGSPTPFPDEAGAFLGLGGGERTEAGGLLASYPPSGRVSLVPWADTGTLG 60QUERY 61 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECVMARKNCGAT 120TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV NCGATP17678 61 TPQWVPPATQMEPPHYLELLQPPRGSPPHPSSGPLLPLSSGPPPCEARECV----NCGAT 116WASQUERY 121 ATPLWRRDGTGHYLCNWASACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCSHERENCQT 180ATPLWRRDGTGHYLCN ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS NCQTP17678 117 ATPLWRRDGTGHYLCN---ACGLYHRLNGQNRPLIRPKKRLLVSKRAGTVCS----NCQT 169NIHMARKHEREQUERY 181 STTTLWRRSPMGDPVCNNIHACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 240STTTLWRRSPMGDPVCN ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPGP17678 170 STTTLWRRSPMGDPVCN---ACGLYYKLHQVNRPLTMRKDGIQTRNRKVSSKGKKRRPPG 226QUERY 241 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 300GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGFP17678 227 GGNPSATAGGGAPMGGGGDPSMPPPPPPPAAAPPQSDALYALGPVVLSGHFLPFGNSGGF 286


Mark BoguskiNCBI

More magazines by this user
Similar magazines