Administrator
|
We spoke recently with folks at NCGR and it occurred to me to ask:
what sort of data formats are there for genome data? I'm curious what the data schema look like and what the popular/standard data formats are. Anyone know about this sort of thing? -- Owen Owen Densmore http://backspaces.net - http://redfish.com - http://friam.org |
Owen -
The default simplest data format for pure sequence data is the "fasta" format (.fna extension), in the form: >gb|U00096|U00096 Escherichia coli K-12 MG1655 complete genome AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG There can be other symbols besides A, C, G, T, such as N (for any of ACGT . . .). There is a page here with some more info: http://home.cc.umanitoba.ca/~psgendb/formats.html The more interesting part is the meta-data (e.g., "gene here," "promoter site here," who did the sequencing, etc.). One very generic format for this kind of stuff is the GenBank format, example here: http://home.cc.umanitoba.ca/~psgendb/X54090.gen.html Beyond that, things tend to get "propietary" and/or viewer/browser dependent pretty fast . . . There are some interesting "browser" systems here (try "genome browser" for example): http://genome.ucsc.edu/ There is some discussion of some meta-data formats here: http://genome.ucsc.edu/ENCODE/submission.html tom On Dec 1, 2005, at 6:08 PM, Owen Densmore wrote: > We spoke recently with folks at NCGR and it occurred to me to ask: > what sort of data formats are there for genome data? I'm curious > what the data schema look like and what the popular/standard data > formats are. > > Anyone know about this sort of thing? > > -- Owen > > Owen Densmore > http://backspaces.net - http://redfish.com - http://friam.org > > > > ============================================================ > FRIAM Applied Complexity Group listserv > Meets Fridays 9a-11:30 at Mission Cafe > lectures, archives, unsubscribe, maps at http://www.friam.org |
In reply to this post by Owen Densmore
Hi Owen-
a short answer is, there are more formats than you'd care to imagine, suited for different purposes, as "genome data" can mean a lot of different things to different people. In any case, one of the best places for starting to feel your way around is at the NCBI site via their entrez search system. http://www.ncbi.nlm.nih.gov/Entrez/ It is basically a text document search engine with some semantic indexing, and you can get the results back in a variety of formats (some of which have different information contents, some of which are syntactic variants). Entrez is used as the search interface to genomic and many other types of biologically important data at the ncbi. You'll also find links out to many of the "boutique" information resources that serve it up in oh-so-many different dialects. There are many standards out there, but NCBI's are probably the strongest de facto standards (at least on this side of the pond- the European equivalent EBI and Japanese DDBJ sometimes collaborate and sometimes go in their own directions). You may also be interested in checking out the biojava/bioperl/biopython/bioruby (and I'm sure I've left someone out) open source projects that supply libraries for manipulating many of these formats as well as doing analyses and other things. (Oh yeah, biosql is there too...) Finally, there is a lot of work in "ontology" building, and the W3C has formed a special group to explore use of semantic web technologies in the life sciences, witha lot of the big players already getting involved. http://www.w3.org/2001/sw/hcls/ Hope this helps more than it muddies the waters... Andrew Farmer (NCGR worker/FRIAM lurker) On Thu, 1 Dec 2005, Owen Densmore wrote: > We spoke recently with folks at NCGR and it occurred to me to ask: > what sort of data formats are there for genome data? I'm curious > what the data schema look like and what the popular/standard data > formats are. > > Anyone know about this sort of thing? > > -- Owen > > Owen Densmore > http://backspaces.net - http://redfish.com - http://friam.org > > > > ============================================================ > FRIAM Applied Complexity Group listserv > Meets Fridays 9a-11:30 at Mission Cafe > lectures, archives, unsubscribe, maps at http://www.friam.org > -- Andrew Farmer adf at ncgr.org (505) 995-4464 Database Administrator/Software Developer National Center for Genome Resources --- "To live in the presence of great truths and eternal laws, to be led by permanent ideals- that is what keeps a man patient when the world ignores him, and calm and unspoiled when the world praises him." -Balzac --- |
In reply to this post by Owen Densmore
Giles Bowkett wrote:
>Hey man -- question came up on a list I'm on. Got a sec? > >---------- Forwarded message ---------- >From: Owen Densmore <owen at backspaces.net> >Date: Dec 1, 2005 7:08 PM >Subject: [FRIAM] genome data formats >To: The Friday Morning Applied Complexity Friam <friam at redfish.com> > > >We spoke recently with folks at NCGR and it occurred to me to ask: >what sort of data formats are there for genome data? I'm curious >what the data schema look like and what the popular/standard data >formats are. > >Anyone know about this sort of thing? > > -- Owen > >Owen Densmore >http://backspaces.net - http://redfish.com - http://friam.org > > > >============================================================ >FRIAM Applied Complexity Group listserv >Meets Fridays 9a-11:30 at Mission Cafe >lectures, archives, unsubscribe, maps at http://www.friam.org > > >-- >Giles Bowkett = Giles Goat Boy >http://www.gilesgoatboy.org/ > > ---------- Forwarded message ---------- From: Stefan Amshey <[hidden email]> Date: Dec 2, 2005 11:23 AM Subject: Re: Fwd: [FRIAM] genome data formats To: Giles Bowkett <gilesb at gmail.com> Hey Giles - yeah, I can kind of answer that. The short answer is "extensibly". At the DNA sequence level you're typically dealing with data in the form of "strings" (ATCG...AUCG...etc) so the same techniques for storing data and moving it in and out of databases apply: XML, or even just flatfiles, and a few simple formats for very specific purposes (like FASTA - a formatting convention for storing sets of sequences) (a former, disgruntled colleague used ot march around on bad days with a shit-eating grin chanting, "Load, unload, parse, repeat!". It was kind of disturbing. Every career has its down side, I guess.) The long answer: The real value of genomic data is not found in the DNA sequence alone - that part's basically meaningless without context. What adds value to a data set is how you cross reference sequence data with other kinds of information, particularly information about the function of whatever product the DNA sequence codes for. I call this "annotation" information - kind of like encyclopedia entries, but for genes (dna sequence, name of gene product, description, metabolic function, references to scientific literature, associated keywords, identities of splice variants/activity regulators...etc). You can infer similarity in function of a gene you're interested in by its similarity in DNA sequence to other, known genes. Often you can make this inference even across species, which is kind of where the idea of common ancestry of all life comes from - you've probably heard it before in macro-biology: form follows function. A hint about what the product of a gene might do goes a long way when you're dealing with molecules because it's not like you can just pick them up and squint at them. I can explain more if that's not clear, but in fairness one of the great challenges (and, uh, careers, if I do say so myself) in the biotech industry is putting together different information from different sources and finding new ways of combining it and cross-referencing it so that useful inferences can be drawn by people who know what to look for. There is a lot of redundant information out there, of course, but each data set will typically have some unique content. /S -- Giles Bowkett = Giles Goat Boy http://www.gilesgoatboy.org/ |
Free forum by Nabble | Edit this page |