Friam

genome data formats

Classic

List

Threaded

4 messages Options

Owen Densmore

genome data formats

Administrator

We spoke recently with folks at NCGR and it occurred to me to ask:
what sort of data formats are there for genome data? I'm curious
what the data schema look like and what the popular/standard data
formats are.

Anyone know about this sort of thing?

-- Owen

Owen Densmore
http://backspaces.net - http://redfish.com - http://friam.org

Tom Carter

genome data formats

Owen -

The default simplest data format for pure sequence data is the
"fasta" format (.fna extension), in the form:

>gb|U00096|U00096 Escherichia coli K-12 MG1655 complete genome
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC
TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA
TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC
ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG

There can be other symbols besides A, C, G, T, such as N (for any of
ACGT . . .). There is a page here with some more info:

http://home.cc.umanitoba.ca/~psgendb/formats.html

The more interesting part is the meta-data (e.g., "gene here,"
"promoter site here," who did the sequencing, etc.). One very
generic format for this kind of stuff is the GenBank format, example
here:

http://home.cc.umanitoba.ca/~psgendb/X54090.gen.html

Beyond that, things tend to get "propietary" and/or viewer/browser
dependent pretty fast . . . There are some interesting "browser"
systems here (try "genome browser" for example):

http://genome.ucsc.edu/

There is some discussion of some meta-data formats here:

http://genome.ucsc.edu/ENCODE/submission.html

tom

On Dec 1, 2005, at 6:08 PM, Owen Densmore wrote:

> We spoke recently with folks at NCGR and it occurred to me to ask:
> what sort of data formats are there for genome data? I'm curious
> what the data schema look like and what the popular/standard data
> formats are.
>
> Anyone know about this sort of thing?
>
> -- Owen
>
> Owen Densmore
> http://backspaces.net - http://redfish.com - http://friam.org
>
>
>
> ============================================================
> FRIAM Applied Complexity Group listserv
> Meets Fridays 9a-11:30 at Mission Cafe
> lectures, archives, unsubscribe, maps at http://www.friam.org

Andrew Farmer

genome data formats

In reply to this post by Owen Densmore

Hi Owen-
a short answer is, there are more formats than you'd care to imagine,
suited for different purposes, as "genome data" can mean a lot of
different things to different people. In any case, one of the best places
for starting to feel your way around is at the NCBI site via their
entrez search system. http://www.ncbi.nlm.nih.gov/Entrez/

It is basically a text document search engine with
some semantic indexing, and you can get the results back in a variety of
formats (some of which have different information contents, some of
which are syntactic variants). Entrez is used as the search interface
to genomic and many other types of biologically important data at the ncbi.
You'll also find links out to many of the "boutique" information resources
that serve it up in oh-so-many different dialects.

There are many standards out there, but NCBI's are probably the strongest
de facto standards (at least on this side of the pond- the European
equivalent EBI and Japanese DDBJ sometimes collaborate and sometimes go
in their own directions).

You may also be interested in checking out the biojava/bioperl/biopython/bioruby
(and I'm sure I've left someone out) open source projects that supply
libraries for manipulating many of these formats as well as doing analyses
and other things. (Oh yeah, biosql is there too...)

Finally, there is a lot of work in "ontology" building, and the W3C has
formed a special group to explore use of semantic web technologies in
the life sciences, witha lot of the big players already getting involved.
http://www.w3.org/2001/sw/hcls/

Hope this helps more than it muddies the waters...

Andrew Farmer (NCGR worker/FRIAM lurker)

On Thu, 1 Dec 2005, Owen Densmore wrote:

--

Andrew Farmer
adf at ncgr.org
(505) 995-4464
Database Administrator/Software Developer
National Center for Genome Resources

---
"To live in the presence of great truths and eternal laws,
to be led by permanent ideals-
that is what keeps a man patient when the world ignores him,
and calm and unspoiled when the world praises him."
-Balzac
---

Giles Bowkett

Fwd: Fwd: genome data formats

In reply to this post by Owen Densmore

Giles Bowkett wrote:

>Hey man -- question came up on a list I'm on. Got a sec?
>
>---------- Forwarded message ----------
>From: Owen Densmore <owen at backspaces.net>
>Date: Dec 1, 2005 7:08 PM
>Subject: [FRIAM] genome data formats
>To: The Friday Morning Applied Complexity Friam <friam at redfish.com>
>
>
>We spoke recently with folks at NCGR and it occurred to me to ask:
>what sort of data formats are there for genome data? I'm curious
>what the data schema look like and what the popular/standard data
>formats are.
>
>Anyone know about this sort of thing?
>
> -- Owen
>
>Owen Densmore
>http://backspaces.net - http://redfish.com - http://friam.org
>
>
>
>============================================================
>FRIAM Applied Complexity Group listserv
>Meets Fridays 9a-11:30 at Mission Cafe
>lectures, archives, unsubscribe, maps at http://www.friam.org
>
>
>--
>Giles Bowkett = Giles Goat Boy
>http://www.gilesgoatboy.org/
>
>

---------- Forwarded message ----------
From: Stefan Amshey <[hidden email]>
Date: Dec 2, 2005 11:23 AM
Subject: Re: Fwd: [FRIAM] genome data formats
To: Giles Bowkett <gilesb at gmail.com>

Hey Giles - yeah, I can kind of answer that. The short answer is
"extensibly". At the DNA sequence level you're typically dealing with
data in the form of "strings" (ATCG...AUCG...etc) so the same techniques
for storing data and moving it in and out of databases apply: XML, or
even just flatfiles, and a few simple formats for very specific purposes
(like FASTA - a formatting convention for storing sets of sequences) (a
former, disgruntled colleague used ot march around on bad days with a
shit-eating grin chanting, "Load, unload, parse, repeat!". It was kind
of disturbing. Every career has its down side, I guess.)

The long answer: The real value of genomic data is not found in the DNA
sequence alone - that part's basically meaningless without context.
What adds value to a data set is how you cross reference sequence data
with other kinds of information, particularly information about the
function of whatever product the DNA sequence codes for. I call this
"annotation" information - kind of like encyclopedia entries, but for
genes (dna sequence, name of gene product, description, metabolic
function, references to scientific literature, associated keywords,
identities of splice variants/activity regulators...etc). You can infer
similarity in function of a gene you're interested in by its similarity
in DNA sequence to other, known genes. Often you can make this
inference even across species, which is kind of where the idea of common
ancestry of all life comes from - you've probably heard it before in
macro-biology: form follows function. A hint about what the product of
a gene might do goes a long way when you're dealing with molecules
because it's not like you can just pick them up and squint at them. I
can explain more if that's not clear, but in fairness one of the great
challenges (and, uh, careers, if I do say so myself) in the biotech
industry is putting together different information from different
sources and finding new ways of combining it and cross-referencing it so
that useful inferences can be drawn by people who know what to look
for. There is a lot of redundant information out there, of course, but
each data set will typically have some unique content.
/S

--
Giles Bowkett = Giles Goat Boy
http://www.gilesgoatboy.org/