You must accept the terms and conditions. You have entered an invalid code. Thank you for submitting a comment on this article. Your comment will be reviewed and published at the journal's discretion. Please check for further notifications by email.
Petri nets and XML in Bioinformatics
Email alerts New issue alert. Receive exclusive offers and updates from Oxford Academic. It is widely recognized that exchange, distribution, and integration of biological data are the keys to improve bioinformatics and genome biology in post-genomic era. However, the problem of exchanging and integrating biological data is not solved satisfactorily. This paper presents XML and Web Services technologies and their use for an appropriate solution to the problem of bioinformatics data exchange and integration.
Recently, more and more genomes have been sequenced and annotated, and the data of proteins and gene interactions are accumulating. Biological data are mostly digital and stored in a wide variety of formats in heterogeneous systems.
XML Databases for Bioinformatics
Biological data exist all over the world as various web services, which provide biologists with much useful information. However, when users actually make use of them, they need to access to web services databases one by one. If they want to compare many different kinds of data, they need to do cumbersome task. Actually, a large part of the work of biologists today consists in distributing local data, querying multiple remote heterogeneous data source, and integrating retrieved data manually.
Many communities have devoted to a large amount of work on the exchange and integration of biological data 1. However, the whole problem of data integration is not solved satisfactorily.
The difficulties in dealing with the bioinformatics data exchange and integration come from the following technical issues:. Data integration consists in wrapping data sources and either loading retrieved data into a data warehouse or returning it to the user. Nowadays, database federation is a main technology for solving data integration problem 3. Database federation offers the promise of a unified view of these disparate data and detailed query through a single easy-to-use interface available via the World Wide Web. There are two approaches for implementing database federation: However, there are some shortcomings by using database federation technology for integrating data.
First, the retrieved data by concrete federation are not always the latest and greatest. Second, because the retrieved data by virtual federation are web pages HTML format , it is an arduous task to parse the result documents. Third, a client of virtual federation must be tied to the upstream web service directly.
- Supplemental Content.
- XML, bioinformatics and data integration.!
Changes of the web service interface make it difficult to maintain the federated database. Web Services, a kind of service-oriented architecture, have been used worldwide to exchange and integrate data in e-commerce. However, few were introduced about the use in bioinformatics. The flat file format FF format is the popular data format for distributing nucleotides data and other biological data.
However, it is very difficult to parse the FF format for extracting the interesting information.
Xml In Bioinformatics, Relevance And Uses
XML has some features for overcoming the disadvantage of FF format. XML provided a generic way to represent structured and typed data, which makes it easy to write a script for parsing an XML document. Web Services have some features for solving the problems of information exchange, data integration and distributed application:. This seems to be from the common misconception that DTDs and "schema" are distinct.
In fact, DTDs are a kind of schema. From Wikipedia, "An XML schema is a description of a type of XML document, typically expressed in terms of constraints on the structure and content of documents". XML is an ideal format to present complex data in small to medium size e. It is much clearer and much less error-prone.
On the other hand, XML may not be appropriate for very large data. It is also a little overkilling for simple structureless data that can be represented in TAB-delimited formats. As Daniel said, it avoids a lot of pitfalls in parsing formats, especially those formats without a formal spec e. However, as an outsider, I also see the following factors that hamper the adoption of XML in Bioinformatics.
After googling a few XML parser benchmarks e. A few parsers e.
Without any evidence, I tend to believe parsing XML is slower than parsing a plain text file. Parsing XML is almost certainly slower than parsing specialized binary formats, probably a lot. This could be a concern for large data sets. Other factors may be Unix unfriendliness and technical complexity, but perhaps once we get used to XML, these are not major concerns. I do not know. I know there are tools to covert XML to line-based format I used them.
But when we want to open multiple XML files without creating temporary files, it becomes a little painful, though solvable. Another thing I mean by "complexity" is it is overkilling for very simple data. This is 'Unix unfriendliness' just ill defined, and technical complexity?