======================
Info theory boosts clustering ====================== The emerging field of clustering aims to help scientists analyse mountains of data like genome sequencing, astronomical observations and market behaviour by automatically grouping like pieces of data. Princeton University researchers have taken a fresh approach to the clustering problem using information theory to generalise the process, which removes the need to define ahead of time what makes pieces of data similar to each other. The method determines how much information each piece of data has in common regardless of the nature of the information, and it boils down to finding the best trade-off between maximising the apparent relatedness of pieces of data while minimising the number of bits needed to describe the data. The method can be used with any kind of data and performs better than previous clustering algorithms, the researchers say. See http://www.pnas.org/cgi/content/abstract/0507432102v1 and Technology Research Magazine - December 20, 2005 http://unu-merit.nl/i&tweekly/ref.php?nid=2456 Unfortunately, only subscribers can see the full text of the article. So far. But here's a link to the lead author's homepage, which has much of the same material. http://www.princeton.edu/~nslonim/ <http://www.princeton.edu/%7Enslonim/> And this is helpfuo, too: http://www.genomics.princeton.edu/biophysics-theory/Clustering/web-content/index.html A paper of related interest is: Estimating mutual information and multi?information in large networks Noam Slonim, Gurinder S. Atwal, Gavsper Tkavcik, and William Bialek Joseph Henry Laboratories of Physics, and Lewis?Sigler Institute for Integrative Genomics Princeton University, Princeton, New Jersey 08544 {nslonim,gatwal,gtkacik,wbialek}@princeton.edu Abstract We address the practical problems of estimating the information relations that characterize large networks. Building on methods developed for analysis of the neural code, we show that reliable estimates of mutual information can be obtained with manage- able computational effort. The same methods allow estimation of higher order, multi?information terms. These ideas are illustrated by analyses of gene expression, financial markets, and consumer preferences. In each case, information theoretic measures correlate with independent, intuitive measures of the underlying structures in the system. This is found at http://www.genomics.princeton.edu/biophysics-theory/DirectMI/0502017.pdf -tj -- ============================================== J. T. Johnson Institute for Analytic Journalism www.analyticjournalism.com 505.577.6482(c) 505.473.9646(h) http://www.jtjohnson.com tom at jtjohnson.com "You never change things by fighting the existing reality. To change something, build a new model that makes the existing model obsolete." -- Buckminster Fuller ============================================== -------------- next part -------------- An HTML attachment was scrubbed... URL: http://redfish.com/pipermail/friam_redfish.com/attachments/20051223/96f17b2e/attachment-0001.htm |
On 12/23/05, J T Johnson <tom at jtjohnson.com> wrote:
> > Unfortunately, only subscribers can see the full text of the article. So > far. But here's a link to the lead author's homepage, which has much of the > same material. > http://www.princeton.edu/~nslonim/ > And this is helpfuo, too: > http://www.genomics.princeton.edu/biophysics-theory/Clustering/web-content/index.html The [Download the paper here] link on this page appears to be the arXiv preprint of the PNAS paper. There's a gpl'ed matlab implementation of the algorithm in the left sidebar if you'd rather try first and RTFM later. It reads very nicely, but most clustering algorithms seem reasonable. You really need to take it for a spin and see if the results make sense. -- rec -- |
Free forum by Nabble | Edit this page |