Administrator
|
I'm taking the Stanford Machine Learning class, and it brought up a problem I've thought about before: When any linear algebraic process reduces the dimensionality of a data set, you loose "names" or "labels" for the reduced features.
Specifically, if the data set has highly correlated features such as sq. ft. of a house, and the number of floors, a dimensionality reduction algorithm is very likely to find high correlation with # floors and sq. ft. of the house, and merge these two into a single new reduced term.
A difficulty arrises: what do you name the new, reduced features?
This occurs big time in DTM (Document Term Matrices) which are classified using SVD to create a much smaller lexicon than the entire dictionary. But when this reduction occurs, there is no new term name that can be used for the linear combination of initial dictionary entries.
One solution is to "undo" the dimensionality reduction, to revert to an approximation of the initial terms. If your dimensionality reduction is REALLY tight, that works OK.
But is there another solution that can create creditable new terms from the original ones? For example, would semantic network approaches help? I could see the initial feature names forming a semantic web of triples, which could yield a navigation technique where the original term names were not lost, yet the relationship between them and the new reduced set became visible.
In the Stanford ML class, we discuss feature sets of 10,000 terms being reduced to 100-500 terms using PCA (Principle Component Analysis) and with "99% variance retained" .. i.e. only 1% squared projection error (not regression). It would be fascinating to retain the initial terms in a web of some sort. I know some search classifiers that use this type of K-means clustering, but alas, they do loose the original terms.
-- Owen
============================================================ FRIAM Applied Complexity Group listserv Meets Fridays 9a-11:30 at cafe at St. John's College lectures, archives, unsubscribe, maps at http://www.friam.org |
On 11/29/2011 8:49 PM, Owen Densmore wrote:
Reserve a forbidden character (e.g. \001) as a delimiter and append the original strings upon the term reduction, forming a lexicon of those unique strings. Then you don't need to remember the index -> string relationships of the original encoding. Alternatively, to make a more dense encoding, one could take the integers corresponding to the terms' row or column indices and form a tuple or list of indices and hash on that to get the new identifier. Could accumulate that stuff recursively if you want to know the history of the encodings. Marcus ============================================================ FRIAM Applied Complexity Group listserv Meets Fridays 9a-11:30 at cafe at St. John's College lectures, archives, unsubscribe, maps at http://www.friam.org |
In reply to this post by Owen Densmore
On Tue, Nov 29, 2011 at 8:49 PM, Owen Densmore <[hidden email]> wrote:
We always used to call them reduced dimensions 1, 2, 3, ..., because they never stuck around long enough to get familiar.
Opening lines of the abstract for a Hadley Wickham talk in Pittsburgh this week:
It may be that your class problems are perfect data sets for the perfect reduction methods they ask you to apply to them, that's never happened to me. -- rec --
============================================================ FRIAM Applied Complexity Group listserv Meets Fridays 9a-11:30 at cafe at St. John's College lectures, archives, unsubscribe, maps at http://www.friam.org |
On 11/30/11 2:01 PM, Roger Critchlow wrote:
And every time you fix something in the data prep all your carefully chosen names go down the tubes with whatever amazing theories you attached to them.If you have "dust" and "sucker" and get the nameless integer 944 for the combination, make (Dust,Sucker) or the fissionable "dust/sucker" rather than agonizing over the possibility of "staubsauger" or "vacuum". There must be more to this question? Marcus ============================================================ FRIAM Applied Complexity Group listserv Meets Fridays 9a-11:30 at cafe at St. John's College lectures, archives, unsubscribe, maps at http://www.friam.org |
Free forum by Nabble | Edit this page |