Friam

Dimensionality reduced term names.

Classic

List

Threaded

4 messages Options

Owen Densmore

Dimensionality reduced term names.

Administrator

I'm taking the Stanford Machine Learning class, and it brought up a problem I've thought about before: When any linear algebraic process reduces the dimensionality of a data set, you loose "names" or "labels" for the reduced features.

Specifically, if the data set has highly correlated features such as sq. ft. of a house, and the number of floors, a dimensionality reduction algorithm is very likely to find high correlation with # floors and sq. ft. of the house, and merge these two into a single new reduced term.

A difficulty arrises: what do you name the new, reduced features?

This occurs big time in DTM (Document Term Matrices) which are classified using SVD to create a much smaller lexicon than the entire dictionary. But when this reduction occurs, there is no new term name that can be used for the linear combination of initial dictionary entries.

One solution is to "undo" the dimensionality reduction, to revert to an approximation of the initial terms. If your dimensionality reduction is REALLY tight, that works OK.

But is there another solution that can create creditable new terms from the original ones? For example, would semantic network approaches help? I could see the initial feature names forming a semantic web of triples, which could yield a navigation technique where the original term names were not lost, yet the relationship between them and the new reduced set became visible.

In the Stanford ML class, we discuss feature sets of 10,000 terms being reduced to 100-500 terms using PCA (Principle Component Analysis) and with "99% variance retained" .. i.e. only 1% squared projection error (not regression). It would be fascinating to retain the initial terms in a web of some sort. I know some search classifiers that use this type of K-means clustering, but alas, they do loose the original terms.

-- Owen

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

Marcus G. Daniels

Re: Dimensionality reduced term names.

On 11/29/2011 8:49 PM, Owen Densmore wrote:

Specifically, if the data set has highly correlated features such as sq. ft. of a house, and the number of floors, a dimensionality reduction algorithm is very likely to find high correlation with # floors and sq. ft. of the house, and merge these two into a single new reduced term.

A difficulty arrises: what do you name the new, reduced features?

Reserve a forbidden character (e.g. \001) as a delimiter and append the original strings upon the term reduction, forming a lexicon of those unique strings. Then you don't need to remember the index -> string relationships of the original encoding. Alternatively, to make a more dense encoding, one could take the integers corresponding to the terms' row or column indices and form a tuple or list of indices and hash on that to get the new identifier. Could accumulate that stuff recursively if you want to know the history of the encodings.

Marcus

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

Roger Critchlow-2

Re: [sfx: Discuss] Dimensionality reduced term names.

In reply to this post by Owen Densmore

On Tue, Nov 29, 2011 at 8:49 PM, Owen Densmore <[hidden email]> wrote:

Specifically, if the data set has highly correlated features such as sq. ft. of a house, and the number of floors, a dimensionality reduction algorithm is very likely to find high correlation with # floors and sq. ft. of the house, and merge these two into a single new reduced term.

A difficulty arrises: what do you name the new, reduced features?

We always used to call them reduced dimensions 1, 2, 3, ..., because they never stuck around long enough to get familiar.

Opening lines of the abstract for a Hadley Wickham talk in Pittsburgh this week:

It's often said that 80 percent of the effort of analysis is spent just getting the data ready to analyze, the process of data cleaning. Data cleaning is not only a vital first step, but it is often repeated multiple times over the course of an analysis as new problems come to light.

If your data set is the only data set for the problem, and it's already perfect, and if your reduction method is the only one for the problem, and it's also perfect, or if all data sets and reduction methods give the exact same reduced dimensions, then you might have time to worry about what to call the reduced dimensions. Otherwise your time is better spent figuring out how to ensure that your data set is what you think it really is, because with probability 1 it's a horrible caricature of what you think it is. And every time you fix something in the data prep all your carefully chosen names go down the tubes with whatever amazing theories you attached to them.

It may be that your class problems are perfect data sets for the perfect reduction methods they ask you to apply to them, that's never happened to me.

-- rec --

Marcus G. Daniels

Re: [sfx: Discuss] Dimensionality reduced term names.

On 11/30/11 2:01 PM, Roger Critchlow wrote:

And every time you fix something in the data prep all your carefully chosen names go down the tubes with whatever amazing theories you attached to them.

If you have "dust" and "sucker" and get the nameless integer 944 for the combination, make (Dust,Sucker) or the fissionable "dust/sucker" rather than agonizing over the possibility of "staubsauger" or "vacuum". There must be more to this question?

Marcus

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org