Friam

hi friam - how do I calculate the fractal dimension of repetitive text?

Classic

List

Threaded

10 messages Options

Giles Bowkett

hi friam - how do I calculate the fractal dimension of repetitive text?

Howdy all - it's been a long time, but I was just planning a return to NM to see family for the holiday and reading Mandelbrot's book on finance, and the pairing reminded me of FRIAM.

I have something I'm puzzling with, which you guys might find interesting. I want to calculate the fractal dimension of repetitive text, specifically code.

I'm building a system to do automated refactoring. It's essentially a compiler. I see this as a business project, but the fractal dimensionality, I don't see any actual business in that. It's just a mild personal obsession. I read Mandelbrot's "Fractal Geometry of Nature" when I was a teenager and it changed my life.

So anyway, consider a code base at a tech company. Typically, these code bases are measured in lines of code. It's a one-dimensional measurement, so in a sense, a "line" whose length is determined by the number of lines of code. However, lines of code exhibit self-similarity, and the self-similarity increases with the clumsiness of the syntax of the programming language, and with the incompetence or indifference of the programmers who wrote it. This comes from tons of personal experience, and from the fact that the less competent a programmer is, or the less they care about the company, or the more confusing a language's syntax is, the more likely a programmer is to just copy and paste some code, instead of actually figuring out what it does.

I think the "line" formed by measuring the number of these lines of code would be more strictly be considered a curve, as in the Koch curve, which is a highly self-similar curve with a fractal or Hausdorff dimension greater than one; its fractal dimension comes from its self-similar filagrees. Fractals 101.

http://en.wikipedia.org/wiki/Koch_snowflake
http://en.wikipedia.org/wiki/Hausdorff_dimension

You create the Koch curve by taking a line, inserting a deviation, and then dividing the result into fourths, and repeating the process on each fourth. You end up with this thing that looks like a snowflake or a fuzzy Star of David.

Many programmers write their code by copying a line and inserting some minor variation.

E.g., say I have this:

this.moduleOverlay.data.info.specifics.details.render(this.display, "hello world", this);

That's pretty much a real-world example. Say I need the application this comes from to display "howdy world" a little later on. Even though I think it's kind of a horrible thing to do, for expediency's sake, I might copy this code, paste it, and just modify the literal:

this.moduleOverlay.data.info.specifics.details.render(this.display, "howdy world", this);

Now imagine this happening very frequently in a company with a very high rate of copy-paste. The company creates its entire code base by copying lines and inserting deviations. You very soon have a very self-similar corpus of text. The recursivity in the Koch curve also has an analog, because programmers will not just copy-paste line by line, but also copy-paste giant blocks of code, instead of refactoring to objects. So although the copying is not a recursive process in most cases, you do have repetition at multiple scales.

The result is that the code base, considered as a whole, has some degree of fractal self-similarity. If we take lines of code as our metric, we're going to have some kind of fractal dimensionality in the result, something greater than 1 but less than 2, just like the coastline of Britain. (again, http://en.wikipedia.org/wiki/Hausdorff_dimension)

In reality, however, lines of code is a ridiculous metric, because it's meaningless. If somebody offers you a consulting gig and tells you they have a thousand lines of code, is that highly repetitive code, or entirely unique code? You could technically be talking about the same line repeated a thousand times. (In fact I worked at a company in Santa Fe which was just about that dumb, and I know if Carl is still on here, he knows who I'm talking about.) In order to do any kind of useful analysis on code, you need to turn the code into a parse tree, and analyze the tree. For instance, my automated refactoring code is still in its infancy, but can detect simple kinds of repetition and similarity, and does so by converting code into parse trees, and then comparing the trees. It's downright tautological to say that if you're looking at trees which contain subtrees, and those subtrees are equal or similar to other subtrees (both subtrees within the same tree and subtrees found in other trees), and further that the same patterns of repetition shaped both the macro and micro scales of the tree, you are talking about fractals.

But I can't find anything on how to calculate the fractal dimension of a tree! The best thing I've found is something on how to calculate the fractal dimension of a network, which I'm guessing could be clumsily coerced into a method for trees. There's plenty out there about how to use fractals to **generate** an irregular tree, so I may be able to use something like that and go backwards.

OK, and I found this:

http://www.trusoft-international.com/

But it costs about $250, and I want something Unix-y or RESTful so I can use it programmatically.

--
Giles Bowkett
http://gilesbowkett.com

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

Jochen Fromm-5

Re: hi friam - how do I calculate the fractal dimension of repetitive text?

Hi Giles,

"Copy and Paste" is not something only unexperienced
programmers do. Let us admit it, programmers do it
all the time. It is a basic tool which is applied
on many scales, from the highest level of the application
to the lowest level of the individual function.
Similar or duplicated code is not completely
bad, it can even be faster than entirely unique
code (for instance if you code a loop directly
in Assembly language, the branching command
can be eliminated). If repetitive code is good or
bad depends on the judgement of the programmer -
whether he wants performance or elegance.

There is a Harvard Business Review article
named "When Should a Process Be Art" which
says that some creative processes naturally
resist definition and standardization.
Programming is such a creative process.
The programmer adjusts the raw material until
it matches the desired requiremewnts. It is an
artistic process where the quality can not be
measured by counting the lines of code, see
http://4loc.wordpress.com/2010/11/22/progress-in-the-software-world/

I doubt that refactoring can be automated.
The book from Martin Fowler is useful:
"Refactoring: Improving the Design of Existing Code"
A list of patterns from this book can be found here
http://www.refactoring.com/catalog/

I am not sure if it makes sense to define a
fractal dimension for text or code. The box
counting method is a common method to
determine the fractal dimension of images
and graphics. We divide the space up into a grid
of boxes of size x, and count the number of boxes
of that scale that would contain a part of the
attractor.

If we divide the code into lines, and count the
number of lines which repeat themselves, we would get
a number between 0% and 100% (only repetition:
100%, no repetition: 0%). This is not a fractal
dimension. And if we are honest, we must consider the
whole stack, which begins on the deepest level of
machine code. Ruby for example is written in C, and
C in Assembly language. Many programs which look
elegant just hide the messy part under the hood.
They are based on large frameworks which contain
all the messy and ugly code.

-J.

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

Russell Standish

Re: hi friam - how do I calculate the fractal dimension of repetitive text?

In reply to this post by Giles Bowkett

I had an idea how you might do this.

You can start by counting the number of repeated characters in your
program. Then count all pairs of characters that
appear more than once. Then all quadruplets, octuplets and so on. This
will give you some scale dependent distribution of code similarity. If
this turns out to exhibit a power law, you can get the Hausdorf
dimension from that.

A variant you could try is considering some threshold value (eg 10%),
and consider a repeat to be any string that is within that threshold
(x string length) as measured by the Hamming distance.

You might find the algorithm as described is computationally intensive
(it looks to be O(n^2) at first glance), so you might want to exploit
some structural properties of the C++ code (eg using method boundaries
and line boundaries to help in framing at the appropriate scales).

The above technique could also be applied to genomic data, and also
the genomes of digitial organisms from such systems as Tierra or Avida.

One possible hypothesis to test is that evolved software will exhibit
a power law scaling, whereas engineered (and finely refactored) code
does not. This would be in line with Stephan Halloy's observations of
scaling laws in natural systems - see
http://pandora.nla.gov.au/pan/10178/20040710-0000/www.complexity.org.au/ci/vol06/halloy/index.html
.

Cheers

On Fri, Dec 10, 2010 at 06:27:13PM -0800, Giles Bowkett wrote:

> Howdy all - it's been a long time, but I was just planning a return to NM to
> see family for the holiday and reading Mandelbrot's book on finance, and the
> pairing reminded me of FRIAM.
>
> I have something I'm puzzling with, which you guys might find interesting. I
> want to calculate the fractal dimension of repetitive text, specifically
> code.
>
> I'm building a system to do automated refactoring. It's essentially a
> compiler. I see this as a business project, but the fractal dimensionality,
> I don't see any actual business in that. It's just a mild personal
> obsession. I read Mandelbrot's "Fractal Geometry of Nature" when I was a
> teenager and it changed my life.
>
> So anyway, consider a code base at a tech company. Typically, these code
> bases are measured in lines of code. It's a one-dimensional measurement, so
> in a sense, a "line" whose length is determined by the number of lines of
> code. However, lines of code exhibit self-similarity, and the
> self-similarity increases with the clumsiness of the syntax of the
> programming language, and with the incompetence or indifference of the
> programmers who wrote it. This comes from tons of personal experience, and
> from the fact that the less competent a programmer is, or the less they care
> about the company, or the more confusing a language's syntax is, the more
> likely a programmer is to just copy and paste some code, instead of actually
> figuring out what it does.
>
> I think the "line" formed by measuring the number of these lines of code
> would be more strictly be considered a curve, as in the Koch curve, which is
> a highly self-similar curve with a fractal or Hausdorff dimension greater
> than one; its fractal dimension comes from its self-similar filagrees.
> Fractals 101.
>
> http://en.wikipedia.org/wiki/Koch_snowflake
> http://en.wikipedia.org/wiki/Hausdorff_dimension
>
> You create the Koch curve by taking a line, inserting a deviation, and then
> dividing the result into fourths, and repeating the process on each fourth.
> You end up with this thing that looks like a snowflake or a fuzzy Star of
> David.
>
> Many programmers write their code by copying a line and inserting some minor
> variation.
>
> E.g., say I have this:
>
> this.moduleOverlay.data.info.specifics.details.render(this.display, "hello
> world", this);
>
> That's pretty much a real-world example. Say I need the application this
> comes from to display "howdy world" a little later on. Even though I think
> it's kind of a horrible thing to do, for expediency's sake, I might copy
> this code, paste it, and just modify the literal:
>
> this.moduleOverlay.data.info.specifics.details.render(this.display, "howdy
> world", this);
>
> Now imagine this happening very frequently in a company with a very high
> rate of copy-paste. The company creates its entire code base by copying
> lines and inserting deviations. You very soon have a very self-similar
> corpus of text. The recursivity in the Koch curve also has an analog,
> because programmers will not just copy-paste line by line, but also
> copy-paste giant blocks of code, instead of refactoring to objects. So
> although the copying is not a recursive process in most cases, you do have
> repetition at multiple scales.
>
> The result is that the code base, considered as a whole, has some degree of
> fractal self-similarity. If we take lines of code as our metric, we're going
> to have some kind of fractal dimensionality in the result, something greater
> than 1 but less than 2, just like the coastline of Britain. (again,
> http://en.wikipedia.org/wiki/Hausdorff_dimension)
>
> In reality, however, lines of code is a ridiculous metric, because it's
> meaningless. If somebody offers you a consulting gig and tells you they have
> a thousand lines of code, is that highly repetitive code, or entirely unique
> code? You could technically be talking about the same line repeated a
> thousand times. (In fact I worked at a company in Santa Fe which was just
> about that dumb, and I know if Carl is still on here, he knows who I'm
> talking about.) In order to do any kind of useful analysis on code, you need
> to turn the code into a parse tree, and analyze the tree. For instance, my
> automated refactoring code is still in its infancy, but can detect simple
> kinds of repetition and similarity, and does so by converting code into
> parse trees, and then comparing the trees. It's downright tautological to
> say that if you're looking at trees which contain subtrees, and those
> subtrees are equal or similar to other subtrees (both subtrees within the
> same tree and subtrees found in other trees), and further that the same
> patterns of repetition shaped both the macro and micro scales of the tree,
> you are talking about fractals.
>
> But I can't find anything on how to calculate the fractal dimension of a
> tree! The best thing I've found is something on how to calculate the fractal
> dimension of a network, which I'm guessing could be clumsily coerced into a
> method for trees. There's plenty out there about how to use fractals to
> **generate** an irregular tree, so I may be able to use something like that
> and go backwards.
>
> OK, and I found this:
>
> http://www.trusoft-international.com/
>
> But it costs about $250, and I want something Unix-y or RESTful so I can use
> it programmatically.
>
> --
> Giles Bowkett
> http://gilesbowkett.com

> ============================================================
> FRIAM Applied Complexity Group listserv
> Meets Fridays 9a-11:30 at cafe at St. John's College
> lectures, archives, unsubscribe, maps at http://www.friam.org

--

----------------------------------------------------------------------------
Prof Russell Standish Phone 0425 253119 (mobile)
Mathematics
UNSW SYDNEY 2052 [hidden email]
Australia http://www.hpcoders.com.au
----------------------------------------------------------------------------

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

Giles Bowkett

Re: hi friam - how do I calculate the fractal dimension of repetitive text?

In reply to this post by Jochen Fromm-5

Hi Jochen, first of all, apologies for the delayed reply, I kinda got knocked out by a nasty sinus infection right after I sent that e-mail. I will admit to having copied and pasted in the past, and I probably will in the future, but I'm not proud of it. I follow the Kent Beck rule on this, which is basically, if you use the same code twice, it's no big deal, but if you do it three times or more, you should refactor. I think I actually got that from the Fowler book which as you say is very good.

In terms of a fractal dimension for code, here's an example:

for (var i = 0; i < foo; i++) {

console.log(bar);

for (var j = 0; j < bar; j++) {

console.log(baz);

console.log(quux);

}

So here you have a piece of code which contains another piece of code highly similar to itself. The inner loop is like a version of the outer loop copied and then slightly altered. You could imagine an infinite series of for loops, each with a tiny tweak but otherwise highly similar to the containing loop. Kind of like how the Koch curve contains an infinite number of nested Koch curves, in a sense, but the difference is that the Koch curve is a regular fractal, while this code, if it has fractal dimension, represents an irregular fractal.

On Fri, Dec 10, 2010 at 11:14 PM, Jochen Fromm <[hidden email]> wrote:

Hi Giles,

"Copy and Paste" is not something only unexperienced
programmers do. Let us admit it, programmers do it all the time. It is a basic tool which is applied on many scales, from the highest level of the application
to the lowest level of the individual function.
Similar or duplicated code is not completely bad, it can even be faster than entirely unique
code (for instance if you code a loop directly
in Assembly language, the branching command can be eliminated). If repetitive code is good or bad depends on the judgement of the programmer - whether he wants performance or elegance.

There is a Harvard Business Review article named "When Should a Process Be Art" which
says that some creative processes naturally resist definition and standardization.
Programming is such a creative process.
The programmer adjusts the raw material until it matches the desired requiremewnts. It is an
artistic process where the quality can not be measured by counting the lines of code, see
http://4loc.wordpress.com/2010/11/22/progress-in-the-software-world/

I doubt that refactoring can be automated.
The book from Martin Fowler is useful:
"Refactoring: Improving the Design of Existing Code"
A list of patterns from this book can be found here
http://www.refactoring.com/catalog/

I am not sure if it makes sense to define a
fractal dimension for text or code. The box counting method is a common method to
determine the fractal dimension of images
and graphics. We divide the space up into a grid of boxes of size x, and count the number of boxes of that scale that would contain a part of the attractor.

If we divide the code into lines, and count the number of lines which repeat themselves, we would get
a number between 0% and 100% (only repetition: 100%, no repetition: 0%). This is not a fractal dimension. And if we are honest, we must consider the whole stack, which begins on the deepest level of machine code. Ruby for example is written in C, and C in Assembly language. Many programs which look elegant just hide the messy part under the hood.
They are based on large frameworks which contain all the messy and ugly code.

-J.

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

--
Giles Bowkett
http://gilesbowkett.com

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

Giles Bowkett

Re: hi friam - how do I calculate the fractal dimension of repetitive text?

In reply to this post by Russell Standish

Hey Russell,

You might find the algorithm as described is computationally intensive

Quite an understatement!

so you might want to exploit
some structural properties of the C++ code (eg using method boundaries
and line boundaries to help in framing at the appropriate scales).

I think that's the way to go, although I'll probably look at other languages first. For example, is there a power law relationship between the similarity of functions within a file and the similarity of files? My refactoring code has some stuff to measure repetition and similarity between functions, and will probably have more soon (very probably if you measure "soon" in years). Basically I'll probably look at code as trees of tokens rather than as literal strings, because the computation would otherwise just be massive.

One possible hypothesis to test is that evolved software will exhibit
a power law scaling, whereas engineered (and finely refactored) code
does not.

Yeah, this is really my base assumption. I have this weird zoo of "evolved" code, kind of an embarrassment of riches in the sample data department. I've been collecting samples for a couple years now. I'm almost certain the power law scaling is there, it's really almost just a question of how you define the scales. measuring repetition from line to line is very easy to do, but code also exhibits patterns which I'm not sure how I'd quantify. Example:

File A:

name = info.name;

address = info.address;

console.log(name);

console.log(address);

File B:

zip = info.zip;

phone = info.phone;

console.log(zip);

console.log(phone);

Obviously there's a lot of similarity here, both between individual lines and between files. it's pretty easy to think of ways to capture these similarities.

File C:

penguin = info.penguin;

console.log(penguin);

walrus = info.walrus;

console.log(walrus);

This represents the equivalent of a mild topological transformation (if I'm using my terms right), plus there's also this metalinguistic thing where the theme of Files A and B are similar to each other, but different from File C, and also that the grouping of terms by theme is a kind of similarity if we introduce a File D which doesn't observe it:

File D:

orca = info.orca;

console.log(orca);

apartment = info.apartment;

console.log(apartment);

You could actually capture that using the Python Natural Language ToolKit, which has some access to semantic clustering of some kind. And you can obtain it through latent semantic analysis, although that's nontrivial to say the least.

Anyway, bit of a tangent there.

On Mon, Dec 13, 2010 at 3:12 PM, Russell Standish <[hidden email]> wrote:

I had an idea how you might do this.

You can start by counting the number of repeated characters in your
program. Then count all pairs of characters that
appear more than once. Then all quadruplets, octuplets and so on. This
will give you some scale dependent distribution of code similarity. If
this turns out to exhibit a power law, you can get the Hausdorf
dimension from that.

A variant you could try is considering some threshold value (eg 10%),
and consider a repeat to be any string that is within that threshold
(x string length) as measured by the Hamming distance.

You might find the algorithm as described is computationally intensive
(it looks to be O(n^2) at first glance), so you might want to exploit
some structural properties of the C++ code (eg using method boundaries
and line boundaries to help in framing at the appropriate scales).

The above technique could also be applied to genomic data, and also
the genomes of digitial organisms from such systems as Tierra or Avida.

One possible hypothesis to test is that evolved software will exhibit
a power law scaling, whereas engineered (and finely refactored) code
does not. This would be in line with Stephan Halloy's observations of
scaling laws in natural systems - see
http://pandora.nla.gov.au/pan/10178/20040710-0000/www.complexity.org.au/ci/vol06/halloy/index.html
.

Cheers

On Fri, Dec 10, 2010 at 06:27:13PM -0800, Giles Bowkett wrote:
> Howdy all - it's been a long time, but I was just planning a return to NM to
> see family for the holiday and reading Mandelbrot's book on finance, and the
> pairing reminded me of FRIAM.
>
> I have something I'm puzzling with, which you guys might find interesting. I
> want to calculate the fractal dimension of repetitive text, specifically
> code.
>
> I'm building a system to do automated refactoring. It's essentially a
> compiler. I see this as a business project, but the fractal dimensionality,
> I don't see any actual business in that. It's just a mild personal
> obsession. I read Mandelbrot's "Fractal Geometry of Nature" when I was a
> teenager and it changed my life.
>
> So anyway, consider a code base at a tech company. Typically, these code
> bases are measured in lines of code. It's a one-dimensional measurement, so
> in a sense, a "line" whose length is determined by the number of lines of
> code. However, lines of code exhibit self-similarity, and the
> self-similarity increases with the clumsiness of the syntax of the
> programming language, and with the incompetence or indifference of the
> programmers who wrote it. This comes from tons of personal experience, and
> from the fact that the less competent a programmer is, or the less they care
> about the company, or the more confusing a language's syntax is, the more
> likely a programmer is to just copy and paste some code, instead of actually
> figuring out what it does.
>
> I think the "line" formed by measuring the number of these lines of code
> would be more strictly be considered a curve, as in the Koch curve, which is
> a highly self-similar curve with a fractal or Hausdorff dimension greater
> than one; its fractal dimension comes from its self-similar filagrees.
> Fractals 101.
>
> http://en.wikipedia.org/wiki/Koch_snowflake
> http://en.wikipedia.org/wiki/Hausdorff_dimension
>
> You create the Koch curve by taking a line, inserting a deviation, and then
> dividing the result into fourths, and repeating the process on each fourth.
> You end up with this thing that looks like a snowflake or a fuzzy Star of
> David.
>
> Many programmers write their code by copying a line and inserting some minor
> variation.
>
> E.g., say I have this:
>
> this.moduleOverlay.data.info.specifics.details.render(this.display, "hello
> world", this);
>
> That's pretty much a real-world example. Say I need the application this
> comes from to display "howdy world" a little later on. Even though I think
> it's kind of a horrible thing to do, for expediency's sake, I might copy
> this code, paste it, and just modify the literal:
>
> this.moduleOverlay.data.info.specifics.details.render(this.display, "howdy
> world", this);
>
> Now imagine this happening very frequently in a company with a very high
> rate of copy-paste. The company creates its entire code base by copying
> lines and inserting deviations. You very soon have a very self-similar
> corpus of text. The recursivity in the Koch curve also has an analog,
> because programmers will not just copy-paste line by line, but also
> copy-paste giant blocks of code, instead of refactoring to objects. So
> although the copying is not a recursive process in most cases, you do have
> repetition at multiple scales.
>
> The result is that the code base, considered as a whole, has some degree of
> fractal self-similarity. If we take lines of code as our metric, we're going
> to have some kind of fractal dimensionality in the result, something greater
> than 1 but less than 2, just like the coastline of Britain. (again,
> http://en.wikipedia.org/wiki/Hausdorff_dimension)
>
> In reality, however, lines of code is a ridiculous metric, because it's
> meaningless. If somebody offers you a consulting gig and tells you they have
> a thousand lines of code, is that highly repetitive code, or entirely unique
> code? You could technically be talking about the same line repeated a
> thousand times. (In fact I worked at a company in Santa Fe which was just
> about that dumb, and I know if Carl is still on here, he knows who I'm
> talking about.) In order to do any kind of useful analysis on code, you need
> to turn the code into a parse tree, and analyze the tree. For instance, my
> automated refactoring code is still in its infancy, but can detect simple
> kinds of repetition and similarity, and does so by converting code into
> parse trees, and then comparing the trees. It's downright tautological to
> say that if you're looking at trees which contain subtrees, and those
> subtrees are equal or similar to other subtrees (both subtrees within the
> same tree and subtrees found in other trees), and further that the same
> patterns of repetition shaped both the macro and micro scales of the tree,
> you are talking about fractals.
>
> But I can't find anything on how to calculate the fractal dimension of a
> tree! The best thing I've found is something on how to calculate the fractal
> dimension of a network, which I'm guessing could be clumsily coerced into a
> method for trees. There's plenty out there about how to use fractals to
> **generate** an irregular tree, so I may be able to use something like that
> and go backwards.
>
> OK, and I found this:
>
> http://www.trusoft-international.com/
>
> But it costs about $250, and I want something Unix-y or RESTful so I can use
> it programmatically.
>
> --
> Giles Bowkett
> http://gilesbowkett.com

> ============================================================
> FRIAM Applied Complexity Group listserv
> Meets Fridays 9a-11:30 at cafe at St. John's College
> lectures, archives, unsubscribe, maps at http://www.friam.org

--

----------------------------------------------------------------------------
Prof Russell Standish Phone 0425 253119 (mobile)
Mathematics
UNSW SYDNEY 2052 [hidden email]
Australia http://www.hpcoders.com.au
----------------------------------------------------------------------------

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

--
Giles Bowkett
http://gilesbowkett.com

Eric Charles

Re: hi friam - how do I calculate the fractal dimension of repetitive text?

In reply to this post by Russell Standish

Hey all,
I'm definitely just a lurker relative to this thread, but could you answer a few quick questions so I can lurk more effectively?

The goal is to find the fractal dimension of a given bunch of computer code (perhaps within a part of a program, perhaps within a database of several thousand programs).

At this point it is clear to me (and it wasn't for some odd reason before this last set of examples) that you could be interested in the fractal dimension for at least two distinct reasons: 1) you are interested in something about computer code itself, 2) you are interested in something about what a given person or group of people do when coding.

If you were interested in "the code itself", then you might, for example, get a sample of code widely agreed to be very pretty, and a sample widely agreed to be very ugly, and compare fractal dimensionality.

If you were interested in "what coders do", then you might, for example, watch someone writing a program that had to perform many tasks, and see how the fractal dimensionality changes as they tweak the code towards a final project.

Are these the types of comparisons you have in mind? Is one type of question more interesting? Is there some third type of question motivating the project?

Thanks,
Eric

P.S. A third reason to be interested that just occurred to me would be comparison across different languages. Surely some languages encourage more fractal code than others... right?

On Wed, Dec 15, 2010 08:11 PM, Giles Bowkett <[hidden email]> wrote:

Hey Russell,

You might find the algorithm as described is computationally intensive

Quite an understatement!

so you might want to exploit
some structural properties of the C++ code (eg using method boundaries
and line boundaries to help in framing at the appropriate scales).

I think that's the way to go, although I'll probably look at other languages first. For example, is there a power law relationship between the similarity of functions within a file and the similarity of files? My refactoring code has some stuff to measure repetition and similarity between functions, and will probably have more soon (very probably if you measure "soon" in years). Basically I'll probably look at code as trees of tokens rather than as literal strings, because the computation would otherwise just be massive.

One possible hypothesis to test is that evolved software will exhibit
a power law scaling, whereas engineered (and finely refactored) code
does not.

Yeah, this is really my base assumption. I have this weird zoo of "evolved" code, kind of an embarrassment of riches in the sample data department. I've been collecting samples for a couple years now. I'm almost certain the power law scaling is there, it's really almost just a question of how you define the scales. measuring repetition from line to line is very easy to do, but code also exhibits patterns which I'm not sure how I'd quantify. Example:

File A:

name = <a href="http://info.name" onclick="window.open('http://info.name');return false;">info.name;

address = info.address;

console.log(name);

console.log(address);

File B:

zip = info.zip;

phone = info.phone;

console.log(zip);

console.log(phone);

Obviously there's a lot of similarity here, both between individual lines and between files. it's pretty easy to think of ways to capture these similarities.

File C:

penguin = info.penguin;

console.log(penguin);

walrus = info.walrus;

console.log(walrus);

This represents the equivalent of a mild topological transformation (if I'm using my terms right), plus there's also this metalinguistic thing where the theme of Files A and B are similar to each other, but different from File C, and also that the grouping of terms by theme is a kind of similarity if we introduce a File D which doesn't observe it:

File D:

orca = info.orca;

console.log(orca);

apartment = info.apartment;

console.log(apartment);

You could actually capture that using the Python Natural Language ToolKit, which has some access to semantic clustering of some kind. And you can obtain it through latent semantic analysis, although that's nontrivial to say the least.

Anyway, bit of a tangent there.

On Mon, Dec 13, 2010 at 3:12 PM, Russell Standish <r.standish@...> wrote:
I had an idea how you might do this.

You can start by counting the number of repeated characters in your
program. Then count all pairs of characters that
appear more than once. Then all quadruplets, octuplets and so on. This
will give you some scale dependent distribution of code similarity. If
this turns out to exhibit a power law, you can get the Hausdorf
dimension from that.

A variant you could try is considering some threshold value (eg 10%),
and consider a repeat to be any string that is within that threshold
(x string length) as measured by the Hamming distance.

You might find the algorithm as described is computationally intensive
(it looks to be O(n^2) at first glance), so you might want to exploit
some structural properties of the C++ code (eg using method boundaries
and line boundaries to help in framing at the appropriate scales).

The above technique could also be applied to genomic data, and also
the genomes of digitial organisms from such systems as Tierra or Avida.

One possible hypothesis to test is that evolved software will exhibit
a power law scaling, whereas engineered (and finely refactored) code
does not. This would be in line with Stephan Halloy's observations of
scaling laws in natural systems - see
<a href="http://pandora.nla.gov.au/pan/10178/20040710-0000/www.complexity.org.au/ci/vol06/halloy/index.html" target="" onclick="window.open('http://pandora.nla.gov.au/pan/10178/20040710-0000/www.complexity.org.au/ci/vol06/halloy/index.html');return false;">http://pandora.nla.gov.au/pan/10178/20040710-0000/www.complexity.org.au/ci/vol06/halloy/index.html
.

Cheers

On Fri, Dec 10, 2010 at 06:27:13PM -0800, Giles Bowkett wrote:
> Howdy all - it's been a long time, but I was just planning a return to NM to
> see family for the holiday and reading Mandelbrot's book on finance, and the
> pairing reminded me of FRIAM.
>
> I have something I'm puzzling with, which you guys might find interesting. I
> want to calculate the fractal dimension of repetitive text, specifically
> code.
>
> I'm building a system to do automated refactoring. It's essentially a
> compiler. I see this as a business project, but the fractal dimensionality,
> I don't see any actual business in that. It's just a mild personal
> obsession. I read Mandelbrot's "Fractal Geometry of Nature" when I was a
> teenager and it changed my life.
>
> So anyway, consider a code base at a tech company. Typically, these code
> bases are measured in lines of code. It's a one-dimensional measurement, so
> in a sense, a "line" whose length is determined by the number of lines of
> code. However, lines of code exhibit self-similarity, and the
> self-similarity increases with the clumsiness of the syntax of the
> programming language, and with the incompetence or indifference of the
> programmers who wrote it. This comes from tons of personal experience, and
> from the fact that the less competent a programmer is, or the less they care
> about the company, or the more confusing a language's syntax is, the more
> likely a programmer is to just copy and paste some code, instead of actually
> figuring out what it does.
>
> I think the "line" formed by measuring the number of these lines of code
> would be more strictly be considered a curve, as in the Koch curve, which is
> a highly self-similar curve with a fractal or Hausdorff dimension greater
> than one; its fractal dimension comes from its self-similar filagrees.
> Fractals 101.
>
> <a href="http://en.wikipedia.org/wiki/Koch_snowflake" target="" onclick="window.open('http://en.wikipedia.org/wiki/Koch_snowflake');return false;">http://en.wikipedia.org/wiki/Koch_snowflake
> <a href="http://en.wikipedia.org/wiki/Hausdorff_dimension" target="" onclick="window.open('http://en.wikipedia.org/wiki/Hausdorff_dimension');return false;">http://en.wikipedia.org/wiki/Hausdorff_dimension
>
> You create the Koch curve by taking a line, inserting a deviation, and then
> dividing the result into fourths, and repeating the process on each fourth.
> You end up with this thing that looks like a snowflake or a fuzzy Star of
> David.
>
> Many programmers write their code by copying a line and inserting some minor
> variation.
>
> E.g., say I have this:
>
> this.moduleOverlay.data.info.specifics.details.render(this.display, "hello
> world", this);
>
> That's pretty much a real-world example. Say I need the application this
> comes from to display "howdy world" a little later on. Even though I think
> it's kind of a horrible thing to do, for expediency's sake, I might copy
> this code, paste it, and just modify the literal:
>
> this.moduleOverlay.data.info.specifics.details.render(this.display, "howdy
> world", this);
>
> Now imagine this happening very frequently in a company with a very high
> rate of copy-paste. The company creates its entire code base by copying
> lines and inserting deviations. You very soon have a very self-similar
> corpus of text. The recursivity in the Koch curve also has an analog,
> because programmers will not just copy-paste line by line, but also
> copy-paste giant blocks of code, instead of refactoring to objects. So
> although the copying is not a recursive process in most cases, you do have
> repetition at multiple scales.
>
> The result is that the code base, considered as a whole, has some degree of
> fractal self-similarity. If we take lines of code as our metric, we're going
> to have some kind of fractal dimensionality in the result, something greater
> than 1 but less than 2, just like the coastline of Britain. (again,
> <a href="http://en.wikipedia.org/wiki/Hausdorff_dimension" target="" onclick="window.open('http://en.wikipedia.org/wiki/Hausdorff_dimension');return false;">http://en.wikipedia.org/wiki/Hausdorff_dimension)
>
> In reality, however, lines of code is a ridiculous metric, because it's
> meaningless. If somebody offers you a consulting gig and tells you they have
> a thousand lines of code, is that highly repetitive code, or entirely unique
> code? You could technically be talking about the same line repeated a
> thousand times. (In fact I worked at a company in Santa Fe which was just
> about that dumb, and I know if Carl is still on here, he knows who I'm
> talking about.) In order to do any kind of useful analysis on code, you need
> to turn the code into a parse tree, and analyze the tree. For instance, my
> automated refactoring code is still in its infancy, but can detect simple
> kinds of repetition and similarity, and does so by converting code into
> parse trees, and then comparing the trees. It's downright tautological to
> say that if you're looking at trees which contain subtrees, and those
> subtrees are equal or similar to other subtrees (both subtrees within the
> same tree and subtrees found in other trees), and further that the same
> patterns of repetition shaped both the macro and micro scales of the tree,
> you are talking about fractals.
>
> But I can't find anything on how to calculate the fractal dimension of a
> tree! The best thing I've found is something on how to calculate the fractal
> dimension of a network, which I'm guessing could be clumsily coerced into a
> method for trees. There's plenty out there about how to use fractals to
> **generate** an irregular tree, so I may be able to use something like that
> and go backwards.
>
> OK, and I found this:
>
> <a href="http://www.trusoft-international.com/" target="" onclick="window.open('http://www.trusoft-international.com/');return false;">http://www.trusoft-international.com/
>
> But it costs about $250, and I want something Unix-y or RESTful so I can use
> it programmatically.
>
> --
> Giles Bowkett
> <a href="http://gilesbowkett.com" target="" onclick="window.open('http://gilesbowkett.com');return false;">http://gilesbowkett.com

> ============================================================
> FRIAM Applied Complexity Group listserv
> Meets Fridays 9a-11:30 at cafe at St. John's College
> lectures, archives, unsubscribe, maps at <a href="http://www.friam.org" target="" onclick="window.open('http://www.friam.org');return false;">http://www.friam.org

--

----------------------------------------------------------------------------
Prof Russell Standish Phone 0425 253119 (mobile)
Mathematics
UNSW SYDNEY 2052 hpcoder@...
Australia <a href="http://www.hpcoders.com.au" target="" onclick="window.open('http://www.hpcoders.com.au');return false;">http://www.hpcoders.com.au
----------------------------------------------------------------------------

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at <a href="http://www.friam.org" target="" onclick="window.open('http://www.friam.org');return false;">http://www.friam.org

--
Giles Bowkett
<a href="http://gilesbowkett.com" onclick="window.open('http://gilesbowkett.com');return false;">http://gilesbowkett.com
============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

Nick Thompson

Re: hi friam - how do I calculate the fractal dimension of repetitive text?

This relates to a discussion I keep trying to start in the evo-devo seminar about a possible metaphor between the modularity in biological development/evolution and modularity in the history of computer coding. Apparently, from the resistance I am getting there is no useful metaphor.

Nick

From: [hidden email] [mailto:[hidden email]] On Behalf Of ERIC P. CHARLES
Sent: Wednesday, December 15, 2010 7:05 PM
To: Giles Bowkett
Cc: The Friday Morning Applied Complexity Coffee Group
Subject: Re: [FRIAM] hi friam - how do I calculate the fractal dimension of repetitive text?

Hey Russell,

You might find the algorithm as described is computationally intensive

Quite an understatement!

so you might want to exploit
some structural properties of the C++ code (eg using method boundaries
and line boundaries to help in framing at the appropriate scales).

One possible hypothesis to test is that evolved software will exhibit
a power law scaling, whereas engineered (and finely refactored) code
does not.

File A:

name = info.name;

address = info.address;

console.log(name);

console.log(address);

File B:

zip = info.zip;

phone = info.phone;

console.log(zip);

console.log(phone);

Obviously there's a lot of similarity here, both between individual lines and between files. it's pretty easy to think of ways to capture these similarities.

File C:

penguin = info.penguin;

console.log(penguin);

walrus = info.walrus;

console.log(walrus);

File D:

orca = info.orca;

console.log(orca);

apartment = info.apartment;

console.log(apartment);

Anyway, bit of a tangent there.

On Mon, Dec 13, 2010 at 3:12 PM, Russell Standish <[hidden email]> wrote:

On Fri, Dec 10, 2010 at 06:27:13PM -0800, Giles Bowkett wrote:

--

----------------------------------------------------------------------------
Prof Russell Standish Phone 0425 253119 (mobile)
Mathematics
UNSW SYDNEY 2052 [hidden email]
Australia http://www.hpcoders.com.au
----------------------------------------------------------------------------

--
Giles Bowkett
http://gilesbowkett.com

============================================================

FRIAM Applied Complexity Group listserv

Meets Fridays 9a-11:30 at cafe at St. John's College

lectures, archives, unsubscribe, maps at http://www.friam.org

Giles Bowkett

Re: hi friam - how do I calculate the fractal dimension of repetitive text?

In reply to this post by Eric Charles

At this point it is clear to me (and it wasn't for some odd reason before this last set of examples) that you could be interested in the fractal dimension for at least two distinct reasons: 1) you are interested in something about computer code itself, 2) you are interested in something about what a given person or group of people do when coding.

If you were interested in "the code itself", then you might, for example, get a sample of code widely agreed to be very pretty, and a sample widely agreed to be very ugly, and compare fractal dimensionality.

If you were interested in "what coders do", then you might, for example, watch someone writing a program that had to perform many tasks, and see how the fractal dimensionality changes as they tweak the code towards a final project.

Are these the types of comparisons you have in mind? Is one type of question more interesting? Is there some third type of question motivating the project?

Well, I'm working on building automated refactoring tools in my spare time. Imagine a sort of Roomba that works on code lint instead of physical dust. So far it's a) awesome and b) time-consuming. I think that's likely to continue. I did my first thing in this in 2008, it's kind of an on-again off-again thing. So my interest arose out of the comparisons and contrasts between the code itself in its current shape and the code itself in the shape it will or would take in future, after processing through automated refactoring. So my tools are nowhere near finished but I've done enough refactoring to know that abstracting out repetition and near-repetition (i.e. highly similar methods or functions or objects) is a big part of it. So if the repetition in code scales, then it's a fractal, basically, and that means you can measure its fractal dimension, and that means you can compute objective measures of cruft. By which I mean (ideally) useful objective measures of cruft. I already have stuff that can tell you "you have 44 methods/functions which are highly similar to one another" and group clusters of similar code, but on any kind of nontrivial code base the report for that could end up almost as complex as the code itself.

Fractal dimensionality probably isn't actually the most useful metric, you're probably way better off in practical terms with one of those hub-spoke node graphs like Google Wonder Wheel (http://www.google.com/search?hl=en&tbs=ww:1&q=fractal+dimension&btnG=Search), but I really like fractals.

Also, I chopped off the languages question by accident, yes, I think it's very very likely that languages differ both by degrees of repetition and by specific patterns of repetition, and indeed a big part of my perspective on this is shaped by the fact that I write a lot of Ruby and the Ruby community is sometimes a little obsessive about eliminating repetition. Also I wouldn't be surprised if it turned out to be mathematically provable that Scheme and Scheme-like Lisps enable more repetition-elimination than any other language, through their macro systems, although I have absolutely zero proof. Just speaking off the top of my head there.

Nick said:

Sounds like a useful metaphor to me. I think the paper Russell linked is relevant:

http://pandora.nla.gov.au/pan/10178/20040710-0000/www.complexity.org.au/ci/vol06/halloy/index.html

--
Giles Bowkett

http://gilesbowkett.com

Russell Standish

Re: hi friam - how do I calculate the fractal dimension of repetitive text?

I spoke to the guy who runs Complexity International, and he should
get the journal website back online shortly, so you won't need to use
the mirror. The paper won't change though :)

On Thu, Dec 16, 2010 at 01:37:30PM -0800, Giles Bowkett wrote:
>
> Sounds like a useful metaphor to me. I think the paper Russell linked is
> relevant:
>
> http://pandora.nla.gov.au/pan/10178/20040710-0000/www.complexity.org.au/ci/vol06/halloy/index.html
>
> --
> Giles Bowkett
> http://gilesbowkett.com

> ============================================================
> FRIAM Applied Complexity Group listserv
> Meets Fridays 9a-11:30 at cafe at St. John's College
> lectures, archives, unsubscribe, maps at http://www.friam.org

--

----------------------------------------------------------------------------
Prof Russell Standish Phone 0425 253119 (mobile)
Mathematics
UNSW SYDNEY 2052 [hidden email]
Australia http://www.hpcoders.com.au
----------------------------------------------------------------------------

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org

Jochen Fromm-5

Fractal Dimension of Map-Reduce Patterns

In reply to this post by Giles Bowkett

Hi Giles,

Fractal dimensions are a fascinating topic. What do you
think, perhaps one can define a fractal dimension for certain
map-reduce patterns that characterizes certain flows,
streams, waves, whirls or vortices? That would be cool.

Has anyone considered the connection between Map-Reduce
and CA/NN/ABM in general? On the one hand Map-Reduce
algorithms can be used to calculate very large Cellular
Automata (CA), Neural Networks (NN) and perhaps Agent
Based Models (ABM). On the other hand, CA, NN and ABM
may shed some light on possible Map-Reduce patterns..

Map-Reduce is a basic pattern of distributed computation,
which can be found in CA and NN. Neurons in NN collect
informations through their synapses (reduce) and send
it to other neurons through their axon (map).
Cells in CA do the same, the collect information from
their neighbors in the previous timestep and process
the input (reduce), then they send it to to their
neighbors for the next timestep (map).

If we have only mapping, or if mapping prevails,
then we get a propagating outgoing wave, which ensures
activities. (This corresponds roughly to liveliness properties
in distributed algorithms, and Eppstein's CA classification
"Contraction impossible").
If we have only reducing, or if reducing prevails,
then we get an incoming wave, which ensures stability.
(This corresponds to safety property in distributed
algorithms, and Eppstein's CA classification
"Expansion impossible").

If we have both mapping and reducing, expansion and
contraction, we get all kinds of interesting complex
patterns. For instance if we embed a 2-dimension
manifold in a 3-dimensional space we get many
interesting patterns and strange attractors (Roessler,
Lorenz, etc). I wonder if it is possible to define a fractal
dimension for it in general, or if certain kinds of
"map-reduce" patterns may result in certain kinds
of streams, flows, vortices or whirls?

-J.

----- Original Message -----
From: Giles Bowkett
To: The Friday Morning Applied Complexity Coffee Group
Sent: Thursday, December 16, 2010 1:42 AM
Subject: Re: [FRIAM] hi friam - how do I calculate the fractal dimension of
repetitive text?

[...] So here you have a piece of code which contains another piece of code
highly similar to itself. The inner loop is like a version of the outer loop
copied and then slightly altered. You could imagine an infinite series of
for loops, each with a tiny tweak but otherwise highly similar to the
containing loop. Kind of like how the Koch curve contains an infinite number
of nested Koch curves, in a sense, but the difference is that the Koch curve
is a regular fractal, while this code, if it has fractal dimension,
represents an irregular fractal.

============================================================
FRIAM Applied Complexity Group listserv
Meets Fridays 9a-11:30 at cafe at St. John's College
lectures, archives, unsubscribe, maps at http://www.friam.org