Author:
Thomas Hansen
polterguy@hotmail.com
Date: 28. June 2002
What is a genetic tree?
A question difficult to answer without doing lots of arm-waving and "you know
what I mean?" stuff...
But a Genetic Tree Structure might be defined as an n'node Tree Structure with
Binary Tree capabilities!
When I first thought of genetic trees it was to solve a specific problem and
that problem was to syntax check an Edifact file. An Edifact file is a
relational file (XML is a relational file structure) **WITHOUT!** closing tags.
The fact that they don't have closing tags raises a lot of problems, e.g. how
do I know which level to go to when I am finished with a segment group(*1)?!?
I am NOT going to explain the exact approach I had to take to solve the Edifact
problem, but I AM going to explain how I solved the "linear/relational"
problem...
The way I solved it was by coming up with a tree structure that could be
presented both relational like a tree, and linearly like a list. To do this I
took an n'node tree structure and added up linear information. This could have
been done in several ways, I could have used plain index numbers for instance,
but I wanted to have more power and flexibility then that approach.
The solution I came up with was to assign a genetic code to the
ROOT NODE (Alpha Node) which represented some kind of "minimum value" of
possible combinations of genetic sequences, I choose the ASCII character 'a'.
Then when I added a child to the Alpha Node, I took the 'a' and added up
another 'a' so that the genetic code for the Alpha Node's first child became
'aa', then for every node I added as a child to the Alpha Node I increased the
last character of the DNA sequence by one (Alpha's second child has 'ab',
alpha's third child has 'ac' etc...
The whole point is that a child to any given node inherits it's parent node's
DNA and adds up the character representing it's number in it's siblings flock,
e.g. Alpha's second child's third child will get 'abc', se illustration 1.1.
1.1
In the figure the first part of the string is the explaining text, the number
in parentheses is the index number of the given node, and the DNA= xx is the
DNA sequence of the given node. (yes I know I forgot the parentheses at the
deepest level of the illustration...Opps! ;) too lazy to fix)
If we explore the possibilities of a Genetic Tree Structure for a while we will find many really interesting traits that can be exploited by the implementer:
Especially the last trait gives us interesting possibilities, if you can compare
two DNA sequences to see which one is before the other then you can Binary
Partition the tree structure up into a "before this node part" and an "after
this node part"...
You get a fully-fledged Binary Tree Structure without restricting the number of
nodes in each level (plain English:) "You can do a 'binary' search into a tree
where each node may have an 'unlimited' number of children...
The observant reader may be suspicious to the last statement since there are a
limited number of characters from a-z (26 to be exact) which will restrict the
number of siblings a node can have to 25, but really, you don't need to use
character sequences as DNA building blocks, I have merely used them here for
illustration purposes. (it's easier to explain with a string-compare)
If you need a DNA sequence with a higher resolution, you can use e.g. a WORD
structure giving you the possibility to have each node having up to 65 535
children nodes, or even a DOUBLE WORD giving you more then 4 billion
children....
Mark also one very nice feature, you don't have to insert nodes according to
their values like you have to do in a Binary Tree Structure.
Plus the lookup function will go FASTER then in a Binary Tree (as long as you
have the genetic code for what you are looking for) because a Binary Tree will
probably have a much deeper depth then a Genetic Tree and the lookup into which
children node you're going to is just an offset into an array of nodes, off
course you MUST have the genetic code for what you are looking for or the
lookup will have to traverse potentially the WHOLE TREE to find what it's
looking for...
Fact is I believe they in practise for tree structures with very many siblings may outperform a hash table, but that's NOT the point, the point is that in a Genetic Tree Struct you have data stored BOTH linearly AND relationally!!!
Ok, this is quite awesome, what's the catch?!?
Yes there is a catch, not one but MANY indeed, although I haven't researched
all of the possible scenarios what's pretty clear to me is that if you generate
the DNA key on the fly while inserting (as opposed to after you have a complete
tree) you will have difficulties to insert nodes in-between other sibling
nodes...
Why is that so?!?
Because you would invalidate ALL DNA sequences to the 'younger' siblings and to
the children to the 'younger' siblings too meaning that you would either get a
corrupt tree structure or you would need to traverse the tree updating all DNA
sequences to the 'younger' siblings and their children etc... (claims a lot of
resources)
e.g. if you insert at 'acb' and you have a node 'acb' from before and an 'acc'
node you would invalidate the DNA sequence to the former 'acb', 'acc', all of
'acb''s children/grandchildren etc... and the same for 'acc'... (MARK! The
sequence up till 'aca' or the sequence from 'ab' and out would NOT be affected)
However there are several workarounds this problem, you can either restrict the
insertion points to consequently being a 'back_insert_child()' /
'insert_youngest_sibling()' which would completely eliminate the problem by
narrowing the possibilities, or you could generate the DNA sequence for the
tree AFTER you are done inserting nodes through some kind of
'validate_DNA_for_tree()' function.
There are other bad scenarios too, but most of them have a foundation in the
fact of that the insertion point is either restrictive or potentially
expensive...
MARK!
Only the 'Family Tree' of the 'younger siblings' of the insertion point will be
affected by an insert...
What are they good for?!?
As I said earlier, if you need to be able to have data (works best with
'static' or 'read once' data) in BOTH a relational and a linear representation,
Genetic Trees are the way to go.
E.g. take an XML Schema file for instance, you read it once in some kind of
'read_schema()' function and you need to know which node is a children to which
other node (relational part), plus you need to know which node comes before
another node (linear part) because you must have XML tags in a certain order
for the XML document to be valid against the Schema.
But I have only used them in the Edifact domain, so I wouldn't have a clue
about the applications they could be of use in other then the ones mentioned.
But if you need to represent data in both a linear and a relational way without
managing duplicated data structures, Genetic Trees are the way to go.
(*1) A segment group in an Edifact file is "ONE LEVEL" of segments where a segment is "one tag" containing data specific to that tag, e.g. the "BODY" tag of an HTML document is one level in an HTML document even though the BODY tag can contain inner levels e.g. a "P" element (P=paragraph) or an "I" element (I=Italics)
I have in fact implemented a Template Container Class for manipulating Genetic
Tree Structures, I haven't been working too much on the classes, roughly about
24 hours of programming, but I think they are versatile and stable enough to
get the idea, and maybe to do a sample implementation of an "n'node tree
structure with Binary Tree capabilities".
They are implemented in C++, built and tested with Microsoft Visual Studio 7.0.
(I really missed partial template specialization)
But I think they will compile in most of todays compilers.
They use the "iterator design pattern" (STL way), and allthough I have tried it
with a couple of the std algorithms, I guess they will tremble upon some of
them, especially since the end iterator is weakly implemented, good ideas or
refactoring of the end iterator is greatly appreciated!! (it's a special value
today)
All though they are implemented in C++, I think most OOP capable programmers
with some "iterator knowledge" and basic knowledge about generic containers
will be able to port it to their favourite language...
Download sample Generic Genetic Tree Template library for
C++ (gen_tree.h) and sample use of gen_tree.h (gen_tree_tester.cpp) contained
in 'gen_tree.zip'
If you would like to comment about Genetic Tree Structures, my implementation
of them(gen_tree.h), my bad English or anything else presented in this article,
please don't hesitate to mail me at polterguy@hotmail.com
BTW!
If you're looking for a very good template based C++ Windows API GUI library have a look at my main project SmartWin
And if you're interested in other things I'm thinking about at the moment, check out my blog HERE or my current GUI library here... ;)
Thomas Hansen
"By the time we've finished with him, he won't know whether he's Number Six or the cube root of infinity!"