Categorization limitations

by Peter Small


There are several important areas of information where search engines and basic stigmergic systems cannot cope very well. This is in areas where there is: ambiguity; high volatility; subjective content; high rates of technological change; classification of people; classification of services; etc. These areas are a search engine's blind spots.

In these areas, there is a need for multiple levels of categorization and a means of being able to compare conflicting or alternative information.

This calls for a completely different mind set. It requires a system outside of the confining metaphors of database technology and an enhanced form of stigmergy.

Search engine and stigmergic system blind spots

RingWeb and The Open Source Directory. are examples of basic stigmergic systems used to create open source directories. They are complementary to algorithmic searches and highly effective is many subject areas. However, there are certain areas of information where they are no more efficient than conventional algorithmic search engines. It is in these problem subject areas, where we shall need a more sophisticated form of stigmergic system than those used by RingWeb and The Open Source Directory.

Let's look at a few searches, to see what kind of deficiencies are exposed

If we use Google, for example, to find references to 'Javascript', it produces a list of 850,000 URLs. To make a similar search with WebRings, we go through four levels of classification till we come to Javascript - Computers & Internet > Internet > Web Design > JavaScript

This will present us with 16 rings covering 175 sites. This is far more efficient than Google, giving us only 175 sites to consider. The rings can be searched independently for key words to cut this list down even further.

The Open Source Directory is even more efficient, as it has a hierarchy of editors to monitor and judge the quality of the results. Again going through four levels of classification - Computers: Programming: Languages: JavaScript - we are presented with a selected choice of only 35 sites to look through.

Now let's try searches in a more complex area of information. Supposing you had a friend who'd been diagnosed with rectal cancer with metastasis (spreading to other parts of the body). Suppose you'd been asked to see if you can find a treatment trial she could take part in to increase her chances of recovery.

Using Google to search for 'Cancer' produces 12,700,000 references.

You can cut this down through the sensible use of key words. For example:

Searching for:

rectal cancer produces 127,000 references

rectal cancer treatment produces 87,100 references

rectal cancer treatment trials produces 21,000 references

rectal cancer with metastasis treatment trials produces 3,120 references

rectal cancer with metastasis treatment trials UK produces 625 references

Even here though, the number of references are still huge, there is no way to cross reference them or make comparisons. You cannot be sure you've allocated the correct key words and you cannot know what information is missing. There is also the problem of what other factors have to be taken into consideration and whether or not you will be able to understand some of the highly technical papers you will come across.

You'd expect the search results would be better by using The Open Directory and WebRings, because these results are selected and edited by humans. However, what you find is The Open Directory does not have a hierarchical route to a "Cancer" category.

The situation isn't much better in the 'Web Rings' Directory either. There you'll find only a three level hierarchy path (Health: Conditions and Diseases: Cancer), which leads you to rings that list 1,581 references, but cover only 18 of the 420 different types of cancer - none of them dealing specifically with rectal cancer.

What has gone wrong with the stigmergic systems? Why didn't they perform as well as they did with Javascript?

The reason is that Javascript is one of the less complicated subject areas to classify. The stigmergic systems used by The Open Source Directory and WebRings could easily sort out the wheat from the chaff and present you with a choice selection of hand picked reference sources.

However, when it comes to something like cancer where there are 420 different types of cancer, each of which has many variations. There are simply too many variables to be taken into consideration to be able to conduct a reasonably efficient search. This causes the system to break down completely.

In summary then, there are certain areas of information that cannot be handled efficiently by either a conventional search engine or a simple stigmergic system. This would be where;

1) there are too many variables to be classified in a hierarchical system.

or

2) The content is highly subjective technically complex

diagram

Figure 1 - There are some areas of information where search engines cannot produce efficient search results

Clearly, there is no evidence that there is a satisfactory way of searching though subject matter like cancer - certainly not by using conventional methods and techniques. Even stigmergic systems seem to be hopeless for dealing with the problem.

Directories and indexed lists would be helpful, but, when you start adding levels in a hierarchy (equivalent to adding a new key word) the size of the directory starts to increase exponentially. This imposes a practical limit to the extent of the categorization.

The answer lies in a completely different approach: using the techniques that have evolved in nature to solve similar problems of selection within a vast range of possibilities.

The solution is to turn the idea of a hierarchy on its head. Start with the classifications at the bottom of a hierarchy and then 'grow' the hierarchy upwards. This may sound crazy, but it is the ingenious and elegantly simple solution that nature has come up with.