« Piracy - The Forgotten Victims | Main | My Honorary Degree from the University of Blogging »

Yahoo Content Analyser Anomaly

The other day while ranting about SEO killing creative writing I created the following piece which I used as sample text.

Vailankanni: Beggars & Pilgrims
Vailankanni, also known as Velanganni, is in Tamil Nadu in India.
Vailankanni is famous for being a pilgrim center.
The Virgin Mary has been seen here three times.
Pilgrims come from all over India to pray at the chapel of "Our Lady of Health Vailankanni"
Early in the morning you can hear the bells call the faithful to their Catholic rites.
Imagine the scene as the pilgrims bathe themselves in the sea before the sun comes up.
In Vailankanni there are many beggars who prey on the pilgrims.
There are also stalls selling holy souvenirs and fish fryers on the right of the beach road.

I was using the Yahoo Content Analyser to extract keyterms for the "Vailanakanni" text and looking at the XML output when I noticed that the XML looked incomplete - but not in an XML type of way - a content type of way.

<?xml version="1.0" encoding="UTF-8" ?>
< ResultSet xmlns:xsi="http://www >
<Result>india</Result>
<Result>fish fryers</Result>
<Result>pilgrims</Result>
<Result>tamil nadu</Result>
<Result>beggars</Result>
<Result>pilgrim center</Result>
<Result>virgin mary</Result>
<Result>catholic rites</Result>
<Result>bathe</Result>
<Result>souvenirs</Result>
<Result>prey</Result>
<Result>bells</Result>
<Result>imagine</Result>
<Result>fish</Result>
</ResultSet>

Now the odd thing about this XML file is that there is no mention of the keyword "Vailanakanni" - which is strange as it is the highest density keyword (37.50%) in the entire text. It should be in there somewhere bracketed by <Result> tags .. but it isn't.

The Yahoo content analyser finds the next highest density keyword "pilgrims" (32.25%) and the third highest "beggars" (18.75%) - I'm not concerned with ordering and density when using Yahoo like this - just the keywords extracted by the content analysis service.

The strange case of the missing "Vailankanni" from the results makes me wonder about the accuracy of other web sites using the Yahoo REST services.

For example Tag Cloud uses the Yahoo content extraction service to automagically tag pages and build tag clouds using automatic semantic extraction to build tags for the description fields.

If Yahoo is missing out a 37.50% keyword when contructing the results then anything we build on top of these results is like a castle built on sand - shaky foundations with a guarantee of eventual collapse.

I'd like to hear from other people who have played with the Yahoo content extraction REST service - do you have this problem? What is the problem? Anyone got any ideas?

Sure as hell I haven't .. and my enquiring mind wants to know.


Tags: