New language auto-detection over Blogs

We are pleased to announce the upcoming launch of improved language detection for blogs in the UGC Metabase in two weeks. We’re also introducing new blog lists sorted by language, so you can see all the English, French, German, Chinese blogs, etc, in our index.

And we’re adding a new date field, showing the time we indexed a particular post. This is in addition to the publish date already provided, as copied from the original XML/RSS feed.

 

1. Improved language detection at post level 

Blog feeds normally state which language they are in. However, this isn’t always reliable – typically blog publishing platforms have a default language setting, and bloggers do not always update their blogs to give their local language. The result is a significant portion of blog feeds with the wrong language. 

We’ve been working hard in the background to produce a more reliable approach to language detection. We’ll be rolling this out next month as the basis for setting the post’s language, as provided in the <language> tag. Only when this approach is unable to confidently determine the language, will we revert to using the language tag provided in the original XML as fallback.

 

2. New language tagging at feed level

 Further to this, we are adding a new <feedLanguage> tag, showing the language of the blog feed. This is in addition to the existing <language> tag referred to above, which is at post level. 

Adding language categorisation at feed level makes it possible to better organise the index by language – for example we can identify exactly which blogs are in French, which are in English, etc, and provide and manage these in lists.

The new language tag will appear in the UGC XML as follows

<feedLink>http://blog.moreover.com/feed/</feedLink> 
<feedLanguage>English</feedLanguage>
<generator>http://wordpress.org/?v=MU</generator>

 

3. Introducing a new Harvest Date field

Lastly, we’re adding a new <itemHarvestDate> field to the feed. This gives the time Moreover actually indexed the item. We already pass on the publish date of the post, as provided in the original XML/RSS feed — The new index time complements this tag and can provide, for example, additional information about the latency of indexing as it occurs across the feeds.

The new harvest date tag will appear in the UGC XML as follows:

<pubDate>2009-02-11 14:26:06.0</pubDate>
<itemHarvestDate>2009-03-13 18:38:21.0</itemHarvestDate>
<validDate>2009-03-13 18:37:18.0</validDate>

All times are shown in GMT.

 

We believe in being open and transparent about our crawling performance, and are confident about our technology. We invite comparison with other, similar services (for example, see Technorati and a recent comment on ReadWriteWeb), and welcome any feedback you, as customers and users, have.

.

February 19, 2009 Brian Mackie 1 Comment

Filed under: aggregation,metadata,Moreover Technologies,Products,social media,Social Media Metabase

Tags: , , ,

Previous Post:

Next Post:

1 Comment Leave a Comment

  • 1. FinancialServicesRenoNV  |  March 28, 2009 at 11:00 pm

    Greetings all members,

    I would just like to say hello and let you know that I’m happy to be a member – been a lurker long enough :)

    Hope to contribute some and gain some knowledge along the way….

Leave a Comment

(required)

(required), (Hidden)

*

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

TrackBack URL  |  RSS feed for comments on this post.


Moreover Technologies

Our company blog with the latest news, product updates, media intelligence insights, and other fine fare out of our Dayton (OH), Reston (VA), and London (UK) offices!

Moreover Links

Tag Cloud

  Bookmark and Share
wordpress counter

Archives