Plone conference 10-26-06: Afternoon sessions – part 2

October 26, 2006 at 5:09 pm (Uncategorized)

3:30 A Needle in a Haystack: Discovering relationships in your content

Ben Saller

Haystack – relates similar content, summarizing incoming info, visualize relationships between content.

Language has inherent ambiguity and can be misinterpreted. We figure out meanings by keeping a lot of context. We are efficient at navigating webs of possible meanings & selecting likely interpretations. We are able to recover from errors and reevaluate prior choices.

Computers and language – Google considers search an AI problem. (ha! bonus points for using “grok” in a sentence.)

Haystack1 does probabilistic analysis of individual tests. Look at a given text figure out a term frequency distribution & extract likely keywords. Relates items by common keywords. Maps well into Plone because we can suggest or automatically apply extracted keywords to content.

Haystack2 was written to contextualize the keyword output of they Haystack1 frequency analysis.

HS2 uses your corpus, your data, to aid in its determination of what are vital concepts.

WordNet groups English words into sets of synonyms called synsets. It’s a huge database.

MultiSemCor takes a large text corpus and a lot of stat analysis and tags. Can figure out use of terms used in context.

With WN you can take articles say about angels and devils and connect them as both being about spiritual beings.

(Sigh. Another person who makes itty bitty slide graphics. I really need to start sitting up front with the good students.)

This type of system isn’t going to give you 100% accuracy. Maybe 80% or so. What you can get is a top level view of your content’s relationships.

Practical applications – personalization (affinity, matching) , coverage (which concepts are underrepresented in your corpus, visualization (hierarchies & clusters)

Issues – if any phase fails it can disrupt later phases. It’s harder to go back and reevaluate old info based on new (but Haystack will try). WordNet lacks coverage of proper nouns and many domain specific senses.

——————-

OK another one without any proofing. Did run spell check for once so there should be less pain overall.

Post a Comment