Wednesday, November 18, 2009

Discovering Content by Mining the Entity Web

I had a blast last night presenting to CS students at the University of Washington. For those who missed the talk, the video is embedded below.

Abstract: Unstructured natural language text found in blogs, news and other web content is rich with semantic relations linking entities (people, places and things). At Evri, we are building a system which automatically reads web content similar to the way humans do. The system can be thought of as an army of 7th grade grammar students armed with a really large dictionary. The dictionary, or knowledge base, consists of relatively static information mined from structured and semi-structured publicly available information repositories like Wikipedia, Crunchbase, and Amazon. This large knowledge base is in turn used by a highly distributed search and indexing infrastructure to perform a deep linguistic analysis of many millions of documents ultimately culminating in a large set of semantic relationships expressing grammatical SVO style clause level relationships. This highly expressive, exacting, and scalable index makes possible a new generation of content discovery applications.

The talk slides used in the video are available HERE. It is best to download the slides and follow along as they are difficult to see in the video. In addition, the talk is broken up into multiple parts. Part 1, is shown below. Links to all parts are as follows: Part 1, Part 2, Part 3, Part 4, Part 5, and Part 6.

Monday, November 09, 2009

Automated Action Based Content Generation

I am republishing this article in its entirety. The original was published on the Evri blog HERE.


I'm excited to announce The Attack Machine, an exploration site available via Evri's experimental Garden, which harnesses the power of the Evri API to automatically generate content oriented around an action, or verb.

Content on The Attack Machine is automatically generated leveraging Evri's deep linguistic analysis of news, blog and other web content. The Attack Machine is an example application highlighting the power of leveraging a grammatical clause level understanding of verbs. The entire Attack Machine site could be straightforwardly mapped to other verbs, such as love, hate, punch, cherish, destroy, etc.

Content on The Attack Machine homepage is generated by an algorithm which runs every few minutes and looks for the top attackers and victims in particular categories such as animals, locations, weapons, and politicians.

Evri's system can be thought of like an army of 7th grade grammar students armed with a really large dictionary, or knowledgebase. These 7th grade grammar student algorithms are able to scour the news, blog and other web content to break every sentence down into key grammatical clauses such as the grammatical subject, verb, and object. The attackers in The Attack Machine are, in essence, the grammatical subjects; the verb is always attack and related verbs like: kill, assault, maim, etc.; and the victims on The Attack Machine are the grammatical objects. So, for example, if a blog post contains a simple sentence or title like: "Israel attacks Hamas.", this post will appear in the attacker page for Israel, and the victim page for Hamas.

In a blog post titled Evri's Garden Sprouts Some Search I discuss in detail the underlying search mechanism our scientists and engineers use to power higher level API and application functionality. One of the key higher level API functionalities The Attack Machine leverages is an API resource called Get relations about an entity. This API resource is used, for example, to populate the center column in the individual attack pages such as this one on alligator attacks:

and more specifically, the following REST API call is used:

This API call allows us to automatically identify the key people, places, organizations and things involved in attacks, in addition to getting the articles which correspond to the latest attacks. Now if the user clicks on a specific person, place, or thing, the following API call is used:

So in this way, we can populate the data for all attacks by alligators in Florida.

Finally, The Attack Machine leverages the Evri API to generate unique natural language content which benefits the reader, as well as significantly helps page SEO. For example, the 1st and third paragraphs in the 1st column in the screen shot above are automatic formulations from API output. In addition, the Evri knowledgebase is leveraged to minimize human editorial contribution. For example, the second paragraph above is written by an editor and linked to either a specific entity, a narrow category for an entity, such as animal, or politician, and a higher level category such as person, organism, or location. Simple logic is then applied at page generation time to select the natural language content from an entity handle if it exists, if not, from a narrow category handle, and if none exists (since we have thousands), then from a higher level category (there are only a handful of these) handle.

That's all for now. If you have any questions on how to use the Evri API for similar applications, or any other feedback, please let us know on our API forum.
Related Posts with Thumbnails

Liked what you read? Tell your friends

More info about content in my post