<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/rss2full.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/itemcontent.css" type="text/css" media="screen"?><!-- generator="wordpress/2.1.2" --><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>AI Articles</title>
	<link>http://ai-depot.com/articles</link>
	<description>AI Depot</description>
	<pubDate>Fri, 29 Aug 2008 16:41:32 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.1.2</generator>
	<language>en</language>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/AiArticles" type="application/rss+xml" /><feedburner:emailServiceId>828911</feedburner:emailServiceId><feedburner:feedburnerHostname>http://www.feedburner.com</feedburner:feedburnerHostname><item>
		<title>Artificial Intelligence in Games</title>
		<link>http://feeds.feedburner.com/~r/AiArticles/~3/156120896/</link>
		<comments>http://ai-depot.com/articles/artificial-intelligence-in-games/#comments</comments>
		<pubDate>Thu, 13 Sep 2007 21:01:47 +0000</pubDate>
		<dc:creator>alexjc</dc:creator>
		
		<category><![CDATA[review]]></category>

		<guid isPermaLink="false">http://ai-depot.com/articles/artificial-intelligence-in-games/</guid>
		<description><![CDATA[There&#8217;s a new site on Game AI called AiGameDev.com (feed).  It features daily articles in a blog-like format, including reviews, editorials, and tutorials, not forgetting regular community discussions on game AI.
Here are some interesting posts made recently.
Machine Learning in Games
Game developers are increasingly keen to try ML techniques, but it does take some know-how [...]]]></description>
		<content:encoded><![CDATA[<p>There&#8217;s a new site on <a href="http://aigamedev.com/">Game AI</a> called <tt>AiGameDev.com</tt> (<a href="http://aigamedev.com/feed/">feed</a>).  It features daily articles in a blog-like format, including reviews, editorials, and tutorials, not forgetting regular community discussions on game AI.</p>
<p>Here are some interesting posts made recently.</p>
<h3>Machine Learning in Games</h3>
<p>Game developers are increasingly keen to try ML techniques, but it does take some know-how and experience.</p>
<ul>
<li><a href="http://aigamedev.com/architecture/learn-realistic">The Secret to Building Game AI that Learns Realistically</a></li>
<li><a href="http://aigamedev.com/design/alternatives-online-learning">Alternatives to Online Learning for Actor Behaviors</a></li>
<li><a href="http://aigamedev.com/editorial/beauty-beast">Game AI: Beauty and the Beast</a></li>
</ul>
<h3>Hierarchical Planning &amp; Behavior Trees</h3>
<p>Planners have proven themselves very effective in recent games, and hierarchical planners are the next logical step.</p>
<ul>
<li><a href="http://aigamedev.com/hierarchical-logic/bt-overview">Understanding Behavior Trees</a></li>
<li><a href="http://aigamedev.com/alive/game-ai-technical">A Technical Overview of Game::AI++</a></li>
<li><a href="http://aigamedev.com/alive/game-ai-motivation">The Motivation Behind Game::AI++</a></li>
</ul>
<h3>Conference Coverage</h3>
<p><tt>AiGameDev.com</tt> also has summaries of recent events, very useful for catching up with what went on&#8230;</p>
<ul>
<li><a href="http://aigamedev.com/coverage/aiide-2007-posters">AIIDE &#8216;07 Conference Coverage</a></li>
<li><a href="http://aigamedev.com/coverage/aiide-2007-papers">Pushing the Limits of Game AI Technology</a></li>
<li><a href="http://aigamedev.com/coverage/applyai-roundtable">Apply AI 2007 Roundtable Report</a></li>
</ul>
<p>Be sure to check out the site/blog if you&#8217;re interested in <a href="http://aigamedev.com/">game AI</a>!</p>
]]></content:encoded>
			<wfw:commentRss>http://ai-depot.com/articles/artificial-intelligence-in-games/feed/</wfw:commentRss>
		<feedburner:origLink>http://ai-depot.com/articles/artificial-intelligence-in-games/</feedburner:origLink></item>
		<item>
		<title>More AI Content &amp; Format Preference Poll</title>
		<link>http://feeds.feedburner.com/~r/AiArticles/~3/109714323/</link>
		<comments>http://ai-depot.com/articles/more-ai-content-format-preference-poll/#comments</comments>
		<pubDate>Tue, 17 Apr 2007 09:59:39 +0000</pubDate>
		<dc:creator>alexjc</dc:creator>
		
		<category><![CDATA[announcement]]></category>
<category>content</category><category>knowledge</category><category>poll</category><category>web 2.0</category>
		<guid isPermaLink="false">http://ai-depot.com/articles/more-ai-content-format-preference-poll/</guid>
		<description><![CDATA[Most of my time these days is spent updating the artificial intelligence knowledge part of the website.  Two brand-new overviews of AI techniques have been published recently, complete with pseudo-code and graphics:

Minimax Search
Decision Tree

You can subscribe to articles like these by signing up for the combined artificial intelligence feed.  Be sure to let [...]]]></description>
		<content:encoded><![CDATA[<p>Most of my time these days is spent updating the <a href="http://ai-depot.com/knowledge/">artificial intelligence knowledge</a> part of the website.  Two brand-new overviews of AI techniques have been published recently, complete with pseudo-code and graphics:</p>
<ul>
<li><a href="/knowledge/minimax_search">Minimax Search</a></li>
<li><a href="/knowledge/decision_tree">Decision Tree</a></li>
</ul>
<p>You can <a href="/site/subscribe">subscribe</a> to articles like these by signing up for the combined artificial intelligence feed.  Be sure to let me know (using the form below) if you have a particular topic at heart so we can bump it up the list of things to write about.</p>
<p>On a related note, I&#8217;m exploring ways to liven up this kind of content and drag it kicking and screaming into the world of Web 2.0.  I would appreciate your feedback on the subject:</p>
<div id="polls-2" class="wp-polls">
<form id="polls_form_2" action="/articles/wp-rss2.php" method="post">
<input type="hidden" name="poll_id" value="2" />
<h3>What format do you prefer to complement text articles?</h3>
<div id="polls-2-ans" class="wp-polls-ans">
<ul class="wp-polls-ul">
<li>
<input type="radio" id="poll-answer-6" name="poll_2" value="6" /> <label for="poll-answer-6">Animation (slides)</label></li>
<li>
<input type="radio" id="poll-answer-7" name="poll_2" value="7" /> <label for="poll-answer-7">Audio (podcast)</label></li>
<li>
<input type="radio" id="poll-answer-8" name="poll_2" value="8" /> <label for="poll-answer-8">Interaction (demo)</label></li>
<li>
<input type="radio" id="poll-answer-9" name="poll_2" value="9" /> <label for="poll-answer-9">Video (webcast)</label></li>
</ul>
<p style="text-align: center;">
<input type="button" name="vote" value="   Vote   " class="Buttons" onclick="poll_vote(2);" onkeypress="poll_result(2);" /></p>
<p style="text-align: center;"><a href="#ViewPollResults" onclick="poll_result(2); return false;" onkeypress="poll_result(2); return false;" title="View Results Of This Poll">View Results</a></p>
</div></form>
</div>
<div id="polls-2-loading" class="wp-polls-loading"><img src="http://ai-depot.com/articles/wp-content/plugins/polls/images/loading.gif" width="16" height="16" alt="Loading ..." title="Loading ..." class="wp-polls-image" />&nbsp;Loading &#8230;</div>
<p>Any comments are also welcome on the subject.</p>
]]></content:encoded>
			<wfw:commentRss>http://ai-depot.com/articles/more-ai-content-format-preference-poll/feed/</wfw:commentRss>
		<feedburner:origLink>http://ai-depot.com/articles/more-ai-content-format-preference-poll/</feedburner:origLink></item>
		<item>
		<title>What’s Your Biggest Question about Artificial Intelligence?</title>
		<link>http://feeds.feedburner.com/~r/AiArticles/~3/107397616/</link>
		<comments>http://ai-depot.com/articles/whats-your-biggest-question-about-artificial-intelligence/#comments</comments>
		<pubDate>Sat, 07 Apr 2007 19:15:32 +0000</pubDate>
		<dc:creator>alexjc</dc:creator>
		
		<category><![CDATA[announcement]]></category>
<category>question</category><category>survey</category>
		<guid isPermaLink="false">http://ai-depot.com/articles/whats-your-biggest-question-about-artificial-intelligence/</guid>
		<description><![CDATA[Is there anything you want to know about artificial intelligence?  If so, fill in our AI survey online; it&#8217;s only one question.  You can be as brief or expressive as you like.
As artificial intelligence enthusiasts and developers, we&#8217;re interested in hearing what you have to say.  There are lots of ideas and [...]]]></description>
		<content:encoded><![CDATA[<p>Is there anything you want to know about artificial intelligence?  If so, fill in our <a href="http://ai-depot.com/site/survey">AI survey</a> online; it&#8217;s only one question.  You can be as brief or expressive as you like.</p>
<p>As artificial intelligence enthusiasts and developers, we&#8217;re interested in hearing what you have to say.  There are lots of ideas and content in the pipeline for the <b>AI Depot</b>, but we&#8217;re particularly interested in what you want from us.</p>
<ul>
<li><a href="http://ai-depot.com/site/survey">http://ai-depot.com/site/survey</a></li>
</ul>
<p>Those of you who fill in the questionnaire will get exclusive access to the new content before anyone else!  You also get the satisfaction of knowing you helped out in shaping the future direction of this site.</p>
<p>We&#8217;re looking forward to hearing from you, but be sure to subscribe to our new <a href="http://ai-depot.com/feed/">artificial intelligence feed</a> to hear from us daily!</p>
]]></content:encoded>
			<wfw:commentRss>http://ai-depot.com/articles/whats-your-biggest-question-about-artificial-intelligence/feed/</wfw:commentRss>
		<feedburner:origLink>http://ai-depot.com/articles/whats-your-biggest-question-about-artificial-intelligence/</feedburner:origLink></item>
		<item>
		<title>The Easy Way to Extract Useful Text from Arbitrary HTML</title>
		<link>http://feeds.feedburner.com/~r/AiArticles/~3/107068258/</link>
		<comments>http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/#comments</comments>
		<pubDate>Thu, 05 Apr 2007 17:24:05 +0000</pubDate>
		<dc:creator>alexjc</dc:creator>
		
		<category><![CDATA[tutorial]]></category>
<category>machine learning</category><category>neural network</category><category>python</category><category>scraping</category><category>statistics</category><category>text mining</category>
		<guid isPermaLink="false">http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/</guid>
		<description><![CDATA[
You&#8217;ve finally got your hands on the diverse collection of HTML documents you needed.  But the content you&#8217;re interested in is hidden amidst adverts, layout tables or formatting markup, and other various links.  Even worse, there&#8217;s visible text in the menus, headers and footers that you want to filter out.  If you [...]]]></description>
		<content:encoded><![CDATA[<p><img src="http://ai-depot.com/articles/wp-content/uploads/2007/04/statistics.png" alt="[Statistical Text Mining]" /></p>
<p>You&#8217;ve finally got your hands on the diverse collection of HTML documents you needed.  But the content you&#8217;re interested in is hidden amidst adverts, layout tables or formatting markup, and other various links.  Even worse, there&#8217;s visible text in the menus, headers and footers that you want to filter out.  If you don&#8217;t want to write a complex scraping program for each type of HTML file, there is a solution.</p>
<p>This article shows you how to write a relatively simple script to extract text paragraphs from large chunks of HTML code, without knowing its structure or the tags used.  It works on news articles and blogs pages with worthwhile text content, among others&#8230;</p>
<p>Do you want to find out how statistics and machine learning can save you time and effort <a href="http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/">mining text</a>?</p>
<p><a id="more-90"></a></p>
<div class="advert"><script type="text/javascript"><!--
google_ad_client = "pub-0940885572422333";
google_alternate_color = "FFFFFF";
google_ad_width = 468;
google_ad_height = 60;
google_ad_format = "468x60_as";
google_ad_type = "text";
//2007-04-05: Content
google_ad_channel = "6105530284";
google_color_border = "FFFFFF";
google_color_bg = "FFFFFF";
google_color_link = "202040";
google_color_text = "000000";
google_color_url = "606030";
//-->
</script>
<script type="text/javascript"
  src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script></div>
<p>The concept is rather simple: use information about the density of text vs. HTML code to work out if a line of text is worth outputting.  (This isn&#8217;t a novel idea, but it works!)  The basic process works as follows:</p>
<ol>
<li>Parse the HTML code and keep track of the number of bytes processed.</li>
<li>Store the text output on a per-line, or per-paragraph basis.</li>
<li>Associate with each text line the number of bytes of HTML required to describe it.</li>
<li>Compute the text density of each line by calculating the ratio of text to bytes.</li>
<li>Then decide if the line is part of the content by using a neural network.</li>
</ol>
<p>You can get pretty good results just by checking if the line&#8217;s density is above a fixed threshold (or the average), but the system makes fewer mistakes if you use machine learning &#8212; not to mention that it&#8217;s easier to implement!</p>
<p>Let&#8217;s take it from the top&#8230;</p>
<h3>Converting the HTML to Text</h3>
<p>What you need is the core of a text-mode browser, which is already setup to read files with HTML markup and display raw text.  By reusing existing code, you won&#8217;t have to spend too much time handling invalid XML documents, which are very common &#8212; as you&#8217;ll realise quickly.</p>
<p>As a quick example, we&#8217;ll be using <a href="http://python.org/">Python</a> along with a few built-in modules: <tt>htmllib</tt> for the parsing and <tt>formatter</tt> for outputting formatted text.  This is what the top-level function looks like:</p>

<div class="wp_syntax"><div class="code"><pre class="python"><span style="color: #7777ff;font-weight:bold;">def</span> extract_text<span style="color: black;">&#40;</span>html<span style="color: black;">&#41;</span>:
    <span style="color: #808080; font-style: italic;"># Derive from formatter.AbstractWriter to store paragraphs.</span>
    writer = LineWriter<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #808080; font-style: italic;"># Default formatter sends commands to our writer.</span>
    <span style="color: #808040;">formatter</span> = AbstractFormatter<span style="color: black;">&#40;</span>writer<span style="color: black;">&#41;</span>
    <span style="color: #808080; font-style: italic;"># Derive from htmllib.HTMLParser to track parsed bytes.</span>
    <span style="color: #808040;">parser</span> = TrackingParser<span style="color: black;">&#40;</span>writer, <span style="color: #808040;">formatter</span><span style="color: black;">&#41;</span>
    <span style="color: #808080; font-style: italic;"># Give the parser the raw HTML data.</span>
    <span style="color: #808040;">parser</span>.<span style="color: black;">feed</span><span style="color: black;">&#40;</span>html<span style="color: black;">&#41;</span>
    <span style="color: #808040;">parser</span>.<span style="color: black;">close</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
    <span style="color: #808080; font-style: italic;"># Filter the paragraphs stored and output them.</span>
    <span style="color: #7777ff;font-weight:bold;">return</span> writer.<span style="color: black;">output</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>The TrackingParser itself overrides the callback functions for parsing start and end tags, as they are given the current parse index in the buffer. You don&#8217;t have access to that normally, unless you start diving into frames in the call stack &#8212; which isn&#8217;t the best approach!  Here&#8217;s what the class looks like:</p>

<div class="wp_syntax"><div class="code"><pre class="python"><span style="color: #7777ff;font-weight:bold;">class</span> TrackingParser<span style="color: black;">&#40;</span><span style="color: #808040;">htmllib</span>.<span style="color: #808040;">HTMLParser</span><span style="color: black;">&#41;</span>:
    <span style="color: #48488b;">&quot;&quot;</span><span style="color: #48488b;">&quot;Try to keep accurate pointer of parsing location.&quot;</span><span style="color: #48488b;">&quot;&quot;</span>
    <span style="color: #7777ff;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span>, writer, *args<span style="color: black;">&#41;</span>:
        <span style="color: #808040;">htmllib</span>.<span style="color: #808040;">HTMLParser</span>.<span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span>, *args<span style="color: black;">&#41;</span>
        <span style="color: #A0A020;">self</span>.<span style="color: black;">writer</span> = writer
    <span style="color: #7777ff;font-weight:bold;">def</span> parse_starttag<span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span>, i<span style="color: black;">&#41;</span>:
        index = <span style="color: #808040;">htmllib</span>.<span style="color: #808040;">HTMLParser</span>.<span style="color: black;">parse_starttag</span><span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span>, i<span style="color: black;">&#41;</span>
        <span style="color: #A0A020;">self</span>.<span style="color: black;">writer</span>.<span style="color: black;">index</span> = index
        <span style="color: #7777ff;font-weight:bold;">return</span> index
    <span style="color: #7777ff;font-weight:bold;">def</span> parse_endtag<span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span>, i<span style="color: black;">&#41;</span>:
        <span style="color: #A0A020;">self</span>.<span style="color: black;">writer</span>.<span style="color: black;">index</span> = i
        <span style="color: #7777ff;font-weight:bold;">return</span> <span style="color: #808040;">htmllib</span>.<span style="color: #808040;">HTMLParser</span>.<span style="color: black;">parse_endtag</span><span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span>, i<span style="color: black;">&#41;</span></pre></div></div>

<p>The <tt>LineWriter</tt> class does the bulk of the work when called by the default formatter.  If you have any improvements or changes to make, most likely they&#8217;ll go here.  This is where we&#8217;ll put our machine learning code in later.  But you can keep the implementation rather simple and still get good results.  Here&#8217;s the simplest possible code:</p>

<div class="wp_syntax"><div class="code"><pre class="python"><span style="color: #7777ff;font-weight:bold;">class</span> Paragraph:
    <span style="color: #7777ff;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #A0A020;">self</span>.<span style="color: black;">text</span> = <span style="color: #48488b;">''</span>
        <span style="color: #A0A020;">self</span>.<span style="color: black;">bytes</span> = <span style="color: #454580;">0</span>
        <span style="color: #A0A020;">self</span>.<span style="color: black;">density</span> = <span style="color: #454580;">0.0</span>
&nbsp;
<span style="color: #7777ff;font-weight:bold;">class</span> LineWriter<span style="color: black;">&#40;</span><span style="color: #808040;">formatter</span>.<span style="color: black;">AbstractWriter</span><span style="color: black;">&#41;</span>:
    <span style="color: #7777ff;font-weight:bold;">def</span> <span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span>, *args<span style="color: black;">&#41;</span>:
        <span style="color: #A0A020;">self</span>.<span style="color: black;">last_index</span> = <span style="color: #454580;">0</span>
        <span style="color: #A0A020;">self</span>.<span style="color: black;">lines</span> = <span style="color: black;">&#91;</span>Paragraph<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#93;</span>
        <span style="color: #808040;">formatter</span>.<span style="color: black;">AbstractWriter</span>.<span style="color: #0000cd;">__init__</span><span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #7777ff;font-weight:bold;">def</span> send_flowing_data<span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span>, data<span style="color: black;">&#41;</span>:
        <span style="color: #808080; font-style: italic;"># Work out the length of this text chunk.</span>
        t = <span style="color: #A0A020;">len</span><span style="color: black;">&#40;</span>data<span style="color: black;">&#41;</span>
        <span style="color: #808080; font-style: italic;"># We've parsed more text, so increment index.</span>
        <span style="color: #A0A020;">self</span>.<span style="color: black;">index</span> += t
        <span style="color: #808080; font-style: italic;"># Calculate the number of bytes since last time.</span>
        b = <span style="color: #A0A020;">self</span>.<span style="color: black;">index</span> - <span style="color: #A0A020;">self</span>.<span style="color: black;">last_index</span>
        <span style="color: #A0A020;">self</span>.<span style="color: black;">last_index</span> = <span style="color: #A0A020;">self</span>.<span style="color: black;">index</span>
        <span style="color: #808080; font-style: italic;"># Accumulate this information in current line.</span>
        l = <span style="color: #A0A020;">self</span>.<span style="color: black;">lines</span><span style="color: black;">&#91;</span><span style="color: #454580;">-1</span><span style="color: black;">&#93;</span>
        l.<span style="color: black;">text</span> += data
        l.<span style="color: black;">bytes</span> += b
&nbsp;
    <span style="color: #7777ff;font-weight:bold;">def</span> send_paragraph<span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span>, blankline<span style="color: black;">&#41;</span>:
        <span style="color: #48488b;">&quot;&quot;</span><span style="color: #48488b;">&quot;Create a new paragraph if necessary.&quot;</span><span style="color: #48488b;">&quot;&quot;</span>
        <span style="color: #7777ff;font-weight:bold;">if</span> <span style="color: #A0A020;">self</span>.<span style="color: black;">lines</span><span style="color: black;">&#91;</span><span style="color: #454580;">-1</span><span style="color: black;">&#93;</span>.<span style="color: black;">text</span> == <span style="color: #48488b;">''</span>:
            <span style="color: #7777ff;font-weight:bold;">return</span>
        <span style="color: #A0A020;">self</span>.<span style="color: black;">lines</span><span style="color: black;">&#91;</span><span style="color: #454580;">-1</span><span style="color: black;">&#93;</span>.<span style="color: black;">text</span> += <span style="color: #48488b;">'n'</span> * <span style="color: black;">&#40;</span>blankline<span style="color: #454580;">+1</span><span style="color: black;">&#41;</span>
        <span style="color: #A0A020;">self</span>.<span style="color: black;">lines</span><span style="color: black;">&#91;</span><span style="color: #454580;">-1</span><span style="color: black;">&#93;</span>.<span style="color: black;">bytes</span> += <span style="color: #454580;">2</span> * <span style="color: black;">&#40;</span>blankline<span style="color: #454580;">+1</span><span style="color: black;">&#41;</span>
        <span style="color: #A0A020;">self</span>.<span style="color: black;">lines</span>.<span style="color: black;">append</span><span style="color: black;">&#40;</span>Writer.<span style="color: black;">Paragraph</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #7777ff;font-weight:bold;">def</span> send_literal_data<span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span>, data<span style="color: black;">&#41;</span>:
        <span style="color: #A0A020;">self</span>.<span style="color: black;">send_flowing_data</span><span style="color: black;">&#40;</span>data<span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #7777ff;font-weight:bold;">def</span> send_line_break<span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #A0A020;">self</span>.<span style="color: black;">send_paragraph</span><span style="color: black;">&#40;</span><span style="color: #454580;">0</span><span style="color: black;">&#41;</span></pre></div></div>

<p>This code doesn&#8217;t do any outputting yet, it just gathers the data.  We now have a bunch of paragraphs in an array, we know their length, and we know roughly how many bytes of HTML were necessary to create them.  Let&#8217;s see what emerge from our statistics.</p>
<h3>Examining the Data</h3>
<p>Luckily, there are some patterns in the data.   In the raw output below, you&#8217;ll notice there are definite spikes in the number of HTML bytes required to encode lines of text, notably around the title, both sidebars, headers and footers.</p>
<div class="image"> <img src="http://ai-depot.com/articles/wp-content/uploads/2007/04/textvsbytes.png" alt="Graph of Text Output vs. HTML Bytes" /></div>
<p>While the number of HTML bytes spikes in places, it remains below average for quite a few lines.  On these lines, the text output is rather high.  Calculating the <strong>density</strong> of text to HTML bytes gives us a better understanding of this relationship.</p>
<div class="image"><img src="http://ai-depot.com/articles/wp-content/uploads/2007/04/density.png" alt="Graph of Text Density per Line" /></div>
<p>The patterns are more obvious in this density value, so it gives us something concrete to work with.</p>
<h3>Filtering the Lines</h3>
<p>The simplest way we can filter lines now is by comparing the density to a fixed threshold, such as 50% or the <em>average</em> density.  Finishing the <tt>LineWriter</tt> class:</p>

<div class="wp_syntax"><div class="code"><pre class="python">    <span style="color: #7777ff;font-weight:bold;">def</span> compute_density<span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #48488b;">&quot;&quot;</span><span style="color: #48488b;">&quot;Calculate the density for each line, and the average.&quot;</span><span style="color: #48488b;">&quot;&quot;</span>
        total = <span style="color: #454580;">0.0</span>
        <span style="color: #7777ff;font-weight:bold;">for</span> l <span style="color: #7777ff;font-weight:bold;">in</span> <span style="color: #A0A020;">self</span>.<span style="color: black;">lines</span>:
            l.<span style="color: black;">density</span> = <span style="color: #A0A020;">len</span><span style="color: black;">&#40;</span>l.<span style="color: black;">text</span><span style="color: black;">&#41;</span> / <span style="color: #A0A020;">float</span><span style="color: black;">&#40;</span>l.<span style="color: black;">bytes</span><span style="color: black;">&#41;</span>
            total += l.<span style="color: black;">density</span>
        <span style="color: #808080; font-style: italic;"># Store for optional use by the neural network.</span>
        <span style="color: #A0A020;">self</span>.<span style="color: black;">average</span> = total / <span style="color: #A0A020;">float</span><span style="color: black;">&#40;</span><span style="color: #A0A020;">len</span><span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span>.<span style="color: black;">lines</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
&nbsp;
    <span style="color: #7777ff;font-weight:bold;">def</span> output<span style="color: black;">&#40;</span><span style="color: #A0A020;">self</span><span style="color: black;">&#41;</span>:
        <span style="color: #48488b;">&quot;&quot;</span><span style="color: #48488b;">&quot;Return a string with the useless lines filtered out.&quot;</span><span style="color: #48488b;">&quot;&quot;</span>
        <span style="color: #A0A020;">self</span>.<span style="color: black;">compute_density</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
        output = <span style="color: #808040;">StringIO</span>.<span style="color: #808040;">StringIO</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
        <span style="color: #7777ff;font-weight:bold;">for</span> l <span style="color: #7777ff;font-weight:bold;">in</span> <span style="color: #A0A020;">self</span>.<span style="color: black;">lines</span>:
            <span style="color: #808080; font-style: italic;"># Check density against threshold.</span>
            <span style="color: #808080; font-style: italic;"># Custom filter extensions go here.</span>
            <span style="color: #7777ff;font-weight:bold;">if</span> l.<span style="color: black;">density</span> &amp;gt; <span style="color: #454580;">0.5</span>:
	        output.<span style="color: black;">write</span><span style="color: black;">&#40;</span>l.<span style="color: black;">text</span><span style="color: black;">&#41;</span>
	<span style="color: #7777ff;font-weight:bold;">return</span> output.<span style="color: black;">getvalue</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>This rough filter typically gets most of the lines right.  All the headers, footers and sidebars text is usually stripped as long as it&#8217;s not too long.  However, if there are long copyright notices, comments, or descriptions of other stories, then those are output too.  Also, if there are short lines around inline graphics or adverts within the text, these are not output.</p>
<p>To fix this, we need a more complex filtering heuristic.  But instead of spending days working out the logic manually, we&#8217;ll just grab loads of information about each line and use machine learning to find patterns for us.</p>
<h3>Supervised Machine Learning</h3>
<p>Here&#8217;s an example of an interface for tagging lines of text as content or not:</p>
<div class="image"><img src="http://ai-depot.com/articles/wp-content/uploads/2007/04/training.png" alt="Training From News Articles" /></div>
<p>The idea of supervised learning is to provide examples for an algorithm to learn from.  In our case, we give it a set documents that were tagged by humans, so we know which line must be output and which line must be filtered out.  For this we&#8217;ll use a simple neural network known as the perceptron.  It takes floating point inputs and filters the information through weighted connections between &#8220;neurons&#8221; and outputs another floating point number.  Roughly speaking, the number of neurons and layers affects the ability to approximate functions precisely; we&#8217;ll use both single-layer perceptrons (SLP) and multi-layer perceptrons (MLP) for prototyping.</p>
<p>To get the neural network to learn, we need to gather some data.  This is where the earlier <tt>LineWriter.output()</tt> function comes in handy; it gives us a central point to process all the lines at once, and make a global decision which lines to output.  Starting with intuition and experimenting a bit, we discover that the following data is useful to decide how to filter a line:</p>
<ul>
<li>Density of the <strong>current</strong> line.</li>
<li>Number of HTML bytes of the line.</li>
<li>Length of output text for this line.</li>
<li>These three values for the <strong>previous</strong> line,</li>
<li>&#8230; and the same for the <strong>next</strong> line.</li>
</ul>
<p>For the implementation, we&#8217;ll be using Python to interface with <em>FANN</em>, the <a href="http://leenissen.dk/fann/">Fast Artificial Neural Network</a> Library.  The essence of the learning code goes like this:</p>

<div class="wp_syntax"><div class="code"><pre class="python"><span style="color: #7777ff;font-weight:bold;">from</span> pyfann <span style="color: #7777ff;font-weight:bold;">import</span> fann, libfann
&nbsp;
<span style="color: #808080; font-style: italic;"># This creates a new single-layer perceptron with 1 output and 3 inputs.</span>
obj = libfann.<span style="color: black;">fann_create_standard_array</span><span style="color: black;">&#40;</span><span style="color: #454580;">2</span>, <span style="color: black;">&#40;</span><span style="color: #454580;">3</span>, <span style="color: #454580;">1</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
ann = fann.<span style="color: black;">fann_class</span><span style="color: black;">&#40;</span>obj<span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># Load the data we described above.</span>
patterns = fann.<span style="color: black;">read_train_from_file</span><span style="color: black;">&#40;</span><span style="color: #48488b;">'training.txt'</span><span style="color: black;">&#41;</span>
ann.<span style="color: black;">train_on_data</span><span style="color: black;">&#40;</span>patterns, <span style="color: #454580;">1000</span>, <span style="color: #454580;">1</span>, <span style="color: #454580;">0.0</span><span style="color: black;">&#41;</span>
&nbsp;
<span style="color: #808080; font-style: italic;"># Then test it with different data.</span>
<span style="color: #7777ff;font-weight:bold;">for</span> datin, datout <span style="color: #7777ff;font-weight:bold;">in</span> validation_data:
    result = ann.<span style="color: black;">run</span><span style="color: black;">&#40;</span>datin<span style="color: black;">&#41;</span>
    <span style="color: #7777ff;font-weight:bold;">print</span> <span style="color: #48488b;">'Got:'</span>, result, <span style="color: #48488b;">' Expected:'</span>, datout</pre></div></div>

<p>Trying out different data and different network structures is a rather mechanical process.  Don&#8217;t have too many neurons or you may train too well for the set of documents you have (overfitting), and conversely try to have enough to solve the problem well.  Here are the results, varying the number of lines used (1L-3L) and the number of attributes per line (1A-3A):</p>
<div class="image"><img src="http://ai-depot.com/articles/wp-content/uploads/2007/04/comparison.png" alt="Neural Network Comparison Chart" /></div>
<p>The interesting thing to note is that 0.5 is already a pretty good guess at a fixed threshold (see first set of columns).  The learning algorithm cannot find much better solution for comparing the density alone (1 Attribute in the second column).  With 3 Attributes, the next SLP does better overall, though it gets more false negatives.  Using multiple lines also increases the performance of the single layer perceptron (fourth set of columns).  And finally, using a more complex neural network structure works best overall &#8212; making 80% less errors in filtering the lines.</p>
<p><em>Note that you can tweak how the error is calculated if you want to punish false positives more than false negatives.</em></p>
<h3>Conclusion</h3>
<p>Extracting text from arbitrary HTML files doesn&#8217;t necessarily require scraping the file with custom code.  You can use statistics to get pretty amazing results, and machine learning to get even better.  By tweaking the threshold, you can avoid the worst false positive that pollute your text output.  But it&#8217;s not so bad in practice; where the neural network makes mistakes, even humans have trouble classifying those lines as &#8220;content&#8221; or not.</p>
<p>Now all you have to figure out is what to do with that clean text content!</p>
]]></content:encoded>
			<wfw:commentRss>http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/feed/</wfw:commentRss>
		<feedburner:origLink>http://ai-depot.com/articles/the-easy-way-to-extract-useful-text-from-arbitrary-html/</feedburner:origLink></item>
		<item>
		<title>AI Knowledge At Your Fingertips</title>
		<link>http://feeds.feedburner.com/~r/AiArticles/~3/107068259/</link>
		<comments>http://ai-depot.com/articles/ai-knowledge-at-your-fingertips/#comments</comments>
		<pubDate>Fri, 30 Mar 2007 19:44:11 +0000</pubDate>
		<dc:creator>alexjc</dc:creator>
		
		<category><![CDATA[announcement]]></category>
<category>ai-depot</category><category>artificial intelligence</category><category>knowledge</category><category>resources</category><category>web 2.0</category>
		<guid isPermaLink="false">http://ai-depot.com/content/2007/03/ai-knowledge-at-your-fingertips/</guid>
		<description><![CDATA[
Forgive the promotion, but there have been more changes on the site &#8212; and I&#8217;m quite proud of the result!  The artificial intelligence knowledge warehouse has undergone a facelift, brought into the modern world of Web 2.0.  All the content is accessible via topic tags, with the list of items sorted by their [...]]]></description>
		<content:encoded><![CDATA[<p><img src='http://ai-depot.com/content/wp-content/uploads/2007/03/knowledge.png' alt='Artificial Intelligence Knowledge' /></p>
<p>Forgive the promotion, but there have been more changes on the site &#8212; and I&#8217;m quite proud of the result!  The <a href="http://ai-depot.com/knowledge/">artificial intelligence knowledge</a> warehouse has undergone a facelift, brought into the modern world of Web 2.0.  All the content is accessible via topic tags, with the list of items sorted by their popularity.  Each item can be rated individually using the traditional voting button, and you can post comment about each submission.  Where available, a introduction to the topics is also displayed.</p>
<p>I&#8217;ve added quite a number of links and topic descriptions already, and will continue to do so over the next few weeks.  Feel free to contribute if you have a favourite site, or a personal link to promote.  You can easily submit links to <a href="http://ai-depot.com/knowledge/">artificial intelligence resources</a> via the submit tab menu, and instantly make your contribution available to thousands of AI enthusiasts daily.</p>
<p>Current content submissions, and those in the pipeline include:</p>
<ul>
<li>
<p>Artificial intelligence tutorials, essays and articles.</p>
</li>
<li>
<p>Lecture slides, or white papers relating to AI.</p>
</li>
<li>
<p>Online books or videos discussing artificial intelligence.</p>
</li>
</ul>
<p>On a related note, be sure to get your latest <a href="http://ai-depot.com/news/">Artificial Intelligence News</a> over here.  If you subscribe to the RSS feed for the <a href="http://feeds.feedburner.com/AiNews">front page ai news</a>, there are usually 2-4 AI stories daily.  The voting system is running smoothly, so the content stays much more relevant!</p>
<p>Ok, enough public announcements for today.  Back to more normal AI commentaries in a few days&#8230; Stay tuned for even more improvements and big changes on the <b>AI Depot</b>!</p>
]]></content:encoded>
			<wfw:commentRss>http://ai-depot.com/articles/ai-knowledge-at-your-fingertips/feed/</wfw:commentRss>
		<feedburner:origLink>http://ai-depot.com/articles/ai-knowledge-at-your-fingertips/</feedburner:origLink></item>
	</channel>
</rss><!-- Dynamic Page Served (once) in 0.419 seconds --><!-- Cached page served by WP-Cache -->
