Web Document Analysis 2005
Tuesday, August 30, 2005
What does document analysis give us?
The 3rd Web Document Analysis Workshop closed with an interesting discussion around the provocative question:
What does document analysis give us, how can we take advantage of it and how can we encourage it?
The question was inspired largely by the content of Dan Lopresti's excellent invited talk ('The case of the missing dimension(s)'). Dan observed that traditional systems view web documents as linear sequences of tokens but that they were in fact encodings of two dimensional documents.
Much of the discussion focused on search: how would document analysis affect search results? A number of responses to this were proposed including:
* The interpretation of tabular material.
For example, if you were interested in climactic information about cities in Korea, you might use the query 'average rainfall seoul pusan'. Thomas Breuel pointed out, quite correctly, that issuing this search would most likely produce a page with the desired tabular data. In later discussions I had with Robert Dale and Vanessa Long, we discussed the notion of search result quality. In other words, relevancy is not the same as quality. In the case of the search for climactic information, imagine a system that given such a query could produce a statistical summary of the results found in all tables (e.g. giving the mean and variance in a super table).
* Title and other block segmentation.
Here the desire is to ensure that adjacency in the linear stream of tokens is not confused with token adjacency in the document. For example, treating the last word in a title or section heading as the first work in a phrase including the initial tokens in the following paragraph.
* Accurate PDF search.
PDF documents, and other layout-weak document encodings are commonly returned in search results. These document pose significant challenges at very low levels. Consequently, a reasonable number of standard document analysis processes need to be run against the document prior to indexing.
* Document zoning.
This is something of particular interest to blog or message board search engines. Web pages are generally made up of a number of functional elements (including title, navigation, adverts, main content). Indexers have not recognition of the significance of these areas, which is why in some cases results that take you to a page may not contain the query that got you there. The blogosphere offers a good example with the inclusion of recently updated blogs on typepad blogs. This list is changing constantly and is almost guaranteed to be different from how it appeared at index time.
* Sub-page Documents
Similar to document zoning, the problem of sub-page documents is familiar to blog search engine implementers. It addresses the fact that the basic unit of content is not the web page, but some smaller unit (e.g. a blog post). In addition, the web page contains many such elements which all need to be indexed individually.
There was recognition that discussion on search applications makes broad assumptions about use cases and user expectation which have been drilled in to the consumers of such interfaces. The example of a search result returning a summary of tabular data illustrates this point and hints at the potential for new interfaces, new user experiences and new user expectations in the search space.
Document analysis researchers often view the problem of analysing web pages as a very partitioned space - the web documents must be consumed as is. The second part of the discussion looked at what can be done to assist in the analysis of online documents. A big part of this problem is the inclusion of information in the markup which will help with various tasks. In the case of certain layout elements (e.g. titles) that information is already present. However, for many of the issues raised above, there is now clear standard. It was recognized that there are a number of ad-hoc inclusions (e.g. comments to indicate where ads appear, or where navigation appears). These inclusions may be taken advantage of opportunistically but do not represent a stable path to success.
As with the inclusion of any novel information, adding in this data is going to be challenging from the human behaviour point of view, though it was recognized that structured blogging and microformats were a start.
I was encouraged to write these notes sooner rather than later by Abdel Belaid (thanks), but do recognize that these are not minutes of the meetings and include my own personal bias and some subsequent conversations with others. This content will be posted both on the WDA2005 blog and on my own blog. Please comment on the WDA blog only.
Sunday, August 28, 2005
We would like to thank everyone who attended the workshop and made it a success. As mentioned, we will put the pdf versions of the presented papers online in the near future (subscribe to the RSS feed for this blog to receive the notification).
Thursday, August 25, 2005
Ethan tells me that the weather in Seoul is not all sunshine. Pack a light jacket and something warm to wear.
Wednesday, August 24, 2005
The workshop is going to be held in room 'Lily' from 9-5 on the 28th of August. Here is a map of the location: map. Note that it is on the 3rd floor of the Olympic Parktel. The official ICDAR page states that it is the 4th floor, but the maps provided state the 3rd floor. Either way - see you at 'Lily.'
The schedule for the day will be as follows:
9:00 Welcome and Opening Remarks.
9:15 Invited Speaker: Dan Lopresti
10:30 Session 1, 3 talks
* Using Computer Vision to Detect Web Browser Display Errors: Liu, Doerman
* Link-Based Clustering for Finding Subrelevant Web Pages: Masada, Takasu, Adachi
* Indexing the Blogosphere One Post at a Time: Glance
12-1:30 lunch break
1:30 Session 2, 3 talks
* Mining Tables on the Web for Finding Attributes of a Specified Topic: Kise, Ohmae
* PACE: an Experimental Web-based Audiovisual Application using FDL: Caillet, Carrive, Brunie, Roisin
* EMD based Visual Similarity for Detection of Phishing Webpages: Fu, Wenyin, Deng
4:45-5 Wrap up
Thursday, August 11, 2005
Invited Talk: Title and Abstract
Our invited speaker is Dan Lopresti, LeHigh University.
Web Document Analysis: the Case of the Missing
Web documents are inherently multidimensional; yet,
they are frequently processed as though they were a
one-dimensional stream of data. This
over-simplification has proven remarkably effective in
what can only be termed the infancy of the Web. Will
this continue to hold true for much longer? We as
document analysis researchers know better.
In this talk, I will discuss some of the opportunities
I see for applying and adapting techniques from the
field of document image analysis to Web documents. I
will also present a proposal for melding vexing
problems from document analysis research with a certain
key need in Web-based security in a way that could
prove immensely beneficial to both communities.
Wednesday, August 10, 2005
Intelliseek Sponsors WDA2005
Intelliseek was proud to sponsor the previous workshop and once again has stepped up with sponsorship for this meeting.
Intelliseek's business revolves around the application of text mining to online and internal data to deliver insights into a number of product and brand related areas. Their technology relies heavily on a number of document analysis sub-systems, in particular those involving web content.
(Matt Hurst, co-chair, is employed by Intelliseek.)
Monday, August 01, 2005
Registration is now available for WDA2005! We are following the same format as CBDAR (here).
- Download this form.
- Fill out the form and
- Fax it to the ICDAR secretariat at +82 42 472 7459.