Web Document Analysis 2005
Friday, July 22, 2005

Workshop Outline
Now that we have reviewed the submitted papers, we can provide some description of how the workshop is going to be structured. We will be posting a detailed schedule on this blog in the near future. The workshop will include:

• Introduction by Ethan and Matt,
• Invited talk by Dan Lopresti,
• 6 papers on Web Document Analysis,
• Discussion session.

We also hope to include some social event either lunch or dinner.

Our intention was to have the registration information up today. However, we are currently talking with ICDAR and the other workshop chairs to see if we can centralize this process.

For the accepted papers, we require the camera ready version by August the 12th. Please email it to mhurst at intelliseek dot com.

The papers that have been selected are as follows:

Indexing the Blogosphere One Post at a Time

Natalie Glance (Intelliseek Applied Research Center)

In order to perform analysis over weblogs, we must first identify the appropriate unit of a weblog that corresponds to a document. We argue in the paper that, for weblogs, the correct unit is the weblog post. A weblog post is a structured document with the following fields: date, timestamp, title, content, permalink and author. We present our approach for segmenting weblogs into posts, which breaks down into several steps: (1) automatic feed discovery; (2) feed-guided segmentation, using the weblog feed and HTML; and (3) model-based weblog segementation.

Link-Based Clustering for Finding Subrelevant Web Pages

Tomonari Masada (National Institute of Informatics),
Atsuhiro Takasu (National Institute of Informatics),
Jun Adachi (National Institute of Informatics)

We propose a new Web page clustering. Typical search engines only provide relevant pages, i.e., the pages matching users' needs. However, we design our clustering method to provide non-relevant pages as search results when they refer to relevant pages and help users anticipate the contents of those relevant pages. We call such pages subrelevant. As it is difficult to improve Web search performance, we use subrelevancy to relax the criterion as to what kind of pages should appear in search results with the least drawback, i.e., one click away from a relevant page. Our clustering method is based on three concepts: THP, out-degree path length, and threshold parameter. We use clustering results to modify the feature vectors of Web pages. Hence, each clustering result induces a reranking of search results. We expect the reranking to raise the ranks of subrelevant pages. In the experiments with NTCIR-3 Web task test collection, our clustering largely improved the average precision by 13 percent in comparison with the baseline.

Using Computer Vision to Detect Web Browser Display Errors

Xu Liu (University of Maryland, College Park),
David Doermann (University of Maryland, College Park)

As the functionality and complexity of the WWW continues to grow so does the need for WWW quality assurance and testing. Although there have been numerous approaches to automated Web testing, existing techniques mainly analyze textual information, and the final judgment on correctness of layout is via human observation. The motivation of this paper is to employ computer vision techniques to detect Web display errors. To do this, we analyze images of the rendered pages rather than the HTML and attempt to discover errors. Our approach includes page segmentation, dynamic matching and outlier identification. We show that the approach successfully detects layout errors in the Opera browser on Microsoft Websites, while minimizing false alarms.

Mining Tables on the Web for Finding Attributes of a Specified Topic

Koichi Kise (Osaka Prefecture University),
Nobuhiro Ohmae (Osaka Prefecture University)

Finding attribute-value pairs from a huge collection of HTML pages is a fundamental task for information extraction from the Web. This paper presents an unsupervised method of mining Web tables for finding attributes of a topic specified by the user. The proposed method is based on the assumption that the occurrence of text strings representing attributes is biased to the first rows and columns in tables. The $\chi2$-test is employed to find attribute candidates based on the assumption. Identification of attribute rows and columns using the candidates enables us to improve the accuracy of extraction. The experimental results using 2,700 pages show that precision of extraction is 80\%.

PACE: an Experimental Web-Based Audiovisual Application using FDL

Marc Caillet (INRIA Rhône-Alpes, INA),
Jean Carrive (INA),
Vincent Brunie (INA),
Cécile Roisin (INRIA)

This paper describes the PACE experimental multimedia application that aims at providing automatic tools for television show collections web browsing; experimentations are currently in progress with a fifty-four Le Grand Echiquier show collection. PACE is being built within the FERIA framework and relies on multiple automatic analysis tools. It is thus flexible enough to easily adapt to other collections. Emphasis is then being made on the brand new audiovisual documents description language FDL as it is the core part of FERIA, with a particular attention paid on how it operates in PACE.

EMD based Visual Similarity for Detection of Phishing Webpages

Yingjie Fu (City University of Hong Kong),
Liu Wenyin (City University of Hong Kong),
Xiaotie Deng (City University of Hong Kong)

Phishing has become a severe problem in the Internet society. We propose an effective phishing webpage detection approach using EMD (Earth Mover¯s Distance) based visual similarity of webpages. Both suspected webpage and protected webpage are first preprocessed into low resolution images respectively. The image level colors and coordinate features are used to represent the image signatures. We then use the EMD method to calculate the signature distances of the two images as their visual similarity. When the visual similarity value is higher than a threshold, we classify the suspected webpage as a phishing webpage to the protected one. As our approach is based on image level color and coordinate features rather than HTML, webpage obfuscation scams are cracked. Large scale experiments with 10,279 suspected webpages are carried out to show high classification precision, phishing recall and applicable time performance for online enterprise solution.