Research at Piffany

CaeSAR Algorithm

Though the Internet originated among the scientific community, it has evolved into a much broader tool bringing a cornucopia of information to citizens of the world. The Web, however, is not under any centralized command and there are natural concerns over the quality of information from content providers. There are two key issues for quality: relevancy and authority. My research at Piffany addressed the first, relevancy. Is information available on the web for all age and education levels? A cancer victim is not likely to benefit from access to JAMA (Journal of the American Medical Association) because JAMA articles are written with medical doctors in mind. However, non-medical professionals demand access to medical information that is relevant to their personal health.

Most people get their information on the Web from a commercial search engine like Google. Google’s PageRank algorithm uses popularity as discovered through the hyperlink structure of the Web to establish the authority of a web page. Each web page inherits the authority of those pages that link to it. Ultimately, sites with the most links from pages that themselves have the most links become the most authoritative. Google implicitly assumes that the most popular document is the most relevant. Authority as dictated by search engines do not consider the subjective nature of the Web. Bill Clinton’s favorite online news source is likely not the same as George Bush’s favorite. The first 100 search results returned by Google for the word “Volcano” range in difficulty from kid-friendly to PH.D. level.

Criteria Specific Authority Ranking (CaeSAR) is an algorithm I invented at Piffany Inc. While it is similar to other authority based ranking algorithms in that it uses link structure to establish the popularity and credibility of a web page, it distinguishes itself by making the authority of links context dependent. For example, in determining the credibility of a web page on volcanoes, a link from another page on the same topic will be valued higher than a link from a thematically more distant site. Similarly, the popularity and suitability score of a web page for a particular age group will benefit more from links which originate from other pages targeting the same audience, than from pages directed towards much younger or much older users. The CeaSAR algorithm thus determines the popularity and credibility of a web page among its peers, as opposed to averaging link authority over all user demographics. Its merits are twofold: First, it increases the effectiveness of authority ranking itself by using authority selectively; and second, it decreases the uncertainty of ranking scores computed from individual documents. The figure below illustrates the latter at the example of a set of web pages with various readability or difficulty levels. Before applying CaeSAR, the readability scores vary considerably within clusters of documents, suggesting high noise levels, and the difficulty of some documents cannot be determined at all. After applying CaeSAR, fluctuations within clusters are averaged out while the contrast between different clusters is increased, and former gaps of readability evaluation are filled in.

Illustration of age-specific clustering of nodes using the CaeSAR algorithm. Color is used to describe the reading difficulty level of each node as described in the legend shown on the right. Notice the yellow node top-left in the "before" column. This could be an image, video, or audio file that has no text information associated with it. The age-authority of the nodes linking to this unknown node aides in assigning its age-context, as shown in the "after" column.

CaeSAR can be implemented in several ways. One possibility is to represent the link structure as a matrix whose rows and columns are labelled by the document number, and whose elements M(i,j) are non-zero only if document j links to document i. The principal eigenvector of this matrix, which can be obtained computationally through iterative application of the matrix, beginning with an arbitrary vector of non-zero elements, contains the relative ranking scores of the documents. Context-specificity is introduced by weighting the matrix elements according to some context-specific criterion such as the readability pertaining to a specific user age. An alternative implementation of the CaeSAR algorithm is a random walk through the network of interlinked web pages, in which the probability of arriving at a certain web page by following a link depends on the context-specific authority of the page that the link originates from.

Consider the effectiveness of the CaeSAR algorithm in identifying Internet domains associated with a particular age group. The data is taken from a corpus of one million cached websites that were generated from seed lists of differing origin. Reasonable measures were taken to establish that the reading difficulty distribution among documents was approximately evenly distributed among all age groups, so that no particular age group had an intrinsic statistical advantage over any other at being selected. The list in the table below illustrates the top ten documents ranked by age-specific authority; calibrated here to correspond to advanced elementary school reading difficulty and adult. Inspection of the domain names themselves give a good idea of what kind of website will be found at these addresses, and by clicking through to each domain address you will find that those domains in the left column (kids) are sites designed specifically for kids, and that those domains in the right column (adults) are sites designed specifically for adults. Again, the results are not indicative of the actual reading difficulty scores for each document, but instead cluster documents of similar reading difficulty level to give an overall age-impression for any page in the cluster.

Top ten domains ranked by age-specific authority ranking using CaeSAR.

US Patents related to the CaESAR algorithm

  • 8161040
  • 8983943
  • 9514193
  • 9984162
  • 10289646

References

A. Langville and C. Meyer, Google’s PageRank and Beyond: The Science of Search Engine Rankings (Princeton University Press, New York, 2006).

L. Page and S. Brin, Computer Networks and ISDN Systems 33, 107 (1998).

L. Page and S. Brin, "The Anatomy of a Large-Scale Hypertextual Web Search Engine"