Text Characteristics of English Language University Web Sites
The nature of the contents of academic Web sites is of direct relevance to the new field of scientific Web intelligence, and for search engine and topic-specific crawler designers. We analyze word frequencies in national academic Webs using the Web sites of three English-speaking nations: Australia, New Zealand and the U.K. Strong regularities were found in page size and word frequency distributions, but with significant anomalies. At least 26% of pages contain no words. High frequency words include university names and acronyms, Internet terminology, and computing product names: not always words in common usage away from the Web. A minority of low frequency words are spelling mistakes, with other common types including non-words, proper names, foreign language terms or computer science variable names. Based upon these findings, recommendations for data cleansing and filtering are made, particularly for clustering applications.
The world’s university Web sites contain an enormous quantity of information, ranging from preprints to administrative and recreational pages, created by faculty, students and support staff (Middleton, McConnell & Davidson, 1999). This goldmine can easily be exploited for many academic-related purposes such as exploring topics with online articles and course notes, identifying active scholars and their publications from their personal home pages, and finding courses from online prospectuses. A commercial search engine such as Google is likely to intermediate between users and university Web sites, particularly for new or infrequent information needs. Yet there is much information in the Web that is not apparent from its individual pages, and can only be identified through large-scale investigations of factors such as the relationships between documents. This concept will be familiar to information scientists, particularly those versed in relational...