The goal of this project is to study the main issues in design and implementation of a complete system (including hardware, software and network components) to crawl a large web collection for spatial and temporal mining. This study will be supported by simulation tools and a system prototype.
Even with the existence of several important similar work going on at Internet scale, the key point of this project will be the focus on Portuguese top level domain and language web resources.
Web resources will be tagged with geographical references and used for extract significant information during a large period of time. Link structure, content structure, web server availability and response time are examples of such information. The target Web space must be visited (or sampled) at an appropriate rate to study the evolution of several statistical indicators on extracted information.
Since it is impossible to anticipate the expected information and indicators to be extracted , the core of planned system will be a large repository holding a structured database of the collected web resources. This approach enables the use of new information extractors on all registered resources.
Due to the dynamic nature of content and technology, and the exponential growth of the Web, scalability, modularity and extensibility are key issues of the proposed system architecture.
The design and implementation of a structured and dynamic large scale document database rises interesting problems that must be solved by a small scale clusters, running a multithreaded single system image.
We are not planning to develop an exhaustive set of information extractors but to build limited set of significant examples. This set of examples will enable a precise definition of functional, implementation and evaluation requirements for third part development of specific information extractors at different scales.
We made two crawllings to the Portuguese web, one during 2001 and other in 2002. The following are some simple stats over the crawled space.
The first graphic shows the relative quantity of resources by MIME type, the second shows the relative volume of the same resources.
Next we have a graphic displaying the relative quantity of resources by MIME type and last modification date.
The same but with respect to resources volume.
And finally the use of specific tags in HTML pages.
SIRE - Scalable Information Retrieval Environment.
Inference through Data Mining.
OCLC - Web Characterization - Online Computer Library Center, Inc.
José João Almeida.
http://marco.uminho.pt/~macedo/netcensus Modified:Wed Sep 25 00:02:52 2002