NetCensus's Project Homepage

NetCensus
Spatio-temporal web mining. NetCensus Logo

Project's Summary

The goal of this project is to study the main issues in design and implementation of a complete system (including hardware, software and network components) to crawl a large web collection for spatial and temporal mining. This study will be supported by simulation tools and a system prototype.

Even with the existence of several important similar work going on at Internet scale, the key point of this project will be the focus on Portuguese top level domain and language web resources.

Web resources will be tagged with geographical references and used for extract significant information during a large period of time. Link structure, content structure, web server availability and response time are examples of such information. The target Web space must be visited (or sampled) at an appropriate rate to study the evolution of several statistical indicators on extracted information.

Since it is impossible to anticipate the expected information and indicators to be extracted , the core of planned system will be a large repository holding a structured database of the collected web resources. This approach enables the use of new information extractors on all registered resources.

Due to the dynamic nature of content and technology, and the exponential growth of the Web, scalability, modularity and extensibility are key issues of the proposed system architecture.

The design and implementation of a structured and dynamic large scale document database rises interesting problems that must be solved by a small scale clusters, running a multithreaded single system image.

We are not planning to develop an exhaustive set of information extractors but to build limited set of significant examples. This set of examples will enable a precise definition of functional, implementation and evaluation requirements for third part development of specific information extractors at different scales.