Authors: Rafael León Sanz, José Manuel Rojo, Ramón Galán
It is virtually impossible to test thoroughly Web Mining Applications based on Supervised Learning with pure empirical data, due to the need for human intervention to generate training sets and test sets. On the other hand, because the heterogeneity nature of the Internet it is difficult to extend the results obtained with these sets to other Web pages. We propose using the computer-based Bootstrap paradigm to design a test environment where they are checked with better confidence. In addition, it is possible to go further, by varying the characteristics of the sample, getting a better understanding of the performance of the mining application.
Social networks and blogs mining is a good example where our methodology can be applied, where often it is challenging to find a sample of enough size for a particular matter.To show the application of our methodology we have tested several models used as Web page classifiers based on pattern analysis. These classifiers have been largely studied, but their evaluation and comparison is not well developed yet. We demonstrate with our test technique that it is possible to have deeper understanding on their performance.
The Bootstrap approach is a powerful tool for all related works with the Internet, it allows to create test environments that can simulate real conditions with less human effort, building excellent test beds for applications that are difficult to replicate their working conditions.