MDR is a Web mining system that identifies and extracts regularly structured gegevens records (e.g., products and gegevens tables) from Web pages automatically. See the paper below for details:
Wij only provide executable (.exe) version of the system (without source) which runs on Windows PC. The program is free for scientific use. Please voeling us, if you are programma to use the software for commercial purposes. The software voorwaarde not be distributed without prior permission of the authors.
Download and Install
- Download the MDR program here
- Samenvatting the files ter the zip verkeersopstopping to a directory. You are on your way.
Note: A more sturdy and efficient implementation of the algorithm is ter production use te an ecommerce company
If you have downloaded MDR, Please send us an email so that wij can waterput you ter our mailinglist to inform you any fresh versions and bug-fixes.
How to use
- Click on “mdr.exe”. You will get a petite interface window.
- You can type or paste a URL (including http://) or a local path into the Combo Opbergruimte, the Combo Opbergruimte contains a list of URLs which you have added. At the begining it may be empty.
- If you are interested te extracting tables (or with rows and columns of gegevens), Click on “Samenvatting” ter the Table section.
- If you are interested te extracting other types of gegevens records, click on “Samenvatting” ter the “Gegevens Records (other types)” section. Wij separate the two functions for efficiency reasons.
- After the execution, the output verkeersopstopping will be displayed ter an IE window. The extracted tables or gegevens regions and gegevens records are there,
- You will notice that the output window has some unineresting records. Plain cleaning up can be done to liquidate them, but wij have not done it spil it is not much of research .
- There are a few output files ter the directory which are for our debuging purposes, you don’t have to worry about them. You don’t have to delete them. But if you want to understand them, please send us an email.
- If MDR could not successfully samenvatting the gegevens records te a pagina, one reason could be that the tags ter the pagina are not well formed for MDR to build a juist tag tree. Albeit wij attempted to stationary some of thesis errors, wij did not spend enough time on this. Most of the pages that wij tested have reasonably well formed tags. Please send us those pages that you encounter problems with MDR. Wij hope to improve it overheen time.
Only voorstelling the gegevens regions with “$” sign : When dealing with E-Commerce websites, most gegevens records of rente are merchandise. If this option is checked, MDR only outputs that gegevens regions te which the gegevens records are merchiandise. (Here wij assume every merchandise has a price with “$” sign. ) Te this way, some gegevens regions that also contain regular pattern gegevens records will not be displayed.
There is one threshold parameter te the algorithm, which may affect the extraction results of the system.
Similarity Threshold: Only when the similarity value of two tag strings is higher than Similarity Threshold, the two sub-trees represented by thesis two tag strings are considered spil having similar pattern. Then, the gegevens record represented by thesis two tag strings can be extracted out. The default value of Similarity Threshold is 60%, which is obtained from pilot studies and works well for most Web pages. If you cannot get the expected gegevens record after clicking “Samenvatting”, attempt to switch the Similarity Threshold to a larger value.