Text Classification Program Rubryx. Help file

Rubryx – a blend of experience and knowledge

Download Rubryx Order full version Mirror order site Rubryx help Contacts About us Home Mirror	Rubryx Short Manual Introduction Rubryx is a program of pattern classification of web sites. It allows classifying a large bulk of specialized textual information and generating web-catalogs, electronic libraries, reference systems on account of expert information and full-text analysis. System requirements: Windows 95/98/Me/NT/2000/XP Pentium 100MHz (Pentium III and higher is recommended) How to work with the program: Make a list of classes. Main window. With the help of buttons “Add” and “Delete”, make a list of classes. Select a few patterns of documents for each class. Main window. Choose a class and double-click to enter it. A dialog window “Selection of class patterns” will appear. With the help of “Add” button, make a list of a few documents (4-6) fully representing the corresponding class. Press OK. The program will automatically generate the vocabulary depending on the selected patterns. The process can take a few minutes. Do the same for each class. Define the catalog of sites. Choose the index. Index ranges from 1 to 100. The index is defined empirically. For its initial value, consult the Statistics button. Press “Start”. Practical hints The aim of the program is to classify the documents most efficiently. For a successful solution of the task, an accurate selection of the class and threshold value of index K is required. The classes should be selected so that their intersection is minimized and the most bulk of documents is covered. Index K should be chosen so that odd documents are not included into the class (K value is too small) and suitable documents are not sorted away (K value is too big). A number of preliminary classifications may be required. For preliminary classifications, make approx 1 per cent sample of the general bulk of documents. For example, for 100 thousand web sites to classify, 1000 sites is enough for preliminary experiments. On the one hand, 1000 sites is a representative sample, on the other hand, classification of such a sample on up-to-date computers will take a few moments. During classification a part of documents can be excluded from all classes. These documents should be carefully studied. It is possible that new classes should be added to the list. Part of the residual documents may not suit for the generated catalog. Including of a large amount of the same documents into different classes means that the subject matter of the catalog has been poorly divided into classes. Having obtained good results in sample classification, the whole bulk of documents can be classified. Consequently, you get a number of web sites of qualitative information corresponding to the number of classes. How to create a new dicitonary It is necessary to create a special dictionary to tune the program on new domain. The dictionary is placed in three text files. WordList.txt –the dictionary of one-word terms WordLst2.txt –the dictionary of two-word terms WordLst3.txt –the dictionary of three-word terms You may use ordinary text editor Notepad to create these files or any editor that saves files in plain text format in ANSI. Samples of dictionary files for domain "Computational Linguiastics" are included in delivery. Where to buy the program Rubryx is a shareware product. Demo of program can be downloaded from the mirror sites www.sowsoft.com/rubryx/index.htm and www.rubryx.narod.ru/index.htm Demo allows 30 starts and works during 30 days. Full version of program is sold via service Regsoft.com. It costs $50. The order and conditions of paymant are available at address http://www.regsoft.net/purchase.php3?productid=44711 or http://www.regsoft.net/purchase_nonsecure.php3?productid=44711 All questions and comments may be sent to: rubryx@sowsoft.com vladimir_polyakov@yahoo.com	Rubryx Community KSU FCCL MSLU MISA WJAELA NLP Registry of DFKI Elsnet Prof. Kenji Kita

Copyright © 2001-2002. All rights reserved

Rubryx Community