2010 13th International Conference on Network-Based Information Systems A New Ontology-Supported and Hybrid Recommending Information System for Scholars Chun-Liang Dept of Computer and Communication Engineering Dept. of Electrical Engineerin St John’ s University St Johns University Tamsui, Taipei County, Taiwan, R.O.C Tamsui, Taipei County, Taiwan, R.O.C. ysy@mail. sju.edu.tw iang @mail sju. edu. tw roposed. Not only egrate specific domain documents, but also it can extract important formation from them through the hybrid filtering technology to take outcomes proved that the reliability and validity measurements of the hole system performance can achieve the high-level outcomes information recommendation. Furthermore this paper also discussed ind investigated the advantages and shortcomings of the construction of a recommendation system with different approaches and accordingly hy of In formaion extracto, nformation recommender. webpag Figure 1. Architecture of ontology-supported and hybrid recommending information L. INTRODUCTION system for scholars Nowadays, most of search engines adopted the way of keyword- Figure 1 illustrates the architecture of ontology-supported and based query which has the problem: the keywords entered by users hybrid recommending information system for scholars. Firstly, the were not completed and not able to obviously indicate the query system invokes the webpage crawler [20] to get related webpages by way of search engine Google or Yahoo (or integrate both by user he words with different meaning in different fields. The system defined via the integrated interface). Furthermore, the system ally produce many complicated cross-field query outcomes proceeds to classify webpages via the webpage classifier [211 when system didn 't respectively managed to classifying quer ported by the ontology database, and then invokes the information [7 which resulted in the xtractor to get significant webpage information. Finally, the time to filter available information. In technical literatures, lots of system recommendation and those recommended results were shown to employed the concept of ontology as the core technology lving users by the integrated interface. The experiment outcomes proved the above problems, for example, WebSifter Il[10], OBIGA [2], that the reliability and validity measurements of the whole system Swoogle [3], etc. Ontology can provide complete semantic models performance can achieve the high-level outcomes of information which means in specified domain all related entities, attributes and recommendation. base knowled ong entities have sharing and reuse characterist In short, this paper tried to investigate a proper mechanism to which could used Iving the problems of common sharing and process web documents for fast integrating specific domai communication. To describe the structure of the knowledge content information, and accordingly take one step ahead to extract through ontology can accomplish the knowledge core in a specified significant recommendation information for information integration domain and automatically learn related information, communication, and recommendation ranking. The Scholar information domain in accessing and even induce new knowledge. This paper also relied on artificial intelligence, fuzzy theory, and artificial neural network ar the concept of ontology for solving the problem of returned many chosen as the target application of the proposed system and will be complicated query outcomes by the same words with different used for explanation in the remaining sections. meaning or the different words with same meaning amon information sources II. BACKGROUND KNOWLEDGE AND DEVELOPING TECHNIQUES Data mining employed the techniques of statistic analysis, A. Ontology information classification, or machine learning to provide important can decision bases of information systems in according significant sharing and reuse characteristics that could used solving the problems reasons, relationships, or their potential rule models found from the f common sharing and communication; hence, ontology is huge amount of datum. The six common data mining methods powerful tool to construct and maintain an information system [19] nclude description, estimation, clustering, classification, prediction, This paper adopted Protege [6] to construct the scholar ontology an and association [16]. This paper also especially combined a its most special feature is that used multi ponents to edit and recommender system with the classification and related association ules of SPSS Clementine [16], a data mining tool, to mine out ntology and led knowledge workers to constructing management system based on ontology; furthermore, usefully important information from huge datum, and then to explore users could transfer to different formats of ontology such as RDF(S), the superiority of constructing information recommenders with OWL, XML or directly inherit into database just like MySQL and different structures or approaches and accordingly provided the MS SQL Server, which have better supported function than other design ph 978-0-7695-41679/1052600◎2010IEEE DOII0.I109/NBiS.2010.27
A New Ontology-Supported and Hybrid Recommending Information System for Scholars Sheng-Yuan Yang Dept. of Computer and Communication Engineering St. John’s University Tamsui, Taipei County, Taiwan, R.O.C. ysy@mail.sju.edu.tw Chun-Liang Hsu Dept. of Electrical Engineering St. John’s University Tamsui, Taipei County, Taiwan, R.O.C. liang@mail.sju.edu.tw Abstract—A new ontology-supported and hybrid recommending information system for scholars was proposed. Not only can it fast integrate specific domain documents, but also it can extract important information from them through the hybrid filtering technology to take information integration and recommendation ranking. The experiment outcomes proved that the reliability and validity measurements of the whole system performance can achieve the high-level outcomes of information recommendation. Furthermore, this paper also discussed and investigated the advantages and shortcomings of the construction of a recommendation system with different approaches and accordingly provided the design philosophy of customized services for recommendation systems. Keywords- Ontology, Webpage Crawlers, Webpage Classifiers, Information Extractor, Information Recommender. I. INTRODUCTION Nowadays, most of search engines adopted the way of keywordbased query which has the problem: the keywords entered by users were not completed and not able to obviously indicate the query demands of users. Furthermore, there are so many keywords being the same words with different meaning in different fields. The system would finally produce many complicated cross-field query outcomes when system didn’t respectively managed to classifying query requisition and specifying fields [7], which resulted in the information demanders may spend much time to filter out the available information. In technical literatures, lots of system employed the concept of ontology as the core technology for solving the above problems, for example, WebSifter II [10], OBIGA [2], Swoogle [3], etc. Ontology can provide complete semantic models, which means in specified domain all related entities, attributes and base knowledge among entities have sharing and reuse characteristics which could used solving the problems of common sharing and communication. To describe the structure of the knowledge content through ontology can accomplish the knowledge core in a specified domain and automatically learn related information, communication, accessing and even induce new knowledge. This paper also relied on the concept of ontology for solving the problem of returned many complicated query outcomes by the same words with different meaning or the different words with same meaning among information sources. Data mining employed the techniques of statistic analysis, information classification, or machine learning to provide important decision bases of information systems in according with significant reasons, relationships, or their potential rule models found from the huge amount of datum. The six common data mining methods include description, estimation, clustering, classification, prediction, and association [16]. This paper also especially combined a recommender system with the classification and related association rules of SPSS Clementine [16], a data mining tool, to mine out usefully important information from huge datum, and then to explore the superiority of constructing information recommenders with different structures or approaches and accordingly provided the design philosophy of customized services for recommendation systems. Figure 1. Architecture of ontology-supported and hybrid recommending information system for scholars Figure 1 illustrates the architecture of ontology-supported and hybrid recommending information system for scholars. Firstly, the system invokes the webpage crawler [20] to get related webpages by way of search engine Google or Yahoo (or integrate both by user defined via the integrated interface). Furthermore, the system proceeds to classify webpages via the webpage classifier [21] supported by the ontology database, and then invokes the information extractor to get significant webpage information. Finally, the information recommender is triggered, which makes integrated recommendation and those recommended results were shown to users by the integrated interface. The experiment outcomes proved that the reliability and validity measurements of the whole system performance can achieve the high-level outcomes of information recommendation. In short, this paper tried to investigate a proper mechanism to process web documents for fast integrating specific domain information, and accordingly take one step ahead to extract significant recommendation information for information integration and recommendation ranking. The Scholar information domain in artificial intelligence, fuzzy theory, and artificial neural network are chosen as the target application of the proposed system and will be used for explanation in the remaining sections. II. BACKGROUND KNOWLEDGE AND DEVELOPING TECHNIQUES A. Ontology Ontology [1] can provide complete semantic models, which has sharing and reuse characteristics that could used solving the problems of common sharing and communication; hence, ontology is a powerful tool to construct and maintain an information system [19]. This paper adopted Protégé [6] to construct the scholar ontology and its most special feature is that used multi components to edit and make ontology and led knowledge workers to constructing knowledge management system based on ontology; furthermore, users could transfer to different formats of ontology such as RDF(S), OWL, XML or directly inherit into database just like MySQL and MS SQL Server, which have better supported function than other tool [4]. 2010 13th International Conference on Network-Based Information Systems 978-0-7695-4167-9/10 $26.00 © 2010 IEEE DOI 10.1109/NBiS.2010.27 379
for further processing. Pure Text Extracting used the hyperlink file to is a character queue to describe specified read its content with an iterative loop line by line, and then the order, which could used to search matched pattern in another system deleted the html tags from source file and remained only the character queue. Regular expression can use universal words, set of text content so as to let system conduct further ify ing ways (8 ). There were two analyzing. Finally, Content Filtering judged whether the webpage supported classes for this expression: Pattern and Matcher, and we was the range we It linked the ontology database and fetched would use Pattern to define a Regular expression. If we want to out the content to content of the webpage. If did, the content, onduct pattern matching with other character queue, we would use ts URL and relate were stored by the system for supporting Matcher. However, the regularization of webpage classifier meant further processing and analyzing the system removed the unmeaning words, including continuous blank spaces, line feed, Tab character, punctuation marks, and so on, in classification from being classified documents for up-rising the precision rate of classification(201 lp The conception of TF was first proposed by Salton and McGill while IDF was proposed by Spark Jones[15]. The reasons why hey proposed the two methodologies was that the importance of ry term appearing in the document was not quite the same, most of all, the importance was not necessarily the same even appearing in different articles. Therefore, combining the two methodologies could measure the importance of the feature term. Traditional statistics classifier must be accompanied with proper methodology of xtracting feature and fetch out the proper features from training web-pages so as to gain classification precision. Hence, the quality of gure 2. Operation system structure of the webpage crawler set of term feature would decide the classification precision. From B. Webpage classifier the related literature, there were many resolution such as [15],an extracting mechanism with squared related constancy and Entropy, Tan [17] hich all kinds of document could be treated fairly and the inp mension could be decreased more. From those theories above. the onception of either taking advantage of information theory or machine learning was inclining to use symbols (or numbers) calculation. The processes and its outcomes are difficult to be understood. This paper adapted domain ontology to support the processing of classification that it not only arose the classifier's iciency but also made the processing and outcomes of the D. Developing Techniques The developing tool of our system is Borland JBuilder2006. It is n integrated development environment of Java, which have a fine Figure 3. System operational structure of the webpage classifier human-machine interface and code debugging mechanism to make a onal structure of the webpage classifier [21]. The Source Text originated from the and accordingly reduce the time of system development. In addition, webpage crawler was described before. Firstly, the system filtered ava [11] provides lots of functions and methods to integrate web out non-semantic characters such as the continuous space, Tab, n, t applications and databases. In the view of y, Java is characters, etc, which were used to divide or beautify the content absolutely the optimal choice for solvin of cross and had nothing to do with semantic expression. The initial task platform. This system adapted MS SQL backend includes loading stop word database, ontology database, formal knowledge-database sharing platform based on ontology. MS SQL database, and document array so as to manage afterwards Server is one mostly used relational database management system. Furthermore, this employed the CKiP (short for Chinese n Processing) segmentation system [12]to data in the database be an assistant tool ing the word segmentation, whose output contained segmented words and their corresponding attribute tag III. SYSTEM ARCHITECTURE Preprocessing contains: Format Transformation, which included A. Webpage Crawler attribute tag deleting, full-space character replaced with half-space Figure 2 showed the operation system structure of the webpage character, etc. Segmentation Fixing, which employed the domain crawler[20]. Firstly, Keyword Keying transferred input characters ontology to fix those mentation errors in the into URI code and then embedded into the search engine's query Stemming. which transferred different word Into stem Word Filtering, which not only used"stop list"to store thos and added the query url on well transferred URI code, and then and ought to be but also employed the attribute tags after read its content line by line for outputting the content as text file as inal analysis reference. Moreover, Regular Processing used different proceeded complete cross-c regular expressions to search for whether there is matched URL so as between those documents and system database for to completely fetch out related hyperlinks corresponding to the duplication webpages to avoid duplicated operation o conditions for outputting them into a text file to provide the system backend and accordingly enhance its performance. TF/DF
B. Regular Expression Regular expression is a character queue to describe specified order, which could used to search matched pattern in another character queue. Regular expression can use universal words, set of words, and some quantifiers as specifying ways [8]. There were two supported classes for this expression: Pattern and Matcher, and we would use Pattern to define a Regular expression. If we want to conduct pattern matching with other character queue, we would use Matcher. However, the regularization of webpage classifier meant the system removed the unmeaning words, including continuous blank spaces, line feed, Tab character, punctuation marks, and so on, in classification from being classified documents for up-rising the precision rate of classification [20]. C. Processing of Classification The conception of TF was first proposed by Salton and McGill [14] while IDF was proposed by Spark Jones [15]. The reasons why they proposed the two methodologies was that the importance of every term appearing in the document was not quite the same, most of all, the importance was not necessarily the same even appearing in different articles. Therefore, combining the two methodologies could measure the importance of the feature term. Traditional statistics classifier must be accompanied with proper methodology of extracting feature and fetch out the proper features from training web-pages so as to gain classification precision. Hence, the quality of set of term feature would decide the classification precision. From the related literature, there were many resolution such as [15], an extracting mechanism with squared related constancy and Entropy; Tan [17] especially proposed one fairer feature-select method, in which all kinds of document could be treated fairly and the input dimension could be decreased more. From those theories above, the conception of either taking advantage of information theory or machine learning was inclining to use symbols (or numbers) calculation. The processes and its outcomes are difficult to be understood. This paper adapted domain ontology to support the processing of classification that it not only arose the classifier’s efficiency but also made the processing and outcomes of the classification easily understood. D. Developing Techniques The developing tool of our system is Borland JBuilder2006. It is an integrated development environment of Java, which have a fine human-machine interface and code debugging mechanism to make a fast integration of each code block when the system was developed, and accordingly reduce the time of system development. In addition, Java [11] provides lots of functions and methods to integrate web applications and databases. In the view of extensibility, Java is absolutely the optimal choice for solving the problem of cross platform. This system adapted MS SQL Server as backend knowledge-database sharing platform based on ontology. MS SQL Server is one mostly used relational database management system. SQL (Structured Query Language) is one query language to get the data in the database. III. SYSTEM ARCHITECTURE A. Webpage Crawler Figure 2 showed the operation system structure of the webpage crawler [20]. Firstly, Keyword Keying transferred input characters into URI code and then embedded into the search engine’s query URL. Furthermore, Search Engine Linking declared an URL object and added the query URL on well transferred URI code, and then read its content line by line for outputting the content as text file as final analysis reference. Moreover, Regular Processing used different regular expressions to search for whether there is matched URL so as to completely fetch out related hyperlinks corresponding to the conditions for outputting them into a text file to provide the system for further processing. Pure Text Extracting used the hyperlink file to read its content with an iterative loop line by line, and then the system deleted the html tags from source file and remained only the text content so as to let system conduct further processing and analyzing. Finally, Content Filtering judged whether the webpage was the range we queried. It linked the ontology database and fetched out the content to compare content of the webpage. If did, the content, its URL and related titles were stored by the system for supporting further processing and analyzing. Figure 2. Operation system structure of the webpage crawler B. Webpage Classifier Figure 3. System operational structure of the webpage classifier Figure 3 illustrated the system operational structure of the webpage classifier [21]. The Source Text originated from the webpage crawler was described before. Firstly, the system filtered out non-semantic characters such as the continuous space, Tab, \n, \t characters, etc., which were used to divide or beautify the content and had nothing to do with semantic expression. The initial task includes loading stop word database, ontology database, formal database, and document array so as to manage afterwards. Furthermore, this system employed the CKIP (short for Chinese Knowledge and Information Processing) segmentation system [12] to be an assistant tool for doing the word segmentation, whose output contained segmented words and their corresponding attribute tags. Preprocessing contains: Format Transformation, which included attribute tag deleting, full-space character replaced with half-space character, etc.; Segmentation Fixing, which employed the domain ontology to fix those segmentation errors in the specific domain; Stemming, which transferred different word type into stem; and Stop Word Filtering, which not only used “stop list” to store those words and ought to be excluded but also employed the attribute tags after segmentation to assist in stop word deleting process. Duplication Processing proceeded complete cross-comparison between those documents and system database for deleting duplication webpages to avoid duplicated operation of system backend and accordingly enhance its performance. TF/IDF 380
rocessing carried out word weight calculation with TF and IDF information recommender. The syste the number of datum. The IDF process includes Related Document Collection, duplicated hyperlinks to be the base of Recommendation which accumulated 500 pieces of webpages from the member Le. the weight of the website is hi hen many people Taiwanese Association for Artificial Intelligence as the document recommend it on this class. The system would consult whether som specimen, and IDF Calculation, which deleted the duplicated similar Course Information with the classic scholar pattern existed. If vocabularies from the specimen of documents and then calculated did, its weight is higher. As the same word, if the course information number of appearances of each remained in the specimen of appeared in the significant vocabularies of the belonged classification, documents as the IdF so is higher its weight. Academic Activities recommendation as same Cosine Similarity [5] compared with tF/dF difference of each as the operation of course information, the system used the classic ry among specimen of classes to cale scholar pattern and class keyword to carry out weight promotin similarities. This system takes one step ahead to do the following process. The system also consulted the scholar ontology database to three processes for discriminating most similar documents in retrieve keywords of each class for convenient processing of domains. Firstly, the system raised the weight of the keywords information recommendation. The source of the classic pattern of appeared in the title, namely, Title Weight Raising. Furthermore, the each class in this system derived from 50 classic members of weights of the keywords were assigned according to their level Taiwanese Association for Artificial Intelligence and Taiwan Fuzzy values located in the ontology hierarchy, namely, Hierarchical Systems Association, which extracted personal information fro Weight Raising, to help for discriminating most similar key words in their webpages as specimens of Classic Scholar pattern. Finally, the domains. Finally, threshold Filtering means the system aimed at der produces the recommending information TF/DF values of keyword ltered out the lower value of according to the order of keyword weight through the integrated vocabularies before delivered them into cosine similarity calculation interface available to users. or avoiding too many noises and affecting the similarity calculation 9, whose value of this system was set to 7. Finally, the system conveniently process afterwards into the proper data folds so as to C. Information Extractor E. Integrated interface Figure 4. Architecture of the information extractor Figure 6 illustrates the integrated interface of this system. It is not only to be a communication bridge but also presents the operation Ito figure 4 illustrates the architecture of the information extractor. process of the webpage crawler, webpage classifier, information ombined datum both from the webpage crawler and from the extractor, and information recommender, respectively. Finally, the :bpage classifier and read the document's html source code URL, interface also provides the processing function of IDF to users for belonged classification, and corresponding file name to extract conveniently adding related IDF data by themselves. The user can significant information from them. Preprocessing works mainly click those tags in the left upper area of the interface screen to view include URL Fixing and Sub-webpage Crawling. The former means the processing procedure of each system stage for deeply the URL of the sub-webpage may be written in the internal hyperlink understanding their operations. form of the website. But the system directly employed the webpage crawler to get that webpage and then returned an empty webpage Therefore, the system needs to use the getHostO method in Javas API for getting the host URL of that webpage and then combined it with the internal hyperlink, and accordingly got its correct sub- webpage. The latter means the specific information may exist in its employed the regular expression technique to aim at the three types Information files after text files ferent classes so as to conveniently process afterwards. D. Information Recommender tegrated interface of this system The information recommender only choose the data folds of the F. Construction of Ontology database different classification so as to fast, precisely, and effectively corresponding ones about scholar information, and accordingly constructed for specific domain. That is to say in which
Processing carried out word weight calculation with TF and IDF datum. The IDF process includes Related Document Collection, which accumulated 500 pieces of webpages from the member of Taiwanese Association for Artificial Intelligence as the document specimen; and IDF Calculation, which deleted the duplicated vocabularies from the specimen of documents and then calculated number of appearances of each remained in the specimen of documents as the IDF. Cosine Similarity [5] compared with TF/IDF difference of each vocabulary among specimen of classes to calculate those classes’ similarities. This system takes one step ahead to do the following three processes for discriminating most similar documents in domains. Firstly, the system raised the weight of the keywords appeared in the title, namely, Title Weight Raising. Furthermore, the weights of the keywords were assigned according to their level values located in the ontology hierarchy, namely, Hierarchical Weight Raising, to help for discriminating most similar keywords in domains. Finally, threshold Filtering means the system aimed at TF/IDF values of keywords and filtered out the lower value of vocabularies before delivered them into cosine similarity calculation for avoiding too many noises and affecting the similarity calculation [9], whose value of this system was set to 7. Finally, the system stores the final classifications into the proper data folds so as to conveniently process afterwards. C. Information Extractor Figure 4. Architecture of the information extractor Figure 4 illustrates the architecture of the information extractor. It combined datum both from the webpage crawler and from the webpage classifier and read the document’s html source code, URL, belonged classification, and corresponding file name to extract significant information from them. Preprocessing works mainly include URL Fixing and Sub-webpage Crawling. The former means the URL of the sub-webpage may be written in the internal hyperlink form of the website. But the system directly employed the webpage crawler to get that webpage and then returned an empty webpage. Therefore, the system needs to use the getHost() method in Java’s API for getting the host URL of that webpage and then combined it with the internal hyperlink, and accordingly got its correct subwebpage. The latter means the specific information may exist in its sub-webpages, therefore, the system should crawl down to the specific sub-webpages in the next level. Regular Processing employed the regular expression technique to aim at the three types of the system needed to proceed with significant information analysis and extraction, including course information, website recommendation, and academic activities. The system outputs the Significant Information Files after regular processing as text files based on different classes so as to conveniently process afterwards. D. Information Recommender The information recommender only choose the data folds of the different classification so as to fast, precisely, and effectively query corresponding ones about scholar information, and accordingly integrated and ranked those significant information for recommending to users. Figure 5 illustrates the architecture of the information recommender. The system used the number of duplicated hyperlinks to be the base of Website Recommendation, i.e., the weight of the website is higher when many people recommend it on this class. The system would consult whether some similar Course Information with the classic scholar pattern existed. If did, its weight is higher. As the same word, if the course information appeared in the significant vocabularies of the belonged classification, so is higher its weight. Academic Activities recommendation as same as the operation of course information, the system used the classic scholar pattern and class keyword to carry out weight promoting process. The system also consulted the scholar ontology database to retrieve keywords of each class for convenient processing of information recommendation. The source of the classic pattern of each class in this system derived from 50 classic members of Taiwanese Association for Artificial Intelligence and Taiwan Fuzzy Systems Association, which extracted personal information from their webpages as specimens of Classic Scholar pattern. Finally, the information recommender produces the recommending information according to the order of keyword weight through the integrated interface available to users. Figure 5. Architecture of the information recommender E. Integrated Interface Figure 6 illustrates the integrated interface of this system. It is not only to be a communication bridge but also presents the operation process of the webpage crawler, webpage classifier, information extractor, and information recommender, respectively. Finally, the interface also provides the processing function of IDF to users for conveniently adding related IDF data by themselves. The user can click those tags in the left upper area of the interface screen to view the processing procedure of each system stage for deeply understanding their operations. Figure 6. Integrated interface of this system F. Construction of Ontology Database Our ontology is a knowledge sharing database which was constructed for specific domain. That is to say in which took advantage of built ontology database of some scholars to support 381
system operation. In the mentioned ontology database, it d two constructed stages one is statistics and analysis of concepts of scholars, the other is construction of ontology database. The procedures were detailed as belot Figure 10. Result so Information recommender B. Verification and Comparison of System Performance The information recommendation meant the optimal recommending have chosen from a group of related information sets That wonderfully possessed different approaches to the same purpos igure 7. The ontology structure of Scholars as whether sampling specimens can be on behaving of degree of First of all, the system conducted statistics and survey of sampling body in huge amount of datum. In the sampling survey domain, the reliability was usually employed to measure the degree homepage of related scholars to fetch out the related concepts and of precision of sampling system itself, while the validity wa their synonym appearing in the homepage. Figure 7 indicated the structure of domain ontology of scholars in Protege. In application mphasized whether it can be correct to reflect the properties of the the system combined these related concepts that would be model, provided by J.P. Peter[13] in 1979 and cited by lots of papers, convener tly interpreted by the webpage crawler to compare content of the queried webpage, and if the compared outcomes were to represent the definitions of the reliability and validity [22] In this experiment, we randomly chose 100 data from the corresponding to any item among the matched condition for web personal webpages of the member of Taiwanese Association for onstructing of scholars, in which the main part work is to transfer Artificial Intelligence to carry different 3 times of recommendation the ontology built with Protege into MS SQL database. The xperiments out respectively. The significant information procedures are as following recommendation of this experiment was asserted by the domain experts, including observed values, true values, error values, and (1) Exporting an XML file constructed with Protege knowledge related variances. We respectively calculate reliabilities and (2) Finally importing Ms Excel into MS SQL Sever to finish the validities of information recommendation in different professional ontology construction of this system. shown in TABLE Il. The average values of reliability and validi were 0.88 and 0.78, respectively, From the technical literature [18] we know the regular-level values of reliability and validity were 0.7 and 0.5, respectively, which verify and validate our experiment SYSTEM DISPLAY AND PERFORMANCE VERIFICATION results with high-level outcomes of information recommendation 匚匚90s Google Figure 9. Execution screen of webpage crawler nformation recommenders with different structures or approaches Figure 9 shows the exec cution screen of the ontology-supported this paper also especially combined a data mining tool SPSS webpage crawler, while Fi igure 10 displ Clementine with the domain ontology to mine out usefully recommendation resul mportant information from huge datum, and then to employ Java to develop an another information recommender for scholars[22], in
whole system operation. In the mentioned ontology database, it included two constructed stages; one is statistics and analysis of related concepts of scholars, the other is construction of ontology database. The procedures were detailed as below. Figure 7. The ontology structure of Scholars First of all, the system conducted statistics and survey of homepage of related scholars to fetch out the related concepts and their synonym appearing in the homepage. Figure 7 indicated the structure of domain ontology of scholars in Protégé. In application, the system combined these related concepts that would be conveniently interpreted by the webpage crawler to compare with content of the queried webpage, and if the compared outcomes were corresponding to any item among the matched condition for web page querying. Figure 8 shows the second stage of ontology constructing of scholars, in which the main part work is to transfer the ontology built with Protégé into MS SQL database. The procedures are as following: (1) Exporting an XML file constructed with Protégé knowledge base and then importing into MS Excel for correcting. (2) Finally importing MS Excel into MS SQL Sever to finish the ontology construction of this system. Figure 8. Ontology database transferring procedures of scholars IV. SYSTEM DISPLAY AND PERFORMANCE VERIFICATION A. System Display Figure 9. Execution screen of webpage crawler Figure 9 shows the execution screen of the ontology-supported webpage crawler, while Figure 10 displays the information recommendation results. Figure 10. Result screen of information recommender B. Verification and Comparison of System Performance The information recommendation meant the optimal recommending have chosen from a group of related information sets. That wonderfully possessed different approaches to the same purpose as whether sampling specimens can be on behaving of degree of sampling body in huge amount of datum. In the sampling survey domain, the reliability was usually employed to measure the degree of precision of sampling system itself, while the validity was emphasized whether it can be correct to reflect the properties of the appearance of things. This paper employed the aid of mathematic model, provided by J.P. Peter [13] in 1979 and cited by lots of papers, to represent the definitions of the reliability and validity [22]. In this experiment, we randomly chose 100 data from the personal webpages of the member of Taiwanese Association for Artificial Intelligence to carry different 3 times of recommendation experiments out respectively. The significant information recommendation of this experiment was asserted by the domain experts, including observed values, true values, error values, and related variances. We respectively calculate reliabilities and validities of information recommendation in different professional domains as shown in TABLE I, while the total average results were shown in TABLE II. The average values of reliability and validity were 0.88 and 0.78, respectively. From the technical literature [18], we know the regular-level values of reliability and validity were 0.7 and 0.5, respectively, which verify and validate our experiment results with high-level outcomes of information recommendation. TABLE I. RESULTS OF 3 TIMES OF RECOMMENDATIONS Artificial Intelligence Fuzzy Theory Artificial No Neural Network Information Recommendation r t t V a l r t t V a l r t t V a l Course Information 0.95 0.7 0.92 0.9 0.86 0.83 1 Academic Activities 0.98 0.92 0.93 0.89 0.78 0.66 Website Recommendation 0.96 0.78 0.89 0.73 0.91 0.73 Average 0.96 0.8 0.91 0.84 0.85 0.74 Course Information 0.78 0.75 0.93 0.88 0.76 0.6 2 Academic Activities 0.93 0.84 0.9 0.73 0.73 0.5 Website Recommendation 0.98 0.96 0.9 0.8 0.88 0.87 Average 0.90 0.85 0.91 0.8 0.79 0.66 Course Information 0.99 0.87 0.7 0.53 0.78 0.63 3 Academic Activities 0.96 0.9 0.92 0.85 0.69 0.62 Website Recommendation 0.97 0.89 0.95 0.9 0.83 0.76 Average 0.97 0.89 0.86 0.76 0.77 0.67 TABLE II. TOTAL AVERAGE RESULTS Performance Artificial Intelligence Fuzzy Theory Artificial Neural Network Total Average Average Reliability 0.94 0.89 0.80 0.88 Average Validity 0.85 0.80 0.69 0.78 In addition, to explore the superiority of constructing information recommenders with different structures or approaches, this paper also especially combined a data mining tool SPSS Clementine with the domain ontology to mine out usefully important information from huge datum, and then to employ Java to develop an another information recommender for scholars [22], in 382
which can recommend suitably important information to scholars he experiment used the same set of domain ontology igure 14 showed the operation system structure of this specimen of webpages, and verification rules (i.e, reliability and recommender. The recommender also employed the CKIP validity) and the system presented almost same outputs to do the segmentation system to be a front-end assistant tool, which segment related experiments. TABLE Ill illustrated the reliabilities of some webpage contents of scholars and filter out most part of stop words. scholars recommending information on"Course Information"and The CKIP segmentation system could make segmentation errors in"Academic Activities"which are 0.856 and 0.756 in average some terminologies of specific domains. Those errors could make respectively. TABLE IV illustrated the validities of some scholars an enormity wrong effect in the precision rate of backend word recommending information on "Course Information"and matching. The function of Segmentation Fixing emphasized to solve" Academic Activities"which are 0.856 and 0.756 in average, hose problems, detailed function included: loading stop word respectively. Finally, what merits attention is: the last column database, deleting non-semantic characters, filtering stop words, re- Professional Classification of each scholar can be accurately shown gmenting process supported by the domain ontology, and term that prove the recommendation system with SPSS Clementine has its accuracy and availability L show Op TABLE VALIDITY OF CLASSIFICATION INF Figure 11. Operation system structure of the recommender with SPSS Clementine Segmentation Area Recommend Area L Reached 1240sAl After comparing TABLES Il, Ill, and Iv, the values ecommending reliability and validity of our recommendation system (ie, 0.88 and 0.78)were not only better than the regul ones of the literatures (i.e, 0.7 and 0.5) but also showed the excellent performance than the ones of the recommender with SPSS Clementine (i.e, 0.806). From the above comparison, the approach with SPSs Clementine not only has its availability and system but also Figure 12. Recommendation results of the recommender with SPSS Clementine However, its shortcomings include the learning path SPSS Clementine System entered the text file after processed by SPSS Clementine is too sharp and so long, moreover, its Segmentation Fixing and combined against the domain ontology on input and output cannot completely fit with some database to be the bases of classification statistics and ecific applications for customized services. Our approach could ecommending match analysis. This processing divided into tw fortuitously solve the above problems and spread out the excellent stages: the first stage judged the scholar webpage whether fit in with system performance indeed and accordingly provided the design specific domain, i.e., content-based filtering(same as the function philosophy of customized services to recommendation systems. Content Filtering of the webpage crawler described in Section V. CONCLUSION AND DISCUSSION Ill. A); the second stage extracted the significant information from This paper had focused on developing an ontology-supported the scholar webpage to be the important recommending base. In the and hybrid recommending information system for scholars Not only recommending mechanism, not only has the general normality can it fast integrate specific domain documents, but also it can ecommending but also adding the optimal recommending module. extract important information from them through the hybrid filtering For example, if existed the duplication list of giving lessons between technology to take information integration and recommendation a scholar's and some related scholars, these significant courses will ranking. The experiment outcomes proved that the reliability and be the optimal information of"Courses" recommending, i.e., validity measurements of the whole system performance can collaborative filtering (ust like the role of Classic Scholar Pattern of achieve the high-level outcomes of information recommendation. the information recommender described in Section ll.). Furthermore, this paper also discussed and Recommending Display showed the recommending results after advantage and shortcomings of the constru mining by SPSS Clementine, which employed Java to implement recommendation system with different approaches and accordingly the user interface of this recommender as shown in figure
which can recommend suitably important information to scholars. Figure 14 showed the operation system structure of this recommender. The recommender also employed the CKIP segmentation system to be a front-end assistant tool, which segment webpage contents of scholars and filter out most part of stop words. The CKIP segmentation system could make segmentation errors in some terminologies of specific domains. Those errors could make an enormity wrong effect in the precision rate of backend word matching. The function of Segmentation Fixing emphasized to solve those problems, detailed function included: loading stop word database, deleting non-semantic characters, filtering stop words, resegmenting process supported by the domain ontology, and term frequency statistics. Figure 11. Operation system structure of the recommender with SPSS Clementine Figure 12. Recommendation results of the recommender with SPSS Clementine SPSS Clementine System entered the text file after processed by Segmentation Fixing and combined against the domain ontology database to be the bases of classification statistics and recommending match analysis. This processing divided into two stages: the first stage judged the scholar webpage whether fit in with specific domain, i.e., content-based filtering (same as the function of Content Filtering of the webpage crawler described in Section III.A); the second stage extracted the significant information from the scholar webpage to be the important recommending base. In the recommending mechanism, not only has the general normality recommending but also adding the optimal recommending module. For example, if existed the duplication list of giving lessons between a scholar’s and some related scholars’, these significant courses will be the optimal information of “Courses” recommending, i.e., collaborative filtering (just like the role of Classic Scholar Pattern of the information recommender described in Section III.D). Recommending Display showed the recommending results after mining by SPSS Clementine, which employed Java to implement the user interface of this recommender as shown in Figure 15. The experiment used the same set of domain ontology, specimen of webpages, and verification rules (i.e., reliability and validity) and the system presented almost same outputs to do the related experiments. TABLE III illustrated the reliabilities of some scholars’ recommending information on “Course Information” and “Academic Activities” which are 0.856 and 0.756 in average, respectively. TABLE IV illustrated the validities of some scholars’ recommending information on “Course Information” and “Academic Activities” which are 0.856 and 0.756 in average, respectively. Finally, what merits attention is: the last column Professional Classification of each scholar can be accurately shown that prove the recommendation system with SPSS Clementine has its accuracy and availability. TABLE III. RESULTS OF THE RELIABILITY OF CLASSIFICATION INFORMATION Course Information Academic Activities Domain Professor Ve Vo rtt Ve Vo rtt Professional Classification C.S. Ho (何正信) 1 12 0.92 2 7 0.71 AI T.W. Kuo (郭大維) 0 2 1 0 1 1 AI S.Y. Yang (楊勝源) 0 5 1 0 1 1 AI S.M. Chen (陳錫明) 7 11 0.36 3 7 0.57 Fuzzy W.L Hsu (許聞廉) 0 1 1 2 4 0.5 AI Average 0.856 0.756 TABLE IV. RESULTS OF THE VALIDITY OF CLASSIFICATION INFORMATION Course Information Academic Activities Domain Professor Vco Vo Val Vco Vo Val Professional Classification C.S. Ho (何正信) 11 12 0.92 5 7 0.71 AI T.W. Kuo (郭大維) 2 2 1 1 1 1 AI S.Y. Yang (楊勝源) 5 5 1 1 1 1 AI S.M. Chen (陳錫明) 4 11 0.36 4 7 0.57 Fuzzy W.L Hsu (許聞廉) 1 1 1 2 4 0.5 AI Average 0.856 0.756 After comparing TABLES II, III, and IV, the values of recommending reliability and validity of our recommendation system (i.e., 0.88 and 0.78) were not only better than the regular ones of the literatures (i.e., 0.7 and 0.5) but also showed the excellent performance than the ones of the recommender with SPSS Clementine (i.e., 0.806). From the above comparison, the approach with SPSS Clementine not only has its availability and system performance but also can reduce the whole time of system construction. However, its shortcomings include the learning path and time of SPSS Clementine is too sharp and so long, moreover, its limitations on input and output cannot completely fit with some specific applications for customized services. Our approach could fortuitously solve the above problems and spread out the excellent system performance indeed and accordingly provided the design philosophy of customized services to recommendation systems. V. CONCLUSION AND DISCUSSION This paper had focused on developing an ontology-supported and hybrid recommending information system for scholars. Not only can it fast integrate specific domain documents, but also it can extract important information from them through the hybrid filtering technology to take information integration and recommendation ranking. The experiment outcomes proved that the reliability and validity measurements of the whole system performance can achieve the high-level outcomes of information recommendation. Furthermore, this paper also discussed and investigated the advantages and shortcomings of the construction of a recommendation system with different approaches and accordingly 383
provided the design philosophy of customized services for rmation and Knowledge Management, Washington D. C, USA, 2004, Pp This paper continually employed the related techniques of the AJ. Duineveld, R. Stoter, M.R. Weiden, B. Kenepa and VRBenjamins, ontology-supported webpage crawler, namely OntoCrawler [20], Iternational Journal of Human-Computer Studies, 52(6), 2000, pp llll and the ontology-supported webpage classifier, namely OntoClassifier 21, published by our laboratory before. This system [5] E.Garcia, " Cosine Similarity and Term Weight Tutorial: An Information segmentation system(described etrieval Tutorial on Cosine Similarity Measures, Dot Products and Term lll. B)to assist and improve the original classifiers drawback for WeightCalculationsAvailableathtp//www.miislita.com/information- catching up on regrets resulting from only processing Engli documents. Furthermore, it largely modified the classification knowledge-ba et al., "The evolution of Protege: and environment for to adopt the concepts of Tf/idF rules and cosine similarity method or calculating the degree of webpage similarity, which improved [7 W.E. Grosso, H Eriksson, R.w. Fergerson, J H Gennari, S.W. Tu, and MA the performance of document classification supported by ontologie Musen. "Knowledge Modeling at the Millennium: the design and evolution and accordingly enhanced the precision rate of classification. On the [8] S Y. Hsu, Building a Semantic Ontology with Song Ci Segmentation, Master wpoint of the ontology-supported webpage crawler, it is very hesis, College of Science, National Chiao Tung University, HsinChu, complete working on the whole functions. In this paper, however we only modified part of webpage catching mechanism and output 99) T Joachims, "A probabilistic analysis of the Rocchio algorithm with TFIDF file format for conveniently and effectively combining with the for text categorization, Proc. of the 14th International Conference operation of each sub-system in the backend [10] L. Kerschberg, w. Kim and A. Scime, "Websifter Il: A Personalizable In the future, we will focus on practicable modulation of each Meta-Search Agent Based on Weighted Semantic Taxonomy Tree, Proc. of ub-system in this system. Currently, all sub-system modulation was the International Conference on Intemet Computing, Las Vegas, Nevada, not workable, therefore, the system cannot properly operate on the different domains via purely modified the ontologies only, which .C.Lo, The Art: g Jan, Grandlech computer Graphic Systems still have to add and modify part of program codes for achieving this [12] w.Y. Ma and KJ. Chen,"Introduction to CKIP Chinese we ultra goal. Looking forward to practicable modulation of each sub- system, the system could provide more perfect human-machine mentation Bakeoff. ontologies but also make the system properly operating on those [13]1.P Peter,Reliability: a review or apac me,. pp 168-I7I interface, which not only assists system users to add the domai ies and recent ontologies to take one step ahead for extending the practical [14] G. Salton and M.J. McGill, Introduction to Modem Information Retrieval McGraw Hill Book Co., New York, 1983 [5] K. Spark-Jones, A statistical interpretation of term specificity and its ACKNOWLEDGMENT he authors would like to thank Ssu-Hsien Lu, Ting-An Chen, Chi-Feng Wu, Hung-Chun Chiang, Hsieh-I Lo, and You-Jen Chang [6]SPSS Taiwan Corporation, Handout of training course on SPSS Clementine, for their assistance in related sub-system implementation and [17] C.C. Tan, An Intelligent Web-Page Classifier with Fair Feature-Subnet experiments. The partial work was supported by the National Science Council, Taiwan, R.O. C, under Grants NSC-98-2221-E- 129-012 and NSC-99-2623-E-129-002-ET, and the Ministry of [8] T.x. Wu, " The Reliability and Validity of Attitude and Behavior Research Education, Taiwan, R.O.C under Grant Skill of Taiwan(1)Word No.099004592ls [9S.Y. Yang, Y, C. Chu, and C S Ho, Ontology-Based Solution Integration EFERENCES Intelligence and Applications, Kaohsiung, Taiwan, 2001, Pp. 52-57 [B. Chandrasekaran, J.R. Josephson, and V R. Benjamins, "What Are S.Y. Yang, T A Chen, and C.F. Wu, Ontology-Supported Focused Crawler Ontologies, and Why Do We Need Them?, IEEE Intelligent Systems and Tor Scholar's Webpage, Proc. of 2008 International Conference 21] S.Y. Yang, H.C. Chiang, and C.s. Wu,Ontology-Supported Webpage The 2001 International Conference on Web Intelligence, Maebashi B]L Ding, T Finin, A Joshi,R Pan, R.S. Cost, Y. Peng, P. Reddivari, V. [22] S.Y. Yang, H.L. Lo, and Y.J. Chang, " Ontology-Supported Web Doshi, and J. Sachs, "Swoogle: A Search and Metadata Engine for the recommender for Scholar Information, Proc. of 2009 International Semantic Web, Proc. of the 13th ACM International Conference on conference an Advanced Information Technology, TaiChung, Taiwan, 2009
provided the design philosophy of customized services for recommendation systems. This paper continually employed the related techniques of the ontology-supported webpage crawler, namely OntoCrawler [20], and the ontology-supported webpage classifier, namely OntoClassifier [21], published by our laboratory before. This system employed the CKIP segmentation system (described in Section III.B) to assist and improve the original classifier’s drawback for catching up on regrets resulting from only processing English documents. Furthermore, it largely modified the classification rules to adopt the concepts of TF/IDF rules and cosine similarity method for calculating the degree of webpage similarity, which improved the performance of document classification supported by ontologies, and accordingly enhanced the precision rate of classification. On the viewpoint of the ontology-supported webpage crawler, it is very complete working on the whole functions. In this paper, however, we only modified part of webpage catching mechanism and output file format for conveniently and effectively combining with the operation of each sub-system in the backend. In the future, we will focus on practicable modulation of each sub-system in this system. Currently, all sub-system modulation was not workable, therefore, the system cannot properly operate on the different domains via purely modified the ontologies only, which still have to add and modify part of program codes for achieving this ultra goal. Looking forward to practicable modulation of each subsystem, the system could provide more perfect human-machine interface, which not only assists system users to add the domain ontologies but also make the system properly operating on those ontologies to take one step ahead for extending the practical applications of this system. ACKNOWLEDGMENT The authors would like to thank Ssu-Hsien Lu, Ting-An Chen, Chi-Feng Wu, Hung-Chun Chiang, Hsieh-I Lo, and You-Jen Chang for their assistance in related sub-system implementation and experiments. The partial work was supported by the National Science Council, Taiwan, R.O.C., under Grants NSC-98-2221-E- 129-012 and NSC-99-2623-E-129-002-ET, and the Ministry of Education, Taiwan, R.O.C., under Grant Skill of Taiwan (1) Word No. 0990045921s. REFERENCES [1] B. Chandrasekaran, J.R. Josephson, and V.R. Benjamins, “What Are Ontologies, and Why Do We Need Them?,” IEEE Intelligent Systems and Their Applications, 14(1), 1999, pp 20-26 [2] Y.J. Chen and V.W. Soo, “Ontology-Based Information Gathering Agents,” The 2001 International Conference on Web Intelligence, Maebashi TERRSA, Japan, 2001, pp 423-427. [3] L. Ding, T. Finin, A. Joshi, R. Pan, R.S. Cost, Y. Peng, P. Reddivari, V. Doshi, and J. Sachs, “Swoogle: A Search and Metadata Engine for the Semantic Web,” Proc. of the 13th ACM International Conference on Information and Knowledge Management, Washington D.C., USA, 2004, pp 652-659. [4] A.J. Duineveld, R. Stoter, M.R. Weiden, B. Kenepa and V.R. Benjamins, “WonderTools? A Comparative Study of Ontological Engineering Tools,” International Journal of Human-Computer Studies, 52(6), 2000, pp 1111- 1133. [5] E. Garcia, “Cosine Similarity and Term Weight Tutorial: An Information Retrieval Tutorial on Cosine Similarity Measures, Dot Products and Term Weight Calculations,” Available at http://www.miislita.com/informationretrieval-tutorial/cosine-similarity-tutorial.html, 2006. [6] J.H. Gennari et al., “The evolution of Protégé: and environment for knowledge-based systems development,” International Journal of Humancomputer studies, 58, 2003, pp 89-123. [7] W.E. Grosso, H. Eriksson, R.W. Fergerson, J.H. Gennari, S.W. Tu, and M.A. Musen, “Knowledge Modeling at the Millennium: the Design and Evolution of Protege-2000,” SMI Technical Report, SMI-1999-0801, CA, USA, 1999. [8] S.Y. Hsu, Building a Semantic Ontology with Song Ci Segmentation, Master Thesis, College of Science, National Chiao Tung University, HsinChu, Taiwan, 2006. [9] T. Joachims, “A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization,” Proc. of the 14th International Conference on Machine Learning, 1996, pp 143-151. [10] L. Kerschberg, W. Kim and A. Scime, “WebSifter II: A Personalizable Meta-Search Agent Based on Weighted Semantic Taxonomy Tree,” Proc. of the International Conference on Internet Computing, Las Vegas, Nevada, 2001, pp 14-20. [11] Y.C. Lo, The Art of Java, GrandTech Computer Graphic Systems Incorporation, Taipei, Taiwan, 2003, pp. 6-1~6-66. [12] W.Y. Ma and K.J. Chen, “Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff,” Proc. of ACL, Second SIGHAN Workshop on Chinese Language Processing, Sapporo, Japan, 2003, pp 168-171. [13] J.P Peter, “Reliability: A review of psychometric basics and recent marketing practices,” Journal of Marketing Research, 16, 1979, pp 6-17. [14] G. Salton and M.J. McGill, Introduction to Modern Information Retrieval, McGraw Hill Book Co., New York, 1983. [15] K. Spark-Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of Documentation, 28(5), 1972, pp 111- 121. [16] SPSS Taiwan Corporation, Handout of training course on SPSS Clementine, Taipei, Taiwan, 2006. [17] C.C. Tan, An Intelligent Web-Page Classifier with Fair Feature-Subnet Selection, Master Thesis, Dept. of Electronic Engineering, National Taiwan University of Science and Technology, Taipei, Taiwan, 2000. [18] T.X. Wu, “The Reliability and Validity of Attitude and Behavior Research: Theory, Application, and Self-examination,” Public Opinion Monthly, 1985, pp. 29-53. [19] S.Y. Yang, Y.C. Chu, and C.S. Ho, “Ontology-Based Solution Integration for Internet FAQ Systems,” Proc. of the 6th Conference on Artificial Intelligence and Applications, Kaohsiung, Taiwan, 2001, pp. 52-57. [20] S.Y. Yang, T.A. Chen, and C.F. Wu, “Ontology-Supported Focused Crawler for Scholar’s Webpage,” Proc. of 2008 International Conference on Advanced Information Technology, TaiChung, Taiwan, 2008, pp. 55. [21] S.Y. Yang, H.C. Chiang, and C.S. Wu, “Ontology-Supported Webpage Classifier for Scholar’s Webpages,” Proc. of the Nineteen International Conference on Information Management, NanTou, Taiwan, 2008, pp. 113. [22] S.Y. Yang, H.I. Lo, and Y.J. Chang, “Ontology-Supported Web Recommender for Scholar Information,” Proc. of 2009 International conference on Advanced Information Technology, TaiChung, Taiwan, 2009. 384