Component Search and Reuse: An Ontology-based Approach Awny Alnusair and Tian Zhao Department of Computer Science University of Wisconsin-Milwaukee, USA falnusair, tzhao@uwm. edu Abstract cation domain, and to represent the relationships that hold among these concepts. Due to their solid formal and rea- In order to realize the full potential of software reuse, soning foundation, ontologies can play an important role effective search techniques are indeed essential. In this pa- in domain engineering reuse; they can be used to structure per, we propose a semantic-based approach for identifying and build a source-code knowledge base that can be used by Ind retrieving relevant components from a reuse repository. software agents(e. g, search engine) and certainly serve as This approach relies on building a knowledge base accord- a basis for semantic queries [ 9]. Therefore, ontologies have ing to an ontology model that includes a source-code on- been successfully used by various researchers to improve many aspects of the software engineering processes [12] ogy. Due to the indexing and knowledge population mecha Towards this end, we have developed an ontology model nisms we used, our proof-of-concept supports various kinds that includes an enhanced software representation ontol of search techniques. However, our experiments show evi- This ontology is further extended with additional dence that only pure semantic search that exploits domain component-specific knowledge and automatically popu- nowledge tends to improve precision Based on a usability lated with ontological instances representing various pro- case study, we argue that semantic search is indeed usable gram elements of a given library. These instances are fur ther annotated with respect to concepts from a domain specific ontology Ontology-based component search is thus performed by 1. Introduction the semantic matching of user requests expressed using terms from the domain ontology with component descrip Systematic software reuse enables developers to use ex tions found in the populated knowledge-base. Furthermore isting software components for constructing new quality the knowledge po ation oftware systems. Thus, speeding up the development pro- in our approach still allows searching the knowledge base cess and reducing costs and risks. Failure modes analysis using the most familiar keyword search. This is particularly useful when domain ontologies or semantic component an- of the reuse process shows that in order to be reused, a soft- notations are lacking or incomplete. Users are thus able to ware component must be findable and certainly understand able 51. On one hand, understanding the component's func- keyword queries, type-based queries or a mixture of all tionality as well as its relationships with other components in a reuse library is usually hampered due to the lack of quality descriptions of library services. On the other hand, 2. Ontology model for component reuse finding suitable components is still a significant barrier for exploiting systematic software reuse At the core of our ontology model for object-oriented In this paper, we present an approach for describing, component retrieval is a Source Code Representation On retrieving, and exploring the various relationships among tology(referred to afterwards as SCRO). This ontology pro- software components in object-oriented reuse libraries. In vides a base model for capturing the relationships and de order to provide a formal and precise representation of li- pendencies among source-code artifacts. It models major brary code, our approach relies on ontologizing software concepts and features of object-oriented programs, includ- ing encapsulation, class and interface inheritance, met Ontologies provide means to explicitly describe con- overloading, method overriding, and method signature in- pts, objects, properties and other entities in a given appli- formation. SCRO,'s knowledge is represented using the
Component Search and Reuse: An Ontology-based Approach Awny Alnusair and Tian Zhao Department of Computer Science University of Wisconsin-Milwaukee, USA {alnusair, tzhao}@uwm.edu Abstract In order to realize the full potential of software reuse, effective search techniques are indeed essential. In this paper, we propose a semantic-based approach for identifying and retrieving relevant components from a reuse repository. This approach relies on building a knowledge base according to an ontology model that includes a source-code ontology, a component ontology, and a domain-specific ontology. Due to the indexing and knowledge population mechanisms we used, our proof-of-concept supports various kinds of search techniques. However, our experiments show evidence that only pure semantic search that exploits domain knowledge tends to improve precision. Based on a usability case study, we argue that semantic search is indeed usable and practical. 1. Introduction Systematic software reuse enables developers to use existing software components for constructing new quality software systems. Thus, speeding up the development process and reducing costs and risks. Failure modes analysis of the reuse process shows that in order to be reused, a software component must be findable and certainly understandable [5]. On one hand, understanding the component’s functionality as well as its relationships with other components in a reuse library is usually hampered due to the lack of quality descriptions of library services. On the other hand, finding suitable components is still a significant barrier for exploiting systematic software reuse. In this paper, we present an approach for describing, retrieving, and exploring the various relationships among software components in object-oriented reuse libraries. In order to provide a formal and precise representation of library code, our approach relies on ontologizing software knowledge. Ontologies provide means to explicitly describe concepts, objects, properties and other entities in a given application domain, and to represent the relationships that hold among these concepts. Due to their solid formal and reasoning foundation, ontologies can play an important role in domain engineering reuse; they can be used to structure and build a source-code knowledge base that can be used by software agents (e.g., search engine) and certainly serve as a basis for semantic queries [9]. Therefore, ontologies have been successfully used by various researchers to improve many aspects of the software engineering processes [12]. Towards this end, we have developed an ontology model that includes an enhanced software representation ontology. This ontology is further extended with additional component-specific knowledge and automatically populated with ontological instances representing various program elements of a given library. These instances are further annotated with respect to concepts from a domainspecific ontology. Ontology-based component search is thus performed by the semantic matching of user requests expressed using terms from the domain ontology with component descriptions found in the populated knowledge-base. Furthermore, the knowledge population and indexing mechanisms used in our approach still allows searching the knowledge base using the most familiar keyword search. This is particularly useful when domain ontologies or semantic component annotations are lacking or incomplete. Users are thus able to retrieve components using purely semantic-based queries, keyword queries, type-based queries or a mixture of all. 2. Ontology model for component reuse At the core of our ontology model for object-oriented component retrieval is a Source Code Representation Ontology (referred to afterwards as SCRO). This ontology provides a base model for capturing the relationships and dependencies among source-code artifacts. It models major concepts and features of object-oriented programs, including encapsulation, class and interface inheritance, method overloading, method overriding, and method signature information. SCRO’s knowledge is represented using the
OWL-DL ontology language. OWL is a web-based con- components in a given domain. Therefore, we extends ceptual modelling language used for capturing relationship SCROs semantic representations of API structures and en- semantics among domain concepts, OWL-DL is a subset rich it with additional component-specific descriptions re of oWL based on Description Logic(DL) and has desir- quired to uniquely identify and retrieve an API compe able computational properties for automated reasoning sys- nent. The result is a COMPonent REpresentation ontol- tems. OWL-DLs reasoning support enables inferring addi- ogy(referred to afterwards as COMPRE). In addition to tional knowledge and computing the classification hierar- the concepts, axioms, and properties inherited from SCRO, chy(subsumption reasoning). SCRO defines various OWL COMPRe defines its own class hierarchy and relations concepts that map directly to source-code elements and col- for semantic component descriptions. For instance, the lectively represent the most important concepts found in Component concept represents software components in object-oriented programs. Furthermore, SCRO defines var- general and subsumes other OWL classes that represents ious OWL object properties, datatype properties, and onto- Java-specific components such as Method, classType logical axioms to represent the various relationships among and Interface Type. A fragment of the ontology tax ontological concepts. SCRO is precise, well documented, onomy is shown on the left pane of Figure l and the com and designed with ontology reuse in mind. The availability plete ontology can be examined online [1 of SCRO online [1], allows researchers to reuse or extend its Moreover, comPre defines various ontological ax- representational components to support any semantic-based ioms and object properties that represent relationships application that requires source-code knowledge among software components. These properties link com- ponents with their corresponding semantic descriptions 2.l. Domain-specific ontology specified in the domain-specific ontology. For example, Domain ontologies describe concepts and structures re- corresponding inverse properties are used to annotate in- lated to a particular domain(e.g. finance, shopping, dividual software components with domain terms repre medicine, graphics, or the object-oriented programs domain senting the expected inputs and outputs of the component as specified in SCRO). In our approach to component re- Moreover, COMPRE defines the dependson symmetric trieval, we use a domain ontology that conceptualizes each object property that defines dependency relationship be- software library we need to reuse. This ontology provides tween components. describedBy is also defined to li a common vocabulary with unambiguous and conceptually a component to a domain concept that best describes the sound terms that can be used to annotate software compo- purpose or the nature of the component nents. Annotations in this context serves two key purposes In addition to pure semantic search that is based on Firstly, both software providers and users can communicate using a shared and common vocabulary provided by this component annotations with respect to a domain onto ogy, COMPRE also defines various datatype properties to ontology. Thus enabling a precise retrieval of API compo- model other metadata about components. These proper nents. Secondly, it brings difterent perspectives to typical ties are provided to enable metadata keyword queries that program comprehension tasks. Users can familiarize them- can be used when semantic annotations are lacking or in- selves with terminology and conceptual knowledge that is complete. For instance, the hasInput Terms and the typically implicit in the problem domain. To this end, we have developed a mini-ontology for data words describing the component's input and its expected retrieval in the Semantic Web applications domain. This on tology (referred to afterwards as Swonto) has been used outpu during the evaluation of our approach(cf. Section 4)and serves as a proof of concept that component search can be 23. Knowledge population significantly enhanced through the use of domain ontolo- gies. A small fragment of the ontology's taxonomy is shown on the upper right pane of Figure I and the complete onto- to populate the knowledge base with ontological instances ogy is found online [1] that represent various ontological concepts and their corre- 2.2. Component-specific ontology sponding relationships. Therefore, we have built a knowl- edge extractor subsystem for the Java programming lan In the context of component retrieval, we certainly need guage. Our subsystem performs a comprehensive parsing of profound semantic description of the components in- the Java bytecode and captures every ontology concept that represents a source code element and generates instances ner working structure and its interrelationships with other of all ontological properties defined in our ontologies for http://www.w3.org/2004/owl/ those program elements. The generated semantic instances
OWL-DL1 ontology language. OWL is a web-based conceptual modelling language used for capturing relationship semantics among domain concepts, OWL-DL is a subset of OWL based on Description Logic (DL) and has desirable computational properties for automated reasoning systems. OWL-DLs reasoning support enables inferring additional knowledge and computing the classification hierarchy (subsumption reasoning). SCRO defines various OWL concepts that map directly to source-code elements and collectively represent the most important concepts found in object-oriented programs. Furthermore, SCRO defines various OWL object properties, datatype properties, and ontological axioms to represent the various relationships among ontological concepts. SCRO is precise, well documented, and designed with ontology reuse in mind. The availability of SCRO online [1], allows researchers to reuse or extend its representational components to support any semantic-based application that requires source-code knowledge. 2.1. Domain-specific ontology Domain ontologies describe concepts and structures related to a particular domain (e.g. finance, shopping, medicine, graphics, or the object-oriented programs domain as specified in SCRO). In our approach to component retrieval, we use a domain ontology that conceptualizes each software library we need to reuse. This ontology provides a common vocabulary with unambiguous and conceptually sound terms that can be used to annotate software components. Annotations in this context serves two key purposes. Firstly, both software providers and users can communicate using a shared and common vocabulary provided by this ontology. Thus enabling a precise retrieval of API components. Secondly, it brings different perspectives to typical program comprehension tasks. Users can familiarize themselves with terminology and conceptual knowledge that is typically implicit in the problem domain. To this end, we have developed a mini-ontology for data retrieval in the Semantic Web applications domain. This ontology (referred to afterwards as SWONTO) has been used during the evaluation of our approach (cf. Section 4) and serves as a proof of concept that component search can be significantly enhanced through the use of domain ontologies. A small fragment of the ontology’s taxonomy is shown on the upper right pane of Figure 1 and the complete ontology is found online [1]. 2.2. Component-specific ontology In the context of component retrieval, we certainly need a profound semantic description of the component’s inner working structure and its interrelationships with other 1http://www.w3.org/2004/OWL/ components in a given domain. Therefore, we extends SCROs semantic representations of API structures and enrich it with additional component-specific descriptions required to uniquely identify and retrieve an API component. The result is a COMPonent REpresentation ontology (referred to afterwards as COMPRE). In addition to the concepts, axioms, and properties inherited from SCRO, COMPRE defines its own class hierarchy and relations for semantic component descriptions. For instance, the Component concept represents software components in general and subsumes other OWL classes that represents Java-specific components such as Method, ClassType, and InterfaceType. A fragment of the ontologys taxonomy is shown on the left pane of Figure 1 and the complete ontology can be examined online [1]. Moreover, COMPRE defines various ontological axioms and object properties that represent relationships among software components. These properties link components with their corresponding semantic descriptions specified in the domain-specific ontology. For example, hasDomainInput and hasDomainOutput and their corresponding inverse properties are used to annotate individual software components with domain terms representing the expected inputs and outputs of the component. Moreover, COMPRE defines the dependsOn symmetric object property that defines dependency relationship between components. describedBy is also defined to link a component to a domain concept that best describes the purpose or the nature of the component. In addition to pure semantic search that is based on component annotations with respect to a domain ontology, COMPRE also defines various datatype properties to model other metadata about components. These properties are provided to enable metadata keyword queries that can be used when semantic annotations are lacking or incomplete. For instance, the hasInputTerms and the hasOutputTerms are used to assign meaningful keywords describing the component’s input and its expected output. 2.3. Knowledge population Once the ontology structure is specified, one next needs to populate the knowledge base with ontological instances that represent various ontological concepts and their corresponding relationships. Therefore, we have built a knowledge extractor subsystem for the Java programming language. Our subsystem performs a comprehensive parsing of the Java bytecode and captures every ontology concept that represents a source code element and generates instances of all ontological properties defined in our ontologies for those program elements. The generated semantic instances 2
are serialized using RDF. RDF is web-based language few terms describing its purpose. The third triple, however, suitable for describing resources and provides an extensible tags the same input parameter with a meaningful concept data model for representing machine-processable semantics (QueryText)from the domain ontology. Thus, giving the of data. For each application framework parsed, we thus parameter an agreed-upon and meaningful description other generate an RDF ontology that represents the instantiated than terms or the semantically vague String type. knowledge base for the framework at hand. This knowledge base is managed by Jena [8], an open source Java frame 1. create[.] scro: hasInput Type String work for building Semantic Web applications 2. create[.] compre: hasInputTerms The process of generating semantic instances for the query string concepts and relations specified in SCRO is completely au- 3. create[.] compre: hasDomainInput However, the process of annotating componen according to COMPRE's object properties is currently man- ual as it is the case for semantic annotations in general. ebase http Ourtoolthoughprovidesmeansforinsertingtheseannota-prefixscro: tionsdirectlyintotheknowledge-base,thusgraduallybuild-prefixcomprechttp:../ontologies/compre.owl4> ing semantic descriptions for a particular API that can be PREFIx swonto: properties is generated automatically via direct parsing of a scro: staticMethod the source-code. We thus capture and normalize method scro: hasInputType i scro: hasInput Type signatures, identifier names, source-code comments, and scro: hasInput Type available Java annotations in order to obtain a meaningful utputType keyword descriptions of components. These descriptions scro: invokesMethod are lexically analyzed, stored, and indexed using the to- scro: hassignature create[String, string, Syntax] kenization and indexing mechanisms provided by Apache ompre: describedBy I a swonto: QueryCreation]i Lucene, an open-source full-featured text search engine compre: hasDomainInput a swonto: QueryText In the next section, we show how the knowledge gener- compre: hasDomainInput a swonto: URIl ed using this knowledge extractor sub-system can be used compre: hasDomainInput swonto: QueryLanguage Sy for component search. For an extended discussion of our hasDomainoutput ontologies, complete knowledge population samples, we re fer the reader to our ontologies website [1] compre: hasInputrermsquery string .."i ompre: hasInputTerms query syntax URI 3. ontological search compre: hasoutput Terms " query dc: description "create query Listing I shows a partial RDF description obtained luring the knowledge population phase for a Jena API Listing 1. RDF descriptor for an APl method method, the create method. This method belongs to the jueryFactory class and usually used to create a Query This multi-faceted description of components enables object given the specified input. This rdF descriptioN clearly captures the component's metadata at the semantic a) type or signature-based queries; b)metadata keyword queries; c) pure semantic-based queries; or d) blended The underlying data structure of RDF is a labeled di rected graph. Each node-arc-node in this graph represents a queries of the previous three types However, we focus the discussion on pure semantic- triple that consists of three parts, subject, predicate and ob- based queries that rely on domain-specific knowledge ject. Consider Listing I for example, the described method Primarily, search techniques that rely on variations of is always the subject, onto- keyword-based search suffer from synonymity and poly- ogy properties are predicates, and objects are either a re- semic ambiguity that often lead to low recall and precision source, unlabeled node(blank node)or a literal value. For On the other hand, signature matching techniques cannot example, the first triple below uses a property from scro distinguish between components that have the same signa to assert that the method has an input parameter of type ture but serve different purposes, e.g ng Jena api to String. The second triple associates this parameter with create a new query vs read a query from a file. In http://www.w3.org/tr/rdf-primer semantic search, however, these limitations are completely 3http://lucene.apache.org/ dealt with since the semantics of each of the types in si
are serialized using RDF 2 . RDF is web-based language suitable for describing resources and provides an extensible data model for representing machine-processable semantics of data. For each application framework parsed, we thus generate an RDF ontology that represents the instantiated knowledge base for the framework at hand. This knowledge base is managed by Jena [8], an open source Java framework for building Semantic Web applications. The process of generating semantic instances for the concepts and relations specified in SCRO is completely automatic. However, the process of annotating components according to COMPRE’s object properties is currently manual as it is the case for semantic annotations in general. Our tool though provides means for inserting these annotations directly into the knowledge-base, thus gradually building semantic descriptions for a particular API that can be shared, evolved, and reused by a community of users. On the other hand, metatdata modelled by COMPRE’s datatype properties is generated automatically via direct parsing of the source-code. We thus capture and normalize method signatures, identifier names, source-code comments, and available Java annotations in order to obtain a meaningful keyword descriptions of components. These descriptions are lexically analyzed, stored, and indexed using the tokenization and indexing mechanisms provided by Apache Lucene 3 , an open-source full-featured text search engine. In the next section, we show how the knowledge generated using this knowledge extractor sub-system can be used for component search. For an extended discussion of our ontologies, complete knowledge population samples, we refer the reader to our ontologies website [1]. 3. Ontological search Listing 1 shows a partial RDF description obtained during the knowledge population phase for a Jena API method, the create method. This method belongs to the QueryFactory class and usually used to create a Query object given the specified input. This RDF description clearly captures the component’s metadata at the semantic and syntactic level. The underlying data structure of RDF is a labeled directed graph. Each node-arc-node in this graph represents a triple that consists of three parts, subject, predicate and object. Consider Listing 1 for example, the described method in this snippet, create[..], is always the subject, ontology properties are predicates, and objects are either a resource, unlabeled node (blank node) or a literal value. For example, the first triple below uses a property from SCRO to assert that the method has an input parameter of type String. The second triple associates this parameter with 2http://www.w3.org/TR/rdf-primer 3http://lucene.apache.org/ few terms describing its purpose. The third triple, however, tags the same input parameter with a meaningful concept (QueryText) from the domain ontology. Thus, giving the parameter an agreed-upon and meaningful description other than terms or the semantically vague String type. 1. create[..] scro:hasInputType String 2. create[..] compre:hasInputTerms "query string" 3. create[..] compre:hasDomainInput [ a swonto:QueryText] @base PREFIX scro: PREFIX compre: PREFIX swonto: PREFIX dc: a scro:StaticMethod ; scro:hasInputType ; scro:hasInputType ; scro:hasInputType ; scro:hasOutputType ; scro:invokesMethod ; scro:hasSignature "create[String,String,Syntax]"; compre:describedBy [ a swonto:QueryCreation]; compre:hasDomainInput [ a swonto:QueryText]; compre:hasDomainInput [ a swonto:URI]; compre:hasDomainInput [ a swonto:QueryLanguageSyntax]; compre:hasDomainOutput[ a swonto:ExtendedQuery]; compre:hasInputTerms "query string ..."; compre:hasInputTerms "base URI ..."; compre:hasInputTerms "query syntax URI ..."; compre:hasOutputTerms "query ..."; dc:description "create query ..."; .... Listing 1. RDF descriptor for an API method This multi-faceted description of components enables four different types of queries against the knowledge base: a) type or signature-based queries; b) metadata keyword queries; c) pure semantic-based queries; or d) blended queries of the previous three types. However, we focus the discussion on pure semanticbased queries that rely on domain-specific knowledge. Primarily, search techniques that rely on variations of keyword-based search suffer from synonymity and polysemic ambiguity that often lead to low recall and precision. On the other hand, signature matching techniques cannot distinguish between components that have the same signature but serve different purposes, e.g. using Jena API to create a new query vs read a query from a file. In semantic search, however, these limitations are completely dealt with since the semantics of each of the types in sig- 3
natures are encoded and processed during search. Besides are used to create, read or even parse a semantic addressing knowledge representation effectively, semantic query. Nevertheless, this search mechanism is flexible since search offers extensible solutions to component retrieval. it allows a wide range of queries to run against the knowl- Since we are focusing on API usage and reuse, the de- edge base. In fact, the expressive power provided by our scriptions shown in Listing I capture the component's inter- ontologies allows users to express their queries in more de- face and its relationships with other components. However, tails than would otherwise be expressed with any altena these description can be easily extended to capture other tive method. For instance, assume that the user was able to facets(e.g, component's environment) via introducing ad- obtain a Query object as described in the previous exam- ditional ontological properties. ple. The next natural step is to find a component that can Reasoning is one of the primary added benefits in seman- take this query as input, execute it, and return the required tic search. In addition of classifying and checking the con- results. Browsing the Jena API looking for such a compo- sistency of our ontologies, a DL reasoner can also be used nent or even querying using typical keyword-based queries to inferring and thus enriching the knowledge base with ad- would not return an answer since there is an intermediate ditional knowledge that is not explicitly stated. Thus, play- query execution object that must be obtained to complete ing a vital role in improving search precision and recall in the task. This appears to be a dead end. However, using comparison with other search techniques. DL Subsumption semantic search, this request can be expressed fairly easily reasoning, for example, is typically used to establish sub- as follows set inclusion relationships between different concepts an the ontology. Consider the descriptions Query E compre: Component n Listing I for example, when pure semantic-based queries (compre: has DomainInput are used, users need only to provide domain concepts de- swonto: SemanticQuery) n scribing the components interface. Therefore, if the user Compre: has DomainOutput 3 compre provides SemanticQuery as a domain output of the re- has DomainOutput swonto: Result Set quested component, the method shown in the listing would This query expresses the fact that we are looking for a com- still match this request since SemanticQuery suDs ponent that takes a semantic query as inpu returns an- ExtendedQuery as specified in the subsumption hierar- oth chy of our domain ontology. Thus, automatically enablin other component that returns a query solution. Thus, query ing for multiple components at the sam an implicit form of query expansion It is notable that semantic-based retrieval alleviates many 3. 1. Implementation and ranking mechanisms problems typically faced by tools that rely on exact key word or type matching. One of the strengths of our ap- proach, however, is the ability to utilize our various ontolo- We have implemented this approach in a tool called gies in order to perform blended search against the knowl- CompRE, conveniently named after the main ontology in edge base. In particular, this is helpful when components in our modeL. CompRE is deployed as a plug-in for the the knowledge base are not completely annotated or when clipse Integrated Development Environment (IDE). Fi users are still in the process of becoming familiar with ure I shows a snapshot of CompREs main views in the the ontology. Consider for example a user who wishes Eclipse workbench. When loaded for the first time, Com- pRE processes the library code and the component ontology type(SemanticQuery)and one of the actual input types in order to generate the initial knowledge base as described ( Syntax) are known. Furthermore, since the user is not about the other input types, she wants to provide a fer ficient(it only took 4.5 seconds for par terms to filter out the results. This request can be expressed the Jena framework). CompRE also includes a module that using the following query in DL-like syntax: allow users to tag components with semantic references that corresponds to concepts from the domain ontology. These Query E compre: Component n dra (compre: hasDomain Output tured by the storage module, and stored automatically in swonto: SemanticQuery)n knowledge base Upon the conclusion of the knowledge Escro: hasInputType kb: Syntax)n population process, a knowledge repository is created and (compre: hasInputTerms value"base uri) becomes ready for answering user requests CompRe provides two separate views for formulating As expected, executing this query returns not only the queries. The first view is provided as a simple data entry method shown in Listing I but also other unrelated meth- form as shown in the figure. In each entry box, users need ods. It turns out that the input terms specified in the query to provide search restrictions that are either prefixed with are very popular and are used to describe API methods that an ontology name or provided as plain keywords enclosed
natures are encoded and processed during search. Besides addressing knowledge representation effectively, semantic search offers extensible solutions to component retrieval. Since we are focusing on API usage and reuse, the descriptions shown in Listing 1 capture the component’s interface and its relationships with other components. However, these description can be easily extended to capture other facets (e.g., component’s environment) via introducing additional ontological properties. Reasoning is one of the primary added benefits in semantic search. In addition of classifying and checking the consistency of our ontologies, a DL reasoner can also be used to inferring and thus enriching the knowledge base with additional knowledge that is not explicitly stated. Thus, playing a vital role in improving search precision and recall in comparison with other search techniques. DL Subsumption reasoning, for example, is typically used to establish subset inclusion relationships between different concepts and properties in the ontology. Consider the descriptions in Listing 1 for example, when pure semantic-based queries are used, users need only to provide domain concepts describing the component’s interface. Therefore, if the user provides SemanticQuery as a domain output of the requested component, the method shown in the listing would still match this request since SemanticQuery subsumes ExtendedQuery as specified in the subsumption hierarchy of our domain ontology. Thus, automatically enabling an implicit form of query expansion. It is notable that semantic-based retrieval alleviates many problems typically faced by tools that rely on exact keyword or type matching. One of the strengths of our approach, however, is the ability to utilize our various ontologies in order to perform blended search against the knowledge base. In particular, this is helpful when components in the knowledge base are not completely annotated or when users are still in the process of becoming familiar with the ontology. Consider for example a user who wishes to find a component in which the component’s domain output type (SemanticQuery) and one of the actual input types (Syntax) are known. Furthermore, since the user is not sure about the other input types, she wants to provide a few terms to filter out the results. This request can be expressed using the following query in DL-like syntax: Query ≡ compre : Component u (∃compre : hasDomainOutput . swonto : SemanticQuery) u (∃scro : hasInputT ype . kb : Syntax) u (∃compre : hasInputT erms value 00base uri00) As expected, executing this query returns not only the method shown in Listing 1 but also other unrelated methods. It turns out that the input terms specified in the query are very popular and are used to describe API methods that are used to create, read or even parse a semantic query. Nevertheless, this search mechanism is flexible since it allows a wide range of queries to run against the knowledge base. In fact, the expressive power provided by our ontologies allows users to express their queries in more details than would otherwise be expressed with any alternative method. For instance, assume that the user was able to obtain a Query object as described in the previous example. The next natural step is to find a component that can take this query as input, execute it, and return the required results. Browsing the Jena API looking for such a component or even querying using typical keyword-based queries would not return an answer since there is an intermediate query execution object that must be obtained to complete the task. This appears to be a dead end. However, using semantic search, this request can be expressed fairly easily as follows: Query ≡ compre : Component u (∃compre : hasDomainInput . swonto : SemanticQuery) u (∃compre : hasDomainOutput . ∃ compre : hasDomainOutput . swonto : ResultSet) This query expresses the fact that we are looking for a component that takes a semantic query as input and returns another component that returns a query solution. Thus, querying for multiple components at the same time. 3.1. Implementation and ranking mechanisms We have implemented this approach in a tool called CompRE, conveniently named after the main ontology in our model. CompRE is deployed as a plug-in for the Eclipse Integrated Development Environment (IDE). Figure 1 shows a snapshot of CompRE’s main views in the Eclipse workbench. When loaded for the first time, CompRE processes the library code and the component ontology in order to generate the initial knowledge base as described in Section 2.3. This process is completely automatic and ef- ficient (it only took 4.5 seconds for parsing and processing the Jena framework). CompRE also includes a module that allow users to tag components with semantic references that corresponds to concepts from the domain ontology. These annotations entered via drag and drop mechanisms, captured by the storage module, and stored automatically in the knowledge base. Upon the conclusion of the knowledge population process, a knowledge repository is created and becomes ready for answering user requests. CompRE provides two separate views for formulating queries. The first view is provided as a simple data entry form as shown in the figure. In each entry box, users need to provide search restrictions that are either prefixed with an ontology name or provided as plain keywords enclosed 4
Edit Source Refactor Navigate Search Project Run s CompRE Window Help COMPRE C Domain ontology 23 日 L thing o-C queryLanguage Syntax A 曰 Component L ClassType pu工⊥。工往 ss Retr1 eveselectec [ puB工1Resu⊥ LaSer execute 区 arrw-OurrwFer e EnerFaceType|‖(Tat( E Component Search 3S a界 ARQL Query 由@ AnnotationTyp Enter your search criteria.. +base +uri swonto SemanticQuery +C InstanceMetho Dependencies: Start Over ontrolstructure L Repetitionstrucy Figure 1. CompRE: showing the component ontology, domain ontology, and the main search view within quotes. As described in the previous section, using prove this initial ranking, we refine this initial order based the compre-kb prefix tells the system that this is in fact an on suitability measures that consider the current user con- actual API type specified in the knowledge base. However, text. We thus parse the code that is currently being de- the swonto prefix refers to a concept from the currently veloped and create a profile that includes all visible types active domain ontology. Since the domain ontology ca that are either declared by the programmer or inherited be different for each APl, its name is provided as an exter- the user's context. We further analyze each retrieved can- nal configuration parameter. Free-text requirements in each didates signature in terms of the new input types that this query may optionally utilize all fuzzy extensions supported candidate will introduce into the current context if selected by the lucene's query parser, thus allowing a full-featured by the user. Naturally, candidates that introduce more types keyword-based search. Once the form is filled, CompRE should be assigned a lower rank value. However, finding the collects the search requirements and automatically gener- newly introduced type in the context profile, will not count ate a query using the SPARQL 4 query language, it then against this candidates score. executes this query against the knowledge base. CompRE With the absence of keywords in user queries, we ap- also provides a query answering view for advanced users ply only context-based heuristics such that candidates with who wish to edit their own SPARQL queries directly, there- exact matches are put at the top of the list while other can- ore, gaining full control over various aspects of the compo- didates are ranked based on the number and type of their nent ontology. For instance, users may wish to specify that input and output types. For example, consider a user who is the desired component extends a particular compone trying to search for a component that requires two particular or perhaps usedBy a certain number of components as a input types, namely ll and 12. Assume that the repository measure of its popularity in the target library. Regardless contains three components, namely Cl, C2, and C3. Lets of the data entry mechanism used, CompRE executes the also assume that Cl is an exact match, C2 has only one in- query, ranks the retrieved instances, and presents the result put type(l1), and C3 requires three input types (ll, 12, and n a viewer that enables further exploration of each recom- 13). The system is then ranks Cl first, C2 is ranked second mended component and c3 is in fact the least desired since it will introduce a Ranking the retrieved candidates according to their rele- new type to the user context, it is thus included in the result ncy to the user needs saves time and efforts. While al lowing blended queries in our approach ensures flexibil- are simple, easy to implement, and work surprisingly well ity and robustness, it however complicates the ranking pro- cess. When blended or pure syntactic queries are submitted, 4. Experiments and results we initially rely on the traditional, however solidly proven, scoring mechanisms supported by Lucene. In order to im- Due to the lack of independent and standard benchmark test data, search tools evaluation is, to some degree, sub- 4http://www.w3.org/tr/rdf-sparQ-qUery jective. However, we designed our experi
Figure 1. CompRE: showing the component ontology, domain ontology, and the main search view within quotes. As described in the previous section, using the compre-kb prefix tells the system that this is in fact an actual API type specified in the knowledge base. However, the swonto prefix refers to a concept from the currently active domain ontology. Since the domain ontology can be different for each API, its name is provided as an external configuration parameter. Free-text requirements in each query may optionally utilize all fuzzy extensions supported by the Lucene’s query parser, thus allowing a full-featured keyword-based search. Once the form is filled, CompRE collects the search requirements and automatically generate a query using the SPARQL 4 query language, it then executes this query against the knowledge base. CompRE also provides a query answering view for advanced users who wish to edit their own SPARQL queries directly, therefore, gaining full control over various aspects of the component ontology. For instance, users may wish to specify that the desired component extends a particular component or perhaps usedBy a certain number of components as a measure of its popularity in the target library. Regardless of the data entry mechanism used, CompRE executes the query, ranks the retrieved instances, and presents the result in a viewer that enables further exploration of each recommended component. Ranking the retrieved candidates according to their relevancy to the user needs saves time and efforts. While allowing blended queries in our approach ensures flexibility and robustness, it however complicates the ranking process. When blended or pure syntactic queries are submitted, we initially rely on the traditional, however solidly proven, scoring mechanisms supported by Lucene. In order to im- 4http://www.w3.org/TR/rdf-sparql-query/ prove this initial ranking, we refine this initial order based on suitability measures that consider the current user context. We thus parse the code that is currently being developed and create a profile that includes all visible types that are either declared by the programmer or inherited in the user’s context. We further analyze each retrieved candidate’s signature in terms of the new input types that this candidate will introduce into the current context if selected by the user. Naturally, candidates that introduce more types should be assigned a lower rank value. However, finding the newly introduced type in the context profile, will not count against this candidate’s score. With the absence of keywords in user queries, we apply only context-based heuristics such that candidates with exact matches are put at the top of the list while other candidates are ranked based on the number and type of their input and output types. For example, consider a user who is trying to search for a component that requires two particular input types, namely I1 and I2. Assume that the repository contains three components, namely C1, C2, and C3. Lets also assume that C1 is an exact match, C2 has only one input type (I1), and C3 requires three input types (I1, I2, and I3). The system is then ranks C1 first, C2 is ranked second, and C3 is in fact the least desired since it will introduce a new type to the user context, it is thus included in the result set to improve recall, however, ranked last. These heuristics are simple, easy to implement, and work surprisingly well. 4. Experiments and results Due to the lack of independent and standard benchmark test data, search tools evaluation is, to some degree, subjective. However, we designed our experiments such that it 5
increases our confidence level of a fair evaluation. We have required Jena method, makeRewindable, was not com- selected the Jena framework for testing CompRE, the do- pletely annotated with the proper return type. Thus, search main ontology described in Section 2. 1 fits naturally in the produced spurious components since only the input type Jenas application domain. was used during search. However, in most cases when proper tags exist, semantic search can precisely describe 4. 1. Experiment: component search the needed component and improve overall precision val ues as seen in Figure 2. Metadata keyword-based search This experiment is designed to reveal the overlap be- performs poorly due to the two well-known fundamental is- tween various search methods that are supported by Com- sues of polysemy and synonymy. These two problems be- pRE. The fundamental guiding hypothesis we test in this come even more evident when searching for software com- experiment is that pure semantic-based representation and ponents. This is due in part to inconsistent and often in- annotation of library components improve search precision complete API descriptions of library code. Nevertheless when compared with other techniques. Precision is defined keyword search tends to return an exact match when a par- as the ratio of the number of relevant component instances ticular keyword is used to describe only a single component that are recommended by the tool to the total number of in the library(e.g, clone in Q4) ommended instances. Recall is the other commonly used Signature based queries tend to yield low precision in metric in evaluating search systems, it is defined as the ra- cases where the component signature includes one or more tio of the number of relevant component instances that are semantically vague types such as the Java String type. The recommended to the total number of relevant components best example to illustrate this notion is Q3. This task re in the repository. However, in these experiments we fix re- quires accessing a query service over a Http cOnneCtio ll since we are searching for distinct components, 1.e, the therefore, one needs to provide, among other things, the ent we are searching for is either four ind or not found We have selected twelve programming tasks, six of these URL of this service and the text representation of the query tasks were carefully designed by us and the remaining tasks both of which are specified as String objects. Unless there were collected from the Jena developers forum. Each of a clear semantic descriptions for such parameters, the these tasks requires a query to be fired in order to search for matchmaking process would return many false positives a conoco diverse enough and cover various aspects of the prisingly well. We believe that it is often the case that the onent that is required to complete the task. These (low precision). Blended search, however, performed sur- tasks problem domain. For space limitations, we do not include user is certain about a single API type that is used in the these tasks here, rather an extended discussion of the tasks component 's signature or a certain keyword that precisely d results can be found online 1 describes some aspect of the component. These descrip- We then prepared the necessary coding environment and tions can also be coupled with semantic annotations to pro- formulated four search queries for each task, i.e, one query duce higher precision. These results indicate that blended ch kind of search supported by CompRE. Precision search needs to be formally investigated in more details ummery graph for these queries is shown in Fig- Ranking and running time analysis have been computed as well. On average, semantic search achieved 1.75 rank over twelve queries, 1.e., the desired component was ranked either at the first or at most the second position. This ranking score is relatively comparable with other search 590史 schemes in which they achieved 2.27, 1.8, and 1.66 for key- word, signature, and blended search, respectively. However, the average observed response times were 15.5,9, 2.5, for semantic, blended, and signature and keyword searches, re- spectively. All experiments were performed on a Windows Query Number XP machine with 1. 8GHZ Intel processor and IGB mem ory. This relatively lower performance for semantic queries -Semantic Keyword -Signature - Blended is a result of having our queries run through the reasoner Figure 2. Precision graph for Jena queries In general, search time for all search methods is suscepti ble to increase as the size of the knowledge base increases in the case of semantic search, speed is continuously im As expected, semantic search tends to perform poorly hen the components in the knowledge base are incom proving as reasoners evolve. Achieving perfection in com- ponent search is near impossible, however, we believe tha pletely or incorrectly annotated. In Q11 for example, CompRe's internal mechanisms proved effective, and in the http://tech.groupsyahoocom/group/jena-dev/ majority of cases, show a clear support for our hypothe
increases our confidence level of a fair evaluation. We have selected the Jena framework for testing CompRE, the domain ontology described in Section 2.1 fits naturally in the Jena’s application domain. 4.1. Experiment: component search This experiment is designed to reveal the overlap between various search methods that are supported by CompRE. The fundamental guiding hypothesis we test in this experiment is that pure semantic-based representation and annotation of library components improve search precision when compared with other techniques. Precision is defined as the ratio of the number of relevant component instances that are recommended by the tool to the total number of recommended instances. Recall is the other commonly used metric in evaluating search systems, it is defined as the ratio of the number of relevant component instances that are recommended to the total number of relevant components in the repository. However, in these experiments we fix recall since we are searching for distinct components, i.e, the component we are searching for is either found or not found. We have selected twelve programming tasks, six of these tasks were carefully designed by us and the remaining tasks were collected from the Jena developers forum5 . Each of these tasks requires a query to be fired in order to search for a component that is required to complete the task. These tasks are diverse enough and cover various aspects of the problem domain. For space limitations, we do not include these tasks here, rather, an extended discussion of the tasks and results can be found online [1]. We then prepared the necessary coding environment and formulated four search queries for each task, i.e, one query for each kind of search supported by CompRE. Precision summery graph for running these queries is shown in Figure 2. Figure 2. Precision graph for Jena queries As expected, semantic search tends to perform poorly when the components in the knowledge base are incompletely or incorrectly annotated. In Q11 for example, a 5http://tech.groups.yahoo.com/group/jena-dev/ required Jena method, makeRewindable, was not completely annotated with the proper return type. Thus, search produced spurious components since only the input type was used during search. However, in most cases when proper tags exist, semantic search can precisely describe the needed component and improve overall precision values as seen in Figure 2. Metadata keyword-based search performs poorly due to the two well-known fundamental issues of polysemy and synonymy. These two problems become even more evident when searching for software components. This is due in part to inconsistent and often incomplete API descriptions of library code. Nevertheless, keyword search tends to return an exact match when a particular keyword is used to describe only a single component in the library (e.g., clone in Q4). Signature based queries tend to yield low precision in cases where the component signature includes one or more semantically vague types such as the Java String type. The best example to illustrate this notion is Q3. This task requires accessing a query service over a HTTP connection, therefore, one needs to provide, among other things, the URL of this service and the text representation of the query; both of which are specified as String objects. Unless there is a clear semantic descriptions for such parameters, the matchmaking process would return many false positives (low precision). Blended search, however, performed surprisingly well. We believe that it is often the case that the user is certain about a single API type that is used in the component’s signature or a certain keyword that precisely describes some aspect of the component. These descriptions can also be coupled with semantic annotations to produce higher precision. These results indicate that blended search needs to be formally investigated in more details. Ranking and running time analysis have been computed as well. On average, semantic search achieved 1.75 rank over twelve queries, i.e., the desired component was ranked either at the first or at most the second position. This ranking score is relatively comparable with other search schemes in which they achieved 2.27, 1.8, and 1.66 for keyword, signature, and blended search, respectively. However, the average observed response times were 15.5, 9, 2.5, for semantic, blended, and signature and keyword searches, respectively. All experiments were performed on a Windows XP machine with 1.8GHZ Intel processor and 1GB memory. This relatively lower performance for semantic queries is a result of having our queries run through the reasoner. In general, search time for all search methods is susceptible to increase as the size of the knowledge base increases; in the case of semantic search, speed is continuously improving as reasoners evolve. Achieving perfection in component search is near impossible, however, we believe that CompRe’s internal mechanisms proved effective, and in the majority of cases, show a clear support for our hypothesis. 6
4.2. User study cise queries. This is confirmed by the responses we ob- tained upon the conclusion of the experiment, four out of This experiment is designed to asses the usability of se- six students indicated that there is a small initial learning mantic search and to understand the possible difficulties curve that was reduced fairly quickly as they became more faced by end users in learning and using domain ontologies comfortable with the API vocabulary represented in the do- for successfully completing a particular search task against main ontology. Since the last task has to be done without an unfamiliar apl CompRE's assistance, most students argued that this task Six graduate-level MIS students from Northwestem Uni- could have been completed faster had CompRE's assistance versity have voluntarily agreed to participate in this study been allowed. A domain ontology provide a concise de- Although all students have at least seven months of Java ex- scription of API content and vocabulary. This knowledge perience and a good working knowledge of Semantic Web can be used also to successfully finishing a coding task that technologies, no student has been directly exposed to the may require more than one component. Consider task T2 Jena apl. we delivered a one-hour tutorial that includes a for example, this task requires instantiating an intermediate brief introduction to the Jena apl. brief introduction to the object of type QueryExecution, we suspect that the co- domain ontology, and a sample training task that explains herent representation of ontology concepts and axioms aid CompRE's semantic search features. We then charged each users in arriving at such conclusions during the initial time student with other four independent Jena API programming Invested in understanding and learning the taxonomy tasks that vary in scope: T1) data set creation and handling One may conclude that completing the last task(alterna of multiple RDF gtaphs; T2)query construction and execu- tive task)should be relatively easier. After all,students have tion over a given ontology model; T3)result manipulation been using the same API, thus, the knowledge gained about of query solu this API after completing the first three tasks can be helpful services. On average, two distinct components are needed However, when examining the results, there is no dramatic to successfully complete a given task. An environment is improvement in response time for using alternative meth setup for each task with a skeleton code, each student is ods. We suspect that these alternative methods (e.g,ex then asked to finish three consecutive tasks using only Com- ploring documentation, searching for code in the Web, etc. PRE's pure semantic search while the last task in the se- do not provide a systematic and focused learning experi- quence must be completed using other alternative methods ence for programmers. In this study however, we did not of students choice intend to make a systematic comparison between semantic Table I shows task completion time measured from the search mechanisms with normal search practices used by time in which the task is presented to the student until the fi- programmers. However, the obtained results clearly sup- nal correct answer is submitted. Numbers appeared in bold- port our hypothesis and show that semantic search, in most face represents tasks completed using semantic search while cases, yield better API learning experience and can certainly other numbers are underlined increase programmers productivity Overall, students provided positive comments about CompRE and semantic search. Two students indicated that Table 1. User study statistics the SParQl query view was indeed helpful and used in for- Time(Minutes) mulating more complex queries. However, these students requested a thorough integration of the domain ontology T2 T3 including its object properties into the CompRE's domain ontology view. Only one student reported a relative diffi- 3522 0123 culty adapting to a new search approach after being familiar 3240 with other alternative methods. This user also requested a Tooltip feature such that when the user hovers over an on- 161840 ology concept in the domain ontology view, a hover box Avg(Seman appears with class description. Based on this sound and helpful feedback, we are currently adding new features as The most significant conclusion we can draw from these expressive. lumbers is the correlation between the time taken by stu- dents to complete the first semantic task and the last 5. Related work in the sequence. In most cases, there was a significant re- duction in time as students became more familiar with the Due to the benefits acquired by systematic reuse, many domain ontology and thus more able to construct more pre- researchers have proposed solutions and tackled the reuse 7
4.2. User study This experiment is designed to asses the usability of semantic search and to understand the possible difficulties faced by end users in learning and using domain ontologies for successfully completing a particular search task against an unfamiliar API. Six graduate-level MIS students from Northwestern University have voluntarily agreed to participate in this study. Although all students have at least seven months of Java experience and a good working knowledge of Semantic Web technologies, no student has been directly exposed to the Jena API. We delivered a one-hour tutorial that includes a brief introduction to the Jena API, brief introduction to the domain ontology, and a sample training task that explains CompRE’s semantic search features. We then charged each student with other four independent Jena API programming tasks that vary in scope: T1) data set creation and handling of multiple RDF gtaphs; T2) query construction and execution over a given ontology model; T3) result manipulation of query solutions; and T4) access and treatment of remote services. On average, two distinct components are needed to successfully complete a given task. An environment is setup for each task with a skeleton code, each student is then asked to finish three consecutive tasks using only CompRE’s pure semantic search while the last task in the sequence must be completed using other alternative methods of student’s choice. Table 1 shows task completion time measured from the time in which the task is presented to the student until the fi- nal correct answer is submitted. Numbers appeared in boldface represents tasks completed using semantic search while other numbers are underlined. Table 1. User study statistics Time (Minutes) T1 T2 T3 T4 S1 35 22 27 13 S2 33 35 20 11 S3 45 43 31 33 S4 32 40 22 15 S5 52 12 43 8 S6 21 16 18 40 Avg (Semantic) 37.5 21.25 23.6 16 Avg (Alternative) 34 41.5 43 40 The most significant conclusion we can draw from these numbers is the correlation between the time taken by students to complete the first semantic task and the last one in the sequence. In most cases, there was a significant reduction in time as students became more familiar with the domain ontology and thus more able to construct more precise queries. This is confirmed by the responses we obtained upon the conclusion of the experiment, four out of six students indicated that there is a small initial learning curve that was reduced fairly quickly as they became more comfortable with the API vocabulary represented in the domain ontology. Since the last task has to be done without CompRE’s assistance, most students argued that this task could have been completed faster had CompRE’s assistance been allowed. A domain ontology provide a concise description of API content and vocabulary. This knowledge can be used also to successfully finishing a coding task that may require more than one component. Consider task T2 for example, this task requires instantiating an intermediate object of type QueryExecution, we suspect that the coherent representation of ontology concepts and axioms aid users in arriving at such conclusions during the initial time invested in understanding and learning the taxonomy. One may conclude that completing the last task (alternative task) should be relatively easier. After all, students have been using the same API, thus, the knowledge gained about this API after completing the first three tasks can be helpful. However, when examining the results, there is no dramatic improvement in response time for using alternative methods. We suspect that these alternative methods (e.g., exploring documentation, searching for code in the Web, etc.) do not provide a systematic and focused learning experience for programmers. In this study however, we did not intend to make a systematic comparison between semantic search mechanisms with normal search practices used by programmers. However, the obtained results clearly support our hypothesis and show that semantic search, in most cases, yield better API learning experience and can certainly increase programmer’s productivity. Overall, students provided positive comments about CompRE and semantic search. Two students indicated that the SPARQL query view was indeed helpful and used in formulating more complex queries. However, these students requested a thorough integration of the domain ontology including its object properties into the CompRE’s domain ontology view. Only one student reported a relative diffi- culty adapting to a new search approach after being familiar with other alternative methods. This user also requested a Tooltip feature such that when the user hovers over an ontology concept in the domain ontology view, a hover box appears with class description. Based on this sound and helpful feedback, we are currently adding new features as well as modifying CompRE’s interface so it becomes more expressive. 5. Related work Due to the benefits acquired by systematic reuse, many researchers have proposed solutions and tackled the reuse 7
problem from various perspectives. Many approaches utilizes domain knowledge not only usable and achievable, (e.g.3]employ traditional knowledge representation and but also improves precision of search results. Our results variations of signature matching or keyword-based re- also showed that blended search has a great potential, we are trieval. Similar to CompRE, other tools(e. g, [1l)) leverage currently conducting more case studies to asses the value software understanding by being embedded in the develop- of blended search. There are also two other future work ment environment. However, unlike CompRE, these tools directions. Firstly, ranking reused candidates has always rely on a local repository of sample client code to search for been a challenge, therefore, we are currently investigating components. Code Broker [11] for example, use a combina low could ranking be improved using semantic technolo- tion of free-text and signature matching techniques. In or- gies. Secondly, we have not yet investigated how could one der to retrieve appropriate matches, the user must write high motivate library providers to ship domain ontologies with quality doc comments that precisely describe functionality heir software, or how could individually created ontologies If the user comments did not retrieve satisfactory results, the be shared by a community of users system considers the signature of the method immediately following the comments. Finding a well documented code References to populate the repository with is highly unlikely, especially in open-source and legacy software Other component retrieval approaches(e.g, [6], [7Dap- [1] A. Alnusair and T. Zhao. Ontology models, framework ontolgies, and CompRE evaluation ply automated testing techniques to analyze a corpus of http://www.cs.uwmedu/-a client code harvested from the Web. Code Conjurer[6]for [2] B. Antunes, P. Gomez, and NSeco example, helps agile development users in finding suitable system based on the semantic web components on the basis of unit test cases. Therefore, users International Workshop on Semantic Web Enabled Software of the system has to write such test cases in order to invoke Engineering(SWESE), 2007. the system. Once invoked, the system contacts a remote 3 S. Bajracharya, O. Ossher, and C. Lopes. Sourcerer: An server that finds suitable candidates based on the compo- internet-scale software repository. In First International nents interface specified in the test case Work shop on Search-driven Development: Users, Infras Other semantic-based approaches have also been pro- tructure, Tools and Evaluation (SUITE09), 2009 [4 F A Durao, T.A. Vanderlei, E S. Almeida, and S.R. meira osed. However, the full potential of utilizing domain Applying a semantic layer in a source code search tool. In e was not e xplored. Sugumaran and Storey[10] Proceedings of the 23rd ACM Symposium on Applied Com proposed an approach that utilizes domain models; a do- puting, pages 1151-1157, Fortaleza Ceara, Brazil, 2008 main ontology is used mainly for term disambiguate B. Frakes and K. Kang. Software reuse research: Status basic query refinement for keyword-based queries; and future. IEEE Transactions on Software Engineering keywords are then mapped against the ontology to 31(7):529-536,July2005. hat correct terms are being used. However, no semantic- [6] O. Hummel,, W Janjic, and C. Atkinson. Code Conj based descriptions of components have been used. Other Pulling reusable software out of thin air. IEEE Softu proposals([2] and [4))employ ontologies to addressing the 5(5:45-52,2008 [7 O. Hummel and C. Atkinson. Extreme Harvesting: Test knowledge representation problem found in previous ap- proaches. In [4], software assets are classified into domain driven discovery and reuse of software components. In Pro- eedings of the IEEE International Conference on Inform categories(/O, GUl, Security, etc. )and indexed with a do- tion Reuse and integration(IRI'04), 2004 main field as well as other bookkeeping fields to facilitate [81 B. McBride. Jena: a semantic web toolkit. IEEE Internet free text search. Although the SRS [2] proposal uses the Computing,6(6:55-59,2002 ame indexing mechanism, it maintains two [9] F N. Noy and D. L. McGuinness. Ontology development tologies; an ontology for describing software assets as we 101: A guide to creating your first ontology. Stanford as a domain ontology for classifying these assets. Howeve Knowledge System Technical Report KSL-01-05, 2001 the structure of the source code assets and the semantic re- [10] V Sugumaran and V.C. Storey. A semantic-based approach lationships between those assets via axioms and role restric to component retrieval. ACM SIGMIS DATABASE, 24.2003 tions were not fully utilized [11] Y. Ye and G. Fischer. Reuse-conductive development envi- ronments. The international Journal of Automated Software 6. Conclusions and future work engineering,12(2):199-235,2005 [12] Y. Zhao, J. Dong, and T Peng. Ontology classification for semantic web based software engineering. IEEE Transa We proposed an approach for component reuse In addi- tions on Services Computing, 2(4): 303-317, 2009 tion to supporting pure semantic-based search, our approach lso supports other kinds of search techniques. However, our studies showed evidence that pure semantic search that
problem from various perspectives. Many approaches (e.g., [3]) employ traditional knowledge representation and variations of signature matching or keyword-based retrieval. Similar to CompRE, other tools (e.g., [11]) leverage software understanding by being embedded in the development environment. However, unlike CompRE, these tools rely on a local repository of sample client code to search for components. CodeBroker [11] for example, use a combination of free-text and signature matching techniques. In order to retrieve appropriate matches, the user must write high quality doc comments that precisely describe functionality. If the user comments did not retrieve satisfactory results, the system considers the signature of the method immediately following the comments. Finding a well documented code to populate the repository with is highly unlikely, especially in open-source and legacy software. Other component retrieval approaches (e.g., [6], [7]) apply automated testing techniques to analyze a corpus of client code harvested from the Web. Code Conjurer[6] for example, helps agile development users in finding suitable components on the basis of unit test cases. Therefore, users of the system has to write such test cases in order to invoke the system. Once invoked, the system contacts a remote server that finds suitable candidates based on the component’s interface specified in the test case. Other semantic-based approaches have also been proposed. However, the full potential of utilizing domain knowledge was not explored. Sugumaran and Storey[10] proposed an approach that utilizes domain models; a domain ontology is used mainly for term disambiguation and basic query refinement for keyword-based queries; these keywords are then mapped against the ontology to ensure that correct terms are being used. However, no semanticbased descriptions of components have been used. Other proposals ([2] and [4]) employ ontologies to addressing the knowledge representation problem found in previous approaches. In [4], software assets are classified into domain categories (I/O, GUI, Security, etc.) and indexed with a domain field as well as other bookkeeping fields to facilitate free text search. Although the SRS [2] proposal uses the same indexing mechanism, it maintains two separate ontologies; an ontology for describing software assets as well as a domain ontology for classifying these assets. However, the structure of the source code assets and the semantic relationships between those assets via axioms and role restrictions were not fully utilized. 6. Conclusions and future work We proposed an approach for component reuse. In addition to supporting pure semantic-based search, our approach also supports other kinds of search techniques. However, our studies showed evidence that pure semantic search that utilizes domain knowledge not only usable and achievable, but also improves precision of search results. Our results also showed that blended search has a great potential, we are currently conducting more case studies to asses the value of blended search. There are also two other future work directions. Firstly, ranking reused candidates has always been a challenge, therefore, we are currently investigating how could ranking be improved using semantic technologies. Secondly, we have not yet investigated how could one motivate library providers to ship domain ontologies with their software, or how could individually created ontologies be shared by a community of users. References [1] A. Alnusair and T. Zhao. Ontology models, framework ontolgies, and CompRE evaluation. Available online at: http://www.cs.uwm.edu/˜alnusair/compre. [2] B. Antunes, P. Gomez, and N. Seco. SRS: A software reuse system based on the semantic web. Proceedings of the 3rd International Workshop on Semantic Web Enabled Software Engineering (SWESE), 2007. [3] S. Bajracharya, O. Ossher, and C. Lopes. Sourcerer: An internet-scale software repository. In First International Workshop on Search-driven Development: Users, Infrastructure, Tools and Evaluation (SUITE’09), 2009. [4] F. A. Durao, T. A. Vanderlei, E. S. Almeida, and S. R. Meira. Applying a semantic layer in a source code search tool. In Proceedings of the 23rd ACM Symposium on Applied Computing, pages 1151–1157, Fortaleza Ceara, Brazil, 2008. [5] W. B. Frakes and K. Kang. Software reuse research: Status and future. IEEE Transactions on Software Engineering, 31(7):529–536, July 2005. [6] O. Hummel, , W. Janjic, and C. Atkinson. Code Conjurer: Pulling reusable software out of thin air. IEEE Software, 25(5):45–52, 2008. [7] O. Hummel and C. Atkinson. Extreme Harvesting: Test driven discovery and reuse of software components. In Proceedings of the IEEE International Conference on Information Reuse and Integration (IRI’04), 2004. [8] B. McBride. Jena: a semantic web toolkit. IEEE Internet Computing, 6(6):55–59, 2002. [9] F. N. Noy and D. L. McGuinness. Ontology development 101: A guide to creating your first ontology. Stanford Knowledge System Technical Report KSL-01-05, 2001. [10] V. Sugumaran and V. C. Storey. A semantic-based approach to component retrieval. ACM SIGMIS DATABASE, 34(3):8– 24, 2003. [11] Y. Ye and G. Fischer. Reuse-conductive development environments. The International Journal of Automated Software Engineering, 12(2):199–235, 2005. [12] Y. Zhao, J. Dong, and T. Peng. Ontology classification for semantic web based software engineering. IEEE Transactions on Services Computing, 2(4):303–317, 2009. 8