Abdelguerfi, M.. Eskicioglu, R. Liebowitz, J. "Knowledge engineerin The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton CRC Press llc. 2000
Abdelguerfi, M., Eskicioglu, R., Liebowitz, J. “Knowledge Engineering” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000
94 Knowledge engineering 94.1 Databases Database Abstraction Data Models Databases Hierarchical Databases M. Abdelguerfi and Databases. Architecture of a DBMs. Data Integrity and R. Eskicioglu Security. Emerging Trends University of New Orleans 94.2 Rule-Based Expert Systems Problem Selection.Knowledge Acquisition.Knowledge Jay Liebowitz Representation.Knowledge Encoding. Knowledge Testing and George Washington University Evaluation. Implementation and maintenance 94.1 Databases M. Abdelguerfi and R. Eskicioglu In the past, file processing techniques were used to design information systems. These systems usually consist of a set of files and a collection of application programs. Permanent records are stored in the files, and application programs are used to update and query the files. The application programs were in general developed individ- ually to meet the needs of different groups of users. In many cases, this approach leads to a duplication of data among the files of different users. Also, the lack of coordination between files belonging to different users often leads to a lack of data consistency. In addition, changes to the underlying data requirements usually necessitate hajor changes to existing application programs. Among other major problems that arise with the use of file processing techniques are lack of data sharing, reduced programming productivity, and increased program maintenance. Because of their inherent difficulties and lack of flexibility, file processing techniques have lost great deal of their popularity and are being replaced by database management systems(DBMS). A DBMS is designed to efficiently manage a shared pool of interrelated data( database). This includes the existence of features such as a data definition language for the definition of the logical structure of the database database schema), a data manipulation language to query and update the database, a concurrency control mechanism to keep the database consistent when shared by several users, a crash recovery strategy to avoid any loss of information after a system crash, and safety mechanisms against any unauthorized access Database Abstraction A DBMS is expected to provide for data independence, i.e., user requests are made at a logical level without any need for the knowledge of how the data is stored in actual files. This implies that the internal file structure could be modified without any change to the user's perception of the database. To achieve data independence, ne Standards Planning and Requirements Committee(SPARC)of the American National Standards Institute (ANSI)in its 1977 report recommended three levels of database abstraction(see Fig. 94. 1). The lowest level in the abstraction is the internal level. Here, the database is viewed as a collection of files organized according to one of several possible internal data organizations(e.g, B+-tree data organization). In the conceptual level, the database is viewed at an abstract level. The user at this level is shielded from the internal storage details. At the external level, each group of users has their own perception or view of the database. Each view is derived from c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 94 Knowledge Engineering 94.1 Databases Database Abstraction • Data Models • Relational Databases • Hierarchical Databases • Network Databases • Architecture of a DBMS • Data Integrity and Security • Emerging Trends 94.2 Rule-Based Expert Systems Problem Selection • Knowledge Acquisition • Knowledge Representation • Knowledge Encoding • Knowledge Testing and Evaluation • Implementation and Maintenance 94.1 Databases M. Abdelguerfi and R. Eskicioglu In the past, file processing techniques were used to design information systems. These systems usually consist of a set of files and a collection of application programs. Permanent records are stored in the files, and application programs are used to update and query the files. The application programs were in general developed individually to meet the needs of different groups of users. In many cases, this approach leads to a duplication of data among the files of different users. Also, the lack of coordination between files belonging to different users often leads to a lack of data consistency. In addition, changes to the underlying data requirements usually necessitate major changes to existing application programs. Among other major problems that arise with the use of file processing techniques are lack of data sharing, reduced programming productivity, and increased program maintenance. Because of their inherent difficulties and lack of flexibility, file processing techniques have lost a great deal of their popularity and are being replaced by database management systems (DBMS). A DBMS is designed to efficiently manage a shared pool of interrelated data (database). This includes the existence of features such as a data definition language for the definition of the logical structure of the database (database schema), a data manipulation language to query and update the database, a concurrency control mechanism to keep the database consistent when shared by several users, a crash recovery strategy to avoid any loss of information after a system crash, and safety mechanisms against any unauthorized access. Database Abstraction A DBMS is expected to provide for data independence, i.e., user requests are made at a logical level without any need for the knowledge of how the data is stored in actual files. This implies that the internal file structure could be modified without any change to the user’s perception of the database. To achieve data independence, the Standards Planning and Requirements Committee (SPARC) of the American National Standards Institute (ANSI) in its 1977 report recommended three levels of database abstraction (see Fig. 94.1). The lowest level in the abstraction is the internal level. Here, the database is viewed as a collection of files organized according to one of several possible internal data organizations (e.g., B+-tree data organization). In the conceptual level, the database is viewed at an abstract level. The user at this level is shielded from the internal storage details. At the external level, each group of users has their own perception or view of the database. Each view is derived from M. Abdelguerfi and R. Eskicioglu University of New Orleans Jay Liebowitz George Washington University
ne conceptual database and is designed to meet the needs of particu External Level (View) group of users. Such a group can only have access to the data specified by its particular view. This, of course, ensures both privacy and security. The mapping between the three levels of abstraction is the task of the DBMS. When changes to the internal level(such as a change in file organi- tion)do not affect the conceptual and external levels, the system is said to provide for physical data independence. Logical data independence prevents Conceptual Level changes to the conceptual level to affect users' views. Both types of data independence are desired features in a database system Data Models a data model refers to an integrated set of tools used to describe the data Internal Level and its structure data relation and data constraints. Some data models provide a set of operators that is used to update and query the database. Data FIGURE 94.1 Data abstraction. nodels can be classified in two main categories: record-based and object based. Both classes are used to describe the database at the conceptual and external levels. with object-based data models, constraints on the data can be specified more explicitly. There are three main record-based data models: the relational, network, and hierarchical models. In the relational model, data at the conceptual level is represented as a collection of interrelated tables. These tables are normalized so as to minimize data redundancy and update anomalies. In this model, data relationships ar implicit and are derived by matching columns in tables In the hierarchical and network models, the data represented as a collection of records and data relationships are explicit and are represented by links. The lifference between the last two models is that in the hierarchical model, data is represented as a tree structure, while it is represented as a generalized graph in the network model In hierarchical and network models, the existence of physical pointers(links)to link related records alloy an application program to retrieve a single record at a time by following the pointer's chain. The process of following the pointer's chain and selecting one record at a time is referred to as navigation. In nonnavigational models such as the relational model, records are not related through pointer's chains, but relationships are established by matching columns in different tables The hierarchical and network models require the application programmer to be aware of the internal structure of the database. The relational model, on the other hand, allows for a high degree of physical and logical data independence. Earlier DBMSs were for the most part navigational systems. Because of its simplicity and strong theoretical foundations, the relational model has since received wide acceptance. Today, most DBMSs are based on the relational model Other data models include a popular high level conceptual data model, known as the Entity-Relationship ER)model. The ER model is mainly used for the conceptual design of databases and their applications. The ER model describes data as entities, attributes, and relationships. An entity is an"object"in the real world with an independent existence. Each entity has a set of properties, called attributes, that describes it. A relationship is an association between entities. For example, a professor entity may be described by its name, age, and salary and can be associated with a department entity by the relationship"works for With the advent of advanced database applications, the ER modeling concepts became insufficient. This has led to the enhancement of the ER model with additional concepts, such as generalization, categories, and inheritance, leading to the Enhanced-ER or EER model Relational databases The relational model was introduced by E. F Codd [1970]. Since the theoretical underpinnings of the relational model have been well defined it has become the focus of most commercial dbmss In the relational model, the data is represented as a collection of relations. To a large extent, each relation can be thought of as a table. The example of Fig. 94. 2 shows part of a university database composed of two e 2000 by CRC Press LLC
© 2000 by CRC Press LLC the conceptual database and is designed to meet the needs of a particular group of users. Such a group can only have access to the data specified by its particular view. This, of course, ensures both privacy and security. The mapping between the three levels of abstraction is the task of the DBMS. When changes to the internal level (such as a change in file organization) do not affect the conceptual and external levels, the system is said to provide for physical data independence. Logical data independence prevents changes to the conceptual level to affect users’ views. Both types of data independence are desired features in a database system. Data Models A data model refers to an integrated set of tools used to describe the data and its structure, data relationships, and data constraints. Some data models provide a set of operators that is used to update and query the database. Data models can be classified in two main categories: record-based and objectbased. Both classes are used to describe the database at the conceptual and external levels. With object-based data models, constraints on the data can be specified more explicitly. There are three main record-based data models: the relational, network, and hierarchical models. In the relational model, data at the conceptual level is represented as a collection of interrelated tables. These tables are normalized so as to minimize data redundancy and update anomalies. In this model, data relationships are implicit and are derived by matching columns in tables. In the hierarchical and network models, the data is represented as a collection of records and data relationships are explicit and are represented by links. The difference between the last two models is that in the hierarchical model, data is represented as a tree structure, while it is represented as a generalized graph in the network model. In hierarchical and network models, the existence of physical pointers (links) to link related records allows an application program to retrieve a single record at a time by following the pointer’s chain. The process of following the pointer’s chain and selecting one record at a time is referred to as navigation. In nonnavigational models such as the relational model, records are not related through pointer’s chains, but relationships are established by matching columns in different tables. The hierarchical and network models require the application programmer to be aware of the internal structure of the database. The relational model, on the other hand, allows for a high degree of physical and logical data independence. Earlier DBMSs were for the most part navigational systems. Because of its simplicity and strong theoretical foundations, the relational model has since received wide acceptance. Today, most DBMSs are based on the relational model. Other data models include a popular high level conceptual data model, known as the Entity-Relationship (ER) model. The ER model is mainly used for the conceptual design of databases and their applications. The ER model describes data as entities, attributes, and relationships. An entity is an “object” in the real world with an independent existence. Each entity has a set of properties, called attributes, that describes it. A relationship is an association between entities. For example, a professor entity may be described by its name, age, and salary and can be associated with a department entity by the relationship “works for”. With the advent of advanced database applications, the ER modeling concepts became insufficient. This has led to the enhancement of the ER model with additional concepts, such as generalization, categories, and inheritance, leading to the Enhanced-ER or EER model. Relational Databases The relational model was introduced by E. F. Codd [1970]. Since the theoretical underpinnings of the relational model have been well defined, it has become the focus of most commercial DBMSs. In the relational model, the data is represented as a collection of relations. To a large extent, each relation can be thought of as a table. The example of Fig. 94.2 shows part of a university database composed of two FIGURE 94.1 Data abstraction
FAC INFO (Iname, soclal sec#, street dept w Orleans EE Hanoura 40 anabel Severn Matari Abdel 3893901 St charles DEPT_ CHAIR( dept, FIGURE 94. 2 An example of two relations: FAC_INFO and DEP_CHAIR. lations. FAC_INFO gives personal information(last name, social security, street and city of residence, and department)of a faculty. DEP_CHAIR gives the last name of the chairman of each depa. a columnn ulty is not allowed to belong to two departments. Each row in a relation is referred to as a tuple. A column name is called an attribute name. The data type of each attribute name is known as its domain. a relation scheme is a set of attribute names. For instance, the relation scheme (or scheme for short) of the relation FAC_ INFO is name, social_sect#t, street, city, dept). A key is a set of attribute names whose composite value is distinct for all tuples. In addition, no proper subset of the key is allowed to have this property. It is not unusual for a scheme to have several possible keys In FAC_INFO, both Iname and social_sec# are possible keys. In this case, ach possible key is known as a candidate key, and the one selected to act as the relations key, say, Iname, is referred to as the primary key. a superkey is a key with the exception that there is no requirement for minimality. In a relation, an attribute name(or a set of attribute names)is referred to as a foreign key, if it is the primary key of another relation. In FAC_INFO, the attribute name dept is a foreign key, since the same attribute is a key in DEP_CHAIR. Because of updates to the database, the content of a relation is dynamic. For this reason, the data in a relation at a given time instant is called an instance of the relation There are three integrity constraints that are usually imposed on each instance of a relation: primary key egrity, entity integrity, and referential integrity. The key integrity constraint requires that no two tuples of a relation have the same key value. The entity integrity constraint specifies that the key value of each tuple should have a known value(i. e, no null values are allowed for primary keys). The referential integrity constraint specifies that if a relation r, contains a foreign key that matches the primary key of a relation T2, then each value of the foreign key in ri must either match a value of the primary key in r2 or must be null. For the database of Fig 94.2 to be consistent, each value of dept in FAC_INFO must match a value of dept in DEP_CHAIR Relational database design The relational database design [Maier, 1983] refers to the process of generating a set of relation schemes that minimizes data redundancy and removes update anomalies. One of the most popular approaches is the use of the normalization theory. The normalization theory is based on the notion of functional dependencies. Functional dependencies are constraints imposed on a database. The notion of superkey, introduced in the previous section, can be formulated as follows: A subset of a relation scheme is a superkey if, in any instance of the relation, no two distinct tuples have the same superkey value. If r(R)is used to denote a relation r on a schema R, Kc Ra superkey, and t(k)the K-value of tuple t, then no two tuples t and t, in r(R)are such that (K=L(k) The notion of a functional dependency can be seen as a generalization of the notion of superkey. Let X and Y be two subsets of R; the functional dependency X- Y exists in r(R) if whenever two tuples in r(R)ha the same X-value, their Y-value is also the same. That is, if t(X)=t(X), then t(y=t(n. Using functional dependencies, one can define the notion of a key more precisely. a key k of a relation r(R)is such that k-R and no proper subset of k has this property. Note that if the schema R is composed of attribute names(A,, a2 An, then each attribute name A, is functionally determined by the key k, i. e,k-Aj, i=l,.,n.An c2000 by CRC Press LLC
© 2000 by CRC Press LLC relations. FAC_INFO gives personal information (last name, social security, street and city of residence, and department) of a faculty. DEP_CHAIR gives the last name of the chairman of each department. A faculty is not allowed to belong to two departments. Each row in a relation is referred to as a tuple. A column name is called an attribute name. The data type of each attribute name is known as its domain. A relation scheme is a set of attribute names. For instance, the relation scheme (or scheme for short) of the relation FAC_INFO is (lname, social_sec#, street, city, dept). A key is a set of attribute names whose composite value is distinct for all tuples. In addition, no proper subset of the key is allowed to have this property. It is not unusual for a scheme to have several possible keys. In FAC_INFO, both lname and social_sec# are possible keys. In this case, each possible key is known as a candidate key, and the one selected to act as the relation’s key, say, lname, is referred to as the primary key. A superkey is a key with the exception that there is no requirement for minimality. In a relation, an attribute name (or a set of attribute names) is referred to as a foreign key, if it is the primary key of another relation. In FAC_INFO, the attribute name dept is a foreign key, since the same attribute is a key in DEP_CHAIR. Because of updates to the database, the content of a relation is dynamic. For this reason, the data in a relation at a given time instant is called an instance of the relation. There are three integrity constraints that are usually imposed on each instance of a relation: primary key integrity, entity integrity, and referential integrity. The key integrity constraint requires that no two tuples of a relation have the same key value. The entity integrity constraint specifies that the key value of each tuple should have a known value (i.e., no null values are allowed for primary keys). The referential integrity constraint specifies that if a relation r1 contains a foreign key that matches the primary key of a relation r2, then each value of the foreign key in r1 must either match a value of the primary key in r2 or must be null. For the database of Fig. 94.2 to be consistent, each value of dept in FAC_INFO must match a value of dept in DEP_CHAIR. Relational Database Design The relational database design [Maier, 1983] refers to the process of generating a set of relation schemes that minimizes data redundancy and removes update anomalies. One of the most popular approaches is the use of the normalization theory. The normalization theory is based on the notion of functional dependencies. Functional dependencies are constraints imposed on a database. The notion of superkey, introduced in the previous section, can be formulated as follows: A subset of a relation scheme is a superkey if, in any instance of the relation, no two distinct tuples have the same superkey value. If r(R) is used to denote a relation r on a schema R, K Õ R a superkey, and t(k) the K-value of tuple t, then no two tuples t1 and t2 in r(R) are such that t1(K) = t2(K). The notion of a functional dependency can be seen as a generalization of the notion of superkey. Let X and Y be two subsets of R; the functional dependency X Æ Y exists in r(R) if whenever two tuples in r(R) have the same X-value, their Y-value is also the same. That is, if t1(X) = t2(X), then t1(Y) = t2(Y). Using functional dependencies, one can define the notion of a key more precisely. A key k of a relation r(R) is such that k Æ R and no proper subset of k has this property. Note that if the schema R is composed of attribute names {A1, A2, . . ., An}, then each attribute name Ai is functionally determined by the key k, i.e., k Æ Ai, i = 1, . . ., n. An FIGURE 94.2 An example of two relations: FAC_INFO and DEP_CHAIR
PRODUCT supplier_name, product_name, price, location) Martin 500.99 Kenner metairie FIGURE 94.3 Instance of PRODUCT(supplier_name, product_name, price, quantity attribute name that is part of a key is referred to as a prime attribute In the example of Fig 94.2, both attribute names street and city are nonprime attributes. The normalization process can be thought of as the process of decomposing a scheme with update anomalies and data redundancy into smaller schemes in which these undesirable properties are to a large extent eliminated. Depending on the severity of these undesirable properties, schemes are classified into normal forms. Originally, Codd defined three normal forms: first normal form(INF), second normal form(2NF), and third normal form (NF). Thereafter, a stronger version of the 3NF, known as Boyce-Codd normal form(BCNF), was suggested. These four normal forms are based on the concept of functional dependencies. The INF requires that attribute name values be atomic. That is, composite values for attribute names are not allowed. A 2NF scheme is a INF scheme in which all nonprime attributes are fully dependent on the key Consider the relation of Fig. 94.3. Each tuple in PRodUCt gives the name of a supplier, a product name, its price, and the supplier's location. The scheme(supplier_name, product_name, price, quantity)is in INF since ach attribute name is atomic. It is assumed that many products can be supplied by a single supplier, that a given product can be supplied by more than one supplier, and that a supplier has only one location. So, (supplier_name, product_name)is the relations key and the functional dependency supplier_name location should hold for any instance of PRODUCT. The structure of the relation of Fig 94.3 does not allow a supplier to appear in the relation unless it offers at least one product. Even the use of null values is not of much help in this case as product_name is part of a key and therefore cannot be assigned a null value. Another anomaly can be encountered during the deletion process. For instance, deleting the last tuple in the relation results in the loss of the information that Rudd is a supplier located in Metairie. It is seen that the relation PRODUCT suffers from insertion and deletion Modifications can also be a problem in the relation PRODUCT. Suppose that the location of the suppl Martin is moved from Kenner to Slidell. In order not to violate the functional dependency supplier_name location, the location attribute name of all tuples where the supplier is Martin needs to be changed from Kenner to Slidell. This modification anomaly has a negative effect on performance. In addition, the relation PRODUCT suffers from data redundancy. For example, although Martin has only one location"Kenner", such a location appears in all three tuples where the supplier_name is Martin. The update anomalies and data redundancy encountered in PRODUCT are all due to the functional depen dency suppliername -location. The right-hand side of this dependency "location"is a nonprime attribute, and the left-hand side represents part of the key. Therefore, we have a nonprime attribute that is only partially dependent on the key(supplier_name, product_name). As a consequence, the schema(supplier_name, product_name, price, location)is not in 2NF. The removal of the partial dependency supplier_name- location will eliminate all the above anomalies. The removal of the partial dependency is achieved by decomposing the scheme(supplier_name, product_name, price, quantity) into two 2NF schemes:(supplier_name, product_name, price), and(supplier_name, location). This decomposition results in relations PRO_INFO and SUP_LOC shown in Fig. 94.4. The keys of PRO_INFO and SUP_LOC are(supplier_name, product_name), and supplier_name, respectively. Normalizing schemes into 2NF removes all update anomalies due to nonprime attributes being partiall dependent on keys. Anomalies of a different nature, however, are still possible Update anomalies and data redundancy can originate from transitive dependencies. a nonprime attribute a functionally determine Ak. A 3NF is a INF where no nonprime attribute is transitively dependent on a ko o is said to be transitively dependent on a key k via attribute name A;, if k- Aj, A; A, and A, does ne e 2000 by CRC Press LLC
© 2000 by CRC Press LLC attribute name that is part of a key is referred to as a prime attribute. In the example of Fig. 94.2, both attribute names street and city are nonprime attributes. The normalization process can be thought of as the process of decomposing a scheme with update anomalies and data redundancy into smaller schemes in which these undesirable properties are to a large extent eliminated. Depending on the severity of these undesirable properties, schemes are classified into normal forms. Originally, Codd defined three normal forms: first normal form (1NF), second normal form (2NF), and third normal form (3NF). Thereafter, a stronger version of the 3NF, known as Boyce-Codd normal form (BCNF), was suggested. These four normal forms are based on the concept of functional dependencies. The 1NF requires that attribute name values be atomic. That is, composite values for attribute names are not allowed. A 2NF scheme is a 1NF scheme in which all nonprime attributes are fully dependent on the key. Consider the relation of Fig. 94.3. Each tuple in PRODUCT gives the name of a supplier, a product name, its price, and the supplier’s location. The scheme (supplier_name, product_name, price, quantity) is in 1NF since each attribute name is atomic. It is assumed that many products can be supplied by a single supplier, that a given product can be supplied by more than one supplier, and that a supplier has only one location. So, (supplier_name, product_name) is the relation’s key and the functional dependency supplier_name Æ location should hold for any instance of PRODUCT. The structure of the relation of Fig. 94.3 does not allow a supplier to appear in the relation unless it offers at least one product. Even the use of null values is not of much help in this case as product_name is part of a key and therefore cannot be assigned a null value. Another anomaly can be encountered during the deletion process. For instance, deleting the last tuple in the relation results in the loss of the information that Rudd is a supplier located in Metairie. It is seen that the relation PRODUCT suffers from insertion and deletion anomalies. Modifications can also be a problem in the relation PRODUCT. Suppose that the location of the supplier Martin is moved from Kenner to Slidell. In order not to violate the functional dependency supplier_name Æ location, the location attribute name of all tuples where the supplier is Martin needs to be changed from Kenner to Slidell. This modification anomaly has a negative effect on performance. In addition, the relation PRODUCT suffers from data redundancy. For example, although Martin has only one location “Kenner”, such a location appears in all three tuples where the supplier_name is Martin. The update anomalies and data redundancy encountered in PRODUCT are all due to the functional dependency supplier_name Æ location. The right-hand side of this dependency “location” is a nonprime attribute, and the left-hand side represents part of the key. Therefore, we have a nonprime attribute that is only partially dependent on the key (supplier_name, product_name). As a consequence, the schema (supplier_name, product_name, price, location) is not in 2NF. The removal of the partial dependency supplier_name Æ location will eliminate all the above anomalies. The removal of the partial dependency is achieved by decomposing the scheme (supplier_name, product_name, price, quantity) into two 2NF schemes: (supplier_name, product_name, price), and (supplier_name, location). This decomposition results in relations PRO_INFO and SUP_LOC shown in Fig. 94.4. The keys of PRO_INFO and SUP_LOC are (supplier_name, product_name), and supplier_name, respectively. Normalizing schemes into 2NF removes all update anomalies due to nonprime attributes being partially dependent on keys. Anomalies of a different nature, however, are still possible. Update anomalies and data redundancy can originate from transitive dependencies. A nonprime attribute Ai is said to be transitively dependent on a key k via attribute name Aj , if k Æ Aj, Aj Æ Ai, and Aj does not functionally determine Ak. A 3NF is a 1NF where no nonprime attribute is transitively dependent on a key. FIGURE 94.3 Instance of PRODUCT (supplier_name, product_name, price, quantity)
PRO_INFO suppler_name, product_name, Evans 60099 UP-_ LOC ( suppller_name, location Marti Evans FIGURE 94.4 Decomposition of PRODUCT into PRO_INFO and SUP_LOC. SUPPLIES(clientname, supplier_name, location Krad Martin hengru Greene Evans FIGURE 94.5 Instance of suPPlIeS UP-_CLI( client_name, supplier_name SUP- LOC (supplier_name, location Shangyu Martin Metairie Rudd FIGURE 94.6 mposition of SUPPLIES into SUP_CLI and SUP_LOC. The relation of Fig. 94.5, which is in 2NF, highlights update anomalies and data redundancy due to the transitive dependency of a nonprime attribute on a key. The relation gives the name of a client(client_name) the corresponding supplier(supplier_name), and the supplier's location. Each client is assumed to have one supplier. The relations key is client_name, and each supplier has only one location. A supplier and his location cannot be inserted in SUPPLIES unless the supplier has at least one client. In addition, the relation ha deletion anomaly since if Tillis is no longer a client of Rudd, the information about Rudd as a supplier and his location is lost. A change to a supplier's location may updating the location attribute name of several tuples in the relation. Also, although each supplier has only one location, such a location is sometimes repeated several time unnecessarily, leading to data redundancy. The relation exhibits the following transitive dependency: client_name-supplier_name, supplier_name >location(but not the inverse). The relation CLIENT is clearly in 2NF, but because of the transitive dependency of the nonprime attribute location on the key, it is not in 3NE. This is the cause of the anomalies mentioned above. Eliminating this transitive dependency by splitting the schema into two components will remove these anomalies. Clearly, the resulting two relations SUP_CLI and SUP_LOC are in 3NF(see Fig 94.6 Each partial dependency of a nonprime attribute on a key can be expressed as a transitive dependency of a onprime attribute on a key. Therefore, a scheme in NF is also in 2NF BCNF is a stricter form of 3Nf where a relation r on a schema r is in bcnf if whenever a functional dependency X-Yexists in r(R), then X is a superkey of R. The condition of 3NE, which allows Y to be prime if X is not a superkey, does not exist in BCNE. Thus, every scheme in BCNF is also in 3NF, but the opposite is not always true c2000 by CRC Press LLC
© 2000 by CRC Press LLC The relation of Fig. 94.5, which is in 2NF, highlights update anomalies and data redundancy due to the transitive dependency of a nonprime attribute on a key. The relation gives the name of a client (client_name), the corresponding supplier (supplier_name), and the supplier’s location. Each client is assumed to have one supplier. The relation’s key is client_name, and each supplier has only one location. A supplier and his location cannot be inserted in SUPPLIES unless the supplier has at least one client. In addition, the relation has a deletion anomaly since if Tillis is no longer a client of Rudd, the information about Rudd as a supplier and his location is lost. A change to a supplier’s location may require updating the location attribute name of several tuples in the relation. Also, although each supplier has only one location, such a location is sometimes repeated several time unnecessarily, leading to data redundancy. The relation exhibits the following transitive dependency: client_name Æ supplier_name, supplier_name Æ location (but not the inverse). The relation CLIENT is clearly in 2NF, but because of the transitive dependency of the nonprime attribute location on the key, it is not in 3NF. This is the cause of the anomalies mentioned above. Eliminating this transitive dependency by splitting the schema into two components will remove these anomalies. Clearly, the resulting two relations SUP_CLI and SUP_LOC are in 3NF (see Fig. 94.6). Each partial dependency of a nonprime attribute on a key can be expressed as a transitive dependency of a nonprime attribute on a key. Therefore, a scheme in 3NF is also in 2NF. BCNF is a stricter form of 3NF, where a relation r on a schema R is in BCNF if whenever a functional dependency X Æ Y exists in r(R), then X is a superkey of R. The condition of 3NF, which allows Y to be prime if X is not a superkey, does not exist in BCNF. Thus, every scheme in BCNF is also in 3NF, but the opposite is not always true. FIGURE 94.4 Decomposition of PRODUCT into PRO_INFO and SUP_LOC. FIGURE 94.5 Instance of SUPPLIES. FIGURE 94.6 Decomposition of SUPPLIES into SUP_CLI and SUP_LOC
detailed discussion of higher level normalizations, such as 4NF and 5NF, which are based on other forms of dependencies, can be found in [Elmasri and Navathe, 1994 Data Definition and Manipulation in Relational Databases Upon completion of the relational database design, a descriptive language, usually referred to as Data Definition Language(DDL), is used to define the designed schemes and their relationships. The dDl can be used to create w schemes or modify existing ones, but it cannot be used to query the database. Once DDL statements are compiled, they are stored in the data dictionary. a data dictionary is a repository where information about database schemas, such as attribute names, indexes, and integrity constraints are stored. Data dictionaries also ontain other information about databases, such as design decisions, usage standards, application program descriptions, and user information. During the processing of a query, the DBMS usually checks the data dictionary. The data dictionary can be seen as a relational database of its own. As a result, data manipulation languages that are used to manipulate databases can also be used to query the data dictionary An important function of a DBMS is to provide a Data Manipulation Language(DML) with which a user can retrieve, change, insert, and delete data from the database DMls are classified into two type and nonprocedural. The main difference between the two types is that in procedural DMLs, a user has to specify the desired data and how to obtain it, while in nonprocedural DMLs, a user has only to describe the desired data. Because they impose less burden on the user, nonprocedural DMls are normally easier to learn and use The component of a DMl that deals with data retrieval is referred to as query language. A an be used interactively in a stand-alone manner, or it can be embedded in a general-purpose programming language such as C and Cobol. One of the most popular query languages is SQL (Structured Query Language). SQL is a query language based to a large extent on Codds relational algebra. SQL has additional features for data definition and update. Therefore, SQL is a comprehensive relational database language that includes both a DDL and DML. QL includes the following commands for data definition: CREATE TABLE, DROP TABLE, and ALTER TABLE. The CREATE TABlE is used to create and describe a new relation the two relations of fig 94.4 can be created in the following manner: CREATE TABLE PRO INFo supplier name VARCHAR(12 NOT NULL, product name VARCHAR (8 NOT NULL price DECIMAL (, 2))i CREATE TABLE SUP LOC supplier name VARCHAR(12 NOT NULL, VARCHAR(10)); The CREATE TABLE command specifies all the attribute names of a relation and their data types(e.g INTEGER, DECIMAL, fixed length character"CHAR", variable length character"VARCHAR, DATE).The constraint NOT NULL is usually specified for those attributes that cannot have null values. The primary key of each relation in the database is usually required to have a nonnull value. If a relation is created incorrectly, it can be deleted using the DROP TABLE command. The command is DROP TABLE followed by the name of the relation to be deleted. A variation of DROP command, DROP SCHEMA, is used if the whole schema is no longer needed. The ALTER TABLE is used to add new attribute names to an existing relation, as follows: ALTER TABLE SUP LOC ADD zip code CHAR(5); The SUP_LOC relation now contains an extra attribute name, zip_code. In most DBMSs, the zip of existing tuples will automatically be assigned a null value. Other DBMSs allow for the assignment of initial value to a newly added attribute name. Also, definitions of attributes can be changed and new constraints can be added, or current constraints can be dropped. The DML component of SQL has one basic query statement, sometimes called a mapping, that has the following structure SELECT FROM WHERE e 2000 by CRC Press LLC
© 2000 by CRC Press LLC A detailed discussion of higher level normalizations, such as 4NF and 5NF, which are based on other forms of dependencies, can be found in [Elmasri and Navathe, 1994]. Data Definition and Manipulation in Relational Databases Upon completion of the relational database design, a descriptive language, usually referred to as Data Definition Language (DDL), is used to define the designed schemes and their relationships. The DDL can be used to create new schemes or modify existing ones, but it cannot be used to query the database. Once DDL statements are compiled, they are stored in the data dictionary. A data dictionary is a repository where information about database schemas, such as attribute names, indexes, and integrity constraints are stored. Data dictionaries also contain other information about databases, such as design decisions, usage standards, application program descriptions, and user information. During the processing of a query, the DBMS usually checks the data dictionary. The data dictionary can be seen as a relational database of its own. As a result, data manipulation languages that are used to manipulate databases can also be used to query the data dictionary. An important function of a DBMS is to provide a Data Manipulation Language (DML) with which a user can retrieve, change, insert, and delete data from the database. DMLs are classified into two types: procedural and nonprocedural. The main difference between the two types is that in procedural DMLs, a user has to specify the desired data and how to obtain it, while in nonprocedural DMLs, a user has only to describe the desired data. Because they impose less burden on the user, nonprocedural DMLs are normally easier to learn and use. The component of a DML that deals with data retrieval is referred to as query language. A query language can be used interactively in a stand-alone manner, or it can be embedded in a general-purpose programming language such as C and Cobol. One of the most popular query languages is SQL (Structured Query Language). SQL is a query language based to a large extent on Codd’s relational algebra. SQL has additional features for data definition and update. Therefore, SQL is a comprehensive relational database language that includes both a DDL and DML. SQL includes the following commands for data definition: CREATE TABLE, DROP TABLE, and ALTER TABLE. The CREATE TABLE is used to create and describe a new relation. The two relations of Fig. 94.4 can be created in the following manner: CREATE TABLE PRO_INFO (supplier_name VARCHAR(12) NOT NULL, product_name VARCHAR(8) NOT NULL, price DECIMAL(6,2)); CREATE TABLE SUP_LOC ( supplier_name VARCHAR(12) NOT NULL, location VARCHAR(10)); The CREATE TABLE command specifies all the attribute names of a relation and their data types (e.g., INTEGER, DECIMAL, fixed length character “CHAR”, variable length character “VARCHAR”, DATE). The constraint NOT NULL is usually specified for those attributes that cannot have null values. The primary key of each relation in the database is usually required to have a nonnull value. If a relation is created incorrectly, it can be deleted using the DROP TABLE command. The command is DROP TABLE followed by the name of the relation to be deleted. A variation of DROP command, DROP SCHEMA, is used if the whole schema is no longer needed. The ALTER TABLE is used to add new attribute names to an existing relation, as follows: ALTER TABLE SUP_LOC ADD zip_code CHAR(5); The SUP_LOC relation now contains an extra attribute name, zip_code. In most DBMSs, the zip_code value of existing tuples will automatically be assigned a null value. Other DBMSs allow for the assignment of an initial value to a newly added attribute name. Also, definitions of attributes can be changed and new constraints can be added, or current constraints can be dropped. The DML component of SQL has one basic query statement, sometimes called a mapping, that has the following structure: SELECT FROM WHERE
PROFESSOR(faculty, department, Smith Electrical Eng Mechanical Eng hannes Computer Scl computer sci 7523 Kenneth Mechanical Eng. $40,000 FIGURE 94.7 Instance of the relation professor In the above statement, the seleCt clause specifies the attribute names that are to be retrieved, FROM gives the list of the relations involved, and Where is a Boolean predicate that completely specifies the tuples to be retrieved Consider the database of Fig 94.4, and suppose that we want the name of all suppliers that supply either beds or desks. In SELECT supplier name PRO INFO product name The result of an SQL command may contain duplicate values and is therefore not always a true relation. In fact, the result of the above query, shown below, has duplicate entries. supplier name Martin Martin The entry Martin appears twice in the result, because the supplier Martin supplies both beds and sofas Removal of duplicates is usually a computationally intensive operation. As a result, duplicate entries are not automatically removed by SQL. To ensure uniqueness, the command DISTINCT should be used In the above query, if we want the supplier names to be listed only once, the above query should be modified as follows: ELEC DISTINCT suppller name roduct name=“bed" OR product name=“sofa In SQL, a query can involve more than one relation. Suppose that we want the list of all suppliers from Metairie who supply beds. Such a query, shown below, involves both PRO_INFO and SUP_LOC. SELECT supplier name FROM WHERE PRO INFO. supplier name SUP LOC. supplier name AND When an SQL expression, such as the one above, involves more than one relation, it is sometimes necessary to qualify attribute names, that is, to precede an attribute name by the relation(a period is placed between the two)it belongs to. Such a qualification removes possible ambiguities. In SQL, it is possible to have several levels of query nesting; this is done by including a SElECt query statement within the where clause The output data can be presented in sorted order by using the SQL ORDER BY clause followed by the ttribute name(s)according to which the output is to be sorted. In database management applications it is often desirable to categorize the tuples of a relation by the values of a set of attributes and extract an aggregated characteristic of each category. Such database management tasks are referred to as aggregation functions. For instance, SQL includes the following built-in aggregation function SUM, COUNT, AVERAGE, MIN, MAX. The attribute names used for the categorization are referred to as c2000 by CRC Press LLC
© 2000 by CRC Press LLC In the above statement, the SELECT clause specifies the attribute names that are to be retrieved, FROM gives the list of the relations involved, and WHERE is a Boolean predicate that completely specifies the tuples to be retrieved. Consider the database of Fig. 94.4, and suppose that we want the name of all suppliers that supply either beds or desks. In SQL, this query can be expressed as: SELECT supplier_name FROM PRO_INFO WHERE product_name = “bed” OR product_name = “sofa” The result of an SQL command may contain duplicate values and is therefore not always a true relation. In fact, the result of the above query, shown below, has duplicate entries. supplier_name Martin Martin Rudd The entry Martin appears twice in the result, because the supplier Martin supplies both beds and sofas. Removal of duplicates is usually a computationally intensive operation. As a result, duplicate entries are not automatically removed by SQL. To ensure uniqueness, the command DISTINCT should be used. In the above query, if we want the supplier names to be listed only once, the above query should be modified as follows: SELECT DISTINCT supplier_name FROM PRO_INFO WHERE product_name = “bed” OR product_name = “sofa” In SQL, a query can involve more than one relation. Suppose that we want the list of all suppliers from Metairie who supply beds. Such a query, shown below, involves both PRO_INFO and SUP_LOC. SELECT supplier_name FROM PRO_INFO, SUP_LOC WHERE PRO_INFO.supplier_name = SUP_LOC.supplier_name AND product_name = “bed” When an SQL expression, such as the one above, involves more than one relation, it is sometimes necessary to qualify attribute names, that is, to precede an attribute name by the relation (a period is placed between the two) it belongs to. Such a qualification removes possible ambiguities. In SQL, it is possible to have several levels of query nesting; this is done by including a SELECT query statement within the WHERE clause. The output data can be presented in sorted order by using the SQL ORDER BY clause followed by the attribute name(s) according to which the output is to be sorted. In database management applications it is often desirable to categorize the tuples of a relation by the values of a set of attributes and extract an aggregated characteristic of each category. Such database management tasks are referred to as aggregation functions. For instance, SQL includes the following built-in aggregation functions: SUM, COUNT, AVERAGE, MIN, MAX. The attribute names used for the categorization are referred to as FIGURE 94.7 Instance of the relation PROFESSOR
GROUP BY columns. Consider the relation PROFESSOR of Fig. 94.7. Each tuple of the above relation gives the name of a faculty and his department and academic year salary. Suppose that we want to know the number of faculty in each department and the result to be ordered by department. This query requests for each department a count of the number of faculty. Faculty are therefore categorized according to the attribute name department. As a result, department is referred to as a GROUp BY attribute. In SQL, the above query is formulated as follows: SELECT department, COUNT (faculty) BY The result of applying the COUNT aggregation function is a new relation with two attribute names. They are a GROUP BY attribute(department in this case)and a new attribute called CoUnT. The tuples are ordered lexicographically in ascending order according to the ORDER BY attribute, which is department in this case department count (facult Computer sc Electrical Eng Mechanical Eng The relations created through the CREATE TABLE command are known as base relations. a base relation exists physically and is stored as a file by the DBMS. SQL can be used to create views using the CREAtE VIEw command. In contrast to base relations the creation of a view results in a virtual relation that is, one that does not necessarily correspond to a physical file. Consider the database of Fig 94.4, and suppose that we want to create a view giving the name of all suppliers located in Metairie, the products each one provides, and the corresponding prices. Such a view, called METAIRIE_SUPPLIER, can be created as follows REATE VIEWI METAIRIE SUPPLIER PRO INFO. supplier name, product name, price PRO INFO, SUP LOC WIHIERE PRO INFO. supplier name SUP LOC. supplier name AND location = "Metairie" Because a view is a virtual relation that can be constructed from one or more relations, updating a view may lead to ambiguities. As a result, when a view is generated from more than one relation, there are, in general, estriction on updating such a view. Hierarchical Databases The hierarchical data model [Elmasri and Navathe, 1994 ] uses a tree data structure to conceptualize associations between different record types. In this model, record types are represented as nodes and associations as links Each record type, except the root, has only one parent; that is, only parent-child (or one-to-many) relationships allowed. This restriction gives hierarchical databases their simplicity. Since links are only one way, from a parent to a child, the design of hierarchical database management systems is made simpler, and only a small set of data manipulation commands are needed. Because only parent-child relationships are allowed, the hierarchical model cannot efficiently represent two main types of relationships: many-to-many relationships and the case where a record type is a child in more than one hierarchical schema. These two restrictions can be handled by allowing redundant record instances. However, such a duplication requires that all the copies of the same record should be kept consistent at all times. The example of Fig 94.8 shows a hierarchical schema. The schema gives the relationship between a DEPart MENT, its employees(D_EMPLOYEE), the projects(D_PROJECT)handled by the different departments, and how employees are assigned to these projects. It is assumed that an employee belongs to only one department a project is handled by only one department, and an employee can be assigned to several projects. Notice that since a project has several employees assigned to it, and an employee can be assigned to more than one project, the relationship between D_PROJECT and D_EMPLOYEE is many-to-many. To model this relationship mul tiple instances of the same record type D-EMPLOYEE may appear under different projects e 2000 by CRC Press LLC
© 2000 by CRC Press LLC GROUP BY columns. Consider the relation PROFESSOR of Fig. 94.7. Each tuple of the above relation gives the name of a faculty and his department and academic year salary. Suppose that we want to know the number of faculty in each department and the result to be ordered by department. This query requests for each department a count of the number of faculty. Faculty are therefore categorized according to the attribute name department. As a result, department is referred to as a GROUP BY attribute. In SQL, the above query is formulated as follows: SELECT department, COUNT (faculty) FROM PROFESSOR GROUP BY department ORDER BY department The result of applying the COUNT aggregation function is a new relation with two attribute names. They are a GROUP BY attribute (department in this case) and a new attribute called COUNT. The tuples are ordered lexicographically in ascending order according to the ORDER BY attribute, which is department in this case: department COUNT (faculty) Computer Sc. 4 Electrical Eng. 3 Mechanical Eng. 2 The relations created through the CREATE TABLE command are known as base relations. A base relation exists physically and is stored as a file by the DBMS. SQL can be used to create views using the CREATE VIEW command. In contrast to base relations, the creation of a view results in a virtual relation, that is, one that does not necessarily correspond to a physical file. Consider the database of Fig. 94.4, and suppose that we want to create a view giving the name of all suppliers located in Metairie, the products each one provides, and the corresponding prices. Such a view, called METAIRIE_SUPPLIER, can be created as follows: CREATE VIEW METAIRIE_SUPPLIER AS SELECT PRO_INFO.supplier_name, product_name, price FROM PRO_INFO, SUP_LOC WHERE PRO_INFO.supplier_name = SUP_LOC.supplier_name AND location = “Metairie” Because a view is a virtual relation that can be constructed from one or more relations, updating a view may lead to ambiguities. As a result, when a view is generated from more than one relation, there are, in general, restrictions on updating such a view. Hierarchical Databases The hierarchical data model [Elmasri and Navathe, 1994] uses a tree data structure to conceptualize associations between different record types. In this model, record types are represented as nodes and associations as links. Each record type, except the root, has only one parent; that is, only parent-child (or one-to-many) relationships are allowed. This restriction gives hierarchical databases their simplicity. Since links are only one way, from a parent to a child, the design of hierarchical database management systems is made simpler, and only a small set of data manipulation commands are needed. Because only parent-child relationships are allowed, the hierarchical model cannot efficiently represent two main types of relationships: many-to-many relationships and the case where a record type is a child in more than one hierarchical schema. These two restrictions can be handled by allowing redundant record instances. However, such a duplication requires that all the copies of the same record should be kept consistent at all times. The example of Fig. 94.8 shows a hierarchical schema. The schema gives the relationship between a DEPARTMENT, its employees (D_EMPLOYEE), the projects (D_PROJECT) handled by the different departments, and how employees are assigned to these projects. It is assumed that an employee belongs to only one department, a project is handled by only one department, and an employee can be assigned to several projects. Notice that since a project has several employees assigned to it, and an employee can be assigned to more than one project, the relationship between D_PROJECT and D_EMPLOYEE is many-to-many. To model this relationship multiple instances of the same record type D-EMPLOYEE may appear under different projects
PARTMENT D EMPLOYEE D PROJECT FIGURE 94.8 A hierarchical schema Such redundancies can be reduced to a large extent through the use of logical links. A logical link associates a virtual record from a hierarchical schema with an actual record from either the same schema or another schema. The redundant copy of the actual record is therefore replaced by a virtual record, which is nothing more than a pointer to the actual one. Hierarchical DLLs are used by a designer to declare the different hierarchical schemas, record types, and logical links. Furthermore, a root node must be declared for each hierarchical schema, and each record type declaration must also specify the parent record type Unlike relational DMLs, hierarchical DMLs such as DL/I are record at-a-time languages. DL/l is used by IBM,s IMS hierarchical DBMS In DL/I a tree traversal is based on a preorder algorithm, and within each tree the last record accessed through a Dl/l command can be located through a currency indicator. Retrieval commands are of three GET UNIQUE WHERE Such a command retrieves the leftmost record that meets the imposed restrictions. The search always starts at the root of the tree pointed to by the currency indicator. ET NExT [ WHERE ] Starting from the current position, this command uses the preodrer algorithm to retrieve the next record that satisfies the restrictions. The clause enclosed between brackets is optional. GET NEXT is used to retrieve the next(preorder)record from the current position. GET NEXT WITHIN PARENT [ WHERE ] It retrieves all records that have the same parent and that satisfy the restrictions. The parent is assumed to have been selected through a previous GET command. Four commands are used for record updates: Stores a new record and links it to a parent. The parent has been already selected through a GEt command REPLACE The current record(selected through a previous GET command)is modified. DELETE The current record and all its descendants are deleted GET HOLD Locks the current record while it is being modified. The DL/1 commands are usually embedded in a general-purpose(host)language. In this case, a record accessed through a Dl/I command is assigned to a program variable Network databases d Navathe, 1994] associations between record types are less restrictive than with the hierarchy model. Here, associations among record types are represented as graphs One-to-one and one-to-many relationships are described using the notion of set type. Each set type has an owner record type and a member record type. In the example of Fig. 94.8, the relationship between DEPARTMENT and c2000 by CRC Press LLC
© 2000 by CRC Press LLC Such redundancies can be reduced to a large extent through the use of logical links. A logical link associates a virtual record from a hierarchical schema with an actual record from either the same schema or another schema. The redundant copy of the actual record is therefore replaced by a virtual record, which is nothing more than a pointer to the actual one. Hierarchical DLLs are used by a designer to declare the different hierarchical schemas, record types, and logical links. Furthermore, a root node must be declared for each hierarchical schema, and each record type declaration must also specify the parent record type. Unlike relational DMLs, hierarchical DMLs such as DL/1 are record at-a-time languages. DL/1 is used by IBM’s IMS hierarchical DBMS. In DL/1 a tree traversal is based on a preorder algorithm, and within each tree, the last record accessed through a DL/1 command can be located through a currency indicator. Retrieval commands are of three types: GET UNIQUE WHERE Such a command retrieves the leftmost record that meets the imposed restrictions. The search always starts at the root of the tree pointed to by the currency indicator. GET NEXT [ WHERE ] Starting from the current position, this command uses the preodrer algorithm to retrieve the next record that satisfies the restrictions. The clause enclosed between brackets is optional. GET NEXT is used to retrieve the next (preorder) record from the current position. GET NEXT WITHIN PARENT [ WHERE ] It retrieves all records that have the same parent and that satisfy the restrictions. The parent is assumed to have been selected through a previous GET command. Four commands are used for record updates: INSERT Stores a new record and links it to a parent. The parent has been already selected through a GET command. REPLACE The current record (selected through a previous GET command) is modified. DELETE The current record and all its descendants are deleted. GET HOLD Locks the current record while it is being modified. The DL/1 commands are usually embedded in a general-purpose (host) language. In this case, a record accessed through a DL/1 command is assigned to a program variable. Network Databases In the network model [Elmasri and Navathe, 1994] associations between record types are less restrictive than with the hierarchy model. Here, associations among record types are represented as graphs. One-to-one and one-to-many relationships are described using the notion of set type. Each set type has an owner record type and a member record type. In the example of Fig. 94.8, the relationship between DEPARTMENT and FIGURE 94.8 A hierarchical schema