Predictive analytics II CHAPTER Text. Web. and social Media analytics Learning Objectives for Chapter 5 Describe text analytics and understand the need for text mining Differentiate among text analytics, text mining, and data mining Understand the different application areas for text mining Know the process of carrying out a text mining project Appreciate the different methods to introduce structure to text-based data Describe sentiment analysi Develop familiarity with popular applications of sentiment analysis Learn the common methods for sentiment analysis Become familiar with speech analytics as it relates to sentiment analysis CHAPTER OVERVIE This chapter provides a comprehensive overview of text analytics/mining and Web analytics/mining along with their popular application areas such as search engines, sentiment analysis, and social network/media analytics. As we have been witnessing the recent years, the unstructured data generated over the Internet of things(Web, sensor networks, RFID-enabled supply chain systems, surveillance networkS, etc. )is increasing at an exponential pace, and there is no indication of its slowing down. This changing nature of data is forcing organizations to make text and Web analytics a critical part of heir business intelligence/analytics infrastructure Copyright C2018 Pearson Education, Inc
1 Copyright © 2018Pearson Education, Inc. Predictive Analytics II: Text, Web, and Social Media Analytics Learning Objectives for Chapter 5 ▪ Describe text analytics and understand the need for text mining ▪ Differentiate among text analytics, text mining, and data mining ▪ Understand the different application areas for text mining ▪ Know the process of carrying out a text mining project ▪ Appreciate the different methods to introduce structure to text-based data ▪ Describe sentiment analysis ▪ Develop familiarity with popular applications of sentiment analysis ▪ Learn the common methods for sentiment analysis ▪ Become familiar with speech analytics as it relates to sentiment analysis CHAPTER OVERVIEW This chapter provides a comprehensive overview of text analytics/mining and Web analytics/mining along with their popular application areas such as search engines, sentiment analysis, and social network/media analytics. As we have been witnessing in the recent years, the unstructured data generated over the Internet of things (Web, sensor networks, RFID-enabled supply chain systems, surveillance networks, etc.) is increasing at an exponential pace, and there is no indication of its slowing down. This changing nature of data is forcing organizations to make text and Web analytics a critical part of their business intelligence/analytics infrastructure. CHAPTER 5
CHAPTER OUTLINE 5. 1 Opening Vignette: Machine versus Men on Jeopardy/ The Story of Watson 5.2 Text Analytics and Text Mining Overview 5.3 Natural Language Processing (NLP) 5.4 Text Mining Applications 5.5 Text Mining Process 5.6 Sentiment analysis 5.7 Web Mining Overview 5. 8 Search E 5.9 Web Usage Mining(Web Analytics) 5.10 Social Analytics ANSWERS TO END OF SECTION REVIEW QUESTIONS°··· Section 5. 1 Review Questions 1. What is Watson? What is special about it? Watson is a question answering(Qa)computer system developed by an IBM Research team and named after IBMs first president as part of a project called DeepQA. What makes it special is that it is able to compete at the human was able to defeat Ken Jennings, who held the record for the longest winn F champion level in real time on the tv quiz show, Jeopardy/; in fact, in 2011 streak in the game. Like Deep Blue has done with chess, Watson is showing that computer systems are getting quite good at demonstrating human-like intelligence What technologies were used in build ing Watson(both hardware and software)? Watson is built on the DeepQA framework. The hardware for this system involves a massively parallel processing architecture. In terms of software, Watson uses a variety of Al-related QA technologies, including text mining, natural language processing, question classification and decomposition, automatic source acquisition and evaluation, entity and relation detection, logical form generation, and knowled ge representation and reasoning Copyright C2018 Pearson Education, Inc
2 Copyright © 2018Pearson Education, Inc. CHAPTER OUTLINE 5.1 Opening Vignette: Machine versus Men on Jeopardy!: The Story of Watson 5.2 Text Analytics and Text Mining Overview 5.3 Natural Language Processing (NLP) 5.4 Text Mining Applications 5.5 Text Mining Process 5.6 Sentiment Analysis 5.7 Web Mining Overview 5.8 Search Engines 5.9 Web Usage Mining (Web Analytics) 5.10 Social Analytics ANSWERS TO END OF SECTION REVIEW QUESTIONS Section 5.1 Review Questions 1. What is Watson? What is special about it? Watson is a question answering (QA) computer system developed by an IBM Research team and named after IBM’s first president as part of a project called DeepQA. What makes it special is that it is able to compete at the human champion level in real time on the TV quiz show, Jeopardy!; in fact, in 2011, it was able to defeat Ken Jennings, who held the record for the longest winning streak in the game. Like Deep Blue has done with chess, Watson is showing that computer systems are getting quite good at demonstrating human-like intelligence. 2. What technologies were used in building Watson (both hardware and software)? Watson is built on the DeepQA framework. The hardware for this system involves a massively parallel processing architecture. In terms of software, Watson uses a variety of AI-related QA technologies, including text mining, natural language processing, question classification and decomposition, automatic source acquisition and evaluation, entity and relation detection, logical form generation, and knowledge representation and reasoning
3. What are the innovative characteristics of Deep a architecture that made watson superior The DeepQA architecture involves massive parallelism, many experts, pervasive confidence estimation, and integration of the-latest-and-greatest in-text analytics, involving both shallow and deep semantic knowledge. As implemented in Watson, DeepQa brings more than 100 different techniques for analyzing natural language identifying sources, finding anking ypotheses. More important than any nd generating hypotheses, find ing and scoring particular technique is the combination of overlapping approaches that can bring their strengths to bear and contribute to improvements in accuracy, confidence, and Why did I BM spend all that time and money to build Watson? Where is the rol? IBMs goal was to advance computer science by exploring new ways for computer technology to affect science, business, and society. The techniques IBM developed with DeepQA and Watson are relevant in a wide variety of domains central to IBMs mission. For example, IBM is currently working on a version of Watson to take on surmountable problems in healthcare and medicine. If successful, this could give IBM a distinct competitive advantage in this important technological application ar Section 5.2 Review Questions What is text analytics? How does it differ from text mining? Text analytics is a concept that includes information retrieval (e. g, searching and identifying relevant documents for a given set of key terms)as well as information extraction, data mining, and Web mining. By contrast, text mining is primarily focused on discovering new and useful knowledge from textual data sources. The overarching goal for both text analytics and text mining is to turn unstructured textual data into actionable information through the application of natural language processing(NLP)and analytics. However, text analytics is a broader term because of its inclusion of information retrieval. you can think of text analytics as a combination of information retrieval plus text mining 2. What is text mining? How does it differ from data mining? Text mining is the application of data mining to unstructured, or less structured text files. As the names indicate, text mining analyzes words, and data mini alyzes numeric data Copyright C2018 Pearson Education, Inc
3 Copyright © 2018Pearson Education, Inc. 3. What are the innovative characteristics of DeepQA architecture that made Watson superior? The DeepQA architecture involves massive parallelism, many experts, pervasive confidence estimation, and integration of the-latest-and-greatest in-text analytics, involving both shallow and deep semantic knowledge. As implemented in Watson, DeepQA brings more than 100 different techniques for analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses. More important than any particular technique is the combination of overlapping approaches that can bring their strengths to bear and contribute to improvements in accuracy, confidence, and speed. 4. Why did IBM spend all that time and money to build Watson? Where is the ROI? IBM’s goal was to advance computer science by exploring new ways for computer technology to affect science, business, and society. The techniques IBM developed with DeepQA and Watson are relevant in a wide variety of domains central to IBM’s mission. For example, IBM is currently working on a version of Watson to take on surmountable problems in healthcare and medicine. If successful, this could give IBM a distinct competitive advantage in this important technological application area. Section 5.2 Review Questions 1. What is text analytics? How does it differ from text mining? Text analytics is a concept that includes information retrieval (e.g., searching and identifying relevant documents for a given set of key terms) as well as information extraction, data mining, and Web mining. By contrast, text mining is primarily focused on discovering new and useful knowledge from textual data sources. The overarching goal for both text analytics and text mining is to turn unstructured textual data into actionable information through the application of natural language processing (NLP) and analytics. However, text analytics is a broader term because of its inclusion of information retrieval. You can think of text analytics as a combination of information retrieval plus text mining. 2. What is text mining? How does it differ from data mining? Text mining is the application of data mining to unstructured, or less structured, text files. As the names indicate, text mining analyzes words; and data mining analyzes numeric data
Why is the popularity of text mining as a bI tool increasing? Text mining as a bi tool is increasing because of the rapid growth in text data and availability of sophisticated BI tools. The benefits of text mining are obvious in the areas where very large amounts of textual data are being generated, such as law(court orders), academic research(research articles), finance( quarterly technology(patent files), and marketing(customer comments)actions) reports), medicine(discharge summaries), biology(molecular interactions) What are some popular application areas of text mining? within text by looking for predefined sequences in text via pattern matching Topic tracking. Based on a user profile and documents that a use text mining can predict other documents of interest to the user er views, Summarization. Summarizing a document to save time on the part of the Categorization. Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes Clustering Grouping similar documents without having a predefined set of categories Concept linking. Connects related documents by identifying their shared concepts and, by doing So, helps users find information that they perhaps would not have found using trad itional search methods Question answering. Finding the best answer to a given question through knowledge-driven pattern matchin Section 5.3 Review Questions What is NLP? Natural language processing(NLP)is an important component of text mining and is a subfield of artificial intelligence and computational linguistics. It studies the problem of"understand ing" the natural human language, with the view of converting depictions of human language(such as textual documents)into more formal representations(in the form of numeric and symbolic data)that are easier for computer programs to manipulate Copyright C2018 Pearson Education, Inc
4 Copyright © 2018Pearson Education, Inc. 3. Why is the popularity of text mining as a BI tool increasing? Text mining as a BI tool is increasing because of the rapid growth in text data and availability of sophisticated BI tools. The benefits of text mining are obvious in the areas where very large amounts of textual data are being generated, such as law (court orders), academic research (research articles), finance (quarterly reports), medicine (discharge summaries), biology (molecular interactions), technology (patent files), and marketing (customer comments). 4. What are some popular application areas of text mining? • Information extraction. Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching. • Topic tracking. Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user. • Summarization. Summarizing a document to save time on the part of the reader. • Categorization. Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes. • Clustering. Grouping similar documents without having a predefined set of categories. • Concept linking. Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods. • Question answering. Finding the best answer to a given question through knowledge-driven pattern matching. Section 5.3 Review Questions 1. What is NLP? Natural language processing (NLP) is an important component of text mining and is a subfield of artificial intelligence and computational linguistics. It studies the problem of “understanding” the natural human language, with the view of converting depictions of human language (such as textual documents) into more formal representations (in the form of numeric and symbolic data) that are easier for computer programs to manipulate
How does nlP relate to text mining? Text mining uses natural language processing to induce structure into the text collection and then uses data mining algorithms such as classification, clustering association, and sequence discovery to extract knowledge from it 3. What are some of the benefits and challenges of NLP? NLP moves beyond syntax-driven text manipulation(which is often called word counting)to a true understanding and processing of natural language that considers grammatical and semantic constraints as well as the context. The challenges include Part-of-speech tagging. It is difficult to mark up terms in a text as corresponding to a particular part of speech because the part of speech depends not only on the definition of the term but also on the context within which it is used Text segmentation. Some written languages, such as Chinese, Japanese and Thai, do not have single-word boundaries Word sense disambiguation. Many words have more than one meaning Selecting the meaning that makes the most sense can only be accomplished by taking into account the context within which the word is used Syntactic ambiguity. The grammar for natural languages is ambiguous; that is, multiple possible sentence structures often need to be considered Choosing the most appropriate structure usually requires a fusion of semantic and contextual information Imperfect or irregular input. Foreign or regional accents and vocal imped iments in speech and typographical or gramma ical errors in texts make the processing of the language an even more difficult task Speech acts. A sentence can often be considered an action by the speake The sentence structure alone may not contain enough information to define this action 4. What are the most common tasks addressed by Following are among the most popular tasks ering Automatic summarization Natural language generation Copyright o201& Pearson Education, Inc
5 Copyright © 2018Pearson Education, Inc. 2. How does NLP relate to text mining? Text mining uses natural language processing to induce structure into the text collection and then uses data mining algorithms such as classification, clustering, association, and sequence discovery to extract knowledge from it. 3. What are some of the benefits and challenges of NLP? NLP moves beyond syntax-driven text manipulation (which is often called “word counting”) to a true understanding and processing of natural language that considers grammatical and semantic constraints as well as the context. The challenges include: • Part-of-speech tagging. It is difficult to mark up terms in a text as corresponding to a particular part of speech because the part of speech depends not only on the definition of the term but also on the context within which it is used. • Text segmentation. Some written languages, such as Chinese, Japanese, and Thai, do not have single-word boundaries. • Word sense disambiguation. Many words have more than one meaning. Selecting the meaning that makes the most sense can only be accomplished by taking into account the context within which the word is used. • Syntactic ambiguity. The grammar for natural languages is ambiguous; that is, multiple possible sentence structures often need to be considered. Choosing the most appropriate structure usually requires a fusion of semantic and contextual information. • Imperfect or irregular input. Foreign or regional accents and vocal impediments in speech and typographical or grammatical errors in texts make the processing of the language an even more difficult task. • Speech acts. A sentence can often be considered an action by the speaker. The sentence structure alone may not contain enough information to define this action. 4. What are the most common tasks addressed by NLP? Following are among the most popular tasks: • Question answering • Automatic summarization • Natural language generation
Natural language understand ing · Machine translation F read F Speech recognition Text-to-speech Text proofi Optical character recognition Section 5.4 Review Questions List and briefly discuss some of the text mining applications in marketing Text mining can be used to increase cross-selling and up-selling by analyzing the unstructured data generated by call centers Text mining has become invaluable for customer relationship management Companies can use text mining to analyze rich sets of unstructured text data, combined with the relevant structured data extracted from organizational databases, to predict customer perceptions and subsequent purchasing behavior How can text mining be used in security and counterterrorism? Students may use the introductory case in this answer In 2007, EUROPOL developed an integrated system capable of accessing, storing, and analyzing vast amounts of structured and unstructured data sources in order to track transnational organized crime Another security-related application of text mining is in the area of deception detection 3. What are some promising text mining applications in biomedicine? As in any other experimental approach, it is necessary to analyze the vast amount of data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text tools to assist in such interpretation is one of the mai challenges in current bioinformatics research 6 Copyright C2018 Pearson Education, Inc
6 Copyright © 2018Pearson Education, Inc. • Natural language understanding • Machine translation • Foreign language reading • Foreign language writing • Speech recognition • Text-to-speech • Text proofing • Optical character recognition Section 5.4 Review Questions 1. List and briefly discuss some of the text mining applications in marketing. Text mining can be used to increase cross-selling and up-selling by analyzing the unstructured data generated by call centers. Text mining has become invaluable for customer relationship management. Companies can use text mining to analyze rich sets of unstructured text data, combined with the relevant structured data extracted from organizational databases, to predict customer perceptions and subsequent purchasing behavior. 2. How can text mining be used in security and counterterrorism? Students may use the introductory case in this answer. In 2007, EUROPOL developed an integrated system capable of accessing, storing, and analyzing vast amounts of structured and unstructured data sources in order to track transnational organized crime. Another security-related application of text mining is in the area of deception detection. 3. What are some promising text mining applications in biomedicine? As in any other experimental approach, it is necessary to analyze the vast amount of data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research
Section 5.5 Review Questions 1. What are the main steps in the text mining process? See Figure 5.6(p. 222). Text mining entails three tasks Establish the Corpus: Collect and organize the domain-specific unstructured data Create the Term-Document Matrix: Introduce structure to the corpus Extract Knowledge: Discover novel patterns from the T-D matrix 2. What is the reason for normalizing word frequencies? What are the common methods for normalizing word frequencies? The raw indices need to be normalized in order to have a more consistent tdm for further analysis. Common methods are log frequencies, binary frequencies, and inverse document frequenc What is SvD? How is it used in text mining? Singular value decomposition(SVD), which is closely related to principal components analysis, reduces the overall dimensionality of the input matrix (number of input documents by number of extracted terms )to a lower dimensional space, where each consecutive d imension represents the largest degree of variability(between words and documents) possible 4. What are the main knowledge extraction methods from corpus? The main categories of knowledge extraction methods are classification, clustering, association, and trend analysis Section 5.6 Review Questions 1. What is sentiment analysis? How does it relate to text mining Sentiment analysis tries to answer the question, " What do people feel about a certain topic? by digging into opinions of many using a variety of automated tools. It is also known as opinion mining, subjectivity analysis, and appraisal extraction Sentiment analysis shares many characteristics and techniques with text mining However, unlike text mining, which categorizes text by conceptual taxonomies of Copyright C2018 Pearson Education, Inc
7 Copyright © 2018Pearson Education, Inc. Section 5.5 Review Questions 1. What are the main steps in the text mining process? See Figure 5.6 (p. 222). Text mining entails three tasks: • Establish the Corpus: Collect and organize the domain-specific unstructured data • Create the Term–Document Matrix: Introduce structure to the corpus • Extract Knowledge: Discover novel patterns from the T-D matrix 2. What is the reason for normalizing word frequencies? What are the common methods for normalizing word frequencies? The raw indices need to be normalized in order to have a more consistent TDM for further analysis. Common methods are log frequencies, binary frequencies, and inverse document frequencies. 3. What is SVD? How is it used in text mining? Singular value decomposition (SVD), which is closely related to principal components analysis, reduces the overall dimensionality of the input matrix (number of input documents by number of extracted terms) to a lower dimensional space, where each consecutive dimension represents the largest degree of variability (between words and documents) possible. 4. What are the main knowledge extraction methods from corpus? The main categories of knowledge extraction methods are classification, clustering, association, and trend analysis. Section 5.6 Review Questions 1. What is sentiment analysis? How does it relate to text mining? Sentiment analysis tries to answer the question, “What do people feel about a certain topic?” by digging into opinions of many using a variety of automated tools. It is also known as opinion mining, subjectivity analysis, and appraisal extraction. Sentiment analysis shares many characteristics and techniques with text mining. However, unlike text mining, which categorizes text by conceptual taxonomies of
topics, sentiment classification generally deals with two classes(positive versus negative), a range of polarity(e.g, star ratings for movies), or a range in strength of opinion What are the most popular application areas for sentiment analysis? Why? Customer relationship management(CRM) and customer experience management are popular"voice of the customer(VOC)applications. Other application areas include"voice of the market(VOM)and"voice of the employee (VOe) What would be the expected benefits and beneficiaries of sentiment analysis in olitics? Opinions matter a great deal in politics. Because political discussions are dominated by quotes, sarcasm, and complex references to persons, organizations, and ideas, politics is one of the most difficult, and potentially fruitful, areas for sentiment analysis. By analyzing the sentiment on election forums, one may predict who is more likely to win or lose. Sentiment analysis can help understand what voters are thinking and can clarify a cand idate's position on issues Sentiment analysis can help political organizations, campaigns, and news analysts to better understand which issues and positions matter the most to voters. The technology was successfully applied by both parties to the 2008 and 2012 American presidential election campaigns 4. What are the main steps in carrying out sentiment analysis projects? The first step when performing sentiment analysis of a text document is called sentiment detection, during which text data is differentiated between fact and opinion(objective vs subjective). This is followed by negative-positive(N-P) polarity classification, where a subjective text item is classified on a bipolar range Following this comes target identification(identifying the person, product, event, etc. that the sentiment is about ) Finally come collection and aggregation, in which the overall sentiment for the document is calculated based on the calculations of sentiments of individual phrases and words from the first three 5. What are the two common methods for polarity identification? Explain Polarity identification can be done via a lexicon(as a reference library )or by using a collection of training documents and inductive machine learning algorithms. The lexicon approach uses a catalog of words, their synonyms, and their meanings, combined with numerical ratings indicating the position on the n P polarity associated with these words. In this way, affective, emotional, and attitud inal phrases can be classified according to their degree of positivity or negativity. By contrast, the training-document approach uses statistical analysis and machine learning algorithms, such as neural networks, clustering approaches Copyright C2018 Pearson Education, Inc
8 Copyright © 2018Pearson Education, Inc. topics, sentiment classification generally deals with two classes (positive versus negative), a range of polarity (e.g., star ratings for movies), or a range in strength of opinion. 2. What are the most popular application areas for sentiment analysis? Why? Customer relationship management (CRM) and customer experience management are popular “voice of the customer (VOC)” applications. Other application areas include “voice of the market (VOM)” and “voice of the employee (VOE).” 3. What would be the expected benefits and beneficiaries of sentiment analysis in politics? Opinions matter a great deal in politics. Because political discussions are dominated by quotes, sarcasm, and complex references to persons, organizations, and ideas, politics is one of the most difficult, and potentially fruitful, areas for sentiment analysis. By analyzing the sentiment on election forums, one may predict who is more likely to win or lose. Sentiment analysis can help understand what voters are thinking and can clarify a candidate’s position on issues. Sentiment analysis can help political organizations, campaigns, and news analysts to better understand which issues and positions matter the most to voters. The technology was successfully applied by both parties to the 2008 and 2012 American presidential election campaigns. 4. What are the main steps in carrying out sentiment analysis projects? The first step when performing sentiment analysis of a text document is called sentiment detection, during which text data is differentiated between fact and opinion (objective vs. subjective). This is followed by negative-positive (N-P) polarity classification, where a subjective text item is classified on a bipolar range. Following this comes target identification (identifying the person, product, event, etc. that the sentiment is about). Finally come collection and aggregation, in which the overall sentiment for the document is calculated based on the calculations of sentiments of individual phrases and words from the first three steps. 5. What are the two common methods for polarity identification? Explain. Polarity identification can be done via a lexicon (as a reference library) or by using a collection of training documents and inductive machine learning algorithms. The lexicon approach uses a catalog of words, their synonyms, and their meanings, combined with numerical ratings indicating the position on the NP polarity associated with these words. In this way, affective, emotional, and attitudinal phrases can be classified according to their degree of positivity or negativity. By contrast, the training-document approach uses statistical analysis and machine learning algorithms, such as neural networks, clustering approaches
and decision trees to ascertain the sentiment for a new text document based on patterns from previous training" documents with assigned sentiment scores Section 5.7 Review Questions 1. What are some of the main challenges the Web poses for knowledge discovery? The Web is too big for effective data mining The Web is too complex The Web is too dynamic The Web is not specific to a domain The Web has everything 2. What is Web mining? How does it differ from regular data mining or text mining? Web mining is the discovery and analysis of interesting and useful information from the Web and about the Web, usually through Web-based tools. Text mining is less structured because it's based on words instead of numeric data 3. What are the three main areas of Web mining? The three main areas of Web mining are Web content mining, Web structure mining, and Web usage (or activity) mining 4. What is Web content mining? How can it be used for competitive advantage? Web content mining refers to the extraction of useful information from Web pages. The documents may be extracted in some machine-readable format so that automated techniques can generate some information about the Web pages Collecting and mining Web content can be used for competitive intelligence (collecting intelligence about competitors'products, services, and customers), which can give your organization a competitive advantage 5. What is Web structure mining? How does it differ from Web content mining? Web structure mining is the process of extracting useful information from the links embedded in Web documents. By contrast, Web content mining involves analysis of the specific textual content of web pages. So, Web structure mining is more related to navigation through a website, whereas Web content mining is more related to text mining and the document hierarchy of a particular web page Copyright C2018 Pearson Education, Inc
9 Copyright © 2018Pearson Education, Inc. and decision trees to ascertain the sentiment for a new text document based on patterns from previous “training” documents with assigned sentiment scores. Section 5.7 Review Questions 1. What are some of the main challenges the Web poses for knowledge discovery? • The Web is too big for effective data mining. • The Web is too complex. • The Web is too dynamic. • The Web is not specific to a domain. • The Web has everything. 2. What is Web mining? How does it differ from regular data mining or text mining? Web mining is the discovery and analysis of interesting and useful information from the Web and about the Web, usually through Web-based tools. Text mining is less structured because it’s based on words instead of numeric data. 3. What are the three main areas of Web mining? The three main areas of Web mining are Web content mining, Web structure mining, and Web usage (or activity) mining. 4. What is Web content mining? How can it be used for competitive advantage? Web content mining refers to the extraction of useful information from Web pages. The documents may be extracted in some machine-readable format so that automated techniques can generate some information about the Web pages. Collecting and mining Web content can be used for competitive intelligence (collecting intelligence about competitors’ products, services, and customers), which can give your organization a competitive advantage. 5. What is Web structure mining? How does it differ from Web content mining? Web structure mining is the process of extracting useful information from the links embedded in Web documents. By contrast, Web content mining involves analysis of the specific textual content of web pages. So, Web structure mining is more related to navigation through a website, whereas Web content mining is more related to text mining and the document hierarchy of a particular web page
Section 5.8 Review Questions 1. What is a search engine? Why are they important for today's businesses? A search engine is a software program that searches for documents(Internet sites or files) based on the key words(individual words, multi-word terms, or a complete sentence)that users have provided that have to do with the subject of their inquiry. This is the most prominent type of information retrieval system fo finding relevant content on the Web Search engines have become the centerpiece of most Internet-based transactions and other activities. Because people use them extensively to learn about products and services, it is very important for companies to have prominent visibility on the Web; hence the major effort of companies to enhance their search engine optimization(SEO) 2. what is a web crawler? what is it used for? How does it work? A Web crawler(also called a spider or a Web spider )is a piece of software that systematically browses(crawls through) the World Wide Web for the purpose of find ing and fetching Web pages. It starts with a list of"seed"URLS, goes to the pages of those URLS, and then follows each page's hyperlinks, add ing them to the search engine's database. Thus, the Web crawler navigates through the Web in order to construct the database of websites 3. What is"search engine optimization"? Who benefits from it? Search engine optimization(SEO) is the intentional activity of affecting the visibility of an e-commerce site or a website in a search engines natural (unpaid or organic)search results. It involves editing a pages content, HTML, metadata and associated coding to both increase its relevance to specific key words and to remove barriers to the indexing activities of search engines. In addition, SEO efforts include promoting a site to increase its number of inbound linkS. SEO primarily benefits companies with e-commerce sites by making their pages appear toward the top of search engine lists when users query 4. What things can help Web pages rank higher in the search engine results? Cross-linking between pages of the same website to provide more links to the most important pages may improve its visibility. Writing content that includes frequently searched keyword phrases, so as to be relevant to a wide variety of search queries, will tend to increase traffic. Updating content so as to keep search engines crawling back frequently can give add itional weight to a site. Adding relevant keywords to a Web page's metadata, includ ing the title tag and metadescription, will tend to improve the relevancy of a site's search listings, thus increasing traffic. URL normalization of Web pages so that they are accessible via multiple URLS and using canonical link elements and redirects can help make sure links to different versions of the URL all count toward the page's link opularity scol Copyright C2018 Pearson Education, Inc
10 Copyright © 2018Pearson Education, Inc. Section 5.8 Review Questions 1. What is a search engine? Why are they important for today’s businesses? A search engine is a software program that searches for documents (Internet sites or files) based on the keywords (individual words, multi-word terms, or a complete sentence) that users have provided that have to do with the subject of their inquiry. This is the most prominent type of information retrieval system for finding relevant content on the Web. Search engines have become the centerpiece of most Internet-based transactions and other activities. Because people use them extensively to learn about products and services, it is very important for companies to have prominent visibility on the Web; hence the major effort of companies to enhance their search engine optimization (SEO). 2. What is a web crawler? What is it used for? How does it work? A Web crawler (also called a spider or a Web spider) is a piece of software that systematically browses (crawls through) the World Wide Web for the purpose of finding and fetching Web pages. It starts with a list of “seed” URLs, goes to the pages of those URLs, and then follows each page’s hyperlinks, adding them to the search engine’s database. Thus, the Web crawler navigates through the Web in order to construct the database of websites. 3. What is “search engine optimization”? Who benefits from it? Search engine optimization (SEO) is the intentional activity of affecting the visibility of an e-commerce site or a website in a search engine’s natural (unpaid or organic) search results. It involves editing a page’s content, HTML, metadata, and associated coding to both increase its relevance to specific keywords and to remove barriers to the indexing activities of search engines. In addition, SEO efforts include promoting a site to increase its number of inbound links. SEO primarily benefits companies with e-commerce sites by making their pages appear toward the top of search engine lists when users query. 4. What things can help Web pages rank higher in the search engine results? Cross-linking between pages of the same website to provide more links to the most important pages may improve its visibility. Writing content that includes frequently searched keyword phrases, so as to be relevant to a wide variety of search queries, will tend to increase traffic. Updating content so as to keep search engines crawling back frequently can give additional weight to a site. Adding relevant keywords to a Web page’s metadata, including the title tag and metadescription, will tend to improve the relevancy of a site’s search listings, thus increasing traffic. URL normalization of Web pages so that they are accessible via multiple URLs and using canonical link elements and redirects can help make sure links to different versions of the URL all count toward the page’s link popularity score