特征工程 (Feature Engineering) 李东 广东工业大学 自动化学院 1
(Feature Engineering) 李东 广东工业大学 自动化学院 特征工程 1
Outline ·3.1什么是特征工程? ·3.2自然语言处理中的自动分词、词性标注及句 法分析 ·3.3向量空间模型及文本相似度计算 ·3.4相似度计算 ·3.5特征值的缩放及归一化 ·3.6特征选择 ·3.7特征降维与升维 哈尔滨工业大学计算机学院刘远超 2
Outline • 3.1 什么是特征⼯程? • 3.2 ⾃然语⾔处理中的⾃动分词、词性标注及句 法分析 • 3.3 向量空间模型及⽂本相似度计算 • 3.4 相似度计算 • 3.5 特征值的缩放及归⼀化 • 3.6 特征选择 • 3.7 特征降维与升维 哈尔滨工业大学计算机学院 刘远超 2
什么是特征工程? ●引用维基百科上的定义 (https://en.wikipedia.org/wiki/Feature_engineering Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. ●引自知乎:“数据和特征决定了机器学习的上限,而模型和算法只是 逼近这个上限而已。 ●深度学习也要用到特征,需要对输入的特征进行组合变换等处理。 3
什么是特征⼯程? l引⽤维基百科上的定义 (https://en.wikipedia.org/wiki/Feature_engineering ) n Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. l引⾃知乎:“数据和特征决定了机器学习的上限,⽽模型和算法只是 逼近这个上限⽽已。” l深度学习也要⽤到特征,需要对输⼊的特征进⾏组合变换等处理。 3
自动分词 ·何谓自动分词?自动分词就是将用自然语言书写的文章、句段经计算机 处理后,以词为单位给以输出,为后续加工处理提供先决条件。 ●举例: ●“我来到北京清华大学。” →“我/来到/北京/清华大学/。 ●“I came to Tsinghua University in Beijing." >"I/came/to/Tsinghua/University/in/Beijing/./" ·思考一下:中文的自动分词和英文的自动分词有何不同?
⾃动分词 l何谓⾃动分词?⾃动分词就是将⽤⾃然语⾔书写的⽂章、句段经计算机 处理后,以词为单位给以输出,为后续加⼯处理提供先决条件。 l举例: l “我来到北京清华⼤学。” à“我/ 来到/ 北京/ 清华⼤学/ 。/” l “I came to Tsinghua University in Beijing.” à“I/ came/ to/ Tsinghua/ University/ in/ Beijing/ ./” l思考⼀下:中⽂的⾃动分词和英⽂的⾃动分词有何不同? 4
词根提取与词形还原 ●词根提取(stemming.):是抽取词的词干或词根形式(不一定能够 表达完整语义)。 ■原文:'And I also like eating apple' ■词根提取后:['and',,'also','like,'to','eat','appl']) ●词形还原(lemmatization:是把词汇还原为一般形式(能表达完 整语义)。如将“drove"处理为“drive”。 ■原文:'And I also like eating apple' ■词形还原后:['And',T,'also',like,u'eat','apple']) 5
词根提取与词形还原 l词根提取(stemming):是抽取词的词⼲或词根形式(不⼀定能够 表达完整语义)。 n原⽂:'And I also like eating apple’ n词根提取后:['and', 'I', 'also', 'like', 'to', 'eat’, 'appl’]]) l词形还原(lemmatization):是把词汇还原为⼀般形式(能表达完 整语义)。如将“drove”处理为“drive”。 n原⽂:'And I also like eating apple’ n词形还原后:['And', 'I', 'also', 'like', u'eat', 'apple’]]) 5
词性标注 ●词性标注(part-of-speech tagging)1:是指为分词结果中的每个单 词标注一个正确的词性的程序,也即确定每个词是名词、动词、形 容词或者其他词性的过程。 ●举例:“I like eating apple."的词性标注结果为 [('I','PRP'),('like','VBP'),('eating','VBG'),('apple','NN'),('.') PRP personal pronoun,he,she人称代词 VBP verb,,sing.present,,non-3 d take动词现在 VBG verb,gerund/present participle taking动词动名词现在分词 NN noun,singular'desk'名词单数形式 ●美国滨州树库词性标注规范: http://www.ling.upenn.edu/courses/Fall 2003/ling001/penn treebank pos.html 1.宗成庆《统计自然语言处理》清华大学出版社,2013.8
词性标注 l词性标注(part-of-speech tagging)1:是指为分词结果中的每个单 词标注⼀个正确的词性的程序,也即确定每个词是名词、动词、形 容词或者其他词性的过程。 l举例: “I like eating apple.”的词性标注结果为 [('I', 'PRP'), ('like', 'VBP'), ('eating', 'VBG'), ('apple', 'NN'), ('.', '.’)] PRP personal pronoun I, he, she ⼈称代词 VBP verb,sing. present, non-3d take 动词 现在 VBG verb,gerund/present participle taking 动词 动名词/现在分词 NN noun, singular 'desk' 名词单数形式 6 l 美国滨州树库词性标注规范: http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html 1. 宗成庆《统计自然语言处理》 清华大学出版社, 2013.8
句法分析 ●句法分析(Syntactic analysis):其基本任务是确定句子的句法结构 或者句子中词汇之间的依存关系。 NP-SBJ VP NP ADJP MD VP NNP NNP J will VB PP-CLR NP-TMP Pierre Vinken CD NNS old DT NN IN NP NNP CD 61 years the board as DT NN Nov.29 a nonexecutive director 7
句法分析 l句法分析(Syntactic analysis):其基本任务是确定句⼦的句法结构 或者句⼦中词汇之间的依存关系。 7
NLTK e Natural Language Toolkit 《>C合http/www.nltk.org 染女√短视频变现套路 Q▣ D6~谷歌网址大全360没索游戏中心Lnks黑龙江省公安Elsevier Edit GitHub·Ky Thirty-First Manuscript关于组织申报》 酷扩展~酰圈v国翻译~南网银~网游戏。 NLTK 3.3 documentation NEXT MODULES INDEX Natural Language Toolkit TABLE OF CONTENTS NLTK News NLTK is a leading platform for building Python programs to work with human language data.It Installing NLTK provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification,tokenization,stemming, Installing NLTK Data tagging,parsing,and semantic reasoning,wrappers for industrial-strength NLP libraries,and Contribute to NLTK an active discussion forum. FAQ Thanks to a hands-on guide introducing programming fundamentals alongside topics in computational linguistics,plus comprehensive API documentation,NLTK is suitable for Wiki linguists,engineers,students,educators,researchers,and industry users alike.NLTK is API available for Windows,Mac OS X,and Linux.Best of all,NLTK is a free,open source, community-driven project HOWTO NLTK has been called "a wonderful tool for teaching,and working in,computational linguistics using Python,"and "an amazing library to play with natural language." SEARCH Natural Language Processing with Python provides a practical introduction to programming for Go language processing.Written by the creators of NLTK,it guides the reader through the fundamentals of writing Python programs,working with corpora,categorizing text,analyzing inguistic structure,and more.The online version of the book has been been updated for Python 3 and NLTK 3.(The original Python 2 version is still available at
NLTK 8 l Natural Language Toolkit (⾃然语⾔处理⼯具包)是在NLP领域中最常⽤的 ⼀个Python库。由宾⼣法尼亚⼤学计算机系Steven Bird和Edward Loper 开发 l提供了很多⽂本处理的功能: n Tokenization(词语切分,单词化处理) n Stemming(词⼲提取) n Tagging(标记,如词性标注) n Parsing(句法分析) l此外,还提供了50多种语料和词汇资源的接⼝, 如 WordNet等
Text Processing API Python NLTK Demos and Natx <>http://text-processing.com/ 意◆√点此接索 色~谷歌%址大全360宽案游戏中心nks巢地江省公安Elsevier Edit GitHub-Ky Thirty-First Manuscript1314 黯扩展~疆戟超~国甜详~图网跟 Home NLTK Demos NLP APIs Conact StreamHacker Blog Folow Jacob on twitter Natural Language Processing APls and Python NLTK Demos Welcome to text-processing.com,where you can find natural language processing APIs and Python NLTK demos. Natural Language Text Processing APls The TextProcsng APl supports the following functionality: ·emg&Lma2aton ·Sentiment Analysis ·Ia9 ing and Chunk Extras :已h段Exiracton&ame1 Entity Re0 andigo The APis are currently open&free,but If youd ke higher limits,then signup for the Mashape Text-Processing API.If you have any questions,please checkout the EA Python NLTK Demos You can also see demos of all the API functionality. Bad Data Sentiment Analysis Demo Tagging and Chunk Extraction Demo
Text Processing API l http://text-processing.com/ (Natural Language Text Processing APIs), l⽀持如下功能: n词根提取与词形还原(Stemming & Lemmatization) n情感分析(Sentiment Analysis ) n词性标注和语块抽取(Tagging and Chunk Extraction) n短语抽取和命名实体识别(Phrase Extraction & Named Entity Recognition) 9
基于curl访问Text Processing API cun ×+ e <>ehttps://curl.haxx.se/ 象色√点此搜索 Q口↑ v谷歌网址大全360搜索游戏中心nks黑龙江省公安EksevierEdit GitHub-Ky Thirty-First Manuscript1314为 酪扩展~通酸圈~国副译~国网银√网游戏~ Download Documentation libcurl Get Help Development News Windows 64 bit 别 Windows 64 bit 7.61.1 binary the curl project Windows 64 bit 7.61.1 binary Stefan Kanthak Windows 64 bit 7.61.1 binary Chocolatey Windows 64 bit 7.61.1 binary Viktor Szakats Windows 64 bit 7.59.0 binary Marc Horsken Windows 64 bit 7.53.1 binary Darren Owen curl -d "text=great"http://text-processing.com/api/sentiment/ "probability": "neg":0.39680315784838732, "neutral'":0.28207586364297021, "p0s":0.60319684215161262 "label":"pos" 哈尔滨工业大学计算机学院刘远超 10
基于curl访问Text Processing API l Curl (CommandLine Uniform Resource Locator)是利⽤URL语法在命令⾏⽅ 式下⼯作的开源⽂件传输⼯具。⽀持Unix、多种Linux发⾏版、Win32、 Win64等。 哈尔滨工业大学计算机学院 刘远超 10 $ curl -d "text=great" http://text-processing.com/api/sentiment/ { "probability": { "neg": 0.39680315784838732, "neutral": 0.28207586364297021, "pos": 0.60319684215161262 }, "label": "pos" }