An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Dong Deng (Tsinghua,China) Guoliang Li (Tsinghua,China) Jianhua Feng (Tsinghua,China) h -1911-
Dong Deng (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
Outline ● Motivation ● Problem formulation Trie-based framework Trie-based algorithms o Optimizing partition Scheme ° Experiment ● Conclusion 1/28/2021 aste@ ICDE2012
Outline Motivation Problem Formulation Trie-based Framework Trie-based Algorithms Optimizing Partition Scheme Experiment Conclusion 1/28/2021 Taste @ ICDE2012 2/42
Named Entity recognition o Dictionary-based Ner Dictionary of Entities Documents Isaac newtone 1 Str- Isaac Newton was an english physicist. Sigmund Freud mathematician, astronomer, natural philosopher. alchemist, and theologian and one of the n e most English influential men in human history. His philosophiae Austrian Naturalis Principia Mathematica, published in 1687, physicist by itself considered to be among the most influential mathematician books in the history of science laying the groundwork astronomer for most of classical mechanics philosopher alchemist 2 Sigmund freud was an austrian psychiatrist who theologian founded the psychoanalytic school of psychology. Freud psychiatrist is best known for his theories of the unconscious mind economist and the defense mechanism of repression and for historian creating the clinical practice of psychoanalysis for sociologist curing psychopathology through dialogue between a patient and a psychoanalyst 1/28/2021 Taste@ ICDE2012 /42
Dictionary-based NER Named Entity Recognition 1/28/2021 Taste @ ICDE2012 Isaac Newton Sigmund Freud English Austrian physicist mathematician astronomer philosopher alchemist theologian psychiatrist economist historian sociologist ... 1 Sir Isaac Newton was an English physicist, mathematician, astronomer, natural philosopher, alchemist, and theologian and one of the most influential men in human history. His Philosophiæ Naturalis Principia Mathematica, published in 1687, is by itself considered to be among the most influential books in the history of science, laying the groundwork for most of classical mechanics. 2 Sigmund Freud was an Austrian psychiatrist who founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalyst. Dictionary of Entities Documents 3/42
Automatically add the links ● Wikipedia Q2 KIPEDIA Levenshtein distance the Levenshtein distance fs a measuring the amount of difference between two sequences( ie an edit distance).The o refer specifically to Levenshtein distance The Levenshtein distance betweer(wo strings) defined as the mimimum number of edits needed to transform one string into the other, with the allowable edit operations bei Toolbox insertion deletion, or substitution of a singe character. It is named after Vladimir Levenshtein, who considered this distance in 1965. [11 b Print export For example, the Levenshtein distance between " kitten'andsittingis 3, since the following three edits change one into the other, and there is no way to do it with fewer than v languages three edits. the objective is to find matches for short strings, for instance, strings from a dictionary, in many longer texts, in situations where a sn Here. one of the strings is typically short, while the other is arbitrarily long. This has a wide range of applications, for insta ce, spell checke correcton syst optical character rece and software to assist natural language translation The Levenshtein distance can also be computed between two longer strings, but the cost to compute it, which is roughly proportional to the product of the two string lengths, Levenshtein distance is not the only popular notion of ed distance. Variations can be obtained by changing the set of allowable edit operations: for instance, is available, and each operation is assigned a cost(possibly infinite) further generalized by ence alignment algorithms such h make an operation s cost depend on where it is appled. Computing the Levenshtein distance is based on the observation that if we reserve a matri to hold the Levenshtein distances between all prefixes of the first string and all prefixes of the second, then we can compute the values in the matrix by flood fling the matrix, and thus find the distance between the two full strings as the last value http://en.wikipedia.org/wiki/levenshtein_disTance 1/28/2021 Taste@ ICDE2012 4
Wikipedia http://en.wikipedia.org/wiki/Levenshtein_distance Automatically add the links 1/28/2021 Taste @ ICDE2012 4/42
Real-world Data is Rather Dirty! a DBLP Complete search ● Typo in“aut Argyrios zymnis 008 2EE Argyrios Zymis, Stephen P. Boyd, Dimitry Ml. Gorinevsky: Mixed state estimation for a linear Gaussi an Markov model. CDC2008:3219-3226 I EE Argyris Zymmis, Stephen P. Boyd, Dimitry M. Gorinevsky: Mixed state estimation for a linear Gaussian Markov model. □Awvw:051011 Argyris Zymnis Typo in“ title relaxe Yair Bartal, Moses Charikar, Piotr Indyk: On page migration and otherrelaxed task systems. Theor. Comput. Sci. Tcs)268(1):43-6(2001) 1997 1 EE Yair Bartal, Moses Charikar, Piotr Indyk: On Page Migration and Other Related Task Systems. SODA 1997: 43-52 1/28/2021 related Taste@ ICDE2012
Typo in “author” Typo in “title” relaxed related Argyrios Zymnis Argyris Zymnis DBLP Complete Search 1/28/2021 Real-world Data is Rather Dirty! Taste @ ICDE2012 5/42
Approximate Entity Extraction Approximate dictionary-based entity extraction finds all substrings from the document that approximately match the predefined entities ● For example: Dictionary of entities D ocuments Isaac newton Sigmund freund was an austrian psychiatrist who <Sigmund Freud founded the psychoanalytic school of psychology physicist phys Freud is best known for his theories of the astronomer unconscious mind and the defense mechanism of alchemist theologian repression and for creating the clinical practice of economist psychoanalysis for curing psychopathology sociologist Irou Th dialogue between a patient and a g sychoanalayst 1/28/2021 Taste@ ICDE2012 6/42
Approximate dictionary-based entity extraction finds all substrings from the document that approximately match the predefined entities. For example: Approximate Entity Extraction 1/28/2021 Taste @ ICDE2012 Isaac Newton Sigmund Freud physicist astronomer alchemist theologian economist sociologist ... Sigmund Freund was an Austrian psychiatrest who founded the psychoanalytic school of psychology. Freud is best known for his theories of the unconscious mind and the defense mechanism of repression and for creating the clinical practice of psychoanalysis for curing psychopathology through dialogue between a patient and a psychoanalayst. Dictionary of Entities Documents 6/42
Outline ● Motivation ● Problem formulation Trie-based framework Trie-based algorithms o Optimizing partition Scheme ° Experiment ● Conclusion 1/28/2021 aste@ ICDE2012
Outline Motivation Problem Formulation Trie-based Framework Trie-based Algorithms Optimizing Partition Scheme Experiment Conclusion 1/28/2021 Taste @ ICDE2012 7/42
Problem formulation o Approximate Entity Extraction: Given a dictionary of entities E=ley e,,.,en, a document D, and a predefined edit distance threshold t, approximate entity extraction finds all"similar"pairs such that ED(S, ej s t, where s is a substring of d and eie e (a) dictionary E (b) Document D ID Entities Len Document vancouver an efficient filter for approximates 2 vanateshe membership ChecKing kaushit 3 Ksurajit chaudrD tetretratttrti, >surajit chaudhuri, 4 caushit chaudui 15 vankatesh ganti dong rIn. 5 caushit chakra 15 vancouver, canada. sigmod 2008 1/28/2021 Taste@ ICDE2012
Problem Formulation 1/28/2021 Taste @ ICDE2012 Approximate Entity Extraction: Given a dictionary of entities E = {e1 , e2 , . . . , en }, a document D, and a predefined edit distance threshold τ, approximate entity extraction finds all “similar” pairs such that ED(s, ei) ≤ τ, where s is a substring of D and ei∈ E. 8/42
Edit distance ED(r s): The minimum number of single-character edit operations(insertion/deletion/substitution )to transform r to s o For example: ED(marios, maras)=2 8012345 m101 34 Dij=min (Di l, j +1,Di,jI+1,a 2 1 substitutea-p 321 Ⅰ sert o 0 if a,=b i4321 where t lifa.≠b 5432 654332 1/28/2021 Taste@ ICDE2012 9/42
ED(r, s): The minimum number of single-character edit operations(insertion/deletion/substitution) to transform r to s. For example: ED(marios, maras) = 2 Di,0 = i, D0,j = j, Di,j = min{Di-1, j + 1, Di, j-1 + 1, Di-1, j-1 + t i, j}, 0 if ai = bj 1 if ai bj . Edit Distance 1/28/2021 Taste @ ICDE2012 9/42 where t i, j = Substitute a->i Insert o
State-of-the-art methods NGPP Faerie Inverted Index Prefix of 1-variant family g-grams Filtering a substring matches with Overlap must be larger Condition the prefix of 1-variant of than a threshold partition Shortage ● Large Index size Need to tune parameters Not efficient for large threshold 1/28/2021 Taste@ ICDE2012
State-of-the-art Methods Shortage: Large Index Size Need to Tune Parameters Not efficient for large threshold 1/28/2021 Taste @ ICDE2012 10/42 NGPP Faerie Inverted Index Prefix of 1-variant family q-grams Filtering Condition a substring matches with the prefix of 1-variant of partition Overlap must be larger than a threshold