Challenges in text mining Data collection is "free text" Data is not well-organized Semi-structured or unstructured Natural language text contains ambiguities on many levels Lexical, syntactic, semantic, and pragmatic Learning techniques for processing text typically need annotated training examples Expensive to acquire at scale · What to mine? CSoUVa CS6501: Text MiningChallenges in text mining • Data collection is “free text” – Data is not well-organized • Semi-structured or unstructured – Natural language text contains ambiguities on many levels • Lexical, syntactic, semantic, and pragmatic – Learning techniques for processing text typically need annotated training examples • Expensive to acquire at scale • What to mine? CS@UVa CS6501: Text Mining 11