Text Mining
Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights. Large textual datasets can be analyzed using text mining to uncover hidden links, patterns, and important topics.
Regex in Text Mining
- In text mining, regular expressions, or RegEx, are mostly used for substring matching and basic patterns.
- Substring matching based on patterns is one of the most popular activities. In essence, it involves determining if anything in the provided text matches the predetermined pattern.
- Examples include dates, phone numbers, URLs, email addresses, hashtags, emojis, and more.
Typical Uses for Regex
- Look for particular patterns in the characters.
- Verify a text to see if it adheres to preset patterns (e.g., validate email addresses or passwords).
- Substrings that match a pattern can be extracted, edited, replaced, or deleted (for example, all HTML tags, URLs, and Unicode characters can be removed).
Pattern For Extracting Dates in All Type of Format
r’\d{1,2}[\/-]\d{1,2}[\/-]\d{2,4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)+[.]?[\s-]?\d{1,2}[\-,\s][\s]?\d{2,4}|’\
r’\d{1,2}[\s](?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)+[.,\s]?[\s]?\d{2,4}|’\
r'(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[A-Za-z]*[\s]?[,]?[\s]?\d{2,4}|’\
r’\d{1,2}[\/]\d{2,4}|’\
r'[^\-][\s]?\d{4}[\s,][^\-]’
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
# handling the Duration using regex pattern
fp[[‘duration_in_hr’, ‘duration_in_min’]] = fp[‘Duration’].str.extract(r‘(?:(\d+)h)?\s?(?:(\d+)m)?‘).fillna(0).astype(int)
fp.head()
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.