Text Mining with Regex

Text Mining

Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights. Large textual datasets can be analyzed using text mining to uncover hidden links, patterns, and important topics.

Regex in Text Mining

  • In text mining, regular expressions, or RegEx, are mostly used for substring matching and basic patterns.
  • Substring matching based on patterns is one of the most popular activities. In essence, it involves determining if anything in the provided text matches the predetermined pattern.
  • Examples include dates, phone numbers, URLs, email addresses, hashtags, emojis, and more.

Typical Uses for Regex

  • Look for particular patterns in the characters.
  • Verify a text to see if it adheres to preset patterns (e.g., validate email addresses or passwords).
  • Substrings that match a pattern can be extracted, edited, replaced, or deleted (for example, all HTML tags, URLs, and Unicode characters can be removed).

Pattern For Extracting Dates in All Type of Format

r’\d{1,2}[\/-]\d{1,2}[\/-]\d{2,4}|(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)+[.]?[\s-]?\d{1,2}[\-,\s][\s]?\d{2,4}|’\ r’\d{1,2}[\s](?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)+[.,\s]?[\s]?\d{2,4}|’\ r'(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)[A-Za-z]*[\s]?[,]?[\s]?\d{2,4}|’\ r’\d{1,2}[\/]\d{2,4}|’\ r'[^\-][\s]?\d{4}[\s,][^\-]’
04/20/2009; 04/20/09; 4/20/09; 4/3/09
Mar-20-2009; Mar 20, 2009; March 20, 2009;  Mar. 20, 2009; Mar 20 2009;
20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009
Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009
Feb 2009; Sep 2009; Oct 2010
6/2008; 12/2009
2009; 2010
# handling the Duration using regex pattern

 

fp[[‘duration_in_hr’, ‘duration_in_min’]] = fp[‘Duration’].str.extract(r(?:(\d+)h)?\s?(?:(\d+)m)?).fillna(0).astype(int)
fp.head()

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.

Leave a Reply

Your email address will not be published. Required fields are marked *