I specialize in the dynamic and ever-evolving field of Artificial Intelligence, Data Science. My expertise lies in harnessing the power of AI, Natural Language Processing (NLP), Data Engineering, and cutting-edge AI-ML technologies to unravel complex problems and unlock new possibilities.
Inventions sparks curiosity, Leading to a continuous cycle of exploration and advancement.
Hi, I’m Tejas kamble a Data EngineerAI EngineerResearcher
I specialize in the dynamic and ever-evolving field of Artificial Intelligence, Data Science. My expertise lies in harnessing the power of AI, Natural Language Processing (NLP), Data Engineering, and cutting-edge AI-ML technologies to unravel complex problems and unlock new possibilities.
find with me
best skill on
Visit my portfolio & Hire me
About Me
With a passion for creating intelligent systems, I thrive on developing innovative solutions that bridge the gap between raw data and actionable insights. Whether it’s crafting robust algorithms, engineering data pipelines, or delving into the realms of machine learning, I am dedicated to pushing the boundaries of what AI can achieve.
I’ve actively engaged in developing AI-driven applications, collaborating on research initiatives, and contributing to the advancement of the field. My current endeavors include spearheading two significant projects – one focused on exploring the intersection of AI and healthcare, and another involving the development of a platform that seamlessly integrates AI into everyday life.
Statistics for Machine Learning (Required statistics for Machine learning)
The basic and most important part of the Machine learning and Data analysis is to understand the Data, Analyze the pattern in technical way we will say distribution of the data, we will discuss as follow:
To understand the machine learning and data science perfectly you must know the statistics. we are going to discuss most important techniques in the Statistics.
Types of Statistics:
Descriptive
Inferential
Descriptive Statistics:
1) Measure of Central Tendencies (Mean, Median, Mode)
Measures of central tendency are statistical metrics that summarize a dataset by identifying its central point. They provide insights into the “typical” or “average” value in a dataset, making them essential for analyzing and comparing data distributions.
There are three main measures of central tendency: Mean, Median, and Mode.
1. Mean (Arithmetic Average)
The mean is the sum of all values in a dataset divided by the number of values. It is the most commonly used measure of central tendency.
Formula
For a dataset with nnn values: Mean(xˉ)=∑xin\text{Mean} (\bar{x}) = \frac{\sum x_i}{n}Mean(xˉ)=n∑xi
✅ Easy to calculate and widely used. ✅ Considers all values in the dataset. ❌ Sensitive to outliers (e.g., extreme values can skew the mean).
2. Median (Middle Value of a Sorted Dataset)
The median is the middle value when the dataset is arranged in ascending or descending order. If there are an odd number of values, the median is the middle one. If there are an even number of values, the median is the average of the two middle values.
Steps to Find the Median
Sort the dataset in ascending order.
If nnn is odd, the median is the middle value: Median=x(n+12)\text{Median} = x_{\left(\frac{n+1}{2}\right)}Median=x(2n+1)
If nnn is even, the median is the average of the two middle values: Median=x(n2)+x(n2+1)2\text{Median} = \frac{x_{\left(\frac{n}{2}\right)} + x_{\left(\frac{n}{2} + 1\right)}}{2}Median=2x(2n)+x(2n+1)
Example 1 (Odd Number of Values)
Dataset: {3, 7, 9, 11, 15}
Middle value: 9 (third value in sorted order).
Median = 9
Example 2 (Even Number of Values)
Dataset: {2, 5, 8, 12, 14, 18}
Middle two values: 8 and 12
Median: 8+122=10\frac{8 + 12}{2} = 1028+12=10
Median = 10
Pros and Cons
✅ Not affected by outliers, making it a good measure for skewed distributions. ✅ Works well for ordinal data. ❌ Ignores extreme values in the dataset.
3. Mode (Most Frequent Value)
The mode is the value that appears most frequently in a dataset. A dataset can have:
No mode (if all values appear once).
One mode (Unimodal dataset).
Two modes (Bimodal dataset).
More than two modes (Multimodal dataset).
Example 1 (Unimodal Dataset)
Dataset: {2, 4, 4, 6, 8, 10}
The number 4 appears twice, making it the mode.
Mode = 4
Example 2 (Bimodal Dataset)
Dataset: {1, 3, 3, 5, 7, 7, 9}
Two numbers appear most frequently: 3 and 7.
Modes = 3 and 7 (Bimodal)
Example 3 (No Mode)
Dataset: {2, 5, 8, 11, 14}
No repeated values, so no mode exists.
Pros and Cons
✅ Useful for categorical data (e.g., finding the most common brand, product, or category). ✅ Works well for non-numeric data. ❌ May not exist or may be multiple, making interpretation difficult.
Choosing the Best Measure of Central Tendency
Scenario
Best Measure
Symmetric data with no outliers
Mean
Skewed data with outliers
Median
Categorical or qualitative data
Mode
Bimodal/multimodal distribution
Mode
Example: House Prices
Consider the house prices (in $1000s): {120, 130, 135, 140, 500}
Mean = 120+130+135+140+5005=205\frac{120+130+135+140+500}{5} = 2055120+130+135+140+500=205 (highly affected by outlier).
Median = 135 (middle value, better representation).
Mode = No mode (no repetition).
In this case, the median is the best measure because it is unaffected by the extreme value (500).
2) Measure of Dispersion (Standard Deviation, Variance)
1. Measures of Dispersion
While measures of central tendency (mean, median, mode) tell us about the center of the data, measures of dispersion describe how spread out the data points are. The most common measures of dispersion include Variance and Standard Deviation.
1.1 Variance (𝜎² or s²)
Variance measures how much each data point deviates from the mean, squared.
Formulas for Variance
Population Variance (𝜎²) σ2=∑(xi−μ)2N\sigma^2 = \frac{\sum (x_i – \mu)^2}{N}σ2=N∑(xi−μ)2 where:
σ2\sigma^2σ2 = population variance
xix_ixi = individual data points
μ\muμ = population mean
NNN = total number of data points in the population
🔹 Why use (n-1) in the sample variance formula? The denominator (n−1)(n-1)(n−1) is known as Bessel’s correction, which corrects for bias in estimating the population variance from a sample.
1.2 Standard Deviation (𝜎 or s)
Standard deviation is simply the square root of variance. It provides a measure of spread in the same units as the data.
Formulas for Standard Deviation
Population Standard Deviation σ=∑(xi−μ)2N\sigma = \sqrt{\frac{\sum (x_i – \mu)^2}{N}}σ=N∑(xi−μ)2
Find Standard Deviation: s=17.2≈4.15s = \sqrt{17.2} \approx 4.15s=17.2≈4.15
Thus, the sample standard deviation is 4.15.
2. Population vs. Sample
2.1 Population
A population includes all members of a defined group. When we calculate statistics for an entire population, we use N in the denominator.
Example: The average height of all people in a country.
2.2 Sample
A sample is a subset of a population. Since we usually cannot collect data from the entire population, we estimate statistics using a sample and use n-1 in the denominator.
Example: Measuring the height of 1,000 randomly selected people to estimate the national average.
3. Types of Variables
Variables are classified into categorical and numerical types.
3.1 Categorical Variables (Qualitative)
Categorical variables represent distinct groups or categories.
Nominal: No order or ranking. (e.g., Colors: {Red, Blue, Green})
Ordinal: Categories have a meaningful order. (e.g., Education level: {High School, Bachelor’s, Master’s})
Discrete: Can take only specific values, usually integers. (e.g., Number of children: {0, 1, 2, 3})
Continuous: Can take any value within a range. (e.g., Height: {5.4 ft, 5.5 ft, 6.2 ft})
4. Data Visualization: Histograms and KDE
4.1 Histograms
A histogram is a graphical representation of the frequency distribution of numerical data. It consists of bins, where each bin represents a range of values, and the height of the bar represents the frequency.
🔹 Example of a Histogram:
If we have test scores {50, 55, 60, 60, 65, 70, 75, 80, 85, 90}, a histogram would show how often scores fall within certain ranges (bins like 50-60, 60-70, etc.).
🔹 Advantages of Histograms: ✅ Shows the distribution of data ✅ Helps detect skewness and outliers
4.2 Kernel Density Estimation (KDE)
A Kernel Density Estimate (KDE) is a smoothed version of a histogram. Instead of using bars, KDE uses a smooth curve to estimate the probability density function of a dataset.
🔹 Why Use KDE?
Unlike histograms, KDE does not depend on bin width, providing a clearer view of data distribution.
🔹 Example: If we have a dataset of student heights, a KDE plot would give a smooth curve that helps visualize the probability density of different height ranges.
🔹 Difference Between Histogram & KDE
Feature
Histogram
KDE
Representation
Binned Bars
Smooth Curve
Sensitivity
Depends on bin width
Depends on kernel bandwidth
Use Case
Discrete counts
Probability density estimation
Understanding Percentiles, Quartiles & the 5-Number Summary
The Foundation of Exploratory Data Analysis (EDA)
When you’re trying to understand a dataset or detect outliers, few tools are more powerful and intuitive than percentiles, quartiles, and the 5-number summary. These help you explore how your data is distributed and identify extreme values with precision.
🔢 1. Percentiles
Percentiles divide the dataset into 100 equal parts. Each percentile tells you the value below which a certain percentage of data falls.
Example:
25th percentile (P25) → 25% of data lies below this value.
90th percentile (P90) → 90% of data lies below this value.
Percentiles help describe the relative standing of a value in the dataset.
🧮 2. Quartiles
Quartiles are specific percentiles that divide your data into four equal parts.
Quartile
Percentile Equivalent
Meaning
Q1 (1st Quartile)
25th Percentile
25% of data below
Q2 (Median)
50th Percentile
Middle value
Q3 (3rd Quartile)
75th Percentile
75% of data below
🧰 3. The 5-Number Summary
The 5-number summary is one of the most important techniques in descriptive statistics. It gives a quick overview of the distribution of your data and is crucial for visualizations like box plots.
✅ It includes:
Minimum – Smallest value in the dataset
Q1 (1st Quartile) – 25% of data lies below this
Median (Q2) – 50% of data lies below this
Q3 (3rd Quartile) – 75% of data lies below this
Maximum – Largest value in the dataset
📐 4. Interquartile Range (IQR)
The Interquartile Range (IQR) measures the spread of the middle 50% of the data. IQR=Q3−Q1
This is a robust measure of variability that is not affected by outliers.
🚨 5. Detecting Outliers using IQR
The IQR can be used to create fences beyond which data points are considered outliers.
🔻 Lower Fence:
Lower Fence=Q1−1.5×IQR
🔺 Upper Fence:
Upper Fence=Q3+1.5×IQR
Any value below the lower fence or above the upper fence is typically classified as an outlier.
📊 6. Example: 5-Number Summary + IQR Calculation
📘 Dataset:
data = [7, 8, 8, 10, 12, 13, 14, 15, 16, 22, 40]
🔍 Step-by-Step:
Minimum = 7
Maximum = 40
Median (Q2) = 13
Q1 (25th percentile) = 9
Q3 (75th percentile) = 16
So the 5-number summary is:
[Minimum = 7, Q1 = 9, Median = 13, Q3 = 16, Maximum = 40]
Any value > 26.5 or < -1.5 is an outlier. ✅ So, in this dataset, 40 is an outlier.
📦 7. Visual Representation: Box Plot
The 5-number summary is used in box plots to visually summarize data.
The box spans from Q1 to Q3
A line marks the median (Q2)
“Whiskers” extend to the smallest/largest non-outlier values
Outliers are shown as dots or stars beyond the whiskers
🌟 Why Is This Important?
Feature
Reason
Exploratory Data Analysis (EDA)
Quickly understand the spread and central tendency
Outlier Detection
IQR-based fences help detect anomalies
Feature Scaling & Normalization
Useful for feature engineering
Robust Statistics
Median and IQR are unaffected by extreme values, unlike mean & standard deviation
📌 Summary Table
Term
Description
Formula
Q1
25th percentile
–
Q2 (Median)
50th percentile
–
Q3
75th percentile
–
IQR
Interquartile range
Q3−Q1Q3 – Q1Q3−Q1
Lower Fence
Threshold for low outliers
Q1−1.5×IQR
Upper Fence
Threshold for high outliers
Q3+1.5×IQR
Outliers
Values outside fences
x<LF or x> UF x < LF
Correlation and Covariance: A Comprehensive Guide
Covariance and correlation are fundamental statistical concepts used to measure the relationship between variables. While they serve similar purposes, they differ in important ways. Let’s explore both concepts in depth.
Covariance
Covariance measures how two variables change together. It indicates the direction of the linear relationship between variables.
Mathematical Definition
The sample covariance formula is:
Cov(X,Y) = Σ[(x_i – x̄)(y_i – ȳ)] / (n-1)
Where:
x_i and y_i are individual data points
x̄ and ȳ are the means of X and Y
n is the number of data pairs
Interpretation
Positive covariance: Variables tend to move in the same direction
Negative covariance: Variables tend to move in opposite directions
Zero covariance: No linear relationship between variables
Example
Let’s consider stock prices for two companies, A and B, over 5 days:
Day | Company A | Company B
—-|———–|———-
1 | $10 | $5
2 | $12 | $6
3 | $11 | $4
4 | $13 | $7
5 | $14 | $8
Step 3: Sum products and divide by (n-1) Cov(A,B) = (2+0+2+1+4)/(5-1) = 9/4 = 2.25
The positive covariance indicates that when stock A increases, stock B tends to increase as well.
Types of Covariance
1. Positive Covariance
Indicates that variables tend to move in the same direction.
Example: Height and weight in humans typically have positive covariance because taller people generally weigh more than shorter people.
2. Negative Covariance
Indicates that variables tend to move in opposite directions.
Example: Hours spent studying and number of errors on a test typically have negative covariance because more study time usually results in fewer errors.
3. Zero Covariance
Indicates no linear relationship between variables.
Example: Shoe size and intelligence would likely have zero covariance because there’s no reason one would affect the other.
4. Autocovariance
Measures the covariance of a variable with itself at different time points.
Example: In time series analysis, the price of gold today might have a high autocovariance with its price yesterday.
5. Cross-covariance
Measures the similarity between two different time series at different time lags.
Example: Rainfall amounts and reservoir levels might show cross-covariance with a lag, as it takes time for rainfall to affect reservoir levels.
Correlation
Correlation is a standardized version of covariance that measures both the strength and direction of a linear relationship between variables. It always falls between -1 and 1.
Mathematical Definition
The Pearson correlation coefficient is:
ρ(X,Y) = Cov(X,Y) / (σ_X × σ_Y)
Where:
Cov(X,Y) is the covariance
σ_X is the standard deviation of X
σ_Y is the standard deviation of Y
Interpretation
Correlation of 1: Perfect positive correlation
Correlation of -1: Perfect negative correlation
Correlation of 0: No linear correlation
Correlation between 0 and 1: Positive correlation
Correlation between -1 and 0: Negative correlation
Example
Continuing with our stock price example:
Step 1: Calculate standard deviations
For Company A: σ_A = sqrt([((-2)² + 0² + (-1)² + 1² + 2²)/5]) = sqrt(10/5) = sqrt(2) ≈ 1.41
For Company B: σ_B = sqrt([((-1)² + 0² + (-2)² + 1² + 2²)/5]) = sqrt(10/5) = sqrt(2) ≈ 1.41
(Note: In practice, correlation should always be between -1 and 1. The slight discrepancy here is due to rounding. The actual correlation would be 1, indicating perfect positive correlation.)
Types of Correlation
1. Pearson Correlation
Measures linear relationships between variables with continuous, normally distributed data.
Example: The relationship between height and weight typically follows a linear pattern suitable for Pearson correlation.
2. Spearman Rank Correlation
Measures monotonic relationships, where variables tend to change together but not necessarily at a constant rate.
Example: The relationship between age and reading ability in children might be monotonic (generally increasing) but not strictly linear.
Pearson Correlation Coefficient and Spearman Rank Correlation Coefficient
Let me explain both correlation methods in detail, including their formulas, interpretations, and when to use each one.
Pearson Correlation Coefficient
The Pearson correlation coefficient (r) measures the linear relationship between two continuous variables. It’s the most commonly used correlation measure in statistics.
Formula
For two variables X and Y with n observations, the Pearson correlation coefficient is calculated as:
This indicates a very strong positive linear relationship between X and Y.
Assumptions and Limitations
Assumes variables have a linear relationship
Sensitive to outliers
Both variables should be normally distributed for hypothesis testing
Measures only linear relationships
Spearman Rank Correlation Coefficient
The Spearman rank correlation coefficient (ρ or rₛ) measures the monotonic relationship between two variables by using their ranks rather than actual values.
Formula
For two variables X and Y with n observations, the Spearman correlation is calculated as:
ρ = 1 – (6 × Σd²) / (n(n² – 1))
Where:
d is the difference between the ranks of corresponding values of X and Y
n is the number of observations
If there are no tied ranks, this simplifies to:
ρ = Pearson correlation coefficient between the ranks of X and Y
Properties
Range: Always between -1 and +1
Interpretation:
ρ = 1: Perfect monotonic increasing relationship
ρ = -1: Perfect monotonic decreasing relationship
ρ = 0: No monotonic relationship
Similar magnitude interpretation as Pearson (weak, moderate, strong)
Invariant to any monotonic transformation of the variables
Less sensitive to outliers than Pearson
Example Calculation
Consider these data points:
X: 5, 10, 15, 20, 25
Y: 2, 4, 5, 9, 12
Step 1: Rank the values in each variable (1 = lowest, n = highest)
This indicates a perfect monotonic relationship between X and Y.
Handling Tied Ranks
When ties occur, each tied value is assigned the average of the ranks they would have received if they were distinct. For example, if the 2nd and 3rd positions are tied, both receive rank 2.5.
Only assumes variables have a monotonic relationship (not necessarily linear)
Less powerful than Pearson when data is truly linear and normally distributed
Computationally more intensive for large datasets
When to Use Each Coefficient
Use Pearson When:
Data is continuous
The relationship appears linear
Both variables are approximately normally distributed
Outliers are minimal or have been addressed
You need to measure the strength of strictly linear relationships
Use Spearman When:
Data is ordinal or ranked
The relationship appears monotonic but not necessarily linear
Variables are not normally distributed
Outliers are present
You want to capture any monotonic relationship, not just linear ones
You’re analyzing variables where exact values are less important than relative ordering
Practical Examples
Pearson Correlation: Relationship between height and weight in a population (typically linear)
Spearman Correlation: Relationship between customer satisfaction ratings (1-5 scale) and likelihood to repurchase (where the relationship might be monotonic but not strictly linear)
Key Differences Between Covariance and Correlation
Scale: Covariance is affected by the scale of the variables; correlation is standardized between -1 and 1.
Units: Covariance has units (product of the units of the two variables); correlation is unitless.
Comparability: Correlations can be compared across different datasets; covariances generally cannot.
Interpretation: Correlation provides both direction and strength; covariance only reliably indicates direction.
Applications in Data Analysis
Portfolio Management
Positive covariance between assets increases portfolio risk
Negative covariance helps diversify risk
Machine Learning
Correlation analysis helps identify relevant features
Principal Component Analysis uses covariance matrices to reduce dimensionality
Quality Control
Correlation between process variables helps identify root causes of defects
Economic Analysis
Correlation between GDP and unemployment rate helps understand economic cycles
Probability Distribution
The Relationship Between PDF, PMF, and CDF
Probability Mass Function (PMF)
The PMF applies to discrete random variables and gives the probability that a random variable X equals a specific value x.
Mathematical Definition: P(X = x) = PMF(x)
Properties:
Non-negative: PMF(x) ≥ 0 for all x
Sum to 1: Σ PMF(x) = 1 (over all possible values)
Range: 0 ≤ PMF(x) ≤ 1
Example: For a fair six-sided die, the PMF is:
P(X = 1) = P(X = 2) = … = P(X = 6) = 1/6
Probability Density Function (PDF)
The PDF applies to continuous random variables and represents the relative likelihood of the random variable taking on a specific value.
Mathematical Definition: f(x) = dF(x)/dx, where F(x) is the CDF
Properties:
Non-negative: f(x) ≥ 0 for all x
Area equals 1: ∫f(x)dx = 1 (integrated over all possible values)
P(a ≤ X ≤ b) = ∫(from a to b) f(x)dx
Unlike PMF, PDF can exceed 1 at specific points
Example: The PDF of a standard normal distribution is: f(x) = (1/√(2π)) * e^(-x²/2)
Cumulative Distribution Function (CDF)
The CDF applies to both discrete and continuous random variables and gives the probability that X is less than or equal to x.
Mathematical Definition: F(x) = P(X ≤ x)
For discrete variables: F(x) = Σ PMF(t) for all t ≤ x For continuous variables: F(x) = ∫(from -∞ to x) f(t)dt
Properties:
Non-decreasing: F(x₁) ≤ F(x₂) if x₁ < x₂
Limits: lim(x→-∞) F(x) = 0 and lim(x→∞) F(x) = 1
Range: 0 ≤ F(x) ≤ 1
P(a < X ≤ b) = F(b) – F(a)
Example: For a standard normal distribution, the CDF doesn’t have a simple closed form but is often denoted as Φ(x).
Types of Probability Distributions
Probability distributions come in two main categories:
Election polling (number of voters supporting a candidate)
Poisson Distribution
The Poisson distribution models the number of events occurring in a fixed interval of time or space, when these events happen at a constant average rate.
Properties
PMF: P(X = k) = (e^(-λ) × λ^k) / k!
Mean: E[X] = λ
Variance: Var(X) = λ
Parameter: λ = average number of events per interval
Example
If emails arrive at an average rate of 5 per hour:
Text Mining (also called Text Analytics) is the process of deriving useful insights, patterns, and structure from unstructured textual data.
Since over 80% of real-world data is unstructured (emails, chats, reviews, social media, documents), text mining plays a critical role in data science, NLP, and business intelligence.
📌 Applications of Text Mining:
Sentiment Analysis
Spam Detection
Topic Modeling
Chatbots & Virtual Assistants
Customer Feedback Analysis
Information Retrieval (e.g., search engines)
🧰 Feature Engineering in Text Mining
Feature Engineering is the art of converting raw data into meaningful features that enhance machine learning models.
In text mining, this typically involves:
Cleaning the text
Tokenization
Lowercasing
Removing stop words/punctuation
Extracting specific patterns
Vectorization (TF-IDF, Word2Vec, etc.)
This is where Regex (Regular Expressions) becomes incredibly powerful.
🔎 What is Regex?
Regex stands for Regular Expressions – a sequence of characters used to search, match, or extract patterns in text data.
Think of regex as a search engine for patterns, capable of performing:
Pattern matching
String replacement
Validation
Extraction
⚙️ Why is Regex Important in Feature Engineering?
✅ 1. Automated Text Cleaning
You can use regex to:
Remove HTML tags, special characters, or emojis
Strip URLs, email addresses, or phone numbers
Replace patterns (like converting dates to standard format)
import re
text = "Email me at [email protected] or visit https://site.com" clean_text = re.sub(r"(https?://\S+|www\.\S+)", "", text)
✅ 2. Pattern Extraction as Features
Regex can extract domain-specific features such as:
Hashtags from tweets (#\w+)
Mentions in social media (@\w+)
Capitalized words or abbreviations ([A-Z]{2,})
Dates or timestamps (\d{2}/\d{2}/\d{4})
These extracted features can be used in model training!
✅ 3. Text Normalization & Transformation
Regex helps normalize inconsistent formats like:
Converting all currency symbols to a standard one
Unifying phone numbers
Replacing multiple whitespaces
text = "Price: $25.00 Date: 04/04/2025" normalized = re.sub(r"\s+", " ", text) # removes extra spaces
✅ 4. Efficient Feature Creation at Scale
Rather than writing complex logic or loops, regex allows compact one-liners that scale across datasets with thousands/millions of records.
✅ 5. Regex in Tokenization & NLP Tasks
In advanced NLP preprocessing, regex is used to:
Split compound words
Identify custom tokens
Extract syntactic patterns (like noun-phrases)
Prepare data for Named Entity Recognition (NER) or POS tagging
📈 Impact on Model Performance
Proper feature engineering using regex improves accuracy, precision, and recall by:
Reducing noise
Enhancing signal in text
Creating domain-relevant features
Avoiding overfitting by focusing on important patterns
✨ Example Use-Case: Spam Detection
In a spam detection model, regex can be used to extract:
Number of links → https?://\S+
Count of uppercase words (shouting) → [A-Z]{2,}
Suspicious words (free, win, prize, click)
These become input features that help the model learn to distinguish spam from non-spam.
🧪 Summary Table
Task
Regex Pattern
Use
Remove URLs
https?://\S+
Clean text
Extract hashtags
#\w+
Feature creation
Emails
\S+@\S+
PII removal or extraction
Dates
\d{2}/\d{2}/\d{4}
Standardization
Uppercase words
[A-Z]{2,}
Spam detection
Emojis
[^\w\s,]
Clean noisy text
Text Mining
Text mining, also known as text data mining, is the process of transforming unstructured text into a structured format to identify meaningful patterns and new insights. Large textual datasets can be analyzed using text mining to uncover hidden links, patterns, and important topics.
Regex in Text Mining
In text mining, regular expressions, or RegEx, are mostly used for substring matching and basic patterns.
Substring matching based on patterns is one of the most popular activities. In essence, it involves determining if anything in the provided text matches the predetermined pattern.
Examples include dates, phone numbers, URLs, email addresses, hashtags, emojis, and more.
Typical Uses for Regex
Look for particular patterns in the characters.
Verify a text to see if it adheres to preset patterns (e.g., validate email addresses or passwords).
Substrings that match a pattern can be extracted, edited, replaced, or deleted (for example, all HTML tags, URLs, and Unicode characters can be removed).
Pattern For Extracting Dates in All Type of Format
04/20/2009; 04/20/09; 4/20/09; 4/3/09Mar-20-2009; Mar 20, 2009; March 20, 2009; Mar. 20, 2009; Mar 20 2009;20 Mar 2009; 20 March 2009; 20 Mar. 2009; 20 March, 2009Mar 20th, 2009; Mar 21st, 2009; Mar 22nd, 2009Feb 2009; Sep 2009; Oct 20106/2008; 12/20092009; 2010# handling the Duration using regex pattern
Blockchain Technology, AI & NLP for Sentiment Analysis on News Data.
A Decentralized Autonomous Organization to Improve Coordination Between Nations Using Blockchain Technology, Artificial Intelligence and Natural Language Processing for Sentiment analysis on News Data.
This paper is about Establishing a Decentralized organization with the Different Countries as members where all the countries will be considered as the node of the blockchain. All the countries in the organization will be treated equally there will not be any superpower amongst them. Therefore, The Organization will gather huge amount of the data from the different countries from all the sectors like health, education, economy, technology, culture, and agriculture which represents the overall development of the countries. All this gathered data will be further analyzed for their positive and negative impacts on all the mentioned sectors. This will give brief idea about situation of an individual country in different areas on that basis, members of the Organizations or we can say all the member countries will decide the reward or penalty case for the respective country. Blockchains have the potential to enhance systems by getting rid of middlemen. Artificial Intelligence will play a major role in this organization as dealing with massive amount of data will be in the frame and to deal with this data, we need AI to improve data integrity of the result which will be used by Smart-Contract for decision making purpose, automating, and optimizing the smart contract. AI promises to remove oversight and increase the objectivity of our systems. This organization offers a framework for participants to work together to create a dataset and host a model that is continuously updated using smart contracts. As data is growing rapidly. AI will manage that data efficiently with less energy consumption
Acceptance Letter 2
What Clients Say
Testimonial
Rainbow-Themes
Nevine Acotanza test
Chief Operating Officer
Android App Development
via Upwork - Mar 4, 2015 - Aug 30, 2021 test
Maecenas finibus nec sem ut imperdiet. Ut tincidunt est ac dolor aliquam
sodales. Phasellus sed mauris hendrerit, laoreet sem in, lobortis mauris
hendrerit ante. Ut tincidunt est ac dolor aliquam sodales phasellus smauris
Bound - Trolola
Jone Duone Joe
Operating Officer
Web App Development
Upwork - Mar 4, 2016 - Aug 30, 2021
Maecenas finibus nec sem ut imperdiet. Ut tincidunt est ac dolor aliquam
sodales. Phasellus sed mauris hendrerit, laoreet sem in, lobortis mauris
hendrerit ante. Ut tincidunt est ac dolor aliquam sodales phasellus smauris
Glassfisom
Nevine Dhawan
Operating Officer
Android App Design
Fiver - Mar 4, 2015 - Aug 30, 2021
Maecenas finibus nec sem ut imperdiet. Ut tincidunt est ac dolor aliquam
sodales. Phasellus sed mauris hendrerit, laoreet sem in, lobortis mauris
hendrerit ante. Ut tincidunt est ac dolor aliquam sodales phasellus smauris