How to Make Retrieval Chatbot Using Machine Learning ?



Abstract:

A chatbot is an artificial intelligence (AI) software that can
simulate a conversation (or a chat) with a user in natural language
through messaging applications, websites, mobile apps or through the telephone.

Machine Learning Model that based on Cosine Simiarity 

The bot rely on the similarity between the input questionand all the question
s in the data set. In order to computethis similarity we need to choose
a similarity measurethat would rate the similarity of two sentences,
there are a lot of similarity measures for text but we will choose
the cosine similarity for this one since it’s one of the most common measures in NLP.
Cosine similarity
 Definition and how cosine similarity work?
The cosine of two non-zero vectors can be derived by using the Euclidean dot product formula: 
Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using
a dot product and magnitude aswhere and are components of vector and respectively. 
The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 indicating orthogonality or decorrelation, while in-between values indicate intermediate similarity or dissimilarity.

For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. Cosine similarity can be seen as a method of normalizing document length during comparison. 

In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (using tf–idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°. 



Vectorizing questions:

To go from a text question to a vector that represents the question so we can compute the similarity we need to transform it, in order to transform a text document into a vector we need to use a feature extraction technique, we will use TF-IDF because it’s the most common in NLP.
TF IDF

We will compute the Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document. This will give you a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document).

TF-IDF is the statistical method of evaluating the significance of a word in a given document.

TF — Term frequency(tf) refers to how many times a given term appears in a document.
IDF — Inverse document frequency(idf) measures the weight of the word in the document, i.e if the word is common or rare in the entire document.
The TF-IDF intuition follows that the terms that appear frequently in a document are less important than terms that rarely appear.
Fortunately, scikit-learn gives you
a built-in TfIdfVectorizer class that produces
the TF-IDF matrix quite easily.
How does TF-IDF work?
TF-IDF is computed by first computing two values for each term:
  1. Term Frequency: The frequency of the term in the document
  2. Document Frequency: The fraction of the documents that contain the term
  3. Inverse Document Frequency: The logarithmically scaled inverse Document Frequency 
The term frequency is used because we are concerned with finding documents that have similar
terms, because if two documents have the same terms then they are probably very similar.
The inverse document frequency is used to measure how much information
does each term carry, since terms like “The” will appear in almost every
document then it will have a high document frequency, thus a term with low
document frequency is favored over a term with high document frequency for
the sake of specificity (Thus the inverse relation).
Think of this example, will the term “Acetaldhyde” carry information that will help identify
the document as chemistry related or
the term “The” which exists in every document?
Example of tf–idf:
Suppose that we have term count tables of a corpus consisting of only two documents,
as listed on the right. 
Document 2
Term 
Term Count 
This
is 
another 
example 
Document 1 
Term 
Term Count 
This
is 
sample 
The calculation of tf–idf for the term "this" is performed as follows: 
In its raw frequency form, tf is just the frequency of the "this" for each document.
In each document, the word "this" appears once; but as the document 2 has more words,
its relative frequency is smaller. 
 
An idf is constant per corpus, and accounts for the ratio of documents that include the word "this"
. In this case, we have a corpus of two documents and all of them include the word "this". 
So tf–idf is zero for the word "this", which implies that
the word is not very informative as it appears in all documents. 

The word "example" is more interesting - it occurs three times,
but only in the second document. 
*Coding Section:
all coding section available on my github and if you need data also since this chatbot is part of graduation projet and made on 3rd preratorrry story "Black Beauty" as you can ask question you don't know and retrieve the rigth answer of it .
Importing all the required libraries :

Load the dataset , Show Head of data  and Describe:

Generate chatbot response:

    To generate a response from our chatbot for input questions, the concept of document similarity will be used. As I have already discussed the TFidf vectorizer is used to convert a collection of raw documents to a matrix of TF-IDF features   and to find the similarity between words entered by the user and the words in the dataset we will use cosine similarity.
So, This is how to make Machine Learning Model Based on Cosine Similiarity , Leave your Feedback on Commments or send to mail Thansks, Wish Post is Helpful ,Wish you The best. ; )

Comments

Popular posts from this blog