Machine Learning

How To Make Retrieval Chatbot using DeepLearning and Bag Of Words Technique?

Deep Learning model based on Bag Of Words

Fisrt,What is a Bag-of-Words?

A bag-of-words model, or BoW for short, is a way of extracting features from text for use in modeling, such as with machine learning algorithms

The approach is very simple and flexible, and can be used in a myriad of ways for extracting features from documents.

A bag-of-words is a representation of text that describes the occurrence of words within a document. It involves two things:

A vocabulary of known words.
A measure of the presence of known words.

It is called a “bag” of words, because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

The intuition is that documents are similar if they have similar content. Further, that from the content alone we can learn something about the meaning of the document.

The bag-of-words can be as simple or complex as you like. The complexity comes both in deciding how to design the vocabulary of known words (or tokens) and how to score the presence of known words.

Example of the Bag-of-Words Model

Let’s make the bag-of-words model concrete with a worked example.

Step 1: Collect Data

Below is a snippet of the first few lines of text from the book “A Tale of Two Cities” by Charles Dickens, taken from Project Gutenberg.

It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness,

For this small example, let’s treat each line as a separate “document” and the 4 lines as our entire corpus of documents.

Step 2: Design the Vocabulary

Now we can make a list of all of the words in our model vocabulary.

The unique words here (ignoring case and punctuation) are:

“it”
“was”
“the”
“best”
“of”
“times”
“worst”
“age”
“wisdom”
“foolishness”

That is a vocabulary of 10 words from a corpus containing 24 words.

Step 3: Create Document Vectors

The next step is to score the words in each document.

The objective is to turn each document of free text into a vector that we can use as input or output for a machine learning model.

Because we know the vocabulary has 10 words, we can use a fixed-length document representation of 10, with one position in the vector to score each word.

The simplest scoring method is to mark the presence of words as a boolean value, 0 for absent, 1 for present.

Using the arbitrary ordering of words listed above in our vocabulary, we can step through the first document (“It was the best of times“) and convert it into a binary vector.

The scoring of the document would look as follows:

“it” = 1
“was” = 1
“the” = 1
“best” = 1
“of” = 1
“times” = 1
“worst” = 0
“age” = 0
“wisdom” = 0
“foolishness” = 0

As a binary vector, this would look as follows:

[1, 1, 1, 1, 1, 1, 0, 0, 0, 0]

The other three documents would look as follows:

"it was the worst of times"
= [1, 1, 1, 0, 1, 1, 1, 0, 0, 0] "it was the age
of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]
"it was the age of foolishness"
= [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

"it was the worst of times" = [1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

"it was the age of wisdom" = [1, 1, 1, 0, 1, 0, 0, 1, 1, 0]

"it was the age of foolishness" = [1, 1, 1, 0, 1, 0, 0, 1, 0, 1]

All ordering of the words is nominally discarded and we have
a consistent way of extracting features from any document in
our corpus, ready for use in modeling.

New documents that overlap with the vocabulary of known words, but may contain
words outside of the vocabulary, can still be encoded, where only
the occurrence of known words are scored and unknown words are ignored.

You can see how this might naturally scale to large vocabularies and larger documents

Managing Vocabulary

As the vocabulary size increases, so does the vector representation of documents.

In the previous example, the length of the document vector is equal to
the number of known words.

You can imagine that for a very large corpus, such as thousands of books,
that the length of the vector might be thousands or millions of positions.
Further, each document may contain very few of the known words in the vocabulary.

This results in a vector with lots of zero scores, called a sparse vector or sparse representation.

Sparse vectors require more memory and computational resources
when modeling and the vast number of positions or dimensions
can make the modeling process very challenging for traditional algorithms.

As such, there is pressure to decrease the size of the vocabulary when using a bag-of-words model.

There are simple text cleaning techniques that can be used as a first step, such as:

Ignoring case
Ignoring punctuation
Ignoring frequent words that don’t contain much information, called stop words,
like “a,” “of,” etc.
Fixing misspelled words.
Reducing words to their stem (e.g. “play” from “playing”) using stemming algorithms.

A more sophisticated approach is to create a vocabulary of grouped words.
This both changes the scope of the vocabulary and allows the bag-of-words to capture
a little bit more meaning from the document.

In this approach, each word or token is called a “gram”.
Creating a vocabulary of two-word pairs is, in turn, called a bi-gram model. Again,
only the bi-grams that appear in the corpus are modeled, not all possible bi-grams.

For example, the bi-grams in the first line of text in the previous section:
“It was the best of times” are as follows:

“it was”
“was the”
“the best”
“best of”
“of times”

A vocabulary then tracks triplets of words is called a tri-gram model and the general approach
is called the n-gram model, where n refers to the number of grouped words.

Limitations of Bag-of-Words

The bag-of-words model is very simple to understand and implement and offers
a lot of flexibility for customization on your specific text data.

It has been used with great success on prediction problems like
language modeling and documentation classification.

Nevertheless, it suffers from some shortcomings, such as:

Vocabulary: The vocabulary requires careful design, most specifically in order
to manage the size, which impacts the sparsity of the document representations.
Sparsity: Sparse representations are harder to model both for computational
reasons (space and time complexity) and also for information reasons,
where the challenge is for the models to harness so little information in such
a large representational space.
Meaning: Discarding word order ignores the context, and in turn meaning
of words in the document (semantics). Context and meaning can offer a lot to the model,
that if modeled could tell the difference between the same
words differently arranged (“this is interesting” vs “is this interesting”),
synonyms (“old bike” vs “used bike”), and much more.

* Coding Section:

all coding section available on my github and if you need data also since this chatbot is part of graduation project and made on 3rd preparatory story "Black Beauty" as you can ask question you don't know and retrieve the right answer of it .

Importing required Libraries first :

Make data Preprocessing to my data That Tokenize my Questions Into Tokens and Use English Stemmer to remove adds from words.

Then Make my Bag-of-words from pre-prepared data

Then data ready to fit in neural networks ,Build my model that consist of 2 Hidden Layer of 8 nerouns and Input Layer of Length of Training data numpy array and Output Layer of Length Output numpy array

Result of out Model but on Small data so it gives High Accuracy,

Make Bag For Each Question User ask that Enter to the Model then model classify it accordind to data stored.

Then Start To Chating To Black Beauty,

In ths Function call Bagof word function that take question and translate it to bag of words then enter to model to classify according to data store which answer that question for , We make Thredshold of 80%

and if it lower ThredShold then model suggest

the neareast question to question that asked

Result From Chating with Black Beauty

So, This is how to make DeepLearning Model Based on Bag Of Words , Leave your Feedback on Comments or send to mail Thanks, Wish Post is Helpful ,Wish you The best. ; )

Search This Blog