I have completed the last lesson in the Udacity PyTorch, which was very difficult indeed. PyTorch has a very steep learning curve and takes much longer to learn than another machine learning library, sklearn. The main difference between sklearn and PyTorch is the fact that PyTorch is designed for deep learning and sklearn is not. PyTorch was written in C++ as a PyThon compatible add-on module and PyTorch programs can even be saved and attached to C++ programs, but that is certainly something I am not ready for at this time.
The last lesson in the free PyTorch course that I took was about sentiment analysis, which is something I have a bit of familiarity with, having studied it in another free course I have taken with another training provider. While sklearn compatible sentiment analysis deals with tabulated data, the sentiment analysis in PyTorch is based upon a text file.
In the exercise in this post, a text document of movie reviews is trained, validated and predicted on. The link to the movie reviews can be found here:- https://github.com/lukysummer/Movie-Review-Sentiment-Analysis-LSTM-Pytorch/tree/master/data
I was unable to link the web address of the movie review to the program that I created in Google Colab, so I used control-A to select the entire document and then Control-C to copy it into MS Word and save it as a .txt document. If MS Word is not available then any text editor will do. Once the text files for the movie review and corresponding labels were saved as .txt files, I then copied them into my Google drive so they could be accessed by the program.
After I created the program, I imported the library I would need, numpy, because it would be used to read the movie review and labels into the program:-
The data is then pre-processed by converting the text to lower case, splitting the reviews into new lines and spaces, and creating a list of words in the text file:-
The words in the text file and then encoded, with each word being assigned a number:-
The number of unique words are printed out:-
The labels are then encoded:-
Extremely long and short reviews are then removed from the list of reviews:-
The movie reviews are then padded or truncated to a length of 200:-
The movie reviews are then split into train, validation and test sets:-
Dataloaders are then created for train, validation and test sets:-
The device is defined, which will select the graphical processing unit, or GPU, if it is available:-
The recurrent neural network, or RNN, is defined:-
The hyperparameters for the model are then set:-
Several variables concerning loss and optimisation are defined:-
The model is then trained:-
The network is then tested:-
A negative test review is defined.
The test review is processed to convert it to lower case, splitting the text, and tokenising it.
The sequence of numbers is then padded to a length of 200:-
The test is converted to a tensor and passed to the model:-
The predict function is then defined:-
The positive test review is defined and the model is predicted on:-
The code for this program can be found in its entirety in my personal GitHub account, being here:- Udacity-Course/SentimentRNN_PyTorch.ipynb at main · TracyRenee61/Udacity-Course (github.com)