Get started with NLTK (and Project Gutenberg) in python
If you’re new to performing NLP with the Natural Language Toolkit (ntlk) in python, here’s a compilation of functions to quickly get you up and running. As per usual, this post purposefully contains (1) more code than text and (2) code you can copy directly into a Jupyter Notebook and run to expedite your progress.
With that said, I suggest you head to nltk’s official website (https://www.nltk.org) after this post for more depth and explanations.
To start Natural Language Processing you need some text. One of the best places to grab large text files is Project Gutenberg. Run the code below to get The Great Gatsby by F. Scott Fitzgerald as a string in python.
Next, run the below code to print some information about your newly acquired string.
Ok, so you a ~276,479 character string that mentions the “west egg” 23 times. To start Natural Language Processing with nltk on this text, you must first tokenize your string. The process of Tokenization splits text into smaller units called tokens (most often words or sentences). In our case we will split our text into words:
Now that you tokenized The Great Gatsby into words you have the option of lemmatizing your word tokens. The process of Lemmatization groups all inflections of a word into a single form more easily analyze each word. For example, if you lemmatize “swim”, “swimming”, and “swam” they would all be converted to “swim”. This makes certain analyses much easier, including all we will do in this post.
With the lemmatized tokens, you may naturally wonder which are most prevalent in your text? To return and print the top tokens run the following code, you may change the top_N
argument in line 18 to return more or less top tokens:
Top tokens are interesting, but what about the top bigrams (sets of 2 words)? Try the code below:
You can continue to analyze sets of words (look into trigrams and n-grams on ntlk), but to keep the post short let’s move onto analyzing sentiment. Read and then run the following block of code to get the sentiment score (which ranges from -1 to 1, -1 is completely negative and 1 is completely positive) for your text:
(Optional) Now, in the real world you may need to leverage the Natural Language Toolkit on text that is stored as a column in a pandas dataframe. If that’s the case, you can use the below function to create tokens from a single column in a dataframe:
Hope you’re off the ground with ntlk! If so, you can try running these functions on other books from Project Gutenburg or your own text files.
If you’re ready to create a wordcloud from your text (like below) checkout my next post.