Lab 9: 🔄 RNN for Text Processing
Teaching a computer to read, understand, and write by giving it a
memory.
Libraries: TensorFlow, Keras, NLTK • Estimated Time: 3 hours
Part 1: The Power of Memory
Imagine reading a book one word at a time, but with amnesia. Every time you read a new word, you forget
all the previous ones. You'd never understand the story!
"The cat, which was chased by the ____, climbed the tree."
To fill in the blank, you need to remember the word "cat." The CNNs we've used so far are like readers
with amnesia—they see an image all at once but have no memory of what came before. They're great for
photos, but terrible for sequences like text or stock prices.
Introducing RNNs: Recurrent Neural Networks
An RNN is a special type of network with a loop. When it processes a word, it takes its output and feeds
it back into itself as an input for the next word. This feedback loop acts as a memory, allowing it to keep track of the context from previous words in
the sentence.
How an RNN Reads a Sentence
- Reads "The", creates a summary (a hidden state), and passes it to the next step.
- Reads "cat", combines it with the summary of "The", creates a new, richer summary, and passes it
on.
- Reads "which", combines it with the summary of "The cat", and so on...
This "summary" is just a vector of numbers, but it allows the network to maintain
context through time.
Part 2: From Words to Numbers
A computer can't read words. It only understands numbers. The first and most important step in any NLP
task is to convert our text into a numerical format. This is called Vectorization.
Step 1: Get some text
We'll start with a simple text: the script of Shakespeare's Macbeth, which we can download easily.
import tensorflow as tf
import numpy as np
path_to_file = tf.keras.utils.get_file('macbeth.txt', 'https://www.gutenberg.org/files/2264/2264-0.txt')
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
Step 2: Tokenization - Splitting Text into Words
Next, we create a "vocabulary" of all the unique words in the text and assign a unique integer to each
word. This process is called tokenization.
tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+,-./:;<=>
?@[\\]^_`{|}~\t\n', lower=True, split='
')
tokenizer.fit_on_texts([text])
print(list(tokenizer.word_index.items())[:10])
vocab_size = len(tokenizer.word_index) + 1
print("Vocabulary Size:", vocab_size)
Step 3: Creating Sequences
Now we convert our long text into many shorter, overlapping sequences. If we have the sentence "The cat
sat on the mat", we can create training examples like:
- Input: "The cat", Target: "sat"
- Input: "The cat sat", Target: "on"
- Input: "The cat sat on", Target: "the"
full_sequence = tokenizer.texts_to_sequences([text])[0]
input_sequences = []
for i in range(1, len(full_sequence)):
n_gram_sequence = full_sequence[:i+1]
input_sequences.append(n_gram_sequence)
max_sequence_len = max([len(x) for x in
input_sequences])
input_sequences = np.array(tf.keras.preprocessing.sequence.pad_sequences(input_sequences,
maxlen=max_sequence_len, padding='pre'))
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=vocab_size)
print("Input example:", xs[5])
print("Label for example:", labels[5])
💡 Your Turn: Decode a Sequence
The numbers above are hard to read. Use the `tokenizer.index_word` dictionary to convert the integer
sequence from `xs[5]` back into human-readable words. This helps verify that your preprocessing is
working correctly.
Part 3: Building a Language Model
Now for the fun part. We will build a model that takes a sequence of words as input and tries to predict
the very next word.
Key Keras Layers for NLP
- Embedding: This is the magic layer. It turns our integer
indices into dense vectors of a fixed size. It learns to place similar words close to each other
in vector space (e.g., "king" and "queen" will have similar vectors).
- LSTM (Long Short-Term Memory): A more advanced and powerful
type of RNN layer that is much better at remembering long-term dependencies. We almost always
use LSTM or GRU instead of a plain `SimpleRNN`.
- Dense: Our familiar fully-connected layer for the final
output.
model = tf.keras.Sequential([
layers.Embedding(vocab_size, 100,
input_length=max_sequence_len-1),
layers.LSTM(150),
layers.Dense(vocab_size, activation='softmax')
])
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
💡 Your Turn: Experiment with Complexity
The numbers `100` in the Embedding layer and `150` in the LSTM layer are hyperparameters. What
happens to the "Total params" in `model.summary()` if you change the LSTM units to `256`? What if
you change the embedding dimension to `64`? Getting a feel for how these parameters affect model
size is a key skill.
history = model.fit(xs, ys, epochs=50, verbose=1)
Part 4: Generating New Text
Now that our model has been trained to predict the next word, we can use it to generate brand new text
that sounds like Shakespeare!
seed_text = "To be or not to be"
next_words = 100
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = tf.keras.preprocessing.sequence.pad_sequences([token_list],
maxlen=max_sequence_len-1, padding='pre')
predicted_probs = model.predict(token_list, verbose=0)
predicted_index = np.argmax(predicted_probs)
output_word = ""
for word, index in
tokenizer.word_index.items():
if index == predicted_index:
output_word = word
break
seed_text += " " + output_word
print(seed_text)
💡 Your Turn: Try a Different Seed
The generated text is completely dependent on your starting "seed." Rerun the generation code block,
but this time, try a different famous line from Shakespeare, like `"all the world's a stage"` or
`"what's in a name"`. How does the story change?
💡 Your Turn: Add Temperature
Taking the `argmax` always picks the most likely word, which can be boring. A common technique is to
use "temperature" to introduce randomness. Instead of `np.argmax`, use
`tf.random.categorical(predicted_probs / temperature, num_samples=1)`. Try a temperature of `1.0`
(more random) and `0.5` (more predictable).
Part 5: Your Mission - Character-Level Text Generator
Assignment: Build a Character-Level Model
Instead of tokenizing by words, you can tokenize by individual characters. This allows
the model to learn spelling, punctuation, and capitalization from scratch. It's often more creative,
though sometimes less coherent.
Your Tasks:
- Get New Data: Find the plain text of a different book from Project Gutenberg.
Choose something with a distinct style, like Alice in Wonderland or Frankenstein.
- Character Tokenization: Do not use the Keras `Tokenizer`. Instead, create your
vocabulary yourself. The vocabulary will be the set of all unique characters in the text. Create
a `char_to_index` and `index_to_char` dictionary.
- Create Sequences: Create input sequences and target labels. For example, if
your text is "hello" and your sequence length is 3, your first input would be "hel" and the
target would be "l".
- Build and Train: Build an LSTM model similar to the one in this lab. The
`Embedding` layer is not strictly necessary for character models but can still be useful. Train
it on your character sequences.
- Generate Text: Write a generation loop that works at the character level. Give
it a seed like "Alice was" and see what it writes!
Part 6: Bonus - Sentiment Analysis
Language modeling is fun, but a more common real-world task is classification. Sentiment
analysis—determining if a review is positive or negative—is a classic NLP problem.
The goal is to classify the sentiment of movie reviews into 5 categories (negative, somewhat
negative, neutral, somewhat positive, positive).
Your Challenge:
Use the skills you've learned to build a classifier. Your pipeline will be:
- Load the data using Pandas.
- Use the Keras `Tokenizer` to vectorize the review phrases.
- Build a model. A good starting point is `Embedding` -> `LSTM` -> `Dense(5, activation='softmax')`.
- Train the model and submit your predictions to Kaggle!
Part 7: Submission Guidelines
- Complete all "Your Turn" tasks and the main "Lab Assignment" (Character-Level Generator) in a single
Google Colab notebook. The Kaggle project is a bonus.
- For the assignment, clearly show your tokenization, model building, training, and the final
generated text.
- Add a Text Cell at the end to discuss your results. Did the generated text make sense? What was the
effect of training for more epochs?
- Ensure all your code cells have been run so that their outputs and plots are visible.
- When you are finished, generate a shareable link. In Colab, click "Share" and set
access to "Anyone with the link" can "Viewer".
- Click "Copy link" and submit this link as your assignment.