Lab 9: 🔄 RNN for Text Processing

Teaching a computer to read, understand, and write by giving it a memory.

Libraries: TensorFlow, Keras, NLTK • Estimated Time: 3 hours

Part 1: The Power of Memory

Imagine reading a book one word at a time, but with amnesia. Every time you read a new word, you forget all the previous ones. You'd never understand the story!

"The cat, which was chased by the ____, climbed the tree."

To fill in the blank, you need to remember the word "cat." The CNNs we've used so far are like readers with amnesia—they see an image all at once but have no memory of what came before. They're great for photos, but terrible for sequences like text or stock prices.

Introducing RNNs: Recurrent Neural Networks

An RNN is a special type of network with a loop. When it processes a word, it takes its output and feeds it back into itself as an input for the next word. This feedback loop acts as a memory, allowing it to keep track of the context from previous words in the sentence.

How an RNN Reads a Sentence

  1. Reads "The", creates a summary (a hidden state), and passes it to the next step.
  2. Reads "cat", combines it with the summary of "The", creates a new, richer summary, and passes it on.
  3. Reads "which", combines it with the summary of "The cat", and so on...

This "summary" is just a vector of numbers, but it allows the network to maintain context through time.

Part 2: From Words to Numbers

A computer can't read words. It only understands numbers. The first and most important step in any NLP task is to convert our text into a numerical format. This is called Vectorization.

Step 1: Get some text

We'll start with a simple text: the script of Shakespeare's Macbeth, which we can download easily.

import tensorflow as tf
import numpy as np

path_to_file = tf.keras.utils.get_file('macbeth.txt', 'https://www.gutenberg.org/files/2264/2264-0.txt')

# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

Step 2: Tokenization - Splitting Text into Words

Next, we create a "vocabulary" of all the unique words in the text and assign a unique integer to each word. This process is called tokenization.

# The Keras Tokenizer does all the heavy lifting for us
tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='!"#$%&()*+,-./:;<=> ?@[\\]^_`{|}~\t\n', lower=True, split=' ')

# Give it our text to build the vocabulary
tokenizer.fit_on_texts([text])

# We can now see the word -> index mapping
print(list(tokenizer.word_index.items())[:10])

vocab_size = len(tokenizer.word_index) + 1 # Add 1 for the 0 padding index
print("Vocabulary Size:", vocab_size)

Step 3: Creating Sequences

Now we convert our long text into many shorter, overlapping sequences. If we have the sentence "The cat sat on the mat", we can create training examples like:

# Convert the full text to a sequence of integers
full_sequence = tokenizer.texts_to_sequences([text])[0]

# Create training examples and labels
input_sequences = []
for i in range(1, len(full_sequence)):
  n_gram_sequence = full_sequence[:i+1]
  input_sequences.append(n_gram_sequence)

# Pad sequences so they are all the same length
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(tf.keras.preprocessing.sequence.pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

# Create predictors and labels
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=vocab_size)

print("Input example:", xs[5])
print("Label for example:", labels[5])

💡 Your Turn: Decode a Sequence

The numbers above are hard to read. Use the `tokenizer.index_word` dictionary to convert the integer sequence from `xs[5]` back into human-readable words. This helps verify that your preprocessing is working correctly.

Part 3: Building a Language Model

Now for the fun part. We will build a model that takes a sequence of words as input and tries to predict the very next word.

Key Keras Layers for NLP

  • Embedding: This is the magic layer. It turns our integer indices into dense vectors of a fixed size. It learns to place similar words close to each other in vector space (e.g., "king" and "queen" will have similar vectors).
  • LSTM (Long Short-Term Memory): A more advanced and powerful type of RNN layer that is much better at remembering long-term dependencies. We almost always use LSTM or GRU instead of a plain `SimpleRNN`.
  • Dense: Our familiar fully-connected layer for the final output.
model = tf.keras.Sequential([
  layers.Embedding(vocab_size, 100, input_length=max_sequence_len-1),
  layers.LSTM(150),
  layers.Dense(vocab_size, activation='softmax')
])

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

💡 Your Turn: Experiment with Complexity

The numbers `100` in the Embedding layer and `150` in the LSTM layer are hyperparameters. What happens to the "Total params" in `model.summary()` if you change the LSTM units to `256`? What if you change the embedding dimension to `64`? Getting a feel for how these parameters affect model size is a key skill.

# Train the model (this will take a while!)
history = model.fit(xs, ys, epochs=50, verbose=1)

Part 4: Generating New Text

Now that our model has been trained to predict the next word, we can use it to generate brand new text that sounds like Shakespeare!

seed_text = "To be or not to be"
next_words = 100
for _ in range(next_words):
  token_list = tokenizer.texts_to_sequences([seed_text])[0]
  token_list = tf.keras.preprocessing.sequence.pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
  predicted_probs = model.predict(token_list, verbose=0)
  predicted_index = np.argmax(predicted_probs)
  output_word = ""
  for word, index in tokenizer.word_index.items():
    if index == predicted_index:
      output_word = word
      break
  seed_text += " " + output_word

print(seed_text)

💡 Your Turn: Try a Different Seed

The generated text is completely dependent on your starting "seed." Rerun the generation code block, but this time, try a different famous line from Shakespeare, like `"all the world's a stage"` or `"what's in a name"`. How does the story change?

💡 Your Turn: Add Temperature

Taking the `argmax` always picks the most likely word, which can be boring. A common technique is to use "temperature" to introduce randomness. Instead of `np.argmax`, use `tf.random.categorical(predicted_probs / temperature, num_samples=1)`. Try a temperature of `1.0` (more random) and `0.5` (more predictable).

Part 5: Your Mission - Character-Level Text Generator

Assignment: Build a Character-Level Model

Instead of tokenizing by words, you can tokenize by individual characters. This allows the model to learn spelling, punctuation, and capitalization from scratch. It's often more creative, though sometimes less coherent.

Your Tasks:

  1. Get New Data: Find the plain text of a different book from Project Gutenberg. Choose something with a distinct style, like Alice in Wonderland or Frankenstein.
  2. Character Tokenization: Do not use the Keras `Tokenizer`. Instead, create your vocabulary yourself. The vocabulary will be the set of all unique characters in the text. Create a `char_to_index` and `index_to_char` dictionary.
  3. Create Sequences: Create input sequences and target labels. For example, if your text is "hello" and your sequence length is 3, your first input would be "hel" and the target would be "l".
  4. Build and Train: Build an LSTM model similar to the one in this lab. The `Embedding` layer is not strictly necessary for character models but can still be useful. Train it on your character sequences.
  5. Generate Text: Write a generation loop that works at the character level. Give it a seed like "Alice was" and see what it writes!

Part 6: Bonus - Sentiment Analysis

Language modeling is fun, but a more common real-world task is classification. Sentiment analysis—determining if a review is positive or negative—is a classic NLP problem.

Kaggle: Sentiment Analysis on Movie Reviews

The goal is to classify the sentiment of movie reviews into 5 categories (negative, somewhat negative, neutral, somewhat positive, positive).

Your Challenge:

Use the skills you've learned to build a classifier. Your pipeline will be:

Part 7: Submission Guidelines

  1. Complete all "Your Turn" tasks and the main "Lab Assignment" (Character-Level Generator) in a single Google Colab notebook. The Kaggle project is a bonus.
  2. For the assignment, clearly show your tokenization, model building, training, and the final generated text.
  3. Add a Text Cell at the end to discuss your results. Did the generated text make sense? What was the effect of training for more epochs?
  4. Ensure all your code cells have been run so that their outputs and plots are visible.
  5. When you are finished, generate a shareable link. In Colab, click "Share" and set access to "Anyone with the link" can "Viewer".
  6. Click "Copy link" and submit this link as your assignment.