Assume we have a collection of texts (corpus) with \(n\) words. We pair each unique word \(w\) with its frequency of occurrence in the corpus, designated as \(f(w)\). The rank \(r(w)\) of a word \(w\), with \(1 \leq r(w) \leq n\) is defined as its position in the list of unique words sorted in descending order by frequency.

Zipf’s law is an empirical law that states that the frequency of a word is inversely proportional to its rank:

\[f(w) \propto \frac{1}{r(w)}\]

or more specifically: \(f(w) = \frac{C}{r(w)}\)

where \(C\) is a constant that depends on the corpus.

Zipf’s Law is an instance of what is called a power law. A power law expresses the functional relationship between two quantities, where one quantity changes proportional to a power of another. The common example is the relation between the area of a square and the length of its side: when the length is increased by a factor of \(k\), the area increases by a factor of \(k^2\).

There are many interesting power laws in nature. For example, Kleiber’s law relates the basal metabolic rate of an animal to its mass by:

\[r = 70m^{3/4}\]

where \(r\) is the basal metabolic rate in kilocalories per day, and \(m\) is the mass of the animal in kilograms.

Zipf’s law is a special case of a more general shape of a power law which is:

\[x = cy^k\]

where \(c\) and \(k\) are constants. In Zipf’s law, \(x\) is the frequency of a word, \(y\) is its rank, \(c\) is the constant \(C\), and \(k = -1\).

Define a function that takes a text file as input and maps each unique word to its frequency of occurrence in the file. The output will dictionary that maps each unique word to a dictionary {'frequency': f, 'rank': r} where f is the frequency and r is the rank.

Notes
Solutions

See the solutions for the Plotting frequency vs. rank

Define a function that takes the output of the previous exercise and plots the frequency vs. rank in a normal and a log-log scale.

Notes
Solutions

View / copy zipfs.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import sys
import numpy as np
from matplotlib import pyplot as plt
from functools import reduce
from immutables import Map
from funcutils import tap
from icecream import ic
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download("punkt")

def read_input(filename):
    with open(filename, 'r', encoding='utf-8') as file:
        text = file.read()
    return text

def tokenize_text(text):
    """Tokenize the input text into words."""
    return word_tokenize(text)

def update_count(fdict, item):
    """Update the frequency dictionary with the given item."""
    return fdict.set(item, fdict.get(item,0) + 1)

def compute_frequencies(token_list):
    """Compute frequency distribution of tokens in the list."""
    return reduce(update_count, token_list, Map())

def frequency_to_rank(freq_dict):
    """Convert frequency dictionary to rank dictionary."""
    sorted_items = sorted(freq_dict.items(), key=lambda x: x[1], reverse=True)
    rank_dict = {item[0]: {'frequency': freq_dict[item[0]],
                           'rank': rank + 1}
                 for rank, item in enumerate(sorted_items)}
    return rank_dict

def filter_tokens(token_list):
    """Filter out non-alphabetic tokens and convert to lowercase."""
    return [token.lower() for token in token_list if token.isalpha()]

def extract_data(filename):
    """Extract data from the input file and return frequency-rank mapping."""
    return (
        # add rank information to frequencies
        frequency_to_rank(
            tap(
                # Compute frequencies of tokens
                compute_frequencies(
                    tap(
                        # Filter out non-alphabetic tokens and convert to lowercase
                        filter_tokens(
                            # Tokenize the text
                            tokenize_text(
                                # The corpus is read from the file
                                read_input(filename)
                            )
                        ),
                        # this lambda is for tap to look into the process
                        lambda x: ic(f'Tokenized {len(x)} items'
                                    )
                    )
                ),
                # this lambda is for tap to look into the process
                lambda x: ic(f'Computed frequencies for {len(x)} unique items'
                            )
            )
        )
    )

def plot_data(data):

    # Extract ranks and frequencies

    pairs = [(info['rank'], info['frequency']) for info in data.values()]
    ranks  = [r for r, f in pairs]
    freqs  = [f for r, f in pairs]

    # Plot results

    fig, (ax_lin, ax_log) = plt.subplots(1, 2, figsize=(10, 4), layout="constrained")

    # Left: linear
    ax_lin.plot(ranks, freqs, marker="o", markersize=1, linestyle="None")
    ax_lin.set_title("Linear plot")
    ax_lin.set_xlabel("Rank")
    ax_lin.set_ylabel("Frequency")

    # Right: log (choose what you mean by "log scale")
    ax_log.plot(ranks, freqs, marker="o", markersize=1, linestyle="None")
    ax_log.set_title("Log plot")
    ax_log.set_xlabel("Rank")
    ax_log.set_ylabel("Frequency")
    ax_log.set_yscale("log")
    ax_log.set_xscale("log")

    plt.show()

if __name__ == '__main__':
    data = extract_data(sys.argv[1] if len(sys.argv) > 1 else 'mtc-full.txt')
    plot_data(data)

What do you think? Does the data follow Zipf’s law?