Zipf’s Law
Assume we have a collection of texts (corpus) with \(n\) words. We pair each unique word \(w\) with its frequency of occurrence in the corpus, designated as \(f(w)\). The rank \(r(w)\) of a word \(w\), with \(1 \leq r(w) \leq n\) is defined as its position in the list of unique words sorted in descending order by frequency.
Zipf’s law is an empirical law that states that the frequency of a word is inversely proportional to its rank:
\[f(w) \propto \frac{1}{r(w)}\]or more specifically: \(f(w) = \frac{C}{r(w)}\)
where \(C\) is a constant that depends on the corpus.
Zipf’s Law is an instance of what is called a power law. A power law expresses the functional relationship between two quantities, where one quantity changes proportional to a power of another. The common example is the relation between the area of a square and the length of its side: when the length is increased by a factor of \(k\), the area increases by a factor of \(k^2\).
There are many interesting power laws in nature. For example, Kleiber’s law relates the basal metabolic rate of an animal to its mass by:
\[r = 70m^{3/4}\]where \(r\) is the basal metabolic rate in kilocalories per day, and \(m\) is the mass of the animal in kilograms.
Zipf’s law is a special case of a more general shape of a power law which is:
\[x = cy^k\]where \(c\) and \(k\) are constants. In Zipf’s law, \(x\) is the frequency of a word, \(y\) is its rank, \(c\) is the constant \(C\), and \(k = -1\).
Define a function that takes a text file as input and maps each unique word to
its frequency of occurrence in the file. The output will dictionary that maps
each unique word to a dictionary {'frequency': f, 'rank': r} where f is the
frequency and r is the rank.
Notes
- See the dictionary, reading files, sorting, strings, and tokenization snippets.
- In computing frequencies, if you solved the Frequency
count exercise you can reuse your solution – but beware the efficiency considerations mentioned there – or use the built-in
collections.Counter. - Work with a sample from METU Turkish Corpus.
- Once you are done you can work on the full corpus.
Solutions
See the solutions for the Plotting frequency vs. rank
Define a function that takes the output of the previous exercise and plots the frequency vs. rank in a normal and a log-log scale.
Solutions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
import sys
import numpy as np
from matplotlib import pyplot as plt
from functools import reduce
from immutables import Map
from funcutils import tap
from icecream import ic
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download("punkt")
def read_input(filename):
with open(filename, 'r', encoding='utf-8') as file:
text = file.read()
return text
def tokenize_text(text):
"""Tokenize the input text into words."""
return word_tokenize(text)
def update_count(fdict, item):
"""Update the frequency dictionary with the given item."""
return fdict.set(item, fdict.get(item,0) + 1)
def compute_frequencies(token_list):
"""Compute frequency distribution of tokens in the list."""
return reduce(update_count, token_list, Map())
def frequency_to_rank(freq_dict):
"""Convert frequency dictionary to rank dictionary."""
sorted_items = sorted(freq_dict.items(), key=lambda x: x[1], reverse=True)
rank_dict = {item[0]: {'frequency': freq_dict[item[0]],
'rank': rank + 1}
for rank, item in enumerate(sorted_items)}
return rank_dict
def filter_tokens(token_list):
"""Filter out non-alphabetic tokens and convert to lowercase."""
return [token.lower() for token in token_list if token.isalpha()]
def extract_data(filename):
"""Extract data from the input file and return frequency-rank mapping."""
return (
# add rank information to frequencies
frequency_to_rank(
tap(
# Compute frequencies of tokens
compute_frequencies(
tap(
# Filter out non-alphabetic tokens and convert to lowercase
filter_tokens(
# Tokenize the text
tokenize_text(
# The corpus is read from the file
read_input(filename)
)
),
# this lambda is for tap to look into the process
lambda x: ic(f'Tokenized {len(x)} items'
)
)
),
# this lambda is for tap to look into the process
lambda x: ic(f'Computed frequencies for {len(x)} unique items'
)
)
)
)
def plot_data(data):
# Extract ranks and frequencies
pairs = [(info['rank'], info['frequency']) for info in data.values()]
ranks = [r for r, f in pairs]
freqs = [f for r, f in pairs]
# Plot results
fig, (ax_lin, ax_log) = plt.subplots(1, 2, figsize=(10, 4), layout="constrained")
# Left: linear
ax_lin.plot(ranks, freqs, marker="o", markersize=1, linestyle="None")
ax_lin.set_title("Linear plot")
ax_lin.set_xlabel("Rank")
ax_lin.set_ylabel("Frequency")
# Right: log (choose what you mean by "log scale")
ax_log.plot(ranks, freqs, marker="o", markersize=1, linestyle="None")
ax_log.set_title("Log plot")
ax_log.set_xlabel("Rank")
ax_log.set_ylabel("Frequency")
ax_log.set_yscale("log")
ax_log.set_xscale("log")
plt.show()
if __name__ == '__main__':
data = extract_data(sys.argv[1] if len(sys.argv) > 1 else 'mtc-full.txt')
plot_data(data)
What do you think? Does the data follow Zipf’s law?