Tokenization

The simplest built-in tokenization method in Python is the split() method, which splits a string into words based on whitespace.

text = "Ali ile Ayşe okula gitti. Sonra eve döndüler."
words = text.split()
print(words) # ['Ali', 'ile', 'Ayşe', 'okula', 'gitti', '.', 'Sonra', 'eve', 'döndüler', '.']

Tokenization with `NLTK`

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

nltk.download("punkt")

text = "Ali ile Ayşe okula gitti. Sonra eve döndüler."
print(sent_tokenize(text, language="turkish"))
#> ['Ali ile Ayşe okula gitti.', 'Sonra eve döndüler.']
print(word_tokenize(text, language="turkish"))
#> ['Ali', 'ile', 'Ayşe', 'okula', 'gitti', '.', 'Sonra', 'eve', 'döndüler', '.']

Tokenization with NLTK

Tokenization with `NLTK`