Bauke Brenninkmeijer
Showcase the silent revolution of text tokenizers
MSc in CS and Data Science @Nijmegen
Data Scientist @ABNAMRO since 2019
Co-founder of DSFC
@baukebrenninkmeijer
During my work: nothing, unfortunately.
Outside of work: currently, also nothing.
But I'm a curious person. And I worked with NLP before but never really dove deep into tokenizers.
In an attempt to protect myself from looking foolish, I will limit the scope of this talk to english.
Asian languages like Chinese often require different approaches of which I know nothing.
tokens
'tokenizer'
't', 'o', 'k', 'e', 'n', 'i', 'z', 'e', 'r'
'token', 'izer'
Where are the tokenizers?
Please find the slides on my github: https://github.com/Baukebrenninkmeijer/diving-into-nlp-encoders
