regex - Implement a tokeniser in Python -
i trying implement tokeniser in python (without using nltk libraries) splits string words using blank spaces. example usage is:
>> tokens = tokenise1(“a (small, simple) example”) >> tokens [‘a’, ‘(small,’, ‘simple)’, ‘example’]
i can of way using regular expressions return value includes white spaces don't want. how correct return value per example usage?
what have far is:
def tokenise1(string): return re.split(r'(\s+)', string)
and returns:
['', 'a', ' ', '(small,', ' ', 'simple)', ' ', 'example', '']
so need rid of white space in return
the output having spaces because capture them using ()
. instead can split like
re.split(r'\s+', string) ['a', '(small,', 'simple)', 'example']
\s+
matches 1 or more spaces.