regex - Implement a tokeniser in Python -
i trying implement tokeniser in python (without using nltk libraries) splits string words using blank spaces. example usage is:
>> tokens = tokenise1(“a (small, simple) example”) >> tokens [‘a’, ‘(small,’, ‘simple)’, ‘example’] i can of way using regular expressions return value includes white spaces don't want. how correct return value per example usage?
what have far is:
def tokenise1(string): return re.split(r'(\s+)', string) and returns:
['', 'a', ' ', '(small,', ' ', 'simple)', ' ', 'example', ''] so need rid of white space in return
the output having spaces because capture them using (). instead can split like
re.split(r'\s+', string) ['a', '(small,', 'simple)', 'example'] \s+matches 1 or more spaces.