regex - Implement a tokeniser in Python -


i trying implement tokeniser in python (without using nltk libraries) splits string words using blank spaces. example usage is:

>> tokens = tokenise1(“a (small, simple) example”) >> tokens [‘a’, ‘(small,’, ‘simple)’, ‘example’] 

i can of way using regular expressions return value includes white spaces don't want. how correct return value per example usage?

what have far is:

def tokenise1(string):     return re.split(r'(\s+)', string) 

and returns:

['', 'a', ' ', '(small,', ' ', 'simple)', ' ', 'example', ''] 

so need rid of white space in return

the output having spaces because capture them using (). instead can split like

re.split(r'\s+', string) ['a', '(small,', 'simple)', 'example'] 
  • \s+ matches 1 or more spaces.

Popular posts from this blog

php - How should I create my API for mobile applications (Needs Authentication) -

python 3.x - PyQt5 - Signal : pyqtSignal no method connect -

5 Reasons to Blog Anonymously (and 5 Reasons Not To)