Web scraping using rvest -


i want scrape text in following website: http://curia.europa.eu/juris/document/document.jsf?text=&docid=49703&pageindex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=656172

my code:

  html = http://curia.europa.eu/juris/document/document.jsf?text=&docid=49703&pageindex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=656172    main_content <- html_nodes(html, css = "#document_content")   main_text <- main_content  %>% html_nodes("p") %>%html_text() 

however, in way, not text extracted because text in node "dd"..."/dd"

i wonder if can html_nodes("p") or html_nodes("dd") or html_nodes("dt") replace html_nodes("p") in above dode.

how can achieve this? or there other way can accomplish task? ideally, dont want use

  main_text <- main_content   %>% html_text() 

because want separate each sentence.

when selecting css, if separate nodes want comma, logical or...

library("rvest") url = "http://curia.europa.eu/juris/document/document.jsf?text=&docid=49703&pageindex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=656172" page <- read_html(url) main_text <- page %>%    html_nodes("#document_content") %>%    html_nodes("p,dd,dt") %>%   html_text() 

Popular posts from this blog

php - How should I create my API for mobile applications (Needs Authentication) -

5 Reasons to Blog Anonymously (and 5 Reasons Not To)

Google AdWords and AdSense - A Dynamic Small Business Marketing Duo