Web scraping using rvest -
i want scrape text in following website: http://curia.europa.eu/juris/document/document.jsf?text=&docid=49703&pageindex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=656172
my code:
html = http://curia.europa.eu/juris/document/document.jsf?text=&docid=49703&pageindex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=656172 main_content <- html_nodes(html, css = "#document_content") main_text <- main_content %>% html_nodes("p") %>%html_text()
however, in way, not text extracted because text in node "dd"..."/dd"
i wonder if can html_nodes("p") or html_nodes("dd") or html_nodes("dt") replace html_nodes("p") in above dode.
how can achieve this? or there other way can accomplish task? ideally, dont want use
main_text <- main_content %>% html_text()
because want separate each sentence.
when selecting css, if separate nodes want comma, logical or...
library("rvest") url = "http://curia.europa.eu/juris/document/document.jsf?text=&docid=49703&pageindex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=656172" page <- read_html(url) main_text <- page %>% html_nodes("#document_content") %>% html_nodes("p,dd,dt") %>% html_text()