Web scraping using rvest -

i want scrape text in following website: http://curia.europa.eu/juris/document/document.jsf?text=&docid=49703&pageindex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=656172

my code:

  html = http://curia.europa.eu/juris/document/document.jsf?text=&docid=49703&pageindex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=656172    main_content <- html_nodes(html, css = "#document_content")   main_text <- main_content  %>% html_nodes("p") %>%html_text()

however, in way, not text extracted because text in node "dd"..."/dd"

i wonder if can html_nodes("p") or html_nodes("dd") or html_nodes("dt") replace html_nodes("p") in above dode.

how can achieve this? or there other way can accomplish task? ideally, dont want use

  main_text <- main_content   %>% html_text()

because want separate each sentence.

when selecting css, if separate nodes want comma, logical or...

library("rvest") url = "http://curia.europa.eu/juris/document/document.jsf?text=&docid=49703&pageindex=0&doclang=en&mode=lst&dir=&occ=first&part=1&cid=656172" page <- read_html(url) main_text <- page %>%    html_nodes("#document_content") %>%    html_nodes("p,dd,dt") %>%   html_text()

Search This Blog

Employment

Web scraping using rvest -

Popular posts from this blog

Apache NiFi ExecuteScript: Groovy script to replace Json values via a mapping file -

node.js - How do I prevent MongoDB replica set from querying the primary? -

python 3.x - PyQt5 - Signal : pyqtSignal no method connect -