Type: | Package |
Title: | Scrape the Web with Extra Power |
Version: | 1.2 |
Date: | 2021-06-21 |
Author: | Julien Boelaert <jubo.stats@gmail.com> |
Maintainer: | Julien Boelaert <jubo.stats@gmail.com> |
Description: | Tools for scraping information from webpages and other XML contents, using XPath or CSS selectors. |
Depends: | R (≥ 4.0.0) |
License: | GPL (≥ 3) |
Imports: | XML, xml2, rvest, data.table, parallel |
Encoding: | UTF-8 |
NeedsCompilation: | no |
Packaged: | 2021-06-22 17:24:53 UTC; bart |
Repository: | CRAN |
Date/Publication: | 2021-06-23 07:00:02 UTC |
Scrape the Web with Extra Power.
Description
A set of tools for scraping information from webpages and other XML contents.
Details
Package: | scraEP |
Type: | Package |
Version: | 1.2 |
Date: | 2021-06-21 |
License: | GPL (>=3) |
Function xscrape
is a general tool to extract information from html pages into data frames using XPath or CSS queries.
Function unaccent
removes accents from character strings.
Function strcomp
compares the elements of two character vectors.
Author(s)
Julien Boelaert jubo.stats@gmail.com
Compare the contents of two vectors of character strings.
Description
This function compares the contents of two vectors of character strings: how many elements are present in both vectors, how many are only present in each one, etc., and lists the elements that are only present in one of the vectors.
Usage
strcomp(text1, text2)
Arguments
text1 , text2 |
two character vectors to be compared |
Value
A list containing the following elements:
matchTable
: a cross table of all the unique elements of the concatenation of text1 and text2, indicating whether they belong to text1 and/or to text2.
matchOneInTwo
: a table of the elements of text1 according to how many times they are present in text2.
matchTwoInOne
: a table of the elements of text2 according to how many times they are present in text1.
tabOneInTwo
: a table of the elements of text1 according to whether they are present in text2.
tabTwoInOne
: a table of the elements of text2 according to whether they are present in text1.
oneNotInTwo
: a vector containing the elements of text1 not present in text2.
twoNotInOne
: a vector containing the elements of text2 not present in text1.
Author(s)
Julien Boelaert jubo.stats@gmail.com
Examples
str1 <- c("Alice", "Alice", "Bob", "Carol")
str2 <- c("Bob", "Denise", "Emerson", "Foteini")
strcomp(str1, str2)
Remove all accents from character strings.
Description
This function removes all diacritic accents from character strings (eg. transforms "àüî" into "aui").
Usage
unaccent(text)
Arguments
text |
a character vector |
Value
A character vector, containing the unaccentuated version of the given text.
Author(s)
Julien Boelaert jubo.stats@gmail.com
Examples
unaccent("âàéèïì")
Wikipedia page for R.
Description
Toy example to extract data using xscrape
.
Usage
data(wiki)
Format
Object wiki
is a character vector.
Details
Object wiki
is a raw webpage, containing the source of the
'R (programming language)' article on English wikipedia
(<https://en.wikipedia.org/wiki/R_(programming_language)>, retrieved on 15/11/2017).
Author(s)
Julien Boelaert jubo.stats@gmail.com
Extract information from webpages to a data.frame, using XPath or CSS queries.
Description
This function transforms an html/xml page (or list of pages) into a data.frame, extracting nodes specified by their XPath.
Usage
xscrape(pages,
col.xpath = ".", row.xpath = "/html",
col.css = NULL, row.css = NULL,
collapse = " | ", encoding = NULL,
page.name = TRUE, nice.text = TRUE,
parallel = 0,
engine = c("auto", "XML", "xml2"))
Arguments
pages |
an object of class |
col.xpath |
a character vector of XPath queries used for creating the result columns. If the vector is named, these names are given to the columns. The default "." takes the text from the whole of each page or intermediary node (specified by |
row.xpath |
a character string, containing an XPath query for creating the result rows. The result of this query (on each page) becomes a row in the resulting data.frame. If not specified (default), the intermediary nodes are whole html pages, so that each page becomes a row in the result. |
col.css |
same as |
row.css |
same as |
collapse |
a character string, containing the separator that will be used in case a |
encoding |
a character string (eg. "UTF-8" or "ISO-8859-1"), containing the encoding parameter that will be used by |
page.name |
a logical. If TRUE, the result will contain a column indicating the name of the page each row was extracted from. If |
nice.text |
a logical. If TRUE (only possible with engine xml2), the rvest::html_text2 function is used to extract text into the result, often making the text much cleaner. If FALSE, the function runs faster, but the text might be less clean. |
parallel |
a numeric, indicating the number of cores to use for parallel computation. The default 0 takes all available cores. The parallelization is done on the pages if their number is greater than the number of provided cores, otherwise it is done on the intermediary nodes. Note that parallelization relies on parallel::mclapply, and is thus not supported on Windows systems. |
engine |
a character string, indicating the engine to use for data extraction: either "XML", "xml2", or "auto" (default). The default will adapt the engine to the type of |
Details
If a col.xpath
or col.css
query designs a full node, only its text is extracted. If it designs an attribute (eg. ends with '/@href' for weblinks), only the attribute's value is extracted.
If a col.xpath
or col.css
query matches no elements in a page, returned value is NA
. If it matches multiple elements, they are concatenated into a single character string, separated by collapse
.
Value
A data.frame, where each row corresponds to an intermediary node (either a full page or an XML node within a page, specified by row.xpath
or row.css
), and each column corresponds to the text of a col.xpath
or col.css
query.
Author(s)
Julien Boelaert jubo.stats@gmail.com
Examples
## Extract all external links and their titles from a wikipedia page
data(wiki)
wiki.parse <- XML::htmlParse(wiki)
links <- xscrape(wiki.parse,
row.xpath= "//a[starts-with(./@href, 'http')]",
col.xpath= c(title= ".", link= "./@href"),
parallel = 1)
## Not run:
## Convert results from a search for 'R' on duckduckgo.com
## First download the search page
duck <- XML::htmlParse("http://duckduckgo.com/html/?q=R")
## Then run xscrape on the dowloaded and parsed page
results <- xscrape(duck,
row.xpath= "//div[contains(@class, 'result__body')]",
col.xpath= c(title= "./h2",
snippet= ".//*[@class='result__snippet']",
url= ".//a[@class='result__url']/@href"))
## End(Not run)
## Not run:
## Convert results from a search for 'R' and 'Julia' on duckduckgo.com
## Directly provide the URLs to xscrape
results <- xscrape(c("http://duckduckgo.com/html/?q=R",
"http://duckduckgo.com/html/?q=julia"),
row.xpath= "//div[contains(@class, 'result__body')]",
col.xpath= c(title= "./h2",
snippet= ".//*[@class='result__snippet']",
url= ".//a[@class='result__url']/@href"))
## End(Not run)