Type: | Package |
Title: | Count Words and Characters in R Markdown and Jupyter Notebooks |
Version: | 0.3.1 |
Date: | 2025-05-20 |
Description: | Computes word, character, and non-whitespace character counts in R Markdown documents and Jupyter notebooks, with or without code chunks. Returns results as a data frame. |
Imports: | jsonlite, knitr, rstudioapi |
Suggests: | testthat |
License: | GPL-3 |
URL: | https://github.com/sigbertklinke/rmdwc |
Encoding: | UTF-8 |
RoxygenNote: | 7.3.2 |
NeedsCompilation: | no |
Packaged: | 2025-05-20 10:43:26 UTC; sigbert |
Author: | Sigbert Klinke [aut, cre] |
Maintainer: | Sigbert Klinke <sigbert@hu-berlin.de> |
Repository: | CRAN |
Date/Publication: | 2025-05-20 12:00:02 UTC |
Count text elements in Jupyter Notebook files
Description
This function extracts text from specific cell types (e.g., markdown) in one or more .ipynb
files
and counts the number of characters, words, and lines. It optionally excludes certain patterns (e.g., code fences).
The function uses a helper function rmdcount()
to perform the counting on the extracted text.
Usage
ipynbcount(
files,
celltype = c("markdown"),
space = "[[:space:]]",
word = "[[:space:]]+",
line = "\n",
exclude = "```\\{.*?```"
)
Arguments
files |
character: vector of paths to |
celltype |
character: vector indicating which cell types to include (default is |
space |
character: pattern to split a text at spaces (default: |
word |
character: pattern to split a text at word boundaries (default: |
line |
character: pattern to split lines (default: |
exclude |
character: pattern to exclude text parts, e.g. code chunks (default: |
Details
This function assumes that the notebook files are valid JSON and contain a list of cells under the cells
field.
It temporarily writes the extracted content to a file to reuse the rmdcount()
logic.
Value
A data frame with counts of characters, words, and lines for each file. Additional columns include file
(base name) and path
(directory).
Examples
file <- system.file('ipynb/example_data_analysis.ipynb', package="rmdwc")
ipynbcount(file) # without code
ipynbcount(file, celltype=c("markdown", "code")) # with code
Word, character and non-whitespace characters count
Description
rmdcount
counts lines, words, bytes, characters and non-whitespace characters in R Markdown files excluding code chunks.
txtcount
counts lines, words, bytes, characters and non-whitespace characters in plain text files.
Note that the counts may differ a bit from unix wc
and Libre Office because
it depends on the definition of a line, a word and a character.
Usage
rmdcount(
files = NULL,
space = "[[:space:]]",
word = "[[:space:]]+",
line = "\n",
exclude = "```\\{.*?```"
)
txtcount(
files = NULL,
space = "[[:space:]]",
word = "[[:space:]]+",
line = "\n"
)
Arguments
files |
character: file name(s) |
space |
character: pattern to split a text at spaces (default: |
word |
character: pattern to split a text at word boundaries (default: |
line |
character: pattern to split lines (default: |
exclude |
character: pattern to exclude text parts, e.g. code chunks (default: |
Details
We define:
- Line
the number of lines. It differs from unix
wc -l
sincewc
counts the number of newlines.- Word
it is considered to be a character or characters delimited by white space. However, a "word" is in general a fuzzy concept, for example is "3.141593" a word? Therefore different programs may count differently, for more details see the discussion to the Libreoffice bug Word count gives wrong results - Another Example Comment 5.
The following approach is used to detect lines, words, characters and non-whitespace characters.
- lines
strsplit(rmd, line)[[1]]
withline='\n'
- bytes
charToRaw(rmd)
- words
strsplit(rmd, word)[[1]]
withword='[[:space:]]+'
- characters
strsplit(rmd, '')[[1]]
- non-whitespace characters
strsplit(gsub(space, '', rmd), '')[[1]]
withspace='[[:space:]]'
If txtcount
is used then code chunks are deleted with gsub('```\\{.*?```', '', rmd)
before counting.
Value
a data frame with following elements
- file
basename of file
- lines
number of lines
- words
number of words
- bytes
number of bytes
- chars
number of characters
- nonws
number of non-whitespace characters
- path
path of file
Examples
# count excluding code chunks
files <- system.file('rmarkdown/rstudio_pdf.Rmd', package="rmdwc")
rmdcount(files)
# count including code chunks
txtcount(files) # or rmdcount(files, exclude='')
# count for a set of R Markdown docs
files <- list.files(path=system.file('rmarkdown', package="rmdwc"),
pattern="*.Rmd", full.names=TRUE)
rmdcount(files)
# use of rmdcount() in a R Markdown document
if (interactive()) {
files <- system.file('rmarkdown/rstudio_pdf.Rmd', package="rmdwc")
file.edit(files) # SAVE(!) the file and knit it
}
# count including code chunks
files <- system.file('rmarkdown/rstudio_pdf.Rmd', package="rmdwc")
txtcount(files)
rmdcountAddin
Description
Applies rmdcount
to the current R Markdown document
Usage
rmdcountAddin()
Value
nothing
Examples
if (interactive()) rmdcountAddin()
Word-, character and non-whitespace characters count for a text
Description
Counts words, characters and non-whitespace characters in a string. Is used in rmdcount
, see details there.
Usage
rmdwcl(rmd, space = "[[:space:]]", word = "[[:space:]]+", line = "\n")
Arguments
rmd |
character: R Markdown document as string |
space |
character: pattern to split a text at spaces (default: |
word |
character: pattern to split a text at word boundaries (default: |
line |
character: pattern to split lines (default: |
Value
a list
Examples
file <- system.file('rmarkdown/rstudio_pdf.Rmd', package="rmdwc")
fcont <- readChar(file, file.info(file)$size)
rmdwcl(fcont)