regex - Regular expressions in R: pattern repetitions with {}


Question: 

I am having trouble with a regular expression in R. The goal is to parse a Markdown/reST/knitr report text file in R to remove my own custom comments. These comments are put in the following form:

Some sentence is about something <find a citation to this>.

As Markdown uses <> for HTML tags, I need to remove these comments (with my custom function) to avoid confusion. After I do that, the sentence takes the following form:

Some sentence is about something .

Note the space between the last word and the dot. It is easy to remove that, but then the text might contain reST comments incorporating R code (knitr) with beginning with ..:

.. {r chunk-name}
.. some R code 
.. ..

So basically I need to replace the " ." in the former case, but not in the latter. I though I would achieve this using the repetition modifier of R regexp atoms:

gsub(pattern=" \\.{1}",replacement=".",x="Something ..")
[1] "Something.."

I was expecting that this expression would match a single space followed by a single (but not more) dots. However the string gets replaced regardless of whether there is one dot or two. I am a real newbie with this, so probably missing something obvious. Even so, any help will greatly appreciated.

Regards, Maxim




3 Answers: 

You can remove everything from the last space upto the . and paste a . at the end of the string, no?

# anything followed by any amount of space followed 
# by < followed by anything until the end of the sentence
paste0(gsub("(.*)[ ].*<.*$", "\\1", tt), ".")
# [1] "Some sentence is about something."

That said, you should really read this.

Alternatively, if the markup occurs in the middle of a sentence and you just want to remove them and the spaces around them, then:

# remove everything within <...> including < and > 
# and any spaces surrounding them
gsub("[ ]*<.*?>[ ]*", "", tt)
# [1] "Some sentence is about something."

# example:
tt <- ".. some sentences are wrong <bla bla>. But some are <bla bla> right."
gsub("[ ]*<.*?>[ ]*", "", tt)
# [1] ".. some sentences are wrong. But some are right."

Note the difference between .*> and .*?>. The first one is "greedy" in the sense that it'll match all characters until the last >. Whereas, the second one will stop after the first match, which is desirable here and you want to remove every occurrence.

 

The matching occurs as soon as the pattern matches. There is no look-forward to make sure the pattern is not recurring. I'm not sure if it's general enough but using a character class with a negation operator works in the offered single test case

> gsub(pattern=" \\.[^.]| \\.$",replacement=".",x="Something .")
[1] "Something."
> gsub(pattern=" \\.[^.]| \\.$",replacement=".",x="Something ..")
[1] "Something .."
 

You can accomplish what you want using the negative look ahead pattern in Perl regular expressions. This basically says to match the pattern, but only if not followed by this pattern. A quick example:

> gsub(pattern=" \\.(?!\\.)",replacement=".",x="Something .", perl=TRUE)
[1] "Something."
> gsub(pattern=" \\.(?!\\.)",replacement=".",x="Something ..", perl=TRUE)
[1] "Something .."
 

More Articles


Know of any setup tutorials for SQL Server Express and C#?

I'm a C# and MySQL developer, but I'm looking into Microsoft's SQL Server for a new project. I'm familiar with MySQL syntax and the .NET connector. Does anyone know of any decent tutorials that just cover the code to interact with SQL Server? No C# basics or SQL basics, just how to get the two tec

parsing - Parse JQ output through external bash function?

I want to parse out data out of a log file which consist of JSON sting and I wonder if there's a way for me to use a bash function to perform any custom parsing instead of overloading jq command.Command:tail errors.log --follow | jq --raw-output '. | [.server_name, .server_port, .request_file] | @ts

php - Laravel openssl_private_encrypt(): key param is not a valid private key

I am trying to connect to Chef API with Laravel using PHP-Chef. I have tried to set up my chef config with the data that I got from knife.rb in .chef folder. I have setup client and a key according to the instructions from Knife.rb. But I get:openssl_private_encrypt(): key param is not a valid priva


android - Mock location not working on Google map

I have used code from this. I have changed it a bit. Below is my code snippet. The problem is Google Map is not showing proper location which i have mocked.public class MockGpsProviderActivity extends Activity implements LocationListener {public static final String LOG_TAG = "MockGpsProviderActivity

git - How to use libgit2sharp with ssh-transport-protocol?

When I use libgit2sharp in project to clone repository with ssh-transport protocol, like git@github.com:libgit2/libgit2sharp.git It throw an exception, says "This transport isn't implemented. Sorry"How can I clone repository with ssh-transport-protocol by using libgit2sharp ?

cmake - How to build openCV 3.3.0 with GStreamer on Windows

Having Gstreamer 1.22 successfully installed I'm not able to configure the project to build OpenCV. CMake isn't able to find GStreamer on my machine. Any ideas how two address this issue?


sql server - How to add Text at the end of each line vb.net

I'm developing a program with VB.NET (2013), which works with a local database (sql server 2008 R2),The program is converting database tables into text files ,so how to add some text after the last field of each row at the end of each line in that text file,Thanks, And sorry for my englishIf mytable

c++11 - C++ problem with understanding of counting char variables

I wrote a program that supposed to print the number of characters that i entered till it hits the '#' character. what i don't understand is, when i input in the console more than one character (say "hello world") the program count all the characters in one iteration. why does it count all the charac

powershell - Table - Count Column

I have a PowerShell script that coverts JSON from an API and creates a table output.How would I TOTAL (add or count) the column 'TotalEndpoints' and display the total number of Enpoints?Here is the creation of the table(Invoke-RestMethod @Params).Sites | Format-Table SiteName,SiteId,TotalEndpoints

c++ - GCC Cross compile to a i586 architecture (Vortex86DX)

I have Ubuntu 12.01 with gcc 4.8.2 and would like to cross compile for the Vortex86DX CPU running an old 2.6.23 kernel.I´m trying the following testing code:#include <iostream>int main(){ std::cout << "Hello world" << std::endl;}That is compiled using the following command line:g