Homework3

Question 1

Find: \s{2,}  
Replace:,

This expression allows us to find places where there are more than 2 consecutive spaces and replace them with a comma.

Question 2

Find: ([A-Za-z]+),\s([A-Za]+),\s(.*)  
Replace: \2 \1 \(\3)

This expression says to find any character until you hit a space, find the next word of any character and put parenthesis around the last word.

Question 3

Find:(\d{4}) .+?\.mp3)  
Replace:\1\n

This expression tells us to go until you see a word with .mp3 and then add a line break after it.

Question 4

Find: (\d{4})(.+?\.mp3)  
Replace: \2_ \1

This expression places the 4 digit numbers after the words with an underscore.

Question 5

Find:(\w)(\w+),(\w+),[^,]+,([^,]+)    
Replace:\1_\3,\4

This expression allowed us to reduce the first word to just the letter and underscore and remove the 2nd word.

Question 6

Find:(\w)(\w+),(.{4})\w+,[^,]+,([^,]+)   
Replace:\1_\3,\4

This expression allows us to further reduce the species names, specifically saying to keep 4 of the letters in 3rd set of characters.

Question 7

Find:(\w{3})\w+,(\w{3})\w+,(\d.+),(\d+)   
Replace:\1\2, \4, \3

This expression allows us to combine the first 3 letter of each of the species names and adds spaces between the commas.

Question 8

In the "pathogen_binary" column, I would expect to see 1s and 0s. These are the only two options for binary data and therefore, the NAs should not be included in analysis.

The regular expression [^a-zA-Z0-9\s\n\r] can be used to edit both the "bombus_sp" and "host_plant" columns. I entered these columns, individually, into Notepad++ and ran this to remove all special characters, excluding letters until you hit a space and then do a line return. 

The bee_caste has white spaces that need to be removed. All white space can be removed with ^[\s]+|[\s]+$ and replaced with ""

Homework3

Sam Troast

2025-02-05