Edit

My question was very badly written but the new title reflect the actual question. Thanks to 3 very friendly and dedicated users (@harsh3466 @tuna @learnbyexample) I was able to find a solution for my files, so thank you guys !!!

For those who will randomly come across this post here are 3 possible ways to achieve the desired results.

Solution 1 (https://lemmy.ml/post/25346014/16383487)

#! /bin/bash
files="/home/USER/projects/test.md"

mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"

while IFS= read -r line; do
	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
	sed -i "s/$line/${dashlink}/" "$files"

	#Puts everything to lowercase after a hashtag
	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
	sed -i "s/$dashlink/${lowercaselink}/" "$files"

	#Removes spaces (%20) from markdown links after a hashtag
	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
	sed -i "s/$lowercaselink/${spacelink}/" "$files"

done <<<"$mdlinks2"

Solution 2 (https://lemmy.ml/post/25346014/16453351)

sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'

Solution 3 (https://lemmy.ml/post/25346014/16453161)

perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'

Relevant links

https://mike.bailey.net.au/notes/software/apps/obsidian/issues/markdown-heading-anchors/#background


Hi everyone !

I’m in need for some assistance for string manipulation with sed and regex. I tried a whole day to trial & error and look around the web to find a solution however it’s way over my capabilities and maybe here are some sed/regex gurus who are willing to give me a helping hand !

With everything I gathered around the web, It seems it’s rather a complicated regex and sed substitution, here we go !

What Am I trying to achieve?

I have a lot of markdown guides I want to host on a self-hosted forgejo based git markdown. However the classic markdown links are not the same as one github/forgejo…

Convert the following string:

[Some text](#Header%20Linking%20MARKDOWN.md)

Into

[Some text](#header-linking-markdown.md)

As you can see those are the following requirement:

  • Pattern: [Some text](#link%20to%20header.md)
  • Only edit what’s between parentheses
  • Replace space (%20) with -
  • Everything as lowercase
  • Links are sometimes in nested parentheses
    • e.g. (look here [Some text](#link%20to%20header.md))
  • Do not change a line that begins with https (external links)

While everything is probably a bit complex as a whole the trickiest part is probably the nested parentheses :/

What I tried

The furthest I got was the following:

sed -Ei 's|\(([^\)]+)\)|\L&|g' test3.md #make everything between parentheses lowercase

sed -i '/https/ ! s/%20/-/g' test3.md #change every %20 occurrence to -

These sed/regx substitution are what I put together while roaming the web, but it has a lot a flaws and doesn’t work with nested parentheses. Also this would change every %20 occurrence in the file.

The closest solution I found on stackoverflow looks similar but wasn’t able to fit to my needs. Actually my lack of regex/sed understanding makes it impossible to adapt to my requirements.


I would appreciate any help even if a change of tool is needed, however I’m more into a learning processes, so a script or CLI alternative is very appreciated :) actually any help is appreciated :D !

Thanks in advance.

  • N0x0n@lemmy.mlOP
    link
    fedilink
    arrow-up
    2
    ·
    9 months ago

    Sure :)

    I don’t know if it still a thing but in the past some web URLs had spaces in their addresses e.g.

    https://www.my/%20website%20with%20spaces.com
    

    In markdown you can link to external web addresses like so

    [some link to a web address](https://my/%20website%20with%20spaces.com)
    

    However, /https/ ! s|%20|-|g replaces all occurrences of %20 (which is consider a space in html? Sorry if I’m wrong here :s still have a lot to learn) with -. This would break the link the the web URL [some link to a web address](https://my-website-with-spaces.com/). Am I wrong here?


    If I may I just found something else that doesn’t quite work 😅 and it seems a bit harder to fix i think ! Sometimes I have links in this form:

    [1.3 Subtitles](BDMV_svt-av1_encode_anime.md#1.3%20Subtitles)
    

    As you can see I append the header with 1.3 but as dumb as it is… it also need to be 1-3-subtitles

    e.g.

    [1.3 Subtitles](BDMV_svt-av1_encode_anime.md#1.3%20Subtitles)
    

    Needs to become

    [1.3 Subtitles](BDMV_svt-av1_encode_anime.md#1-3-Subtitles)
    

    Sorry for my bad English trying my best haha ! Hope it’s comprehensible.

    Edit:

    I don’t know why but lemmy add /%20 instead of %20 in my fake URLS ://

    • harsh3466@lemmy.ml
      link
      fedilink
      arrow-up
      2
      ·
      9 months ago

      Okay. To address the %20 and the https links, and the placeholder links, I came up with a bash script to handle this.

      Because of the variation in the links, instead of trying to write a sed command that will match only %20 in anchor markdown links, and placeholder links, while ignoring https links and ignoring all other text in the document.

      To do that, I used grep, a while loop, IFS, and sed

      Here’s the script:

      #! /bin/bash
      
      mdlinks="$(grep -Po ']\((?!https).*\)' ~/mkdn"
      
      while IFS= read -r line; do
      	dashlink="$(echo "$line" | sed 's/%20/-/g')"
      	sed -i "s/$line/${dashlink}/" /path/to/file
      done <<<"$mdlinks"
      

      I’m not sure how familiar you are with bash scripting, so I’ll do the same breakdown:

      #! /bin/bash - This tells the shell what interpreter to use for the script. In this case it’s bash.

      mdlinks="$(grep -Po ']\((?!https).*\)' /path/to/file" - This line uses grep to search for markdown link enclosures excluding https links and to output only the text that matches and saves all of that into a variable called mdlinks. Each link match will be a new line inside the variable.

      The breakdown of the grep command is as followes:

      grep - invokes the grep command

      -Po - two command flags. The P tells grep to use perl regular expressions. The o tells grep to only print the output that matches, rather than the entire line.

      ' - opens the regex statement

      ]\( - finds a closing bracket followed by an opening parentheses

      (?!https) - This is a negative look ahead, which a feature available in perl regex. This tells grep not to match if it finds the https string. The parentheses encloses the negative look ahead. The ?! Is what indicates it’s a negative look ahead, and the https is the string to look for and ignore.

      ' - closes the regex statement

      /path/to/file - the file to search for matches

      while IFS= read -r line; do - this invokes a while loop using the Internal Field Separator (IFS=), which by default includes newline character. This allows the loop to take in the variable containing all of the matched links and separate them line by line to work on one at a time. The read command does what it says and reads the input given. In this case our variable mdlinks. The -r flag tells read to ignore the backslash character and just treat it as a normal part of the input. line is the variable that each line will be saved in as they are worked through the loop. The ; ends while setup, and do opens the loop for the commands we want to run using the input saved in line.

      dashlink="$(echo "$line" | sed 's/%20/-/g')" - This command sequence runs the markdown link saved in the line variable into sed to find all instances of %20 and replace them with a -.

      dashlink - the variable we’re saving the new link with dashes to.

      = - separates the variable from the input being saved into the variable.

      " - opens this command string for variable expansion.

      $ - tells bash to do command substition, meaning that the output of the following commands will be saved to the variable, rather than the actual text of the commands that follows.

      ( - opens the command set

      echo - prints the given text or arguments to standard output, in this case the given argument is the variable $line

      " - tells bash to expand any variables contained within the quote set while ignoring any nonstandard characters like spaces or special shell characters that are saved in the variable.

      $line - the variable containing our active markdown link from the text document

      " - the closing quote ending the argument and the expansion enclosure

      | - This is a pipe, which redirects the standard output of the command on the left into the command on the right. Meaning we’re taking the markdown link currently saved in the variable and feeding it into sed

      sed - invokes sed so we can manipulate our text, and because sed is receiving redirected input, and we’ve specified no flags, the modified text will be printed to standard output.

      's/%20/-/g' - Our pattern match/substitution, which will find all occurrences of the string %20 in the markdown link fed into sed and replace them with -.

      )" - closes our command sequence for command substitution, and the variable expansion. At this point the text printed to standard output by sed is saved to the variable dashlink

      The next line is: sed -i "s/$line/${dashlink}/" /path/to/file, which uses sed to take the line and dashlink variables and use them to find the exact original markdown link in the text containing the %20 sequences, and replace it with the properly formatted markdown link with dashes.

      sed -i - invokes sed and uses the -i flag to edit the file in place.

      " - The double quote enclosure allows the expansion of variables in the pattern match/replacement sequence so it searches for the markdown link, and not the literal text string $line.

      s/ - opens our match/modify sequence.

      $line - the original markdown link that will be found

      / - ends the pattern matching section

      ${dashlink} - The variable containing the previously modified markdown link that now has dashes. This expands to that properly formatted link which will be written into the text file replacing the malformed link. I don’t know why this link has to be enclosed in curly braces while the first one does not.

      /" - ends the text modification section and closes the variable expansion.

      /path/to/file - the file to be worked on

      Finally we have done<<<"$mdlinks", which ends the while loop and feeds the mdlinks variable into it.

      done - closes the while loop

      <<< - This feeds the given argument into the while loop for processing

      " - expands the variable within while ignoring nonstandard characters

      $mdlinks - the variable we’re feeding in with all of our links containing %20, except for https links.

      " - closes the variable expansion.

      If you’ve never written/created your own bash script, here’s what you need to do.

      • in your home directory, or in the directory you’re working in with these files, use a text editor like vim or nano or gedit or kate or whatever plain text editor you want to to create a new file. Call the file whatever you want.

      • Paste the entirety of the script text into the file. Modify the file paths as needed to work the file you want to work. if working multiple files, you’ll need to update the script for each new file path as you finish one and move on to the next

      • Save and exit the file

      • Make the file executable at the terminal with sudo chmod +x /path/to/script/file

      • To run it:

        • Change directory to the directory that contains the script file (if you’re not already there)
        • at the command line use the command . ./name-of-script-file
      • N0x0n@lemmy.mlOP
        link
        fedilink
        arrow-up
        2
        ·
        9 months ago

        First, thanks again for sharing your knowledge with me I really appreciate the time/effort you took to write all of this. I know those are a lot of thank you :/ but I’m really grateful for all of this, this is very valuable information I will keep in my knowledge base. It’s really time I learn proper bash/python/Pearl? scripting with all those tools (grep/sed/regex).

        Second, YOU MISSED A DAMNED parentheses you fool xD ! mdlinks="$(grep -Po ']\((?!https).*\)' ~/mkdn)" Took me some time to figured it out with a very non informative error bashscript.sh: line 8: unexpected EOF while looking for matching "' but as expected it works !

        From
        -------
        [Just a test](#Just%20a%20test.md)
        [Just a link](https://mylink/%20with%20space.com)
        %20
        
        To
        -------
        [Just a test](#Just-a-test.md)
        [Just a link](https://mylink/%20with%20space.com)
        %20
        

        Next to show you my appreciation and not to take everything for granted and being spoon feed for everything, I tried to find a solution myself for something else, I will try to explain the best I can how I solved it.

        From
        -------
        [Just a test](Another%20markdown%20file.md#Hello%20World)
        
        To
        -------
        [Just a test](Another%20markdown%20file.md#hello-world)
        

        The part before the hashtag needs to keep it’s initial form (it links to the original markdown file). So, because just playing around with Pearl and regex (which doesn’t end well doing this blindly without the proper knowledge) I did some simple string manipulation. It’s not very elegant but does the trick, thankfully to your well written breakdown.

        • I printed out the $mdlinks variable just to see what it prints out
        • Copied and changed your Pearl/regex to find the first hashtag (#) and save it into a new variable ($mdlinks2)
        • Feed your $mdlinks variable into my new Pearl/regex
        • Feed my new variable into done? (I’m a bit confused here but okay xD)
        #! /bin/bash
        mdlinks="$(grep -Po ']\((?!https).*\)' "/home/dany/newtest.md")"
        echo $mdlinks
        
        mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
        echo $mdlinks2
        
        while IFS= read -r line; do
        	dashlink="$(echo "$line" | sed 's|%20|-|g')"
        	sed -i "s/$line/${dashlink}/" "/home/dany/newtest.md"
        done <<<"$mdlinks2"
        

        Yes, not very elegant but It’s the best I could do currently :/ However, I still got a YES effect :P


        To answer your question:

        Quick question as I’m working on this, in the new link example, is the BDMV and other capitalized text in this link supposed to be converted to lowercase, or to remain uppercase?

        As you can see in my string manipulation above, the part before the # needs to keep it’s original form :) (Sorry wasn’t aware of this before working with the original files) I solved it with some string manipulation as shown above.

        I’m a bit tired from all this searching/trail&error, tomorrow I will try to wrap everything up and answer your post below :) ! Also, I need to clean up the mess I made in my home directory xD.

        Thanks again for your help ! Have a good night/day !

        • harsh3466@lemmy.ml
          link
          fedilink
          arrow-up
          2
          ·
          9 months ago

          Oh god! I’m sorry about the missing )! I must have dropped it when copying things from my notes over to post the comment! (≧▽≦)

          Despite my error, I’m glad it worked, and even happier that you were able to take what we had worked out and modify it further to fit your other requirements. It’s fun helping each other out, and it’s also great learning.

          I learn by problem solving, so I’ve got all my notes from working on this in my knowledge base as well!

          In the future, feel free to ping me if you need help with other linux/cli/bash things. As I’ve mentioned before I’m no expert, but happy to help where I can.

          • N0x0n@lemmy.mlOP
            link
            fedilink
            arrow-up
            1
            ·
            9 months ago

            Hello :) I promise this is the last time I will bother you (I know what you are going to say :P) ! If it’s not to much could you give me just a few hints on how I could improve a bit the final script?

            #! /bin/bash
            
            files="/home/USER/projects/test.md"
            
            mdlinks="$(grep -Po ']\((?!https).*\)' "$files")"
            mdlinks2="$(grep -Po '#.*' <<<$mdlinks)"
            
            while IFS= read -r line; do
            	#Converts 1.2 to 1-2 (For a third level heading needs to add a supplementary [0-9]) 
            	dashlink="$(echo "$line" | sed -r 's|(.+[0-9]+)\.([0-9]+.+\))|\1-\2|')"
            	sed -i "s/$line/${dashlink}/" "$files"
            
            	#Puts everything to lowercase after a hashtag
            	lowercaselink="$(echo "$dashlink" | sed -r 's|#.+\)|\L&|')"
            	sed -i "s/$dashlink/${lowercaselink}/" "$files"
            
            	#Removes spaces (%20) from markdown links after a hashtag
            	spacelink="$(echo "$lowercaselink" | sed 's|%20|-|g')"
            	sed -i "s/$lowercaselink/${spacelink}/" "$files"
            
            done <<<"$mdlinks2"
            

            This works perfectly en fulfills all my needs (thanks !!) ! However I’m not very fond of the variable string manipulation ($mdlinks2), if you have some tips without spoiling to much, would be great, otherwise it’s okay, it works exactly how I have imagined it and ticks all use cases. Also If you could give some pointer for an overall improvement or if you see something that could potentially create some strange loop or looks off feel free to comment in your spare time :).


            Another question which has nothing to do with the post and gets a bit off topic… You gave me the right push I needed and I saw the power and usefulness of proper knowledge with sed/bash/Pearl. It’s time I finally learn a scripting language ! I want to hear your opinion on what tools would you recommend? Most people would say Python for beginners but I heard so much good things about Pearl (Exiftool is a good example of how powerful Pearl can be) but the syntax scares me out a little bit compared to Python.

            Any good book material you have in mind for a beginner?

            Thanks again for everything !!!

            • harsh3466@lemmy.ml
              link
              fedilink
              arrow-up
              2
              ·
              9 months ago

              Hello! I will take a look at it, I just haven’t had a chance over the last day. Give me a couple days and I will give some feedback. Bear in mind I am not an expert, so I might not have much to offer, but I’ll share what I can. :)

              • N0x0n@lemmy.mlOP
                link
                fedilink
                arrow-up
                1
                ·
                9 months ago

                Hey take your time :) Don’t worry even if you forget, you did more than enough to help some random on the web ! 2 other users came up with a plain/bare bone regex solution if you want to have a look and maybe there’s something you can learn out of it? (I doubt it xD).

                Plain sed regex (https://lemmy.ml/post/25346014/16453351)

                sed -E ':l;s/(\[[^]]*\]\()([^)#]*#[^)]*\))/\1\n\2/;Te;H;g;s/\n//;s/\n.*//;x;s/.*\n//;/^https?:/!{:h;s/^([^#]*#[^)]*)(%20|\.)([^)]*\))/\1-\3/;th;s/(#[^)]*\))/\L\1/;};tl;:e;H;z;x;s/\n//;'
                

                Plain Pearl regex (https://lemmy.ml/post/25346014/16453161)

                perl -pe 's/\[[^]]+\]\((?!https?)[^#]*#\K[^)]+(?=\))/lc $&=~s:%20|\d\K\.(?=\d):-:gr/ge'
                

                Nonetheless, I really prefere your solution because as someone else said I will have an easier time to change a script I “understand”. Soo thanks again !

    • harsh3466@lemmy.ml
      link
      fedilink
      arrow-up
      1
      ·
      9 months ago

      Quick question as I’m working on this, in the new link example, is the BDMV and other capitalized text in this link supposed to be converted to lowercase, or to remain uppercase?

      Edit: expanded the question to question case in the whole link

      • N0x0n@lemmy.mlOP
        link
        fedilink
        arrow-up
        2
        ·
        9 months ago

        Hello !!!

        Sorry for the very late response had something else to do. I will read everything carefully and response to every post :) I also thought about it over night and I think that sed and and regex wasn’t the best option here (as other have mentioned it).

        I think a python script or bash (as you have mentioned it a bit later ) would be a better way. I’m sorry that I put you through all of this… wrong tool for the job :s.

    • harsh3466@lemmy.ml
      link
      fedilink
      arrow-up
      1
      ·
      9 months ago

      Don’t worry or apologize about your English. I’m having no trouble understanding. :)

      I’m going to take the second part first and come back with another comment to address the %20 and https bits.

      So these variations, like [1.3 Subtitles](BDMV_svt-av1_encode_anime.md#1.3%20Subtitles), are where you would start to craft a new expression. Trying to catch every variation in a single expression would get to complicated and more likely to fail and/or modify text you don’t want modified.

      So in this case, here’s the expression I’d use:

      sed -ri 's|(]\(.+[0-9]+)\.([0-9]+.+\))|\1-\2|' somefile

      And the breakdown:

      sed -ri calls sed with the expanded regular expressions capabilities and to edit the file in place

      's| - Begins the pattern match|modify expression

      ( - This very first opening parentheses is a special metacharacter that is used to group a sub-expression within the larger expression. By doing this we can create variables that we can refer to in the modification portion of the command.

      ]\( - Find the closing bracket character and an opening parentheses character, which we know will be the beginning of a markdown url. The backslash precedes the open parentheses to escape it and indicate it needs to look for the actual open parentheses character

      .+ - Find any character (indicated by the .) one or more times (indicated by the +). This will find any characters until it gets to the next specified character in the expression

      [0-9]+ - This is two parts. The first part is [0-9]. The brackets are metacharacters in regex that enclose a character set to match from. In this case the character set is the numbers zero to nine. What this means on its own is that sed will look for one occurrence of any number between zero and nine. The + tells sed to find one or more occurrences of a number between one and nine until it gets to the next portion of the pattern. I did this because I don’t know the upper bounds of the documentation numeration you’re working with in the links. If all the links only contain single digit numbers before the decimal, you can remove the +.

      ) - This closing parentheses marks the end of the subexpression that we want to refer to. In this case, the sub expression is capturing from the closing bracket up to (but not including) the decimal in the number.

      \. - This tells sed to find the period/dot/decimal character in the number. It’s preceded by the backslash because the period/dot/decimal character is a metacharacter in regular expressions.

      ( - This is the beginning of a new subexpression

      [0-9]+ - The numeral capture repeats to find the number after the period/dot/decimal. Similarly to the number before the decimal, if the number after the decimal is only ever single digit, the + can be removed.

      .+ - Find any character (indicated by the .) one or more times (indicated by the +). This will find any characters until it gets to the next specified character in the expression, taking us to the end of the url

      \) - Find the closing parentheses of the url. The backslash precedes the closing parentheses to escape it and indicate it needs to look for the actual open parentheses character.

      ) - This closes our second subexpression, which captures everything from the number after the decimal to the closing parentheses of the link.

      | - Indicates the end of the pattern matching portion of the expression/command. and the beginning of the modification part of the command/expression.

      \1 - This is how we refer to or call the subexpressions. The syntax is a backslash followed by a number, and the number indicates the sequential position of the subexpression. So \1 refers to this portion of the regex in the command above: (]\(.+[0-9]+). This section of the expression is capturing everything from the closing bracket up to (but not including) the period/dot/decimal character. By using it in this position in the substitution/modification, we’re just using it as a variable, so in the substitution, it’s going to put everything it finds in the first subexpression first in the new/modified string of text.

      - - This tells sed to put a dash immediately after the first subexpression in this new/modified string of text, effectively replacing the period/dot/decimal in the number portion of the url.

      \2 - This is calling the second subexpression, which is this portion of the pattern matching regex: [0-9]+.+\). This captures everything in the url from the number after the period/dot/decimal (not including the decimal), to the closing parthenses of the markdown url. Used in this position of the substitution it tells sed to place it after the dash in the new/modified text.

      |' - This indicates the end of the modification portion of the command and closes the match|substitution expression.

      somefile - The file to be worked on

      Here is the full command again: sed -ri 's|(]\(.+[0-9]+)\.([0-9]+.+\))|\1-\2|' somefile

      Altogether what this does is: Begin the first subexpression that starts with finding a closing bracket followed by an opening parentheses followed by any character one or more times until finding at least one or more numbers between zero and nine until it finds a decimal, and then close and remember what was found for this sub expression (not including the decimal). Then begin the second subexpression that starts with finding a number between zero and nine one or more times, and then find any character any number of times until a closing parentheses is found. Then close and remember what was found in this subexpression. Replace everything with subexpression one followed by a dash followed by subexpression two.

      If you also need this markdown link text to be converted to lowercase, just add \L to the replacement section before the \1 like so:

      sed -ri 's|(]\(.+[0-9]+)\.([0-9]+.+\))|\L\1-\2|' somefile