7
AlgoRythm
16h

Markdown syntax is ambiguous and always forgiving (any text in a markdown document is valid markdown). Syntax is complicated and context-dependent.

I honestly think it might be one of the trickiest languages to parse.

Comments
  • 2
    Never thought about it but you might be right.
  • 2
    @Lensflare if you add in the fact that Markdown allows HTML alongside markdown, it’s fucking insanely complicated. Now you’ve gotta parse HTML and Markdown - and you need to consider malformed syntax of BOTH.

    I’m fully convinced it’s impossible to write a “perfect” MD parser for this reason. That’s why there’s so many different flavors of MD. The people building the parsers need to pick and choose what they will implement!
  • 2
    @AlgoRythm yeah, markdown seems like it‘s not as extensible as HTML. Thinking about it, HTML is based on XML and the X is for Extensible. It‘s literally in the name lol.
  • 2
    @Lensflare right. In general, extending markdown requires adding to the language’s grammar and extending HTML just requires coming up with a new tag name.

    Some libraries offer the ability to determine how the markdown renders to HTML but in order to change how the language is parsed very few actually offer an interface, like markdown-it (VSCode’s parser/renderer of choice)
  • 2
    Ah cool, you're actually doing it. I wouldn't accept invalid HTML at all, or even not HTML for starters. If you take that all at once you'll become crazy. You're making it in zig right? I would start with looking at a python example on GPT I guess. Just for the base approach. Python examples are always nice. When does the parsing get hard? It only parses or transpiles to HTML?
  • 2
    @retoor Yeah frankly I don't think I'll be supporting HTML this iteration. I've done a lot of research and prototyping and the part where parsing gets hard is generally when you get to the really complicated stuff like header/titles inside of nested block quotes and a stray [ in the body of a paragraph somewhere (you can't be sure if it's a link or just a random character until you hit a delimiter like newline, which really sucks for both logic and performance issues)

    I have spent about 4 hours modifying my prototype and writing up docs on my approach and still I'm not even at the point where I'm scanning characters yet.

    The cool news: I think I'm going to support outputting either rendered HTML or, optionally, the AST in JSON format (so you can render or process the parsed MD document however you wish)
  • 1
    @AlgoRythm exactly. I wanted to say that about the extensibility and you phrased it perfectly.
  • 1
    @AlgoRythm just an idea, not sure if possible or practical:
    Could you detect the html parts and delegate the parsing of them to an existing html parser?
    Then you could at least focus just on the markdown part.
  • 2
    @Lensflare Realistically: no.

    Depending on your level of support for HTML/MD mixed, you'll need different things, but almost all cases require that you find the *bounds* of an HTML span.

    For example if someone writes a <select> inside your markdown document, you need to find the MATCHING </select> and then what you do with that data is to your discretion.

    Do you want to parse the HTML and integrate it into your own AST with syntax correction etc etc like a real HTML parser? Then you're gonna need the AST from the HTML parser.

    Do you want to do the reasonable thing and just blit the HTML text data into your rendered MD output without a care in the world? Then just do that.

    Either way, most HTML parsers aren't equipped to parse from a specific opening tag to the end of that specific opening tag.

    Even more interesting: what if that HTML is malformed and an opening tag is created but a closing one never is? Then the whole rest of the document might be interpreted as HTML
  • 1
    @AlgoRythm dat only the json one and base everything on that I would say. That's the api then I guess. There's no markdown file so big that the json middle layer would make it slow. Json is perfect. So, I would just get a [{"type" : "h1", "content": "title"}, {"type" : "p", "content": "text under previous title"}] right? I assume you would reuse the html tagging / naming for naming the json elements right? It would be comfortable. In that way you still have kinda both, json / html just because it's prolly just the best choice I guess. Markdown -> HTML DOM json transpiler 😁. Sounds very useful / something that makes people happy.
Add Comment