[rust] add support for accesing the list of line offsets #3860

froydnj · 2026-01-18T18:40:25Z

As per subject. rubyfmt spends a non-trivial amount of time figuring out where the newlines are in the source string, and since the parser already has this information, it seems silly to have to recompute it.

I'm glad I wrote tests, though, because the implementation seems surprising to me -- I don't think I ever would have expected the list to contain offsets for newlines that aren't actually present in the source string. I guess it's too late to change the contract here? (I considered making this code return list[1..], but I think the consistency between it and the C implementation is more important; happy to reconsider or to get the underlying library changed.)

cc @reese

froydnj · 2026-01-18T18:47:07Z

Oh, I see -- the doc comment on pm_newline_list_t is a little ambiguous, or maybe I just read too fast:

prism/include/prism/util/pm_newline_list.h

Lines 24 to 40 in 6881fd5

    
           /** 
        
            * A list of offsets of newlines in a string. The offsets are assumed to be 
        
            * sorted/inserted in ascending order. 
        
            */ 
        
           typedef struct { 
        
               /** A pointer to the start of the source string. */ 
        
               const uint8_t *start; 
        
               /** The number of offsets in the list. */ 
        
               size_t size; 
        
               /** The capacity of the list that has been allocated. */ 
        
               size_t capacity; 
        
               /** The list of offsets. */ 
        
               size_t *offsets; 
        
           } pm_newline_list_t;

The indices in the list are actually the indices for the beginning of each line, not of the actual newlines in the source string. OK, that makes the tests make a little more sense, but in that case, I think the function name should be something more like line_offsets and the documentation should be changed.

froydnj · 2026-01-20T15:51:31Z

Moving this to draft; I think if we're going to provide line offsets, we might as well go ahead and just expose an iterator over the lines instead, since that's what people are going to be doing anyway.

eregon · 2026-01-20T19:24:14Z

The list of line offsets is useful for quite a few other things, which probably will be useful in Rust too:

prism/lib/prism/parse_result.rb

Lines 79 to 169 in 38c64e5

    
           # Converts the line number to a byte offset corresponding to the start of that line 
        
           def line_to_byte_offset(line) 
        
             l = line - @start_line 
        
             if l < 0 || l >= offsets.size 
        
               raise ArgumentError, "line #{line} is out of range" 
        
             end 
        
             offsets[l] 
        
           end 
        
           # Binary search through the offsets to find the line number for the given 
        
           # byte offset. 
        
           def line(byte_offset) 
        
             start_line + find_line(byte_offset) 
        
           end 
        
           # Return the byte offset of the start of the line corresponding to the given 
        
           # byte offset. 
        
           def line_start(byte_offset) 
        
             offsets[find_line(byte_offset)] 
        
           end 
        
           # Returns the byte offset of the end of the line corresponding to the given 
        
           # byte offset. 
        
           def line_end(byte_offset) 
        
             offsets[find_line(byte_offset) + 1] || source.bytesize 
        
           end 
        
           # Return the column number for the given byte offset. 
        
           def column(byte_offset) 
        
             byte_offset - line_start(byte_offset) 
        
           end 
        
           # Return the character offset for the given byte offset. 
        
           def character_offset(byte_offset) 
        
             (source.byteslice(0, byte_offset) or raise).length 
        
           end 
        
           # Return the column number in characters for the given byte offset. 
        
           def character_column(byte_offset) 
        
             character_offset(byte_offset) - character_offset(line_start(byte_offset)) 
        
           end 
        
           # Returns the offset from the start of the file for the given byte offset 
        
           # counting in code units for the given encoding. 
        
           # 
        
           # This method is tested with UTF-8, UTF-16, and UTF-32. If there is the 
        
           # concept of code units that differs from the number of characters in other 
        
           # encodings, it is not captured here. 
        
           # 
        
           # We purposefully replace invalid and undefined characters with replacement 
        
           # characters in this conversion. This happens for two reasons. First, it's 
        
           # possible that the given byte offset will not occur on a character 
        
           # boundary. Second, it's possible that the source code will contain a 
        
           # character that has no equivalent in the given encoding. 
        
           def code_units_offset(byte_offset, encoding) 
        
             byteslice = (source.byteslice(0, byte_offset) or raise).encode(encoding, invalid: :replace, undef: :replace) 
        
             if encoding == Encoding::UTF_16LE || encoding == Encoding::UTF_16BE 
        
               byteslice.bytesize / 2 
        
             else 
        
               byteslice.length 
        
             end 
        
           end 
        
           # Generate a cache that targets a specific encoding for calculating code 
        
           # unit offsets. 
        
           def code_units_cache(encoding) 
        
             CodeUnitsCache.new(source, encoding) 
        
           end 
        
           # Returns the column number in code units for the given encoding for the 
        
           # given byte offset. 
        
           def code_units_column(byte_offset, encoding) 
        
             code_units_offset(byte_offset, encoding) - code_units_offset(line_start(byte_offset), encoding) 
        
           end 
        
           # Freeze this object and the objects it contains. 
        
           def deep_freeze 
        
             source.freeze 
        
             offsets.freeze 
        
             freeze 
        
           end 
        
           private 
        
           # Binary search through the offsets to find the line number for the given 
        
           # byte offset. 
        
           def find_line(byte_offset) 
        
             index = offsets.bsearch_index { |offset| offset > byte_offset } || offsets.length 
        
             index - 1 
        
           end

Agreed the name pm_newline_list_t is confusing 😅
Would need to fix that without breaking compatibility though.

[rust] add support for accesing the list of line offsets

ee562a8

froydnj force-pushed the froydnj-rust-newlines branch from 3880eda to ee562a8 Compare January 18, 2026 18:59

froydnj changed the title ~~[rust] add support for accesing the list of newline offsets~~ [rust] add support for accesing the list of line offsets Jan 18, 2026

froydnj marked this pull request as draft January 20, 2026 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[rust] add support for accesing the list of line offsets #3860

[rust] add support for accesing the list of line offsets #3860

froydnj commented Jan 18, 2026

Uh oh!

froydnj commented Jan 18, 2026

Uh oh!

froydnj commented Jan 20, 2026

Uh oh!

eregon commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[rust] add support for accesing the list of line offsets #3860

Are you sure you want to change the base?

[rust] add support for accesing the list of line offsets #3860

Conversation

froydnj commented Jan 18, 2026

Uh oh!

froydnj commented Jan 18, 2026

Uh oh!

froydnj commented Jan 20, 2026

Uh oh!

eregon commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants