Skip to content

Conversation

@froydnj
Copy link
Contributor

@froydnj froydnj commented Jan 18, 2026

As per subject. rubyfmt spends a non-trivial amount of time figuring out where the newlines are in the source string, and since the parser already has this information, it seems silly to have to recompute it.

I'm glad I wrote tests, though, because the implementation seems surprising to me -- I don't think I ever would have expected the list to contain offsets for newlines that aren't actually present in the source string. I guess it's too late to change the contract here? (I considered making this code return list[1..], but I think the consistency between it and the C implementation is more important; happy to reconsider or to get the underlying library changed.)

cc @reese

@froydnj
Copy link
Contributor Author

froydnj commented Jan 18, 2026

Oh, I see -- the doc comment on pm_newline_list_t is a little ambiguous, or maybe I just read too fast:

/**
* A list of offsets of newlines in a string. The offsets are assumed to be
* sorted/inserted in ascending order.
*/
typedef struct {
/** A pointer to the start of the source string. */
const uint8_t *start;
/** The number of offsets in the list. */
size_t size;
/** The capacity of the list that has been allocated. */
size_t capacity;
/** The list of offsets. */
size_t *offsets;
} pm_newline_list_t;

The indices in the list are actually the indices for the beginning of each line, not of the actual newlines in the source string. OK, that makes the tests make a little more sense, but in that case, I think the function name should be something more like line_offsets and the documentation should be changed.

@froydnj froydnj force-pushed the froydnj-rust-newlines branch from 3880eda to ee562a8 Compare January 18, 2026 18:59
@froydnj froydnj changed the title [rust] add support for accesing the list of newline offsets [rust] add support for accesing the list of line offsets Jan 18, 2026
@froydnj
Copy link
Contributor Author

froydnj commented Jan 20, 2026

Moving this to draft; I think if we're going to provide line offsets, we might as well go ahead and just expose an iterator over the lines instead, since that's what people are going to be doing anyway.

@froydnj froydnj marked this pull request as draft January 20, 2026 15:51
@eregon
Copy link
Member

eregon commented Jan 20, 2026

The list of line offsets is useful for quite a few other things, which probably will be useful in Rust too:

# Converts the line number to a byte offset corresponding to the start of that line
def line_to_byte_offset(line)
l = line - @start_line
if l < 0 || l >= offsets.size
raise ArgumentError, "line #{line} is out of range"
end
offsets[l]
end
# Binary search through the offsets to find the line number for the given
# byte offset.
def line(byte_offset)
start_line + find_line(byte_offset)
end
# Return the byte offset of the start of the line corresponding to the given
# byte offset.
def line_start(byte_offset)
offsets[find_line(byte_offset)]
end
# Returns the byte offset of the end of the line corresponding to the given
# byte offset.
def line_end(byte_offset)
offsets[find_line(byte_offset) + 1] || source.bytesize
end
# Return the column number for the given byte offset.
def column(byte_offset)
byte_offset - line_start(byte_offset)
end
# Return the character offset for the given byte offset.
def character_offset(byte_offset)
(source.byteslice(0, byte_offset) or raise).length
end
# Return the column number in characters for the given byte offset.
def character_column(byte_offset)
character_offset(byte_offset) - character_offset(line_start(byte_offset))
end
# Returns the offset from the start of the file for the given byte offset
# counting in code units for the given encoding.
#
# This method is tested with UTF-8, UTF-16, and UTF-32. If there is the
# concept of code units that differs from the number of characters in other
# encodings, it is not captured here.
#
# We purposefully replace invalid and undefined characters with replacement
# characters in this conversion. This happens for two reasons. First, it's
# possible that the given byte offset will not occur on a character
# boundary. Second, it's possible that the source code will contain a
# character that has no equivalent in the given encoding.
def code_units_offset(byte_offset, encoding)
byteslice = (source.byteslice(0, byte_offset) or raise).encode(encoding, invalid: :replace, undef: :replace)
if encoding == Encoding::UTF_16LE || encoding == Encoding::UTF_16BE
byteslice.bytesize / 2
else
byteslice.length
end
end
# Generate a cache that targets a specific encoding for calculating code
# unit offsets.
def code_units_cache(encoding)
CodeUnitsCache.new(source, encoding)
end
# Returns the column number in code units for the given encoding for the
# given byte offset.
def code_units_column(byte_offset, encoding)
code_units_offset(byte_offset, encoding) - code_units_offset(line_start(byte_offset), encoding)
end
# Freeze this object and the objects it contains.
def deep_freeze
source.freeze
offsets.freeze
freeze
end
private
# Binary search through the offsets to find the line number for the given
# byte offset.
def find_line(byte_offset)
index = offsets.bsearch_index { |offset| offset > byte_offset } || offsets.length
index - 1
end

Agreed the name pm_newline_list_t is confusing 😅
Would need to fix that without breaking compatibility though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants