[SVCS-531] Separate csv and tsv function and remove use of sniff#285
[SVCS-531] Separate csv and tsv function and remove use of sniff#285AddisonSchiller wants to merge 3 commits intoCenterForOpenScience:developfrom
Conversation
cslzchen
left a comment
There was a problem hiding this comment.
As discussed, looks good. I will locally test it.
Double check tests.
|
@cslzchen Added new test file to look over |
cslzchen
left a comment
There was a problem hiding this comment.
As discussed, there is some error handling issues (not from your code though) but worth taking a look at. 🔥 🔥 .
Csv.sniff could cause random characters or spaces to be used as the delimiter. Separating these functions and using a hard coded dialect fixes this display problem.
94d40a2 to
93235e1
Compare
cslzchen
left a comment
There was a problem hiding this comment.
Looks good and works as expected. 🎆 🎆 Move to PCR.
d9dc5f3 to
b1083c5
Compare
felliott
left a comment
There was a problem hiding this comment.
@TomBaxter, sending this one to you. Needs some fixups:
- some
.seek()s need to switch back to.read()s - what are we losing by removing sniffing from the csv detector?
- remove quoting for tsv
- minor error handling de-duplication
| :return: tuple of table headers and data | ||
| """ | ||
| data = fp.read(2048) | ||
| data = fp.seek(2048) |
There was a problem hiding this comment.
seek just returns the offset the pointer was advanced to. This should probably be read.
| return parse_stdlib(reader) | ||
|
|
||
| def tsv_stdlib(fp): | ||
| data = fp.seek(2048) |
| except csv.Error: | ||
| dialect = csv.excel | ||
| else: | ||
| _set_dialect_quote_attrs(dialect, data) |
There was a problem hiding this comment.
I don't think I've ever seen a tsv with quoting in it. Has anyone else? Maybe we leave quoting alone until it's reported as an issue.
| # on certain exceptions | ||
| except Exception as e: | ||
| raise TabularRendererError('Cannot render file as csv/tsv. ' | ||
| 'The file may be empty or corrupt', |
There was a problem hiding this comment.
Nitpick: indentation is weird here.
| 'The file may be empty or corrupt', | ||
| code=HTTPStatus.BAD_REQUEST, | ||
| extension='csv') from e | ||
|
|
There was a problem hiding this comment.
Since this is identical to the error raised in the next stanza, could we just throw the error instead?
| 'Please download and view it locally.', | ||
| code=400, | ||
| code=HTTPStatus.BAD_REQUEST, | ||
| extension='csv', |
There was a problem hiding this comment.
Since both the csv and tsv parser call this, can we make sure the correct extension is being passed?
| fp.seek(0) | ||
| # set the dialect instead of sniffing for it. | ||
| # sniffing can cause things like spaces or characters to be the delimiter | ||
| dialect = csv.excel |
There was a problem hiding this comment.
Hmmm. I'm not sure how I feel about this. I like that it solves the issue of really long lines, but it bugs me a bit that we're throwing out support for alternative delimiters. If we use the csv.excel dialect, do we still support tab- and pipe-delimited text? If not, can we document that in a comment, so we'll know what to fix if we encounter it?
| 'Please download and view it locally.', | ||
| code=400, | ||
| code=HTTPStatus.BAD_REQUEST, | ||
| extension='csv', |
| :return: tuple of table headers and data | ||
| """ | ||
| data = fp.read(2048) | ||
| data = fp.seek(2048) |
| fp.seek(0) | ||
| # set the dialect instead of sniffing for it. | ||
| # sniffing can cause things like spaces or characters to be the delimiter | ||
| dialect = csv.excel |
| return parse_stdlib(reader) | ||
|
|
||
| def tsv_stdlib(fp): | ||
| data = fp.seek(2048) |
| except csv.Error: | ||
| dialect = csv.excel | ||
| else: | ||
| _set_dialect_quote_attrs(dialect, data) |
|
PR closed and replaced by #308. |
refs: https://openscience.atlassian.net/browse/SVCS-531
Purpose
Csv.sniff could cause random characters or spaces to be used
as the delimiter. Separating these functions and using a hard coded
dialect fixes this display problem.
Summary of changes
Broke up the csv/tsv function into two new ones, with the bulk of it in a helper function.
Instead of using Csv.sniff, just use csv.excel, or csv.excel_tab to set the delimiter. This stops things like spaces, characters, numbers etc from being detected as the delimiter.
QA/Testing notes
There is a zip file of csv/tsv files on the JIRA ticket that display the error. Trying them on staging/prod will show you what the error looks like. These display errors should not be present with the changes.