Added script for PyTorch issues debugging#137
Conversation
ef0xa
left a comment
There was a problem hiding this comment.
This mostly looks very good. Have a lot of small comments on general performance and 'go/house style' that are just meant to be pointers. Feel free to ignore the stuff labeled OPTIONAL or MINOR: just keep them in mind for next time.
cmd/diagnostic/PyTorch.go
Outdated
| } | ||
| } | ||
| ` | ||
| data := map[string]interface{}{ |
There was a problem hiding this comment.
MINOR: General go HTTP/REST performance tips:
- use a struct here rather than
map[string]anyto avoid extraneous allocations. - don't make a map to iterate through the headers: just do
req.Header.Set("Content-Type", "application/json") - don't make a new
http.Clientfor each request: they're safe to share between invocations. if you don't care which client to use, usehttp.DefaultClient - either or both of the client or request context should have a timeout.
| "github.com/spf13/cobra" | ||
| ) | ||
|
|
||
| func getPodMachineID(podID, apiKey string) string { |
There was a problem hiding this comment.
this function should return an error.
cmd/diagnostic/PyTorch.go
Outdated
| } | ||
| jsonData, _ := json.Marshal(data) | ||
|
|
||
| req, _ := http.NewRequest("POST", url, bytes.NewBuffer(jsonData)) |
There was a problem hiding this comment.
check this error and return it.
| headers := map[string]string{ | ||
| "Content-Type": "application/json", | ||
| } | ||
| query := ` |
There was a problem hiding this comment.
OPTIONAL: this can be a const.
cmd/diagnostic/PyTorch.go
Outdated
|
|
||
| var result map[string]interface{} | ||
| json.NewDecoder(resp.Body).Decode(&result) | ||
| if pod, ok := result["data"].(map[string]interface{})["pod"].(map[string]interface{}); ok { |
There was a problem hiding this comment.
again, don't use map[string]any: just make a struct type that's the shape you want. in this case, it would be
type Result struct {
Data struct {
Pod struct {
MachineID string `json:"machineId"`
} `json:"pod"`
} `json:"data"`
}| driverVersion := driverVersionRegex.FindStringSubmatch(output) | ||
| gpuName := gpuNameRegex.FindStringSubmatch(output) | ||
|
|
||
| info := map[string]string{ |
There was a problem hiding this comment.
(MINOR, OPTIONAL): see above re: structs vs maps
|
|
||
| client := &http.Client{} | ||
| resp, err := client.Do(req) | ||
| if err != nil { |
There was a problem hiding this comment.
use return fmt.Errorf with %w instead of just printing the error.
| return parseNvidiaSMIOutput(string(output)) | ||
| } | ||
|
|
||
| func getSystemInfo() map[string]interface{} { |
There was a problem hiding this comment.
(OPTIONAL, MINOR): this function is tiny and only called once. inline it.
| return results | ||
| } | ||
|
|
||
| func saveInfoToFile(info map[string]interface{}, filename string) { |
There was a problem hiding this comment.
(OPTIONAL, MINOR): this function is tiny and only called once. inline it.
| return info | ||
| } | ||
|
|
||
| func getNvidiaSMIInfo() map[string]string { |
There was a problem hiding this comment.
(OPTIONAL, MINOR): this function is tiny and only called once. inline it.
|
What will happen if people try to run this on the client? Is there anything telling them that it is a diagnostic to be run specifically through web terminal on the pod? I think somebody will try to run it locally and be confused by the output if not |
Most of the output in json file would be empty though script will try to run PyTorch test. |
Added new command: gpu-test
Curently it's including one of my script to test out PyTorch (PyTorch,go file)
Saves debug informations to /workspace/gpu_diagnostics.json
What script does:
Gather all informations about host including:
Runs test on all attached GPU's to makre sure PyTorch fully utilizes CUDA
Logs error messages into .json file for tech support to review.