Comment lire utf16 fichier texte à la chaîne dans golang?

Je peux lire le fichier de tableau d'octets

mais quand je les convertir en string

il traiter l'utf16 octets ascii

Comment convertir correctement?

package main

import ("fmt"
"os"
"bufio"
)

func main(){
    //read whole the file
    f, err := os.Open("test.txt")
    if err != nil {
        fmt.Printf("error opening file: %v\n",err)
        os.Exit(1)
    }
    r := bufio.NewReader(f)
    var s,b,e = r.ReadLine()
    if e==nil{
        fmt.Println(b)
        fmt.Println(s)
        fmt.Println(string(s))
    }
}

de sortie:

faux

[255 254 91 0 83 0 99 0 114 0 105 0 112 0 116 0 32 0 73 0 110 0 102 0 111 0 93 0
13 0]

S c r i p t I o n de f o ]

Mise à jour:

Après j'ai testé les deux exemples, j'ai understanded quel est le problème exact maintenant.

Dans windows, si j'ajoute un saut de ligne (CR+LF) à la fin de la ligne, le CR sera lu dans la ligne. Parce que la fonction readline ne peut pas gérer l'unicode correctement ([OD OA]=ok, [OD 00 OA 00]=pas ok).

Si la fonction readline peut reconnaître unicode, il faut comprendre [OD 00 OA 00] et retour []uint16 au lieu de []bytes.

Donc je pense que je ne devrais pas utiliser bufio.NewReader comme il n'est pas capable de lire utf16, je ne vois pas bufio.NewReader.ReadLine peut accepter des paramètres comme indicateur pour indiquer que la lecture de texte est utf8, utf16le ou utf32. Est-il readline fonction de texte unicode en aller à la bibliothèque?

InformationsquelleAutor CL So | 2013-04-03

go readline unicode utf-16

UTF16, UTF8, et l'Ordre des Octets de Marques sont définis par la Consortium Unicode: UTF-16 FAQ, UTF-8 FAQ, et Marque d'Ordre des octets (BOM) FAQ.

Question 4802: bufio: la lecture des lignes est trop lourd

La lecture des lignes à partir d'un fichier est trop lourd en Aller.

Les gens sont souvent tirés de bufio.Reader.ReadLine en raison de son nom,
mais il a une drôle de signature, le retour (ligne []byte, isPrefix bool,
err erreur), et nécessite beaucoup de travail.

ReadSlice et ReadString besoin d'un séparateur d'octets, ce qui est presque
toujours visibles et inesthétiques '\n', et aussi peut retourner à la fois une ligne
et un EOF

Révision: f685026a2d38

bufio: nouvelle interface du Scanner

Ajout d'une nouvelle interface simple pour la numérisation (probablement textuel) de données,
basé sur un nouveau type de Scanner. Il fait de son propre intérieur
mise en mémoire tampon, ce qui devrait être logiquement efficace même sans injection d'un
bufio.Reader. Le format de l'entrée est définie par un "split
la fonction", par défaut le fractionnement des lignes.

go1.1beta1 publié

Vous pouvez télécharger le binaire et source distributions à partir de l'endroit habituel:
https://code.google.com/p/go/downloads/list?q=go1.1beta1

Voici un programme qui utilise l'Unicode des règles pour convertir UTF16 fichier texte les lignes pour Aller encodés en utf-8 cordes. Le code a été révisé afin de prendre avantage de la nouvelle bufio.Scanner interface en Aller 1.1.

package main
import (
"bufio"
"bytes"
"encoding/binary"
"fmt"
"os"
"runtime"
"unicode/utf16"
"unicode/utf8"
)
//UTF16BytesToString converts UTF-16 encoded bytes, in big or little endian byte order,
//to a UTF-8 encoded string.
func UTF16BytesToString(b []byte, o binary.ByteOrder) string {
utf := make([]uint16, (len(b)+(2-1))/2)
for i := 0; i+(2-1) < len(b); i += 2 {
utf[i/2] = o.Uint16(b[i:])
}
if len(b)/2 < len(utf) {
utf[len(utf)-1] = utf8.RuneError
}
return string(utf16.Decode(utf))
}
//UTF-16 endian byte order
const (
unknownEndian = iota
bigEndian
littleEndian
)
//dropCREndian drops a terminal \r from the endian data.
func dropCREndian(data []byte, t1, t2 byte) []byte {
if len(data) > 1 {
if data[len(data)-2] == t1 && data[len(data)-1] == t2 {
return data[0 : len(data)-2]
}
}
return data
}
//dropCRBE drops a terminal \r from the big endian data.
func dropCRBE(data []byte) []byte {
return dropCREndian(data, '\x00', '\r')
}
//dropCRLE drops a terminal \r from the little endian data.
func dropCRLE(data []byte) []byte {
return dropCREndian(data, '\r', '\x00')
}
//dropCR drops a terminal \r from the data.
func dropCR(data []byte) ([]byte, int) {
var endian = unknownEndian
switch ld := len(data); {
case ld != len(dropCRLE(data)):
endian = littleEndian
case ld != len(dropCRBE(data)):
endian = bigEndian
}
return data, endian
}
//SplitFunc is a split function for a Scanner that returns each line of
//text, stripped of any trailing end-of-line marker. The returned line may
//be empty. The end-of-line marker is one optional carriage return followed
//by one mandatory newline. In regular expression notation, it is `\r?\n`.
//The last non-empty line of input will be returned even if it has no
//newline.
func ScanUTF16LinesFunc(byteOrder binary.ByteOrder) (bufio.SplitFunc, func() binary.ByteOrder) {
//Function closure variables
var endian = unknownEndian
switch byteOrder {
case binary.BigEndian:
endian = bigEndian
case binary.LittleEndian:
endian = littleEndian
}
const bom = 0xFEFF
var checkBOM bool = endian == unknownEndian
//Scanner split function
splitFunc := func(data []byte, atEOF bool) (advance int, token []byte, err error) {
if atEOF && len(data) == 0 {
return 0, nil, nil
}
if checkBOM {
checkBOM = false
if len(data) > 1 {
switch uint16(bom) {
case uint16(data[0])<<8 | uint16(data[1]):
endian = bigEndian
return 2, nil, nil
case uint16(data[1])<<8 | uint16(data[0]):
endian = littleEndian
return 2, nil, nil
}
}
}
//Scan for newline-terminated lines.
i := 0
for {
j := bytes.IndexByte(data[i:], '\n')
if j < 0 {
break
}
i += j
switch e := i % 2; e {
case 1: //UTF-16BE
if endian != littleEndian {
if i > 1 {
if data[i-1] == '\x00' {
endian = bigEndian
//We have a full newline-terminated line.
return i + 1, dropCRBE(data[0 : i-1]), nil
}
}
}
case 0: //UTF-16LE
if endian != bigEndian {
if i+1 < len(data) {
i++
if data[i] == '\x00' {
endian = littleEndian
//We have a full newline-terminated line.
return i + 1, dropCRLE(data[0 : i-1]), nil
}
}
}
}
i++
}
//If we're at EOF, we have a final, non-terminated line. Return it.
if atEOF {
//drop CR.
advance = len(data)
switch endian {
case bigEndian:
data = dropCRBE(data)
case littleEndian:
data = dropCRLE(data)
default:
data, endian = dropCR(data)
}
if endian == unknownEndian {
if runtime.GOOS == "windows" {
endian = littleEndian
} else {
endian = bigEndian
}
}
return advance, data, nil
}
//Request more data.
return 0, nil, nil
}
//Endian byte order function
orderFunc := func() (byteOrder binary.ByteOrder) {
switch endian {
case bigEndian:
byteOrder = binary.BigEndian
case littleEndian:
byteOrder = binary.LittleEndian
}
return byteOrder
}
return splitFunc, orderFunc
}
func main() {
file, err := os.Open("utf16.le.txt")
if err != nil {
fmt.Println(err)
os.Exit(1)
}
defer file.Close()
fmt.Println(file.Name())
rdr := bufio.NewReader(file)
scanner := bufio.NewScanner(rdr)
var bo binary.ByteOrder //unknown, infer from data
//bo = binary.LittleEndian //windows
splitFunc, orderFunc := ScanUTF16LinesFunc(bo)
scanner.Split(splitFunc)
for scanner.Scan() {
b := scanner.Bytes()
s := UTF16BytesToString(b, orderFunc())
fmt.Println(len(s), s)
fmt.Println(len(b), b)
}
fmt.Println(orderFunc())
if err := scanner.Err(); err != nil {
fmt.Println(err)
}
}

De sortie:

utf16.le.txt
15 "Hello, 世界"
22 [34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 0 22 78 76 117 34 0]
0 
0 []
15 "Hello, 世界"
22 [34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 0 22 78 76 117 34 0]
LittleEndian
utf16.be.txt
15 "Hello, 世界"
22 [0 34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 78 22 117 76 0 34]
0 
0 []
15 "Hello, 世界"
22 [0 34 0 72 0 101 0 108 0 108 0 111 0 44 0 32 78 22 117 76 0 34]
BigEndian

Maintenant, je comprends le problème, ce qui n'est pas dans la conversion, c'est dans readline. La question est donc de mise à jour.
Voici un programme révisé pour résoudre votre problème.
Je vous remercie pour votre programme, je vais réviser à la base de la révision, parce que le saut de ligne ont encore beaucoup de standard lien. Depuis pas de forfait en aller lire utf16, je pense que je dois aussi signaler ce problème à google, car de nos jours, le langage de programmation moderne doit être capable de traiter unicode correctement, en particulier dans les applications internet.
Je vais avoir une nouvelle version très améliorée bientôt pour Aller 1.1 beta.
Voici la nouvelle version. Il utilise bufio.Scanner au lieu de bufio.Readline et bufio.ReadBytes et a de nombreuses autres améliorations de trop. Exécuter à l'aide de Aller 1.1 ou mieux.

InformationsquelleAutor peterSO

La dernière version de golang.org/x/text/encoding/unicode rend plus facile de le faire parce qu'il comprend unicode.BOMOverride, ce qui se fait de manière intelligente interpréter la NOMENCLATURE.

Ici est ReadFileUTF16(), qui est comme de l'os.ReadFile (), mais décode UTF-16.

package main
import (
"bytes"
"fmt"
"io/ioutil"
"log"
"strings"
"golang.org/x/text/encoding/unicode"
"golang.org/x/text/transform"
)
//Similar to ioutil.ReadFile() but decodes UTF-16.  Useful when
//reading data from MS-Windows systems that generate UTF-16BE files,
//but will do the right thing if other BOMs are found.
func ReadFileUTF16(filename string) ([]byte, error) {
//Read the file into a []byte:
raw, err := ioutil.ReadFile(filename)
if err != nil {
return nil, err
}
//Make an tranformer that converts MS-Win default to UTF8:
win16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
//Make a transformer that is like win16be, but abides by BOM:
utf16bom := unicode.BOMOverride(win16be.NewDecoder())
//Make a Reader that uses utf16bom:
unicodeReader := transform.NewReader(bytes.NewReader(raw), utf16bom)
//decode and print:
decoded, err := ioutil.ReadAll(unicodeReader)
return decoded, err
}
func main() {
data, err := ReadFileUTF16("inputfile.txt")
if err != nil {
log.Fatal(err)
}
final := strings.Replace(string(data), "\r\n", "\n", -1)
fmt.Println(final)
}

Ici est NewScannerUTF16, qui est comme l'os.Open (), mais renvoie à un scanner.

package main
import (
"bufio"
"fmt"
"log"
"os"
"golang.org/x/text/encoding/unicode"
"golang.org/x/text/transform"
)
type utfScanner interface {
Read(p []byte) (n int, err error)
}
//Creates a scanner similar to os.Open() but decodes the file as UTF-16.
//Useful when reading data from MS-Windows systems that generate UTF-16BE
//files, but will do the right thing if other BOMs are found.
func NewScannerUTF16(filename string) (utfScanner, error) {
//Read the file into a []byte:
file, err := os.Open(filename)
if err != nil {
return nil, err
}
//Make an tranformer that converts MS-Win default to UTF8:
win16be := unicode.UTF16(unicode.BigEndian, unicode.IgnoreBOM)
//Make a transformer that is like win16be, but abides by BOM:
utf16bom := unicode.BOMOverride(win16be.NewDecoder())
//Make a Reader that uses utf16bom:
unicodeReader := transform.NewReader(file, utf16bom)
return unicodeReader, nil
}
func main() {
s, err := NewScannerUTF16("inputfile.txt")
if err != nil {
log.Fatal(err)
}
scanner := bufio.NewScanner(s)
for scanner.Scan() {
fmt.Println(scanner.Text()) //Println will add back the final '\n'
}
if err := scanner.Err(); err != nil {
fmt.Fprintln(os.Stderr, "reading inputfile:", err)
}
}

Pour info: j'ai mis ces fonctions dans un module open source et ont fait d'autres améliorations. Voir https://github.com/TomOnTime/utfutil/

InformationsquelleAutor TomOnTime

Par exemple:

package main
import (
"errors"
"fmt"
"log"
"unicode/utf16"
)
func utf16toString(b []uint8) (string, error) {
if len(b)&1 != 0 {
return "", errors.New("len(b) must be even")
}
//Check BOM
var bom int
if len(b) >= 2 {
switch n := int(b[0])<<8 | int(b[1]); n {
case 0xfffe:
bom = 1
fallthrough
case 0xfeff:
b = b[2:]
}
}
w := make([]uint16, len(b)/2)
for i := range w {
w[i] = uint16(b[2*i+bom&1])<<8 | uint16(b[2*i+(bom+1)&1])
}
return string(utf16.Decode(w)), nil
}
func main() {
//Simulated data from e.g. a file
b := []byte{255, 254, 91, 0, 83, 0, 99, 0, 114, 0, 105, 0, 112, 0, 116, 0, 32, 0, 73, 0, 110, 0, 102, 0, 111, 0, 93, 0, 13, 0}
s, err := utf16toString(b)
if err != nil {
log.Fatal(err)
}
fmt.Printf("%q", s)
}

(Également ici)

De sortie:

"[Script Info]\r"

Je recommande aussi à l'aide de encoding/binary de le lire comme un []uint16 pour commencer.
Je ne recommanderais pas que.
Pourquoi ? Veillez à ce que les caractères en UTF16 ne sont pas toujours codés sur deux octets (c'est vrai seulement pour la BMP).
Maintenant, je comprends le problème, ce qui n'est pas dans la conversion, c'est dans readline. La question est donc de mise à jour.

InformationsquelleAutor zzzz

Ici est la façon la plus simple de le lire:

package main
import (
"bufio"
"fmt"
"log"
"os"
"golang.org/x/text/encoding/unicode"
"golang.org/x/text/transform"
)
func main() {
file, err := os.Open("./text.txt")
if err != nil {
log.Fatal(err)
}
scanner := bufio.NewScanner(transform.NewReader(file, unicode.UTF16(unicode.LittleEndian, unicode.UseBOM).NewDecoder()))
for scanner.Scan() {
fmt.Printf(scanner.Text())
}
}

depuis Windows little-endian ordre par défaut lien, nous utilisons le format unicode.UseBOM politique pour récupérer de la NOMENCLATURE du texte, et unicode.Little-endian comme un secours

Très simple, +1

InformationsquelleAutor Joel Chen

Si tu veux quelque chose à imprimer sous la forme d'une chaîne vous pouvez utiliser fmt.Sprint

package main
import (
"bufio"
"fmt"
"os"
)
func main() {
//read whole the file
f, err := os.Open("test.txt")
if err != nil {
fmt.Printf("error opening file: %v\n", err)
return
}
r := bufio.NewReader(f)
var s, _, e = r.ReadLine()
if e != nil {
fmt.Println(e)
return
}
fmt.Println(fmt.Sprint(string(s)))
}

InformationsquelleAutor Jess

Vous devez vous connecter pour publier un commentaire.