JAVA 텍스트 파일의 Encoding 정보 가져오기 + String Encoding 변환하기.

개발 관련/Java

JAVA 텍스트 파일의 Encoding 정보 가져오기 + String Encoding 변환하기.

snoworca 2015. 11. 18. 15:06

이클립스로 작성된 다양한 프로젝트들을 의존하는 프로젝트를 생성하는 경우 인코딩 문제로 당황스러울 때가 종종 있다.

만약 여러개의 프로젝트내의 소스 파일들이 각각 다른 인코딩으로 저장되었을 경우 이 것을 한 번에 utf-8 로 바꾸는 방법에 대하여 알아볼 것이다.

1. 텍스트 파일 (혹은 java 파일) 의 인코딩 정보 가져오기

Text 파일 혹은 Java 소스 파일의 인코딩을 확인하는 방법은 여러가지가 있지만 가장 정확하면서 간단한 방법은 바로 juniversalchardet (https://code.google.com/p/juniversalchardet/) 라이브러리를 사용하는 것이다. (무식한? 방법으로는 파일을 읽어서 테스트 하려는 인코딩의 스트링으로 변환하고 character 배열을 받아 0xfffd 문자를 찾는 방법이 있다.)

우선 라이브러리를 다운 받아서 프로젝트에 포함시킨다.

Maven Repository 주소는 아래와 같다. (or 이 곳에서 라이브러리를 다운로드 받을 수 있다.)

http://www.mvnrepository.com/artifact/com.googlecode.juniversalchardet/juniversalchardet/1.0.3

아래 코드는 juniversalchardet 프로젝트 페이지 메인에 나온 샘플 코드를 그대로 옮겨온 것이다.

이렇게 간단한 방법으로 텍스트 파일의 인코딩 정보를 바로 가져올 수 있다.

샘플 코드:

import org.mozilla.universalchardet.UniversalDetector;

public class TestDetector {
  public static void main(String[] args) throws java.io.IOException {
    byte[] buf = new byte[4096];
    String fileName = args[0];
    java.io.FileInputStream fis = new java.io.FileInputStream(fileName);

    UniversalDetector detector = new UniversalDetector(null);

    int nread;
    while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
      detector.handleData(buf, 0, nread);
    }
    detector.dataEnd();

    String encoding = detector.getDetectedCharset();
    if (encoding != null) {
      System.out.println("Detected encoding = " + encoding);
    } else {
      System.out.println("No encoding detected.");
    }

    detector.reset();
  }
}

2. String 의 인코딩 바꾸기

아래 코드는 "헬로월드!" 라는 문자열을 EUC-KR 로 인코딩하여 출력한다음 다시 UTF-8 로 디코딩하여 출력하는 예제이다.

java.nio 의 Charset 을 통하여 문자열을 인코딩 및 디코딩할 수 있다.

// UTF-8.
String str = new String("헬로월드!");

Charset eucKRCharset = Charset.forName("EUC-KR");
CharBuffer sourceBuffer = CharBuffer.wrap(str.toCharArray());
ByteBuffer resultByteBuffer = eucKRCharset.encode(sourceBuffer);
byte[] resultBytes =  resultByteBuffer.array();
// EUC-KR 의 String 을 생성할 때, 두번째 인자값으로 인코딩 정보를 넣어준다.
System.out.println(new String(resultBytes, eucKRCharset));
// 만약 인코딩 정보를 넣지 않는다면 에러 스트링이(�, 0xfffd) 이 출력될 것이다. 
System.out.println(new String(resultBytes));

// 원래의 UTF-8 로 디코딩.
CharBuffer charBuffer = eucKRCharset.decode(ByteBuffer.wrap(resultBytes));
System.out.println(charBuffer.toString());

3. 응용 - workspace 내의 모든 자바 소스 파일을 UTF-8 로 바꾸기

public static void main(String[] args) throws IOException {
	// 프로젝트 폴더 경로
	decodingProjectSources(new File("/home/name/worksapce"));
}

public static  void decodingProjectSources(File file) throws IOException {
	// 자바 파일만 UTF-8로 디코딩
	if(file.isFile() && file.getName().matches("^.*\\.((?i)JAVA)$")) {
		String encoding =  readEncoding(file);
		if(encoding != null && !encoding.equals("UTF-8")) {
			decodingFile(file, encoding);
		}
	} else if(file.isDirectory()) {
		File[] list = file.listFiles();
		for(File childFile : list) {
			decodingProjectSources(childFile);
		}
	}
}

public static void decodingFile(File file, String encoding) throws IOException {
	Charset charset = Charset.forName(encoding);
	FileInputStream fis = new FileInputStream(file);
	ByteOutputStream fbs = new ByteOutputStream();
	
	byte[] buffer = new byte[4096];
	int n = 0;
	while((n = fis.read(buffer, 0, buffer.length)) > 0) {
		fbs.write(buffer, 0, n);
	}
	CharBuffer charBuffer = charset.decode(ByteBuffer.wrap(fbs.getBytes()));
	BufferedWriter bw = new BufferedWriter(new FileWriter(file));
	bw.append(charBuffer);
	bw.close();
}

public static String readEncoding(File file) throws IOException {
	byte[] buf = new byte[4096];
	java.io.FileInputStream fis = new java.io.FileInputStream(file);
	UniversalDetector detector = new UniversalDetector(null);
	int nread;
	while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
		detector.handleData(buf, 0, nread);
	}
	detector.dataEnd();
	String encoding = detector.getDetectedCharset();
	detector.reset();
	buf = null;
	fis.close();
	return encoding == null?"UTF-8":encoding;
}

저작자표시