# # spec file for package tesseract-ocr # # Copyright (c) 2020 SUSE LLC # # All modifications and additions to the file contributed by third parties # remain the property of their copyright owners, unless otherwise agreed # upon. The license for this file, and modifications and additions to the # file, is the same license as for the pristine package itself (unless the # license for the pristine package is not an Open Source License, in which # case the license is the MIT License). An "Open Source License" is a # license that conforms to the Open Source Definition (Version 1.9) # published by the Open Source Initiative. # Please submit bugfixes or comments via https://bugs.opensuse.org/ # %define so_ver 4 Name: tesseract-ocr Version: 4.1.1 Release: 3.3 Summary: Open Source OCR Engine License: Apache-2.0 AND GPL-2.0-or-later URL: https://github.com/tesseract-ocr/tesseract Source0: https://github.com/tesseract-ocr/tesseract/archive/%{version}.tar.gz#/%{name}-%{version}.tar.gz # PATCH-FIX-OPENSUSE -- boo#1159231 Patch0: tesseract-ocr-no-cpudetection.patch BuildRequires: asciidoc BuildRequires: autoconf BuildRequires: automake BuildRequires: doxygen BuildRequires: fdupes BuildRequires: gcc-c++ BuildRequires: libtool BuildRequires: libxslt-tools BuildRequires: opencl-headers BuildRequires: pkgconfig >= 0.9.0 BuildRequires: pkgconfig(OpenCL) BuildRequires: pkgconfig(cairo) BuildRequires: pkgconfig(fontconfig) BuildRequires: pkgconfig(icu-i18n) >= 52.1 BuildRequires: pkgconfig(icu-uc) >= 52.1 BuildRequires: pkgconfig(lept) >= 1.74 BuildRequires: pkgconfig(libarchive) BuildRequires: pkgconfig(pango) >= 1.22.0 BuildRequires: pkgconfig(pangocairo) >= 1.22.0 BuildRequires: pkgconfig(pangoft2) >= 1.22.0 Recommends: tesseract-ocr-traineddata-english %description A commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005. From 2007 it is developed by Google. %package devel Summary: Tesseract Open Source OCR Engine Development files Requires: libtesseract%{so_ver} = %{version} Requires: pkgconfig(lept) >= 1.74 Requires: pkgconfig(libarchive) %description devel This package contains development files for the Tesseract Open Source OCR Engine. %package -n libtesseract%{so_ver} Summary: Open Source OCR Engine %description -n libtesseract%{so_ver} A commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005. From 2007 it is developed by Google. %prep %autosetup -n tesseract-%{version} -p1 %build autoreconf -fiv %configure \ --disable-static %make_build all training doc %install %make_install all training-install # Remove libtool config files rm -f %{buildroot}%{_libdir}/libtesseract.la # Manually install the devel docs in order to fix rpmlint warnings "files-duplicate" and "doc-file-dependency" mkdir -p %{buildroot}%{_defaultdocdir}/%{name}-devel cp -a doc/html/ %{buildroot}%{_defaultdocdir}/%{name}-devel/ # Fix rpmlint warning "doc-file-dependency" rm -f %{buildroot}%{_defaultdocdir}/%{name}-devel/html/installdox # Fix rpmlint warning "non-executable-in-bin" chmod 0755 %{buildroot}%{_bindir}/tesstrain_utils.sh # Fix rpmlint warning "files-duplicate" %fdupes -s %{buildroot} %post -n libtesseract%{so_ver} -p /sbin/ldconfig %postun -n libtesseract%{so_ver} -p /sbin/ldconfig %files %doc AUTHORS ChangeLog README.md %license LICENSE %{_bindir}/* %dir %{_datadir}/tessdata %{_datadir}/tessdata/configs/ %{_datadir}/tessdata/tessconfigs/ %{_datadir}/tessdata/pdf.ttf %{_mandir}/man1/*.1%{?ext_man} %{_mandir}/man5/*.5%{?ext_man} %files devel %doc %{_defaultdocdir}/tesseract-ocr-devel/ %{_includedir}/tesseract/ %{_libdir}/libtesseract*.so %{_libdir}/pkgconfig/*.pc %files -n libtesseract%{so_ver} %{_libdir}/libtesseract.so.%{so_ver}* %changelog * Thu Mar 26 2020 Bernhard Wiedemann - Add tesseract-ocr-no-cpudetection.patch to avoid crashing on older CPUs and to make package build reproducible (boo#1159231) * Fri Jan 3 2020 Tomáš Chvátal - Require libarchive in the devel package * Fri Dec 27 2019 Ismail Dönmez - Update to version 4.1.1 * Bugfixes * Fri Dec 13 2019 Martin Pluskal - Packaging Cleanups - Update dependencies and enable openCL * Fri Dec 13 2019 hiwatari.seiji@gmail.com - Update to 4.1.0 * Added a new output option formatted in the ALTO standard * SIMD optimization * Bugfixes - Update to 4.0.0 * New OCR engine based on LSTMs * Removed Cube OCR engine * Updated build system * Cleanups and fixes * Tue Feb 20 2018 jweberhofer@weberhofer.at - Update to 3.05.01 * Fixed several build issues * Fixed C-API * Backport pdfrenderer changes * Code clean up - Spec file cleaned up * Fri Feb 17 2017 idonmez@suse.com - Update to 3.05.00 * Made some fine tuning to the hOCR output. * Added TSV as another optional output format. * Fixed ABI break introduced in 3.04.00 with the AnalyseLayout() method. * text2image tool - Enable all OpenType ligatures available in a font. This feature requires Pango 1.38 or newer. * Training tools - Replaced asserts with tprintf() and exit(1). * Improved multipage tiff processing. * Improved the embedded pdf font (pdf.ttf). * Enable selection of OCR engine mode from command line. * Changed tesseract command line parameter '-psm' to '--psm'. * Added new C API for orientation and script detection, removed the old one. * Fixed many compiler warning. * Fixed memory and resource leaks. * Fri Feb 19 2016 idonmez@suse.com - Update to 3.04.01 * No changelog upstream * Fri Oct 2 2015 asterios.dramis@gmail.com - Update to version 3.04.00: * Added OpenCL support (experimental). * Many bug fixes. From version 3.03.00: * Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. * Added support for PDF output with searchable text. * Removed entire IMAGE class and all code in image directory. * Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) * Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. * Major refactor of word-level recognition, beam search, eliminating dead code. * Refactored classifier to make it easier to add new ones. * Generalized feature extractor to allow feature extraction from greyscale. * Improved sub/superscript treatment. * Improved baseline fit. * Added set_unicharset_properties to training tools. * Many bug fixes. * More training source data included. - Added new build requirements cairo-devel, doxygen, libicu-devel and pango-devel. - Recommend tesseract-ocr-traineddata-english instead of tesseract-ocr-traineddata-american (based on new (3.04.00) tesseract-ocr traineddata files). * Mon Sep 14 2015 asterios.dramis@gmail.com - Fix Recommends: entry to tesseract-ocr-traineddata-american. * Sat Jun 20 2015 mailaender@opensuse.org - rename to match upstream tarball and fix boo#900303 * Sat Jun 22 2013 asterios.dramis@gmail.com - Split library into separate package (libtesseract3). - Removed debuginfo package (not needed). - There is no need anymore to regenerate the build system (removed automake and libtool build requirements). - Added pkg-config build requirement (fix for rpmlint error "no-pkg-config-provides"). Removed also not needed "Provides: pkgconfig(%%{name})" entry. * Mon May 6 2013 idonmez@suse.com - Update license, some files are GPL-2.0+ licensed * Mon Oct 29 2012 jw@suse.com - Update to version 3.02.02 * untested - Notable features: * Hebrew with BiDi support. * More languages. - removed upstreamed patch0 * Mon Jun 25 2012 asterios.dramis@gmail.com - Update to version 3.01: * Removed old/dead serialise/deserialze methods on *LISTIZED classes. * Total rewrite of DENORM to better encapsulate operation and make for potential to extract features from images. * Thread-safety! Moved all critical globals and statics to members of the appropriate class. Tesseract is now thread-safe (multiple instances can be used in parallel in multiple threads.) with the minor exception that some control parameters are still global and affect all threads. * Added Cube, a new recognizer for Arabic. Cube can also be used in combination with normal Tesseract for other languages with an improvement in accuracy at the cost of (much) lower speed. There is no training module for Cube yet. * OcrEngineMode in Init replaces AccuracyVSpeed to control cube. * Greatly improved segmentation search with consequent accuracy and speed improvements, especially for Chinese. * Added PageIterator and ResultIterator as cleaner ways to get the full results out of Tesseract, that are not currently provided by any of the TessBaseAPI::Get* methods. All other methods, such as the ETEXT_STRUCT in particular are deprecated and will be deleted in the future. * ApplyBoxes totally rewritten to make training easier. It can now cope with touching/overlapping training characters, and a new boxfile format allows word boxes instead of character boxes, BUT to use that you have to have already boostrapped the language with character boxes. "Cyclic dependency" on traineddata. * Auto orientation and script detection added to page layout analysis. * Deleted *lots* of dead code. * Fixxht module replaced with scalable data-driven module. * Output font characteristics accuracy improved. * Removed the double conversion at each classification. * Upgraded oldest structs to be classes and deprecated PBLOB. * Removed non-deterministic baseline fit. * Added fixed length dawgs for Chinese. * Handling of vertical text improved. * Handling of leader dots improved. * Table detection greatly improved. - Removed the various languages traineddata subpackages (to be included in a separate package "tesseract-traineddata"). - Changed License to Apache-2.0 (SPDX style). - Removed libtiff-devel build dependency (not needed anymore). - Added new build dependency liblept-devel, required now by the package. - Added automake and libtool build dependencies in order to regenerate the build system because of missing Makefile.in. - Removed tesseract-traineddata-deu from recommended entries. - Removed nonvoid.patch (fixed upstream). - Added a patch (svutil.cpp_fix.patch) to fix compilation due to missing includes (taken from upstream). - Disabled compilation of static libraries. * Mon Oct 25 2010 prusnak@opensuse.org - fixed missing returns in nonvoid functions (nonvoid.patch) - added missing post/postun scripts calling ldconfig * Thu Sep 23 2010 michal.smrz@opensuse.cz - update to tesseract-3.00 - added plenty od new supported languages - created tesseract-package-creator.py which will, hopefully, make future updates easier * Fri Jul 10 2009 puzel@novell.com - update to tesseract-2.04 * Integrated bug fixes and patches and misc changes for portability. * Integrated a patch to remove some of the "access" macros. * Removed dependence on lua from the viewer, speeding it up dramatically. * Fixed the viewer so it compiles and runs properly!