|
 |
CX. PDF functions
The PDF functions in PHP can create PDF files using the PDFlib
library created by Thomas
Merz.
The documentation in this section is only meant to be an overview
of the available functions in the PDFlib library and should not be
considered an exhaustive reference. Please consult the
documentation included in the source distribution of PDFlib for
the full and detailed explanation of each function here. It
provides a very good overview of what PDFlib is capable of doing
and contains the most up-to-date documentation of all functions.
All of the functions in PDFlib and the PHP module have identical
function names and parameters. You will need to understand some
of the basic concepts of PDF and PostScript to efficiently use
this extension. All lengths and coordinates are measured in
PostScript points. There are generally 72 PostScript points to an
inch, but this depends on the output resolution. Please see the
PDFlib documentation included with the source distribution of
PDFlib for a more thorough explanation of the coordinate system
used.
Please note that most of the PDF functions require a
pdfdoc as its first parameter. Please
see the examples below for more
information.
Замечание:
If you're interested in alternative free PDF generators that do not
utilize external PDF libraries, see
this related FAQ.
Замечание:
This extension has been moved to PECL as
of PHP 4.3.9.
PDFlib is available for download at http://www.pdflib.com/products/pdflib/index.html, but requires that you purchase
a license for commercial use. The JPEG and TIFF libraries are required to compile
this extension.
Any version of PHP 4 after March 9, 2000 does not support versions
of PDFlib older than 3.0.
PDFlib 3.0 or greater is supported by PHP 3.0.19 and later.
Это расширение PECL
не поставляется вместе с PHP.
Дополнительная информация, такая как новый версии,
скачивание, исходные файлы, информация о разработчике и CHANGELOG, могут
быть найдены здесь:
http://pecl.php.net/package/pdflib.
To get these functions to work in PHP < 4.3.9, you have to compile PHP with
--with-pdflib[=DIR]. DIR is the PDFlib
base install directory, defaults to /usr/local.
In addition you can specify the jpeg, tiff, and pnglibrary for PDFlib to
use, which is optional for PDFlib 4.x.
To do so add to your configure line the options
--with-jpeg-dir[=DIR]
--with-png-dir[=DIR]
--with-tiff-dir[=DIR].
When using version 3.x of PDFlib, you should configure PDFlib
with the option --enable-shared-pdflib.
As of PHP 4.3.9, you must install this extension through PEAR, using the following command:
pear install pdflib.
Данное расширение не определяет никакие директивы конфигурации в php.ini.
Starting with PHP 4.0.5, the PHP extension for PDFlib is
officially supported by PDFlib GmbH. This means that all the
functions described in the PDFlib manual (V3.00 or greater) are
supported by PHP 4 with exactly the same meaning and the same
parameters. Only the return values may differ from the PDFlib
manual, because the PHP convention of returning
FALSE was adopted. For compatibility reasons,
this binding for PDFlib still supports the old functions, but they
should be replaced by their new versions. PDFlib GmbH will not
support any problems arising from the use of these deprecated
functions.
Таблица 1. Deprecated functions and their replacements
Most of the functions are fairly easy to use. The most difficult part
is probably creating your first PDF document. The following
example should help to get you started.
It creates test.pdf
with one page. The page contains the text "Times Roman outlined" in an
outlined, 30pt font. The text is also underlined.
Пример 1. Creating a PDF document with PDFlib
<?php
$pdf = pdf_new();
pdf_open_file($pdf, "test.pdf");
pdf_set_info($pdf, "Author", "Uwe Steinmann");
pdf_set_info($pdf, "Title", "Test for PHP wrapper of PDFlib 2.0");
pdf_set_info($pdf, "Creator", "See Author");
pdf_set_info($pdf, "Subject", "Testing");
pdf_begin_page($pdf, 595, 842);
pdf_add_outline($pdf, "Page 1");
$font = pdf_findfont($pdf, "Times New Roman", "winansi", 1);
pdf_setfont($pdf, $font, 10);
pdf_set_value($pdf, "textrendering", 1);
pdf_show_xy($pdf, "Times Roman outlined", 50, 750);
pdf_moveto($pdf, 50, 740);
pdf_lineto($pdf, 330, 740);
pdf_stroke($pdf);
pdf_end_page($pdf);
pdf_close($pdf);
pdf_delete($pdf);
echo "<A HREF=getpdf.php>finished</A>";
?>
|
|
The script getpdf.php just returns the pdf document.
Пример 2. Outputting a precalculated PDF
<?php
$len = filesize($filename);
header("Content-type: application/pdf");
header("Content-Length: $len");
header("Content-Disposition: inline; filename=foo.pdf");
readfile($filename);
?>
|
|
The PDFlib distribution contains a more complex example which
creates a page with an analog clock. Here we use the in-memory
creation feature of PDFlib to alleviate the need to use temporary
files. The example was converted to PHP from the PDFlib example.
(The same example is available in the CLibPDF documentation.)
Пример 3. pdfclock example from PDFlib distribution
<?php
$radius = 200;
$margin = 20;
$pagecount = 10;
$pdf = pdf_new();
if (!pdf_open_file($pdf, "")) {
echo error;
exit;
};
pdf_set_parameter($pdf, "warning", "true");
pdf_set_info($pdf, "Creator", "pdf_clock.php");
pdf_set_info($pdf, "Author", "Uwe Steinmann");
pdf_set_info($pdf, "Title", "Analog Clock");
while ($pagecount-- > 0) {
pdf_begin_page($pdf, 2 * ($radius + $margin), 2 * ($radius + $margin));
pdf_set_parameter($pdf, "transition", "wipe");
pdf_set_value($pdf, "duration", 0.5);
pdf_translate($pdf, $radius + $margin, $radius + $margin);
pdf_save($pdf);
pdf_setrgbcolor($pdf, 0.0, 0.0, 1.0);
pdf_setlinewidth($pdf, 2.0);
for ($alpha = 0; $alpha < 360; $alpha += 6) {
pdf_rotate($pdf, 6.0);
pdf_moveto($pdf, $radius, 0.0);
pdf_lineto($pdf, $radius-$margin/3, 0.0);
pdf_stroke($pdf);
}
pdf_restore($pdf);
pdf_save($pdf);
pdf_setlinewidth($pdf, 3.0);
for ($alpha = 0; $alpha < 360; $alpha += 30) {
pdf_rotate($pdf, 30.0);
pdf_moveto($pdf, $radius, 0.0);
pdf_lineto($pdf, $radius-$margin, 0.0);
pdf_stroke($pdf);
}
$ltime = getdate();
pdf_save($pdf);
pdf_rotate($pdf,-(($ltime['minutes']/60.0)+$ltime['hours']-3.0)*30.0);
pdf_moveto($pdf, -$radius/10, -$radius/20);
pdf_lineto($pdf, $radius/2, 0.0);
pdf_lineto($pdf, -$radius/10, $radius/20);
pdf_closepath($pdf);
pdf_fill($pdf);
pdf_restore($pdf);
pdf_save($pdf);
pdf_rotate($pdf,-(($ltime['seconds']/60.0)+$ltime['minutes']-15.0)*6.0);
pdf_moveto($pdf, -$radius/10, -$radius/20);
pdf_lineto($pdf, $radius * 0.8, 0.0);
pdf_lineto($pdf, -$radius/10, $radius/20);
pdf_closepath($pdf);
pdf_fill($pdf);
pdf_restore($pdf);
pdf_setrgbcolor($pdf, 1.0, 0.0, 0.0);
pdf_setlinewidth($pdf, 2);
pdf_save($pdf);
pdf_rotate($pdf, -(($ltime['seconds'] - 15.0) * 6.0));
pdf_moveto($pdf, -$radius/5, 0.0);
pdf_lineto($pdf, $radius, 0.0);
pdf_stroke($pdf);
pdf_restore($pdf);
pdf_circle($pdf, 0, 0, $radius/30);
pdf_fill($pdf);
pdf_restore($pdf);
pdf_end_page($pdf);
sleep(1);
}
pdf_close($pdf);
$buf = pdf_get_buffer($pdf);
$len = strlen($buf);
header("Content-type: application/pdf");
header("Content-Length: $len");
header("Content-Disposition: inline; filename=foo.pdf");
echo $buf;
pdf_delete($pdf);
?>
|
|
Замечание:
An alternative PHP module for PDF document creation based on
FastIO's ClibPDF is
available. Please see the ClibPDF
section for details. Note that ClibPDF has a slightly different API
than PDFlib.
PDF functions
phpguy at theos dot me dot uk
01-Mar-2006 05:17
On my system at least (debian stable) the command to install pdflib is not
pear install pdflib
but rather
pecl install pdflib
spingary at yahoo dot com
12-Jan-2006 12:55
I was having trouble with streaming inline PDf's using PHP 5.0.2, Apache 2.0.54.
This is my code:
<?
header("Pragma: public");
header("Expires: Mon, 26 Jul 1997 05:00:00 GMT");
header("Last-Modified: " . gmdate("D, d M Y H:i:s") . " GMT");
header("Cache-Control: must-revalidate");
header("Content-type: application/pdf");
header("Content-Length: ".filesize($file));
header("Content-disposition: inline; filename=$file");
header("Accept-Ranges: ".filesize($file));
readfile($file);
exit();
?>
It would work fine in Mozilla Firefox (1.0.7) but with IE (6.0.2800.1106) it would not bring up the Adobe Reader plugin and instead ask me to save it or open it as a PHP file.
Oddly enough, I turned off ZLib.compression and it started working. I guess the compression is confusing IE. I tried leaving out the content-length header thinking maybe it was unmatched filesize (uncompressed number vs actual received compressed size), but then without it it screws up Firefox too.
What I ended up doing was disabling Zlib compression for the PDF output pages using ini_set:
<?
ini_set('zlib.output_compression','Off');
?>
Maybe this will help someone. Will post over in the PDF section as well.
davedotmarshallatcspencerltddotcodotuk
08-Nov-2005 04:17
RE: thodge at ipswich dot qld dot gov dot au
I think the line:
preg_match_all(
'/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si',
$postScriptData,
$matches
);
should read:
preg_match_all(
'/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si',
$psData,
$matches
);
ontwerp AT zonnet.nl
03-Nov-2005 11:01
I was searching for a lowcost/opensource option for combining static html files [as templates] and dynamic output from perl or php routines etc. And the sooner or later I found out that this was the most stable, 'speedest' and customizeable way to produce usable pdf 's with nice formatting :
1] create html page output [perl-> html output, direct html output from any app or php echo's etc. [sort these html files locally]
2] parse all html [inluding webimages links, tables font formatting etc] to [E]PS files with the perl app : html2ps [as mentioned beneath]
http://user.it.uu.se/~jan/html2ps.html [sort all ps files by future pdf page positions]
3] use the free ps2pdf/ps2pdfwr linux application
http://www.ps2pdf.com/convert/index.htm [uses gostscript, ghostview libs and so on etc]
Has great formatting options like headers, footers, numbering etc
[sort pdf files]
4] convert all pdf files to 1 pdf file with : pdftk [pdftoolkit], deliveres optional compressions/encryption, background stamps etc
One should ask why using different scripts :
- combination perl/php is great : perl is speedier at some issues like conversion to ps files in my experience
- ps to pdf is quickier then direct php to pdf [in my exp.!]
- I have total control over every files whenever i change html files as a template I use only editors or other app. for it [online or offline].
p.s. I had to make a opensource solution for creating simpel report analyses that's based on things like :
- first page [name / title / #/ date]
- some static info [like introduction, copyrights etc]
- some dynamic info [outputted from php->dbase queries] combined
with html tags/images etc.
And this all mixed [so seperated in files for transparancy]. Also the 3 way manner : data-> html, html->ps, ps->pdf, is easier and quickier to program or adjust in every step.
Correct me if i'm wrong [mail me to]
ing. Valentijn Langendorff
Design & Technologist
ragnar at deulos dot com
07-Oct-2005 07:30
After one hole day understanding how pdflib works i got the conclusion that its enough hard to draw just with words to furthermore for drawing a line maybe you will need something like four lines of code, so i did my own functions to do the life easier and the code more understable to modify and draw. I also made a function that will draw a rect with the corners round and the posibility even to fill it ;)
You can get it from http://www.deulos.com/pdf_php.php
feel free to make suggestions or whatever u like ;o)
17-Sep-2005 11:26
some code that can be very helpful for starters.
<?php
$pdf = pdf_new();
PDF_open_file($pdf);
PDF_set_info($pdf, "author", "Alexander Pas");
PDF_set_info($pdf, "title", "PDF by PHP Example");
PDF_set_info($pdf, "creator", "Alexander Pas");
PDF_set_info($pdf, "subject", "Testing Code");
pdf_set_parameter($pdf, "FontOutline", "Arial=arial.ttf"); $font1 = PDF_findfont($pdf, "Helvetica-Bold", "winansi", 0); $font2 = PDF_findfont($pdf, "Arial", "winansi", 1); $image1 = PDF_open_image_file($pdf, "gif", "image.gif"); PDF_begin_page($pdf, 450, 450); $bookmark = PDF_add_bookmark($pdf, "Front"); PDF_setfont($pdf, $font1, 12); PDF_show_xy($pdf, "First Page!", 5, 225); pdf_place_image($pdf, $image1, 255, 5, 1); PDF_end_page($pdf); PDF_begin_page($pdf, 450, 225); $bookmark1 = PDF_add_bookmark($pdf, "Chapter1", $bookmark); PDF_setfont($pdf, $font2, 12); PDF_show_xy($pdf, "Chapter1!", 225, 5);
PDF_add_bookmark($pdf, "Chapter1.1", $bookmark1); PDF_setfont($pdf, $font1, 12);
PDF_show_xy($pdf, "Chapter1.1", 225, 5);
PDF_end_page($pdf);
PDF_close($pdf); $output = PDF_get_buffer($pdf); header("Content-type: application/pdf"); header("Content-Length: ".strlen($output)); header("Content-Disposition: attachment; filename=test.pdf"); echo $output; PDF_delete($pdf);
?>
thodge at ipswich dot qld dot gov dot au
04-Sep-2005 10:22
Yet another addition to the PDF text extraction code last posted by jorromer. The code only seemed to work for PDF 1.2 (Acrobat 3.x) or below. This pdfExtractText function uses regular expressions to cover cases I have found in PDF 1.3 and 1.4 documents. The code also handles closing brackets in the text stream, which were ignored by the previous version. My regular expression skills are somewhat lacking, so improvements may possible by a more skilled programmer. I'm sure there are still cases that this function will not handle, but I haven't come across any yet...
<?php
function pdf2string($sourcefile) {
$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);
$searchstart = 'stream';
$searchend = 'endstream';
$pdfText = '';
$pos = 0;
$pos2 = 0;
$startpos = 0;
while ($pos !== false && $pos2 !== false) {
$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);
if ($pos !== false && $pos2 !== false){
if ($content[$pos] == 0x0d && $content[$pos + 1] == 0x0a) {
$pos += 2;
} else if ($content[$pos] == 0x0a) {
$pos++;
}
if ($content[$pos2 - 2] == 0x0d && $content[$pos2 - 1] == 0x0a) {
$pos2 -= 2;
} else if ($content[$pos2 - 1] == 0x0a) {
$pos2--;
}
$textsection = substr(
$content,
$pos + strlen($searchstart) + 2,
$pos2 - $pos - strlen($searchstart) - 1
);
$data = @gzuncompress($textsection);
$pdfText .= pdfExtractText($data);
$startpos = $pos2 + strlen($searchend) - 1;
}
}
return preg_replace('/(\s)+/', ' ', $pdfText);
}
function pdfExtractText($psData){
if (!is_string($psData)) {
return '';
}
$text = '';
$psData = str_replace('\)', '##ENDBRACKET##', $psData);
$psData = str_replace('\]', '##ENDSBRACKET##', $psData);
preg_match_all(
'/(T[wdcm*])[\s]*(\[([^\]]*)\]|\(([^\)]*)\))[\s]*Tj/si',
$postScriptData,
$matches
);
for ($i = 0; $i < sizeof($matches[0]); $i++) {
if ($matches[3][$i] != '') {
preg_match_all('/\(([^)]*)\)/si', $matches[3][$i], $subMatches);
foreach ($subMatches[1] as $subMatch) {
$text .= $subMatch;
}
} else if ($matches[4][$i] != '') {
$text .= ($matches[1][$i] == 'Tc' ? ' ' : '') . $matches[4][$i];
}
}
$trans = array(
'...' => '…',
'\205' => '…',
'\221' => chr(145),
'\222' => chr(146),
'\223' => chr(147),
'\224' => chr(148),
'\226' => '-',
'\267' => '•',
'\(' => '(',
'\[' => '[',
'##ENDBRACKET##' => ')',
'##ENDSBRACKET##' => ']',
chr(133) => '-',
chr(141) => chr(147),
chr(142) => chr(148),
chr(143) => chr(145),
chr(144) => chr(146),
);
$text = strtr($text, $trans);
return $text;
}
?>
28-Aug-2005 09:58
If you want to display the number of pages (for example: page 1 of 3) then the following code could be helpful:
<?php
...
$pdf->begin_page_ext(842,595 , "");
.. add text,images,...
$pdf->suspend_page("");
$pdf->begin_page_ext(842,595 , "");
.. add text,images,...
$pdf->suspend_page("");
... create all pages
$pdf->resume_page("pagenumber 1");
... add number of pages to page 1
$pdf->end_page_ext("");
$pdf->resume_page("pagenumber 2");
... add number of pages to page 2
$pdf->end_page_ext("");
...
?>
jorromer at uchile dot cl -- Krash
07-Jun-2005 10:51
I recently use mattb code below for the extraction of text from PDF files. I modify this code for only extract text fields.
Hope i can help some one
Here is the Function
<?php
$text = pdf2string("file.pdf");
echo $text;
function pdf2string($sourcefile){
$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);
$searchstart = 'stream';
$searchend = 'endstream';
$pdfdocument = '';
$pos = 0;
$pos2 = 0;
$startpos = 0;
while( $pos !== false && $pos2 !== false ){
$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);
if ($pos !== false && $pos2 !== false){
if ($content[$pos]==0x0d && $content[$pos+1]==0x0a) $pos+=2;
else if ($content[$pos]==0x0a) $pos++;
if ($content[$pos2-2]==0x0d && $content[$pos2-1]==0x0a) $pos2-=2;
else if ($content[$pos2-1]==0x0a) $pos2--;
$textsection = substr($content, $pos + strlen($searchstart) + 2, $pos2 - $pos - strlen($searchstart) - 1);
$data = @gzuncompress($textsection);
$data = ExtractText2($data);
$startpos = $pos2 + strlen($searchend) - 1;
if ($data === false){
return -1;}
$pdfdocument .= $data;}}
return $pdfdocument;}
function ExtractText2($postScriptData){
$sw = true;
$textStart = 0;
$len = strlen($postScriptData);
while ($sw){
$ini = strpos($postScriptData, '(', $textStart);
$end = strpos($postScriptData, ')', $textStart+1);
if (($ini>0) && ($end>$ini)){
$valtext = strpos($postScriptData,'Tj',$end+1);
if ($valtext == $end + 2)
$text .= substr($postScriptData,$ini+1,$end - $ini - 1);}
$textStart = $end + 1;
if ($len<=$textStart) $sw=false;
if (($ini == 0) && ($end == 0)) $sw=false;}
$trans = array("\\341" => "a","\\351" => "e","\\355" => "i","\\363" => "o","\\223" => "","\\224" => "");
$text = strtr($text, $trans);
return $text;
}
?>
jonathan dot beckett at gmail dot com
06-Jun-2005 03:03
After spending ages writing my own PDF to text extraction routine (well... a couple of hours), I realised that you have to interpret the entire stream to have a hope of getting all the characters you really want - so I started digging.
I then discovered that the XPDF project has everything you need to deal with PDFs - Linux and Win32 binaries are available. Most distro's have the RPMs too.
The resultant command is thus;
$result = shell_exec("pdftotext -raw ".$filename." -");
...it works perfectly for content searching purposes.
q
02-Jun-2005 02:24
It seems that the newest adobe reader 7 (using pdf 1.6) is no longer fully compatible with pdfs generated with PDFlib <= 5. The solution is to upgrade to PDFlib 6. Unfortunately, this means coughing up some more cash to the authors, if you need to get rid of the watermark.
santa at selekcia dot com
18-May-2005 11:53
used function pdf2string does not work corectly with all PDFs. There are problems when in PDF are used 0x0D, 0x0A as line separator. Better way is detect length via /Length tag and detect first 2 chars if they are 0x0d or 0x0d and 0x0a both.
When I update this code i will send it, but if someone have now changed it please, publish it. May be it would be better to extend standard PDF lib included to PHP to add functionality to postprocess PDFs. It is usefull sometime to use for example templates, and so.
Thnx to all developpers extending PHP functions and base team.
webadmin at secretscreen dot com
05-Apr-2005 02:51
I found this info about pdflib scope on a Chinese (I think) site and translated it. I was trying to do pdf_setfont and kept getting the wrong scope error. Turns out it has to be in the Page scope. So pdf_setfont will only work when called between pdf_begin_page and pdf_end_page.
#########################################
When API of the PDFlib is called, the error, Can't - IN 'document' scope occurs
There is a concept of " the scope " in the PDFlib, as for all API of the PDFlib it is called with some scope, the *1 which is decided This error occurs when it is called other than the scope where API is appointed. The chart below in reference, please verify API call position.
Path: PDF_moveto (), PDF_circle (), PDF_arc (), PDF_arcn (), PDF_rect () in each case PDF_stroke (), PDF_closepath_stroke (), PDF_fill (), PDF_fill_stroke (), PDF_closepath_fill_stroke (), PDF_clip (), PDF_endpath () the between
Page: PDF_begin_page () with PDF_end_page () in between outside path
Template: PDF_begin_template () with PDF_end_template () in between outside path
Pattern: PDF_begin_pattern () with PDF_end_pattern () in between outside path
Font: PDF_begin_font () with PDF_end_font () in between outside glyph
Glyph: PDF_begin_glyph () with PDF_end_glyph () in between outside path
Document: PDF_open_* () with PDF_close () in between outside page tempalte and pattern
Object: The PDF_new () with the PDF_delete () it belongs to the other no scope in between the place
Null: Outside object
Any: All scopes other than
##########################################
Hope this helps others as much as it helped me!!!
kevin at kevinnading dot com
30-Mar-2005 12:46
Hey people.. the bug with IE not accepting a pdf created via post.. If you can use a get method instead then it will work fine. both post and get methods work in firefox, but only the get method seems to work in IE. However, you may use a content-disposition attachment(means requires user interaction) to popup an open/save dialog box to the user and post/get both work in IE and firefox. Hope this helps!
beanjammin dot removethis at gmail dot com
30-Mar-2005 10:32
This was originally posted by mat3582 at NOSPAM dot hotmail dot com on the Session Handling Functions manual page, however as it is pdf specific I hope that moving it here will make it easier for others to find.
I fought this for longer than I'd care to admit after a web server distros switch before discovering my problem was session related and subsequently discovering Mat's post.
// Mats Note:
Outputting a pdf file to a MSIE browser didn't work (MSIE mistook the file for an Active-X control,
then failed to download) untill I added
<?php
ini_set('session.cache_limiter',"0");
?>
to my script. I hope this will help someone else.
// End Mats Note
In addition to Mat's suggestion the php.ini file can also be edited to add/change the session.cach_limiter setting to 0.
chu61 dot tw at gmail dot com
06-Mar-2005 07:57
How to get how many pages in a PDF? I read PDF spec. V1.6 and find this:
PDF set a "Page Tree Node" to define the ordering of pages in the document. The tree structure allows PDF applications, using little memory to quickly open a document containing thousands of pages.
If a PDF have 63 pages, the page tree node will like this...
2 0 obj
<< /Type /Pages
/Kidsn [ 4 0 R
10 0 R
]
/Count 63 <---- YES, got it
>>
endobj
[P.S] a PDF may not only a pages tree node, The right answer is in "root page tree node", if /Count XX with /Parent XXX node, it not "root page tree node"
SO, You must find the node with /Count XX and Without /Parent terms, and you'll get total pages of PDF
%PDF-1.0 ~ %PDF-1.5 all works
Alex form Taipei,Taiwan
mattb at bluewebstudios dot com
04-Feb-2005 01:44
I recently tested Donatas' code below for the extraction of text from PDF files. After running into a few problems where PDF files were not being read at all, I've modified it somewhat. It still isn't perfect, but should work great for searching. Thanks Donatas.
<?php
$test = pdf2string("<pathtoPDFfile>");
echo "$test";
function pdf2string($sourcefile)
{
$fp = fopen($sourcefile, 'rb');
$content = fread($fp, filesize($sourcefile));
fclose($fp);
$searchstart = 'stream';
$searchend = 'endstream';
$pdfdocument = "";
$pos = 0;
$pos2 = 0;
$startpos = 0;
while( $pos !== false && $pos2 !== false )
{
$pos = strpos($content, $searchstart, $startpos);
$pos2 = strpos($content, $searchend, $startpos + 1);
if( $pos !== false && $pos2 !== false )
{
$textsection = substr($content, $pos + strlen($searchstart) + 2, $pos2 - $pos - strlen($searchstart) - 1);
$data = @gzuncompress($textsection);
$data = ExtractText($data);
$startpos = $pos2 + strlen($searchend) - 1;
if( $data === false ) { return -1; }
$pdfdocument = $pdfdocument . $data;
}
}
return $pdfdocument;
}
function ExtractText($postScriptData)
{
while( (($textStart = strpos($postScriptData, '(', $textStart)) && ($textEnd = strpos($postScriptData, ')', $textStart + 1)) && substr($postScriptData, $textEnd - 1) != '\\') )
{
$plainText .= substr($postScriptData, $textStart + 1, $textEnd - $textStart - 1);
if( substr($postScriptData, $textEnd + 1, 1) == ']' ) {
$plainText .= ' ';
}
$textStart = $textStart < $textEnd ? $textEnd : $textStart + 1;
}
return stripslashes($plainText);
}
?>
ken at thesmallbox.com
30-Oct-2004 08:13
Please note that these functions have been removed from PHP 5. They are still available through the pdflib PECL module.
13-Aug-2004 11:58
for people who are using PDF_FINDFONT there is a catch..
--------------------------------------------------------
int PDF_findfont(PDF *p, const char *fontname, const char *encoding, int embed)
Deprecated, use PDF_load_font( ).
----
use PDF_load_font instead....
arjen at queek dot nl
15-Jul-2004 07:50
If you prefer a OO-approach to the PDF-functions, you can use this snippet of code (PHP5 only! and does add some overhead). It's just a "start-up", extend/improve as you wish...
You can pass all pdf_* functions to your object and stripping pdf_ of the function name. Plus, you don't have to pass the pdf-resource as the first argument.
For example:
<?php
pdf_show($pdf, $text); ?>
Can become:
<?php
$pdf->show($text); ?>
Code:
<?php
class PDF {
private $pdf;
public function __construct() {
$this->pdf = pdf_new();
}
public function __call($function, $arguments) {
array_unshift($arguments, $this->pdf);
return call_user_func_array('pdf_' . $function, $arguments);
}
}
?>
michi (Alt+Q) marel.at
01-Jul-2004 07:10
<?PHP
function calcToPt($intMillimeter) {
$intPoints = ($intMillimeter*72)/25.4;
$intPoints = round($intPoints);
return $intPoints;
}
pdf_begin_page( $pdf, calcToPt(210), calcToPt(297)); ?>
donatas at spurgius dot com
22-Jun-2004 12:56
I've been looking for a way to extract plain text from PDF documents (needed to search for text inside 'em). Not being able to find one I wrote the needed functions myself. here you go folks.
<?php
function pdf2string ($sourceFile)
{
$textArray = array ();
$objStart = 0;
$fp = fopen ($sourceFile, 'rb');
$content = fread ($fp, filesize ($sourceFile));
fclose ($fp);
$searchTagStart = chr(13).chr(10).'stream';
$searchTagStartLenght = strlen ($searchTagStart);
while ((($objStart = strpos ($content, $searchTagStart, $objStart)) && ($objEnd = strpos ($content, 'endstream', $objStart+1))))
{
$data = substr ($content, $objStart + $searchTagStartLenght + 2, $objEnd - ($objStart + $searchTagStartLenght) - 2);
$data = @gzuncompress ($data);
if ($data !== FALSE && strpos ($data, 'BT') !== FALSE && strpos ($data, 'ET') !== FALSE)
{
$textArray [] = ExtractText ($data);
}
$objStart = $objStart < $objEnd ? $objEnd : $objStart + 1;
}
return $textArray;
}
function ExtractText ($postScriptData)
{
while ((($textStart = strpos ($postScriptData, '(', $textStart)) && ($textEnd = strpos ($postScriptData, ')', $textStart + 1)) && substr ($postScriptData, $textEnd - 1) != '\\'))
{
$plainText .= substr ($postScriptData, $textStart + 1, $textEnd - $textStart - 1);
if (substr ($postScriptData, $textEnd + 1, 1) == ']') {
$plainText .= ' ';
}
$textStart = $textStart < $textEnd ? $textEnd : $textStart + 1;
}
return stripslashes ($plainText);
}
?>
uwe at steinmann dot cx
13-May-2004 06:25
Those looking for a free replacement of pdflib may consider
pslib at http://pslib.sourceforge.net which produces PostScript but it can be easily turned into PDF by Acrobat Distiller or ghostscript. The API is very similar and even hypertext functions are supported. There
is also a php extension for pslib in PECL, called ps.
james at lanpad dot org
18-Apr-2004 08:36
PDFLib has a free replacement, that also is much easier to work with too (no more working with co-ordinates from the bottom left hand corner!)!
http://www.fpdf.org
Its also free for commercial use, and is very useable, unlike the PDFlib extensions.
matic at koncan dot net
11-Jan-2004 06:22
The solution for IE (refresh):
...
$buf = PDF_get_buffer($p);
$len = strlen($buf);
header("Cache-Control: no-store");
header("Cache-Control: no-cache");
header("Cache-Control: must-revalidate");
header("Content-type: application/pdf");
header("Content-Length: $len");
header("Content-Disposition: inline; filename=file.pdf");
print $buf;
PDF_delete($p);
SenorTZ senortz at nospam dot yahoo dot com
28-Jul-2003 06:23
About creating a PDF document based on the content of another document(let's say a text file):
I have tried to send to the PDF-creator page from a link from the sender page the file name of the file I want to read the content from and generate the PDF document containing this content. The idea is is that when I tried to reffer the pdf-creator page via the link your_root/create_pdf.php?filename=$your_file_name, the pdf-creator page does not behave well when before creating the pdf document I have a line like $filename = $_GET["filename"].
I solved this using on the sender page instead of the link a form with a button, so the form has as action "create_pdf.php", as method "post" and a hidden field containing the "filename" value. And it works like this if, on the pdf-creator page I have a line like $filename = $_POST["filename"].
I would like to understand why this way it works and the other way does not.
I hope this helps. Here are the pieces of code I used.
Sender page:
print("<form name='to_pdf' action='see_pdf_file.php' method='post'>");
print("<br/><input type='submit' value='PDF'><input type='hidden' name='filename' value='$filename'></form>");
PDF-creator page:
<?
$filename = $_POST["filename"];
$file_handle = fopen($filename, "r");
$file_content = file_get_contents($filename);
fclose($file_handle);
$file_content = wordwrap($file_content,72,"|");
$a_row = explode("|",$file_content);
$i = 0;
$pdf = pdf_new();
pdf_open_file($pdf, "");
pdf_begin_page($pdf, 595, 842);
pdf_set_font($pdf, "Times-Roman", 16, "host");
pdf_add_outline($pdf, "Page 1");
pdf_set_value($pdf, "textrendering", 1);
pdf_show_xy($pdf, 'The content of the file:',50,700);
while ($a_row[$i] != "")
{
pdf_continue_text($pdf,$a_row[$i]);
$i++;
}
pdf_end_page($pdf);
pdf_close($pdf);
$data = pdf_get_buffer($pdf);
header("Content-type: application/pdf");
header("Content-disposition: inline; filename=test.pdf");
header("Content-length: " . strlen($data));
echo $data;
?>
PDFLib and PHP 431 used.
Thanks.
bmironov at jonview dot com
24-Jun-2003 03:46
RedHat 9 + Apache 2.0 + PHP 4.3.2 + Oracle 9i + PDFlib 5.0.1 (binary distribution)
It seems to be a working bundle if you do some magic with ./configure:
RedHat 9:
kernel-2.4.20-18.9
Apache 2.0.46:
./configure --enable-so --enable-rewrite=shared --enable-status --enable-mpm=prefork
PHP 4.3.2:
./configure \
--program-prefix= \
--prefix=/usr \
--exec-prefix=/usr \
--bindir=/usr/bin \
--sbindir=/usr/sbin \
--sysconfdir=/etc \
--datadir=/usr/share \
--includedir=/usr/include \
--libdir=/usr/lib \
--libexecdir=/usr/libexec \
--localstatedir=/var \
--sharedstatedir=/usr/com \
--mandir=/usr/share/man \
--infodir=/usr/share/info \
--with-config-file-path=/etc \
--with-config-file-scan-dir=/etc/php.d \
--without-tsrm-pthreads \ # !!!!!!!!!!!!!!!!!!!!
--with-zlib \
--with-gd \
--enable-gd-native-ttf \
--with-ttf \
--without-mysql \
--with-apxs2filter=/usr/local/apache2/bin/apxs \
--with-oci8 \
--enable-sigchild \
--enable-inline-optimization
Oracle9i:
ln -s $ORACLE_HOME/rdbms/public/nzerror.h $ORACLE_HOME/rdbms/demo/nzerror.h
ln -s $ORACLE_HOME/rdbms/public/nzt.h $ORACLE_HOME/rdbms/demo/nzt.h
ln -s $ORACLE_HOME/rdbms/public/ociextp.h $ORACLE_HOME/rdbms/demo/ociextp.h
If you want to use bundled GD-library then:
1) install following packages: libjpeg, libjpeg-devel, libpng, libpng-devel, freetype, freetype-devel, libtiff, libtiff-devel, zlib, zlib-devel
2) ln -s /usr/lib/libjpeg.so.62 /usr/lib/libjpeg.so
ln -s /usr/lib/libpng.so.62 /usr/lib/libpng.so
It seems to be a working combination, because it is NOT give you:
1) error message in Apache's error_log:
Module compiled with module API=20020429, debug=0, thread-safety=0
PHP compiled with module API=20020429, debug=0, thread-safety=1
2) error message in Apache's error_log:
[notice] child pid 12345 exit signal Segmentation fault (11)
3) MS Internet Explorer can show PDF-output from your PHP-script via Acrobat plug-in and does not crush. No confusing messages about opening "Adobe Acrobat Control for ActiveX".
Hope it will save you some time.
Good luck,
Boris
matt at nospam dot org
29-Aug-2002 11:11
Adding to my prior note, IE 6 has a strange feature of using GET when refreshing a pdf document, even though the page was originally POSTed to. This may be the root cause of all the trouble listed above regarding posting and pdf.
So, I recommend:
1) using a two page form/action handler when doing pdf rendering instead of the standard $PHP_SELF form/self handler to resolve the problem discussed above
2) Using either GET, or a self posting form that sets cookies and then redirects to the pdf creation page instead of POST, so that the parms get to the page. HTH
gilbertng at hongkong dot com
11-Jun-2002 03:23
Hope it can help someone:
$pdf = pdf_new();
//pdf_open_file($pdf,"");
if (!pdf_open_file($pdf, "")) {
print error;
exit;
}
PDF_set_parameter($pdf, "resourcefile", "/usr/local/pdflib/fonts/pdflib.upr");
PDF_set_parameter($pdf,"prefix","/usr/local/pdflib/fonts");
pdf_begin_page($pdf, 595, 842);
pdf_add_outline($pdf, "Page 1");
//pdf_set_font($pdf, "Times-Roman", 30, "host");
// set chinese characters,
$font = pdf_findfont($pdf, "MHei-Medium", "B5pc-H",0);
if ($font) {
pdf_setfont($pdf, $font, 30);
}
pdf_set_value($pdf, "textrendering",0);
pdf_show_xy($pdf, " 100 Roman outlined", 50, 750);
pdf_set_font($pdf, "Times-Roman", 30, "host");
pdf_show_xy($pdf, " Times Roman outlined", 50, 600);
pdf_moveto($pdf, 50, 740);
pdf_lineto($pdf, 330, 740);
pdf_stroke($pdf);
pdf_end_page($pdf);
pdf_close($pdf);
$buf = pdf_get_buffer($pdf);
$len = strlen($buf);
header("Content-type: application/pdf");
header("Content-Length: $len");
header("Content-Disposition: inline; filename=foo.pdf");
print $buf;
pdf_delete($pdf);
chernyshevsky at hotmail dot com
06-May-2002 03:22
If you're wondering how to highlight words inside a PDF file, take a look at this script I've written (doesn't need PDFLib)
http://zeus.jtlnet.com/~conradis/pdfhi.php.txt
It's a whole lot harder than you think. (Rarely has no much code been written that does so little, that's what I say :-) Worth looking at if you want to do searches inside a PDF.
pbierans at lynet dot de
27-Mar-2002 09:56
Load extension, open a PDF, add a font, modify PDF in memory and send
it to browser:
<?php
header("Expires: Mon, 26 Jul 1997 05:00:00 GMT");
header("Last-Modified: ".gmdate("D, d M Y H:i:s")." GMT");
header("Cache-Control: no-store, no-cache, must-revalidate");
header("Cache-Control: post-check=0, pre-check=0", false);
header("Pragma: no-cache");
$ext_name="libpdf_php.so";
if (!extension_loaded($ext_name) && !@dl($ext_name))
{
?>
<table width="100%" border="0"><tr><td align="center">
<table style="border: solid #f0f0f0 2px;"><tr>
<td valign="middle" style="padding: 20px; margin: 0px;">
<p style="font-family: arial; font-size: 12px; ">
<b>Sorry,</b><br>
<br>
A PDF can not be generated right now.<br>
The administrator has been informed and will fix this as
soon as possible.<br>
Please try again later.
</p>
</td></tr></table>
</td></tr></table>
<?php
mail('admin@domain.com','Error: PDFLib not found',
'Called by script:\n '.$SCRIPT_FILENAME.'?'.$QUERY_STRING,
"From: warnings@domain.com\n");
exit;
} srand(microtime()*10000);
$usnr= gmdate("Ymd-His-").rand(1000,9999).'-';
$pdf_file=$usnr.'result.pdf';
$src_file='source.pdf';
$pdf = pdf_new();
pdf_open_file($pdf);
pdf_set_parameter($pdf, 'serial', 'if-you-have-one');
pdf_set_parameter($pdf, 'FontAFM', 'TradeGothic=Tg______.afm');
pdf_set_parameter($pdf, 'FontOutline', 'TradeGothic=Tg______.pfb');
pdf_set_parameter($pdf, 'FontPFM', 'TradeGothic=Tg______.pfm');
$src_doc =pdf_open_pdi($pdf,$src_file,'', 0);
$src_page =pdf_open_pdi_page($pdf,$src_doc,1,'');
$src_width =pdf_get_pdi_value($pdf,'width' ,$src_doc,$src_page,0);
$src_height=pdf_get_pdi_value($pdf,'height',$src_doc,$src_page,0);
pdf_begin_page($pdf, $src_width, $src_height);
{
pdf_place_pdi_page($pdf,$src_page,0,0,1,1);
pdf_close_pdi_page($pdf,$src_page);
pdf_set_font($pdf, 'TradeGothic', 8, 'host');
pdf_show_xy($pdf, 'Now: '.gmdate("Y-m-d H:i:s"),50,50);
}
pdf_end_page($pdf);
pdf_close($pdf);
$pdfdata = pdf_get_buffer($pdf); $pdfsize = strlen($pdfdata); header('Content-type: application/pdf');
header('Content-disposition: attachment; filename="'.$pdf_file.'"');
header('Content-length: '.$pdfsize);
echo $pdfdata;
exit; ?>
a dot marchand dot nospam at home dot com
01-May-2001 12:42
To continue on the internet explorer (Iexplorer, IE) requirements, instead of content-length, a simple:
header("Accept-Ranges: bytes");
is enough for the getpdf.php file working right. Even Netscape will without error with this modification.
Aurelien
| |