Skip to main content
marquez7

2 years ago

grep command to search .docx files???

So i know the grep command is used to search for patterns in text files. I tried several times to search through some docx files to practice, and the grep command wont work. I keep getting a binary error.


So i used the  "-a" command as explained by the man page to override the issue and nothing happens.

From what I gather the grep command only works on .txt files? (so most system files are saved in .txt form, but what if I want to utilize to search user files on the network that are .docx files)  What command is used to search through .pdf files, and docx files?? Perhaps more specifically .docx since most people tend to save with that extension.
 I got the same error on my mac pc when I was trying to use the command there as well.

Image of harock
harock
2 years ago
I believe this is because the .docx format is not plain txt format instead it's some weird form of zip (sorry my knowledge is limited) so one can not simply grep these files as they are more like archives,

What you could do is install a package called docx2txt which will convert your .docx file to a txt file,

You can then use the following command to grep the data sent to stdout for example

I created a docx file (text.docx) with the following lines

This is a test
This is a 2nd line
This line has a turtle on it
We shall look for that
Because Grep is fun


Then used the following command to grep the turtle string

/test$ docx2txt < test.docx | grep 'turtle'

This line has a turtle on it


I hope this is helpful,

Good luck





Image of
2 years ago
Scott is correct. Using 'grep' by itself will only read plain text for the most part. .docx files are actually an XML style document (to support all the extra formatting options) so grep can't really interpret it the way you are wanting.
Image of
2 years ago
Thanks.  This was extremely helpful . I’ll definitely give it a try as soon as I can get access toy comp.
Image of nicholashendrix
nicholashendrix
2 years ago
Hello! I think your best best would be to convert it to a plain text format before attempting to parse it with command line tools(like grep). Let us know if you have any further questions.
Image of
2 years ago
OK. I do apolagize for asking too many questions here. So I couldn't find  the docx2txt app through yum database search. (RedhatLinux7 & Centos7). Does the proggram have to be downloaded from a third party website?
I just manually copied and pasted the .docx files I wanted to try out pasted them into .txt files and was able to practice the grep command.  I was really wanting to work with the doccx2txt command to try it out, but couldnt find it. I usually login to the servers via ssh through terminal. I think I'm going to have to login to the virtual machine and download it from the website .
Image of
2 years ago
If you're good with python,  you can install a python version of it with pip. Here's the github page

rpmfind.net has an rpm install version if you'd rather use that as well. Just run these commands to donwload and install it:

wget "https://www.rpmfind.net/linux/Mandriva/devel/cooker/x86_64/media/contrib/release/docx2txt-1.2-1-mdv2012.0.noarch.rpm"
sudo rpm -i docx2txt-1.2-1-mdv2012.0.noarch.rpm

Image of
2 years ago
Thanks.